Abstract
New drug development is a costly and time-consuming project in pharmaceutical industry. However, the issue of relatively poor-quality, expensive and delayed regulatory affairs translation which hurdles this project has long been neglected by the pharmaceutical community. This study designed a tailored and impactful lightweight large language model (LLM), PhT-LM, to improve regulatory affairs translation and cut the cost of translation fee for the first time. Following web crawling, cleaning, and verifying the bilingual documents from the official websites of competent regulatory authorities in China and international organizations, a translation dataset containing 34,769 bilingual data was established. Next, the open-source Qwen-1_8B-Chat model was chosen as the basic model, which was then fine-tuned in the aforementioned translation dataset using the low-rank adapter technique. Finally, a retrieval-augmented generate technique was utilized to further enhance the model's translation performance. When compared to popular general-purpose large language models, this lightweight model achieved a BLEU-4 mean score of 36.018 and a CHRF mean score of 58.047 based on a self-constructed training corpus, with improved scores ranging from 16% to 65% with a favorable cost-benefit analysis. Further, the model's excellence has been demonstrated by human evaluation, particularly, its superiority in English-Chinese translation tasks. Our model offers a promising tool for pharmaceutical industry worldwide to translate regulatory affairs documents in high-quality, and efficiently with decreased cost.