A lightweight large language model for regulatory affairs translation in pharmaceutical industry

一种用于制药行业监管事务翻译的轻量级大型语言模型

阅读:1

Abstract

New drug development is a costly and time-consuming project in pharmaceutical industry. However, the issue of relatively poor-quality, expensive and delayed regulatory affairs translation which hurdles this project has long been neglected by the pharmaceutical community. This study designed a tailored and impactful lightweight large language model (LLM), PhT-LM, to improve regulatory affairs translation and cut the cost of translation fee for the first time. Following web crawling, cleaning, and verifying the bilingual documents from the official websites of competent regulatory authorities in China and international organizations, a translation dataset containing 34,769 bilingual data was established. Next, the open-source Qwen-1_8B-Chat model was chosen as the basic model, which was then fine-tuned in the aforementioned translation dataset using the low-rank adapter technique. Finally, a retrieval-augmented generate technique was utilized to further enhance the model's translation performance. When compared to popular general-purpose large language models, this lightweight model achieved a BLEU-4 mean score of 36.018 and a CHRF mean score of 58.047 based on a self-constructed training corpus, with improved scores ranging from 16% to 65% with a favorable cost-benefit analysis. Further, the model's excellence has been demonstrated by human evaluation, particularly, its superiority in English-Chinese translation tasks. Our model offers a promising tool for pharmaceutical industry worldwide to translate regulatory affairs documents in high-quality, and efficiently with decreased cost.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。