Domain and Language adaptive pre-training of BERT models for Korean-English bilingual clinical text analysis

针对韩英双语临床文本分析的BERT模型领域和语言自适应预训练

阅读:1

Abstract

OBJECTIVE: To develop bilingual Korean-English medical language models through domain- and language-adaptive pre-training and evaluate their performance in clinical text analysis tasks, specifically semantic similarity and multi-label classification. METHODS: A bilingual corpus comprising Korean (medical textbooks and online health articles) and English (medical textbooks, health-related articles, and MIMIC-IV EHRs) clinical texts were constructed. Three BERT-based foundation models (Korea Medical [KM-BERT], English Biomedical [BioBERT], and multilingual general domain [M-BERT]) underwent additional pre-training using a newly created bilingual WordPiece vocabulary (45,000 tokens). Model performance was assessed intrinsically on the medical semantic textual similarity (MedSTS) benchmark and extrinsically through multi-label classification of chest computed tomography (CT) reports from tertiary hospitals. Macro F1 scores and Pearson’s correlation coefficients were used as primary evaluation metrics. RESULTS: After bilingual pre-training, the Korean semantic similarity performance of bi-BioBERT improved significantly from a Pearson correlation coefficient ranging 0.190–0.871. In the multi-label classification of chest CT reports, all bilingual models outperformed their respective foundation models; bi-KM-BERT achieved the highest Macro F1 score in both internal (0.9460 vs. 0.8902 for KM-BERT) and external validation (0.9288 vs. 0.8495 for KM-BERT). However, bi-KM-BERT and bi-M-BERT experienced semantic performance declines in Korean tasks, indicating catastrophic forgetting, and gradient-based token-importance heatmaps confirmed that the bilingual models captured critical cross-lingual medical contexts more effectively. CONCLUSION: The findings underscore that careful bilingual vocabulary curation and targeted domain-adaptive pre-training enhance natural language processing (NLP) performance in multilingual clinical environments, even with modest training resources. Continual-learning strategies should be explored to mitigate minor forgetting effects. Domain- and language-adaptive pre-training of bilingual medical corpora improves NLP model performance in multilingual clinical settings, thereby providing a scalable strategy for enhancing clinical text analysis capabilities in resource-limited bilingual contexts. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12911-025-03262-7.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。