Biomedical Text Normalization through Generative Modeling

基于生成模型的生物医学文本规范化

阅读:1

Abstract

OBJECTIVE: Around 80% of electronic health record (EHR) data consists of unstructured medical language text. The formatting of this text is often flexible and inconsistent, making it challenging to use for predictive modeling, clinical decision support, and data mining. Large language models' (LLMs) ability understand context and semantic variations makes them promising tools for standardizing medical text. In this study, we develop and assess clinical text normalization pipelines built using large-language models. METHODS: We implemented four LLM-based normalization strategies ( Zero-Shot Recall, Prompt Recall, Semantic Search, and Retrieval-Augmented Generation based normalization [RAGnorm]) and one baseline approach using TF-IDF based String Matching. We evaluated performance across three datasets of SNOMED-mapped condition terms: (1) an oncology-specific dataset, (2) a representative sample of institutional medical conditions, and (3) a dataset of commonly occurring condition codes (>1000 uses) from our institution. We measured performance by recording the mean shortest path length between predicted and true SNOMED CT terms. Additionally, we benchmarked our models against the TAC 2017 drug label annotations, which normalizes terms to the Medical Dictionary for Regulatory Activities (MedDRA) Preferred Terms. RESULTS: We found that RAGnorm was the most effective throughout each dataset, achieving a mean shortest path length of 0.21 for the domain-specific dataset, 0.58 for the sampled dataset, and 0.90 for the top terms dataset. It achieved a micro F1 score of 88.01 on task 4 of the TAC2017 conference, surpassing all other models without viewing the provided training data. CONCLUSION: We find that retrieval-focused approaches overcome traditional LLM limitations for this task. RAGnorm and related retrieval techniques should be explored further for the normalization of biomedical free text.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。