Accelerating real-world data collection using large language models in rare neoplasms: a bone sarcoma example

利用大型语言模型加速罕见肿瘤的真实世界数据收集:以骨肉瘤为例

阅读:3

Abstract

BACKGROUND: Real-world data collection in oncology remains a challenge due to the complex and unstructured format of medical notes. Recently, large language models (LLMs) have demonstrated success in extracting information from free-text data across various domains. This study evaluates the performance of multiple small LLMs as information extractors on Polish medical notes. MATERIALS AND METHODS: Electronic health records (EHRs) of 302 bone sarcoma patients treated in a reference center between 2016 and 2022 were selected. Five variables-pathology type, tumor size, localization, grade, and primary resection-were annotated by an experienced oncologist. Multiple prompting techniques and four LLMs were used to query the models with the task of returning the value for each variable using an XML tag. Additionally, among non-concordant values we distinguished valid results, i.e. of expected format and containing a key word/phrase from a per-variable, expert-devised list. An ensemble voting approach was applied, selecting values appearing in the majority of valid outputs. RESULTS: Single-model accuracy was modest (17.5%-30.3%) and highly prompt-dependent. The tumor localization values turned out to be the easiest to assess with an accuracy of up to 36.2%. The majority of non-concordant values were non-valid. The voting strategy improved performance significantly, with 83.6% overall accuracy, peaking at 90.0% for the resection type variable. CONCLUSIONS: Our study highlights the potential of using lightweight LLMs in the automation of data extraction from medical notes, which could significantly accelerate clinical research. A singular small LLM is not yet sufficient for real use cases in non-English settings; however, prompt engineering and ensemble methods can greatly improve performance.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。