Benchmarking large language models for pathogen-disease classification in post-acute infection syndromes

对用于急性感染后综合征病原体-疾病分类的大型语言模型进行基准测试

阅读:1

Abstract

Post-Acute Infection Syndromes (PAIS) are medical conditions that persist following acute infections from pathogens such as SARS-CoV-2, Epstein-Barr virus, and Influenza virus. Despite growing global awareness of PAIS and the exponential increase in biomedical literature, only a small fraction of this literature pertains specifically to PAIS, making the identification of pathogen-disease associations within such a vast, heterogeneous, and unstructured corpus a significant challenge for researchers. This study evaluated the effectiveness of large language models (LLMs) in extracting these associations through a binary classification task using a curated dataset of 1000 manually labeled PubMed abstracts. We benchmarked a wide range of open-source LLMs of varying sizes (4B-70B parameters), including generalist, reasoning, and biomedical-specific models. We also investigated the extent to which prompting strategies such as zero-shot, few-shot, and Chain of Thought (CoT) methods can improve classification performance. Our results indicate that model performance varied by size, architecture, and prompting strategy. Zero-shot prompting produced the most reliable results: Mistral-Small-Instruct-2409 and Llama-3.1-Nemotron-70B-Instruct achieved balanced accuracy scores of 0.81 and 0.80, respectively, along with macro-F1 scores of up to 0.80, while maintaining minimal invalid outputs. While few-shot and CoT prompting often degraded performance in generalist models, reasoning models such as DeepSeek-R1-Distill-Llama-70B and QwQ-32B demonstrated improved accuracy and consistency when provided with additional context.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。