Benchmarking large language models GPT-4o, llama 3.1, and qwen 2.5 for cancer genetic variant classification

对大型语言模型 GPT-4o、llama 3.1 和 qwen 2.5 进行癌症基因变异分类基准测试

阅读：2

作者：Lin,Kuan-Hsun,Kao,Tzu-Hang,Wang,Lei-Chi,Kuo,Chen-Tsung,Chen,Paul Chih-Hsueh,Chu,Yuan-Chia,Yeh,Yi-Chen

期刊：	npj Precision Oncology	影响因子：	8.000
时间：	2025	起止号：	2025 May 15;9(1):141
doi：	10.1038/s41698-025-00935-4	研究方向：	肿瘤

Abstract

Classifying cancer genetic variants based on clinical actionability is crucial yet challenging in precision oncology. Large language models (LLMs) offer potential solutions, but their performance remains underexplored. This study evaluates GPT-4o, Llama 3.1, and Qwen 2.5 in classifying genetic variants from the OncoKB and CIViC databases, as well as a real-world dataset derived from FoundationOne CDx reports. GPT-4o achieved the highest accuracy (0.7318) in distinguishing clinically relevant variants from variants of unknown clinical significance (VUS), outperforming Qwen 2.5 (0.5731) and Llama 3.1 (0.4976). LLMs demonstrated better concordance with expert annotations for variants with strong clinical evidence but exhibited greater inconsistencies for those with weaker evidence. All three models showed a tendency to assign variants to higher evidence levels, suggesting a propensity for overclassification. Prompt engineering significantly improved accuracy, while retrieval-augmented generation (RAG) further enhanced performance. Stability analysis across 100 iterations revealed greater consistency with the CIViC system than with OncoKB. These findings highlight the promise of LLMs in cancer genetic variant classification while underscoring the need for further optimization to improve accuracy, consistency, and clinical applicability.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用；引用内容仅为补充信息，不代表本站立场。

2、若认为本页面引用内容涉及侵权，请及时与本站联系，我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容，需注明“来源：[生知库]”并获得授权；使用引用内容的，需自行联系原作者获得许可。

4、投稿及合作请联系：info@biocloudy.com。