Benchmarking large language models for predictive modeling in biomedical research with a focus on reproductive health

对用于生物医学研究(重点关注生殖健康)的大型语言模型进行预测建模的基准测试

阅读:1

Abstract

Large language models (LLMs) are increasingly used for code generation and data analysis. This study assesses LLM performance across four predictive tasks from three DREAM challenges: gestational age regression from transcriptomics and DNA methylation and classification of preterm birth and early preterm birth from microbiome data. We prompt LLMs with task descriptions, data locations, and target outcomes and then run LLM-generated code to fit prediction models and determine accuracy on test sets. Among the eight LLMs tested, o3-mini-high, 4o, DeepseekR1, and Gemini 2.0 can complete at least one task. R code generation is more successful (14/16) than Python (7/16). OpenAI's o3-mini-high outperforms others, completing 7/8 tasks. Test set performance of the top LLM-generated models matches or exceeds the median-participating team for all four tasks and surpasses the top-performing team for one task (p = 0.02). These findings underscore the potential of LLMs to democratize predictive modeling in omics and increase research output.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。