Validation of large language models (Llama 3 and ChatGPT-4o mini) for title and abstract screening in biomedical systematic reviews

大型语言模型(Llama 3 和 ChatGPT-4o mini)在生物医学系统评价标题和摘要筛选中的验证

阅读:3

Abstract

With the increasing volume of scientific literature, there is a need to streamline the screening process for titles and abstracts in systematic reviews, reduce the workload for reviewers, and minimize errors. This study validated artificial intelligence (AI) tools, specifically Llama 3 70B via Groq's application programming interface (API) and ChatGPT-4o mini via OpenAI's API, for automating this process in biomedical research. It compared these AI tools with human reviewers using 1,081 articles after duplicate removal. Each AI model was tested in three configurations to assess sensitivity, specificity, predictive values, and likelihood ratios. The Llama 3 model's LLA_2 configuration achieved 77.5% sensitivity and 91.4% specificity, with 90.2% accuracy, a positive predictive value (PPV) of 44.3%, and a negative predictive value (NPV) of 97.9%. The ChatGPT-4o mini model's CHAT_2 configuration showed 56.2% sensitivity, 95.1% specificity, 92.0% accuracy, a PPV of 50.6%, and an NPV of 96.1%. Both models demonstrated strong specificity, with CHAT_2 having higher overall accuracy. Despite these promising results, manual validation remains necessary to address false positives and negatives, ensuring that no important studies are overlooked. This study suggests that AI can significantly enhance efficiency and accuracy in systematic reviews, potentially revolutionizing not only biomedical research but also other fields requiring extensive literature reviews.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。