Interactive active learning for literature screening: finetuning GPT with DeepSeek reasoning for cross-domain generalization

用于文献筛选的交互式主动学习:利用 DeepSeek 推理微调 GPT 以实现跨领域泛化

阅读:1

Abstract

OBJECTIVE: Automated literature screening in biomedical research is often hindered by domain shifts and scarcity of labeled data, which limit model accuracy and generalizability. While large language models (LLMs) perform well in zero-shot settings, they often fail to capture complex, domain-specific reasoning patterns. To address this limitation, this study investigates whether an interactive, weakly supervised learning framework combining GPT (generative pre-trained transformer)'s fine-tuning adaptability with DeepSeek's reasoning capabilities can improve literature screening performance across biomedical domains. MATERIALS AND METHODS: We developed an active learning framework that leverages model disagreement between GPT-4o and DeepSeek to improve literature screening performance. This process began with a labeled corpus of 6331 articles on large language models, from which a model disagreement analysis was performed to identify cases where GPT-4o misclassified and DeepSeek produced correct predictions. Three GPT variants-GPT-4o, GPT-4o-mini, and GPT-4.1-nano, were fine-tuned under standard supervised learning settings using these disagreement-based samples. Fine-tuning prompts incorporated classification labels and, when available, rationale traces generated by DeepSeek to provide reasoning-augmented weak supervision. Model performance was evaluated on an independent benchmark set of 291 annotated articles across 10 topic queries in cancer immunotherapy and LLMs in medicine, using standard evaluation metrics, with recall as the primary measure. RESULTS: Fine-tuning GPT models using disagreement-based examples significantly improved performance. GPT-4o-mini achieved the best overall results after fine-tuning, especially with the highest F1 score (0.93, P < .001) and recall (0.95, P < .001). Across the biomedical topics, fine-tuned models consistently outperformed their zero-shot counterparts without increasing reviewer workload. DISCUSSION: These findings demonstrate the effectiveness of disagreement-driven active learning in enhancing GPT-based biomedical literature screening. Lightweight models like GPT-4o-mini benefit most from targeted, reasoning-enriched training, highlighting their suitability for scalable deployment. CONCLUSION: This study introduces an interactive active learning framework that leverages fine-tuned LLMs with reasoning capabilities to enhance literature screening. The approach offers a scalable solution to more efficient and reliable information retrieval in systematic reviews.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。