Medical QA dialogue datasets in RAG systems performance evaluation and ChatGPT optimization

RAG系统性能评估和ChatGPT优化中的医疗QA对话数据集

阅读:1

Abstract

This study evaluates the effectiveness of Chinese doctor-patient dialogues as retrieval sources for Retrieval-Augmented Generation (RAG) in clinical question answering. Using ChatGPT-3.5 as a baseline and extending to GPT-4o and GPT-5, we compare multiple retrieval pipelines, including dense retrieval, Cross-Encoder reranking, Reciprocal Rank Fusion (RRF), and Cascade RRF→Rerank. Experimental results show that dialogue-based retrieval significantly improves generation quality relative to direct prompting (e.g., ROUGE-1-f: +12.6%, BERTScore_F1: +1.5%, p < 0.05). Among retrieval strategies, Rerank-only provides the best accuracy-latency balance, while the cascade pipeline introduces noise and yields no additional benefit. Under identical retrieval settings, GPT-4o achieves stronger automatic metrics and 4-5× lower latency, whereas GPT-5 receives slightly higher human preference scores (+ 0.08, p < 0.001), indicating a trade-off between efficiency and perceived coherence. Expert evaluation further confirms improvements in readability, accuracy, and authenticity (all p < 0.001). These findings highlight that data representation and metadata structure have a greater impact on RAG performance than retrieval algorithm complexity, offering practical guidance for reliable medical QA deployment.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。