Abstract
BACKGROUND: Applying large language models to medicine faces critical trust challenges in diagnostic reasoning. Existing approaches often fail to generalize across different models and datasets, particularly those covering a wide range of diseases and diverse patient records. This study aims to develop a universal model-based clinical framework that improves diagnostic performance while providing explainable reasoning. METHODS: We introduce a structured clinical approach that replicates real-world diagnostic workflows. Patient narratives are first transformed into labeled clinical components. A validation mechanism then checks model-generated diagnoses using a disease knowledge algorithm. Additionally, a stepwise decision-making model simulates consultations progressing from junior to senior clinicians to refine diagnostic reasoning. The framework is evaluated across multiple large language models and clinical reasoning datasets using standard diagnostic accuracy metrics. RESULTS: Here we show that our approach outperforms existing prompting methods across six large language models and two clinical datasets. One model achieves the highest diagnostic F1 scores (0.93 on NEJM, 0.95 on MedCaseReasoning) with minimal misclassification (1 false positive and 3 false negatives). It also attains the best text-based reasoning scores on NEJM, demonstrating effective, explainable clinical outputs. When validated on real-time electronic health record data, the method shows high diagnostic accuracy (0.91) and human-like rationales (4.5 out of 5), confirming its applicability in real-world clinical settings. CONCLUSIONS: These findings confirm the robustness and generalizability of our framework, highlighting its potential for reliable, scalable, and explainable clinical decision support across diverse models and datasets.