Abstract
BACKGROUND: Clinical reasoning in infectious diseases relies on validated evidence. LLMs are being introduced into diagnosis, antimicrobial stewardship, and guideline interpretation before their safety and reliability are established. METHODS: This review, registered in PROSPERO (CRD420251155354), evaluated studies using GPT, Claude, Gemini, and retrieval-augmented or agentic systems for infectious disease decision-making. PubMed, CENTRAL, Scopus, and Web of Science were searched from January 2018 to September 2025. Two reviewers screened and extracted data. Risk of bias was assessed with QUADAS-AI. FINDINGS: Thirty-one studies met inclusion criteria. Most were cross-sectional (61%) and vignette-based (68%). Only 32% used real clinical data; 23% had low risk of bias. Safety issues were reported in 90% of studies: incomplete responses (61%), unsafe advice (23-32%), and fabricated content (32%). In antimicrobial stewardship, agreement with infectious-disease specialists was ~ 50%. Diagnostic sensitivity for structured infections was 80-100%. Retrieval-augmented systems increased specificity from 35% to 75% and reduced hallucinations. Proprietary models outperformed open-source models but did not reach expert accuracy. INTERPRETATION: LLMs perform well in defined diagnostic tasks but remain unreliable for autonomous clinical use. High error rates, inconsistent reasoning, and fabricated content require expert oversight and external validation before deployment.