Abstract
This study aimed to compare the performance of three large language models (ChatGPT-4o, OpenAI O1, and OpenAI O3 mini) in delivering accurate and guideline compliant recommendations for pneumonia management. By assessing both general and guideline-focused questions, the investigation sought to elucidate each model's strengths, limitations, and capacity to self-correct in response to expert feedback. Fifty pneumonia-related questions (30 general, 20 guideline-based) were posed to the three models. Ten infectious disease specialists independently scored responses for accuracy using a 5-point scale. The two chain-of-thought models (OpenAI O1 and OpenAI O3 mini) were further tested for self-correction when initially rated "poor," with re-evaluations conducted one week later to reduce recall bias. Statistical analyses included nonparametric tests, ANOVA, and Fleiss' Kappa for inter-rater reliability. OpenAI O1 achieved the highest overall accuracy, followed by OpenAI O3 mini; ChatGPT-4o scored lowest. For "poor" responses, O1 and O3 mini both significantly improved after targeted prompts, reflecting the advantages of chain-of-thought reasoning. ChatGPT-4o demonstrated limited gains upon re-prompting and provided more concise, but sometimes incomplete, information. OpenAI O1 and O3 mini offered superior guideline-aligned recommendations and benefited from self-correction capabilities, while ChatGPT-4o's direct-answer approach led to moderate or poor outcomes for complex pneumonia queries. Incorporating chain-of-thought mechanisms appears critical for refining clinical guidance. These findings suggest that advanced large language models can support pneumonia management by providing accurate, up-to-date information, particularly when equipped to iteratively refine their outputs in response to expert feedback.