Abstract
BACKGROUND: There is an unprecedented increase in the use of Generative AI in medical education. There is a need to assess these models' accuracy to ensure patient safety. This study assesses the accuracy of ChatGPT, Gemini, and Copilot in answering multiple-choice questions (MCQs) compared to a qualified medical teacher. METHODS: This study randomly selected 40 Multiple Choice Questions (MCQs) from past United States Medical Licensing Examination (USMLE) and asked for answers to three LLMs: ChatGPT, Gemini, and Copilot. The results of an LLM are then compared with those of a qualified medical teacher and with responses from other LLMs. The Fleiss' Kappa Test was used to determine the concordance between four responders (3 LLMs + 1 Medical Teacher). In case of poor agreement between responders, Cohen's Kappa test was performed to assess the agreement between responders. RESULTS: ChatGPT demonstrated the highest accuracy (70%, Cohen's Kappa = 0.84), followed by Copilot (60%, Cohen's Kappa = 0.69), while Gemini showed the lowest accuracy (50%, Cohen's Kappa = 0.53). The Fleiss' Kappa value of -0.056 indicated significant disagreement among all four responders. CONCLUSION: The study provides an approach for assessing the accuracy of different LLMs. The study concludes that ChatGPT is far superior (70%) to other LLMs when asked medical questions across different specialties, while contrary to expectations, Gemini (50%) performed poorly. When compared with medical teachers, the low accuracy of LLMs suggests that general-purpose LLMs should be used with caution in medical education.