Human Expertise Outperforms Artificial Intelligence in Medical Education Assessments: MCQ Creation Highlights the Irreplaceable Role of Teachers

在医学教育评估中,人类专业知识优于人工智能:多项选择题的创建凸显了教师不可替代的作用

阅读:1

Abstract

INTRODUCTION: Multiple-choice questions (MCQs) are vital tools for assessment in education because they allow for the direct measurement of various knowledge, skills, and competencies across a wide range of disciplines. While artificial intelligence (AI) holds promise as a supplementary tool in medical education, particularly for generating large volumes of practice questions, it cannot yet replace the nuanced and expert-driven process of question creation that human educators provide. This study seeks to close the gap, particularly with regard to difficulty index, discrimination index, and distractor efficiency. MATERIALS AND METHODS: A total of 50 medical students received a set of fifty randomized, blinded, validated MCQs by human physiology experts. Of these, 25 were made by AI, and the remaining 25 were made by qualified, experienced professors. Using the item response theory (IRT) framework, we calculated key metrics like item reliability, difficulty index, discrimination index, and distractor functionality. RESULTS: The results demonstrated that the difficulty index of AI-generated MCQs (mean = 0.62, SD = 0.14) was comparable to that of expert-generated questions, with no statistically significant difference observed ( p  = 0.45). However, significant differences emerged in other key quality metrics. The discrimination index, which reflects a question's ability to distinguish between high- and low-performing students, was notably higher for expert-created MCQs (Mean = 0.48, SD = 0.12) than for those generated by AI (Mean = 0.32, SD = 0.10), indicating a moderate-to-large effect (p = 0.0082, Chi-square = 11.7, df = 3). Similarly, distractor efficiency (DE), which evaluates the effectiveness of incorrect answer options, was significantly greater in expert-authored questions (Mean = 0.24, SD = 7.2) compared to AI-generated items (Mean = 0.4, SD = 8.1), with a moderate effect size (p = 0.0001, Chi-square = 26.2, df = 2). These findings suggest that while AI can replicate human-level difficulty, expert involvement remains crucial for ensuring high-quality discrimination and distractor performance in MCQ design. CONCLUSION: The findings suggest that AI holds promise, particularly in generating questions of appropriate difficulty, but human expertise remains essential in crafting high-quality assessments that effectively differentiate between levels of student performance and challenge students' critical thinking. As AI technology continues to evolve, ongoing research and careful implementation will be essential in ensuring that AI contributes positively to medical education.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。