Evaluating the ability of AI models to generate level-specific medical MCQs with variable difficulty

评估人工智能模型生成难度各异、级别特定的医学多项选择题的能力

阅读:1

Abstract

OBJECTIVE: Artificial intelligence (AI), particularly large language models (LLMs) such as ChatGPT, is increasingly applied in medical education to automate assessment design. However, concerns persist regarding the content accuracy, cognitive depth, and psychometric validity of AI-generated multiple-choice questions (MCQs). RESULTS: A mixed-methods study was conducted to evaluate a structured prompt guiding ChatGPT in generating clinically relevant, single-best-answer MCQs in pediatrics. The prompt defined item count, subdomain distribution, difficulty, and examinee level. Following seven iterative refinements, ChatGPT produced 100 MCQs targeting clinical-year medical students. Two blinded experts independently assessed each item for clarity, content accuracy, clinical realism, distractor plausibility, and cognitive alignment. All items were structurally coherent and linguistically sound. Content accuracy was rated high in 87% of items, and stem clarity in 82%. Distractor plausibility was acceptable in 77%, though 23% of items contained at least one implausible distractor. Agreement between AI-predicted and expert-rated difficulty was low (κ = 0.06), suggesting limited calibration.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。