Abstract
OBJECTIVE: Artificial intelligence (AI), particularly large language models (LLMs) such as ChatGPT, is increasingly applied in medical education to automate assessment design. However, concerns persist regarding the content accuracy, cognitive depth, and psychometric validity of AI-generated multiple-choice questions (MCQs). RESULTS: A mixed-methods study was conducted to evaluate a structured prompt guiding ChatGPT in generating clinically relevant, single-best-answer MCQs in pediatrics. The prompt defined item count, subdomain distribution, difficulty, and examinee level. Following seven iterative refinements, ChatGPT produced 100 MCQs targeting clinical-year medical students. Two blinded experts independently assessed each item for clarity, content accuracy, clinical realism, distractor plausibility, and cognitive alignment. All items were structurally coherent and linguistically sound. Content accuracy was rated high in 87% of items, and stem clarity in 82%. Distractor plausibility was acceptable in 77%, though 23% of items contained at least one implausible distractor. Agreement between AI-predicted and expert-rated difficulty was low (κ = 0.06), suggesting limited calibration.