Abstract
BACKGROUND: Virtual patients (VPs) demonstrate effectiveness in improving clinical reasoning skills; however, traditional VP platforms often lack individualized feedback mechanisms. Advances in large language models (LLMs) enable automated analysis of student-VP interactions, providing scalable feedback on clinical performance. While artificial intelligence (AI)-enhanced social robotic VP platforms show promise for clinical reasoning training, no studies have examined whether AI-generated feedback integrated in such platforms improves clinical performance in standardized assessments. OBJECTIVE: This study evaluated whether AI-generated postconsultation feedback integrated into social robotic VP interactions improves medical students' clinical performance, emphasizing medical history taking and communication. METHODS: A quasi-experimental study with 115 sixth-semester medical students (N=157, 73.2% of eligible students) was conducted at Karolinska Institutet, Stockholm, Sweden, during spring 2025. Students were allocated by hospital site to receive (n=61, 53%) or not receive (n=54, 46.9%) AI-generated feedback following interactions with a Social AI-Enhanced Robotic Interface. All students completed 9 VP cases; the intervention group received approximately 1 page of structured feedback after each VP case. The feedback system used multiple LLMs following a 2-stage algorithm: assessing student-VP dialogues using an assessment rubric, then generating structured feedback on history-taking performance. Both groups participated in case-specific follow-up seminars led by consultant rheumatologists following each VP encounter. Clinical performance was assessed through an 8-minute objective structured clinical examination (OSCE)-based evaluation, with a standardized patient portraying axial spondylarthritis, evaluated by a blinded consultant rheumatologist using a 10-point rubric across 5 domains: communication at consultation start, generic medical history, targeted medical history, diagnostics and management reasoning, and communication at consultation end. RESULTS: Students receiving AI-generated feedback achieved significantly higher total OSCE scores (mean 7.39, SD 0.86 vs mean 6.68, SD 1.04 points; mean difference 0.70; 95% CI 0.35-1.06; P<.001; Cohen d=0.74). Domain-specific analysis revealed significant improvement in generic medical history after Bonferroni correction (mean 2.46, SD 0.65 vs mean 2.03, SD 0.79 points; P=.004; r=0.27), while other domains showed no significant differences: communication at start (P=.13; r=0.14), targeted medical history taking (P=.60; r=0.05), diagnostics and management (P=.14; r=0.14), and communication at consultation end (P=.31; r=0.09). Pass rates were significantly higher in the feedback group (96.7% vs 79.6%; odds ratio 7.55, 95% CI 1.51-72.2; P=.006), with a number needed to assess of 6 students, that is, for every 6 students receiving feedback, 1 additional student passed the assessment. CONCLUSIONS: AI-generated feedback following social robotic VP interactions significantly improved medical students' OSCE-based performance, particularly in generic medical history taking. These findings support integrating validated AI feedback systems as a supplement to expert-led teaching during VP simulations for clinical training and demonstrate the feasibility of scalable, automated feedback in medical education. The domain-specific improvements in generic medical history highlight the importance of targeted, competency-specific feedback design in VP platforms.