Abstract
OBJECTIVE: Artificial intelligence (AI), particularly language models like ChatGPT (OpenAI, San Francisco, CA), is reshaping clinical care and medical education. This study evaluated the impact of a ChatGPT 3.5-generated case-based curriculum on internal medicine residents' understanding of professionalism in a US residency program. METHODS: A single-group, pre-post intervention pilot study was conducted from August 2024 to February 2025 at a US internal medicine residency program (IRB exempt, E24149). Residents from postgraduate year (PGY)-1 to PGY-3 participated in a three-week professionalism curriculum integrated into Friday ambulatory didactics. Weekly modules featured ChatGPT 3.5-generated case scenarios aligned with the Penn State Questionnaire on Professionalism (PSQP) domains, reviewed by three faculty members for clinical and ethical relevance. Residents completed one module per week via Qualtrics (Provo, UT), receiving immediate feedback. The validated 36-item PSQP was administered anonymously before and after the intervention. Pre-post differences were analyzed using unpaired t-tests adjusted for clustering based on baseline characteristics, with sensitivity analyses using log-transformed scores. Propensity score matching and cluster-adjusted logistic regression were used for subgroup analyses. Statistical significance was set at p < 0.05. RESULTS: A total of 37 residents completed the pre-survey, and 33 completed the post-survey. The mean age was 28.9 years (SD: 3.4), with balanced gender distribution (18 males, 19 females) and 59% non-US citizens. Residents were evenly distributed across PGY levels. After matching by age and sex, covariate balance was achieved. While all professionalism domains improved post intervention, changes were not statistically significant overall. Female residents showed significant gains in duty (p = 0.004), accountability (p = 0.037), honor (p = 0.028), and altruism (p = 0.017), with some effects persisting after matching. No significant changes were noted in male residents. Trends toward higher "much" or "great deal" responses were observed across most PSQP items (post: 61-77% vs. pre: 35-70%). Notable gains were seen in corrective action (p = 0.006), attending seminars (p = 0.003), and upholding scientific standards. CONCLUSION: This pilot is among the first to evaluate a ChatGPT 3.5-generated professionalism curriculum using the validated PSQP. While overall changes were not statistically significant, meaningful gains in specific domains among female residents suggest educational benefit and support gender-responsive instructional design. The low-cost, scalable format may serve as a template for institutions seeking to implement professionalism training with limited resources. Further multi-institutional studies with paired designs and long-term follow-up are warranted.