Abstract
IMPORTANCE: The large language model (LLM) chatbot product ChatGPT has accumulated 800 million weekly users since its 2022 launch. In 2025, several media outlets reported on individuals in whom apparent psychotic symptoms emerged or worsened in the context of using ChatGPT. As LLM chatbots are trained to align with user input, they may have difficulty responding to psychotic content. OBJECTIVE: To assess whether ChatGPT can reliably generate appropriate responses to prompts containing psychotic symptoms. DESIGN: Across-sectional study of ChatGPT responses to psychotic and control prompts, with blind clinician ratings of response appropriateness. SETTING: ChatGPT web application accessed on 8/28-8/29/2025, testing three product versions: GPT-5 Auto (current paid default), GPT-4o (previous paid default), and "Free" (version accessible without subscription or account). MAIN OUTCOMES AND MEASURES: We presented 158 unique prompts (79 control and 79 psychotic, generated based on the Structured Interview for Psychosis-Risk Syndromes) to three product versions, yielding 474 prompt-response pairs. Blinded clinicians assigned each an appropriateness rating (0 = completely appropriate, 1 = somewhat appropriate, 2 = completely inappropriate) via a standardized rubric. We hypothesized a priori that psychotic prompts would be more likely than control prompts to elicit inappropriate responses both across and within product versions. RESULTS: In the primary (across-version) analysis, psychotic prompts were 25.84 times more likely to elicit inappropriate responses with "Free" ChatGPT (95% CI 12.45 to 53.66, p < 0.001). GPT-5 Auto reduced risk somewhat (OR for interaction term 0.33, 95% CI 0.16 to 0.68, p = 0.005) yet still generated inappropriate responses at a greatly elevated rate (implied OR 8.53, 95% CI 3.05 to 23.84). In the secondary (within-version) analysis, ORs were 9.08 for GPT-5 Auto (95% CI 4.24 to 21.02), 14.15 for GPT-4o (95% CI 6.12 to 37.23) and 43.37 for "Free" (95% CI 18.44 to 112.80). In an exploratory analysis, prompts reflecting grandiosity or disorganized communication were more likely to elicit inappropriate responses than those reflecting delusions. CONCLUSIONS AND RELEVANCE: No tested version of ChatGPT reliably generated appropriate responses to psychotic content.