Abstract
BACKGROUND: The deployment of large language models (LLMs) in mental health therapy presents a compelling yet deeply fraught opportunity to address widespread disparities in access to psychological care. Recent empirical evidence reveals that these AI systems exhibit substantial shortcomings when confronted with complex clinical contexts. METHODS: This paper synthesizes key findings from a critical analysis of LLMs operating in therapeutic roles and argues for the urgent establishment of comprehensive risk management frameworks, policy interventions, and ethical protocols governing their use. RESULTS: LLMs tested in simulated therapeutic settings frequently exhibited stigmatizing attitudes toward mental health conditions and responded inappropriately to acute clinical symptoms such as suicidal ideation, psychosis, and delusions. Real-world evaluations reinforce these concerns. Some studies found that therapy and companion bots endorsed unsafe or harmful suggestions in adolescent crisis vignettes, while others reported inadequate chatbot responses to self-harm and sexual assault queries, prompting concern from clinicians, disappointment from patients, and calls for stronger oversight from policymakers. These failures contravene fundamental principles of safe clinical practice, including non-maleficence, therapeutic alliance, and evidence-based care. Moreover, LLMs lack the emotional intelligence, contextual grounding, and ethical accountability that underpin the professional responsibilities of human therapists. Their propensity for sycophantic or non-directive responses, driven by alignment objectives rather than clinical efficacy, further undermines their therapeutic utility. CONCLUSIONS: This analysis highlights barriers to the replacement of human therapists with autonomous AI systems. It also calls attention to the regulatory vacuum surrounding LLM-based wellness and therapy applications, many of which are widely accessible and unvetted. Recommendations include professional standards, transparency in training and deployment, robust privacy protections, and clinician oversight. The findings underscore the need to redefine AI as supportive, not substitutive.