Abstract
BACKGROUND: Artificial intelligence (AI) is rapidly entering mental health care, but most models remain proof-of-concept, with limited external validation and substantial risk of overfitting. METHODS: This scoping review of reviews adhered to the PRISMA-ScR checklist and Joanna Briggs Institute guidance. We searched MEDLINE, Embase, PsycINFO, and IEEE Xplore. Eligible publications encompassed systematic, scoping, narrative, integrative, meta-analytic, and patent reviews. Findings were synthesised thematically. RESULTS: Thirty-one reviews were included. Evidence concentrated on depression and anxiety; schizophrenia, bipolar disorder, perinatal mental health, autism spectrum conditions, older adults, nurses, and allied professionals were under-represented. Across screening, diagnosis/classification, and risk prediction, high accuracy was frequently reported under internal validation; in prior syntheses, typical internal AUCs clustered around ≈0.80-0.88 whereas externally or prospectively validated performance was scarce and typically attenuated. Signals were strongest for narrow, feedback-rich tasks, with greater decay for general-purpose models and longer prediction horizons. Conversational agents produced small-to-moderate short-term improvements in depressive symptoms (SMD ≈0.2-0.6); effects for anxiety and stress were smaller or inconsistent and varied with comparator stringency, follow-up (≤8-12 weeks vs longer), and the degree of human guidance. Most chatbot evaluations were short and small-scale, with few randomized or pragmatic trials and limited data on durability beyond 12 weeks. Real-world implementation was limited; several reviews identified usability and electronic health-record integration as prerequisites for adoption, and explainability alone rarely conferred actionability without clinician training. Ethical readiness was incomplete: privacy and bias were commonly discussed, but accountability, post-deployment monitoring, and crisis-escalation protocols were inconsistently specified. Economic evaluations were uncommon and rarely accounted for integration, maintenance, or re-training costs. Workforce outcomes (literacy, confidence, readiness) were infrequently measured. Internal and external metrics were not pooled. CONCLUSIONS: AI applications span the mental-health care continuum but remain early in translation. Performance that appears strong under internal validation often attenuates on external or prospective testing; symptomatic gains are concentrated in depression/anxiety and may diminish over longer follow-up; and adoption is constrained by usability, EHR integration, and incomplete governance. The cross-review signal highlights consistent gaps in accountability, post-deployment monitoring and crisis escalation, equity reporting, workforce readiness, and life-cycle economics (including integration, monitoring, and re-training). Addressing these gaps through externally validated and monitored deployments, routine content/guardrail audits for chatbots with human escalation, predefined subgroup performance and bias auditing, and implementation strategies that pair explainability with clinician training and measure workforce endpoints would better align the evidence base with safe, effective, and sustainable clinical use.