Abstract
Medical artificial intelligence (AI) is being rapidly deployed in clinical practice, yet its real-world effectiveness across diverse patient populations remains poorly characterized. We conducted a systematic review combining automated screening (fine-tuned BERT-PubMed classifiers) with manual validation to identify studies of mature medical AI models deployed in healthcare facilities worldwide. We included 171 studies at the "device-into-practice" stage with sufficient demographic and performance data, representing 209,772 patients. Patient access to these models showed marked demographic disparities: geographic concentration was extreme (Dagum-Gini coefficient 0.97, P < .001), with 95.1% of patient cohorts (studies) from high-income (62.2%) or upper-middle-income (32.9%) countries-primarily China (28.7%) and the United States (18.9%)-and no studies from low-income countries. Racial representation was dominated by White (49.1%) and Asian (42.6%) patients, and 63.8% of studies exhibited moderate-to-high sex imbalance. Across all studies, AI models outperformed human practitioners (81.7% vs. 77.8% accuracy, P < .001), but this superiority was confined to in-distribution applications (same geographic/demographic context: 82.9% vs. 77.3%, P < .001) and disappeared in out-of-distribution deployments (cross-geographic/demographic contexts: 74.1% vs. 76.3%, P = .45). In underrepresented populations, AI performance was not significantly different from that of human practitioners. Overall, mature medical AI models are deployed predominantly in economically advantaged settings, with performance advantages concentrated in well-represented demographic groups, highlighting a digital divide in access and effectiveness, and the need for demographic-specific validation.