Abstract
Objectives: Artificial intelligence (AI) symptom-checker apps are proliferating, yet their everyday usability and transparency remain under-examined. This study provides a triangulated evaluation of three widely used AI-powered mHealth apps: ADA, Mediktor, and WebMD. Methods: Five usability experts applied a 13-item AI-specific heuristic checklist. In parallel, thirty lay users (18-65 years) completed five health-scenario tasks on each app, while task success, errors, completion time, and System Usability Scale (SUS) ratings were recorded. A repeated-measures ANOVA followed by paired-sample t-tests was conducted to compare SUS scores across the three applications. Results: The analysis revealed statistically significant differences in usability across the apps. ADA achieved a significantly higher mean SUS score than both Mediktor (p = 0.0004) and WebMD (p < 0.001), while Mediktor also outperformed WebMD (p = 0.0009). Common issues across all apps included vague AI outputs, limited feedback for input errors, and inconsistent navigation. Each application also failed key explainability heuristics, offering no confidence scores or interpretable rationales for AI-generated recommendations. Conclusions: Even highly rated AI mHealth apps display critical gaps in explainability and error handling. Embedding explainable AI (XAI) cues such as confidence indicators, input validation, and transparent justifications can enhance user trust, safety, and overall adoption in real-world healthcare contexts.