Abstract
BACKGROUND: Major depressive disorder (MDD) remains underdiagnosed worldwide, partly due to reliance on self-reported symptoms and clinician-administered interviews. OBJECTIVE: This study examined whether a speech-based classification model using emotionally and thematically varied image-description tasks could effectively distinguish individuals with MDD from healthy controls. METHODS: A total of 120 participants (59 with MDD, 61 healthy controls) completed four speech tasks: three emotionally valenced images (positive, neutral, negative) and one Thematic Apperception Test (TAT) stimulus. Speech responses were segmented, and 23 acoustic features were extracted per sample. Classification was performed using a long short-term memory (LSTM) neural network, with SHapley Additive exPlanations (SHAP) applied for feature interpretation. Four traditional machine learning models (support vector machine, decision tree, k-nearest neighbour, random forest) served as comparators. Within-subject variation in speech duration was assessed with repeated-measures Analysis of Variance. FINDINGS: The LSTM model outperformed traditional classifiers, capturing temporal and dynamic speech patterns. The positive-valence image task achieved the highest accuracy (87.5%), followed by the negative-valence (85.0%), TAT (84.2%) and neutral-valence (81.7%) tasks. SHAP analysis highlighted task-specific contributions of pitch-related and spectral features. Significant differences in speech duration across tasks (p<0.01) indicated that affective valence influenced speech production. CONCLUSIONS: Emotionally enriched and thematically ambiguous tasks enhanced automated MDD detection, with positive-valence stimuli providing the greatest discriminative power. SHAP interpretation underscored the importance of tailoring models to different speech inputs. CLINICAL IMPLICATIONS: Speech-based models incorporating emotionally evocative and projective stimuli offer a scalable, non-invasive approach for early depression screening. Their reliance on natural speech supports cross-cultural application and reduces stigma and literacy barriers. Broader validation is needed to facilitate integration into routine screening and monitoring.