Abstract
Pain assessment in non-verbal patients, including neonates and unconscious adults, remains a critical challenge in clinical practice. Current pain scales rely heavily on observer interpretation and may lack objectivity, introducing significant inter-rater variability. We propose a novel multimodal deep learning framework that estimates continuous pain intensity by fusing non-speech audio cues with facial expressions. Our approach addresses the critical need for objective pain assessment in vulnerable populations unable to self-report. We developed a cross-modal attention-based fusion network combining spectrogram-derived audio embeddings with facial action unit features. The model was trained and validated on 3,247 audio-visual recordings from 428 subjects, including 215 neonates and 213 adults, across three distinct pain intensity levels. We employed a ResNet-based audio encoder for mel-spectrogram processing and a facial landmark convolutional neural network for expression analysis, integrated through a transformer-based fusion module that learns complementary relationships between modalities. Our model achieved a mean absolute error of 0.89 on a 0-10 pain scale, significantly outperforming audio-only approaches (mean absolute error 1.47, 39% improvement) and visual-only baselines (mean absolute error 1.23, 28% improvement). Cross-age group validation demonstrated robust generalization with mean absolute errors of 0.94 for neonates and 0.91 for adults. The model maintained a Pearson correlation coefficient of 0.89 with ground truth annotations and achieved 81.4% accuracy for three-class pain categorization. Audio-visual fusion significantly enhances pain estimation accuracy across diverse age groups and clinical scenarios. This approach offers substantial potential for objective, automated pain monitoring in clinical settings, particularly for vulnerable populations unable to self-report pain.