Abstract
OBJECTIVES: This study aims to automatically classify physical examinations performed during general practitioner (GP) consultations using a deep learning fusion model. The model distinguishes between two interaction types: Human-Object Activities (HOA), such as blood pressure measurement, and Human-Human Activities (HHA), such as gland palpation. MATERIAL AND METHOD: A multi-component ensemble transfer learning framework was developed that integrates spatial and temporal feature analysis. The model comprises: (1) a CNN-LSTM module for spatial feature extraction and sequential modelling, (2) an ensemble of EfficientNet-B7, DenseNet-121, and Inception-v3 to capture diverse spatial representations, and (3) a fusion module that concatenates outputs from both streams, refined by an attention mechanism to prioritise salient features. Transfer learning was applied to fine-tune pre-trained networks on GP consultation video data. Model performance was evaluated using five-fold stratified video-level cross-validation, reporting mean ± SD for precision, recall, F1-score, specificity, Cohen's κ, and PR-AUC. RESULTS: The fusion model achieved robust overall performance, with a precision of 92.1 ± 1.4%, recall of 89.9 ± 1.8%, F1-score of 90.9 ± 1.5%, specificity of 93.1 ± 1.3%, Cohen's κ of 0.90 ± 0.02, and PR-AUC of 0.935 ± 0.02. It consistently outperformed ten state-of-the-art baselines, while ablation analysis showed F1-score improvements of 17% over CNN-LSTM and 16% over the ensemble model, confirming the benefit of combining spatial and temporal analysis. CONCLUSION: The proposed fusion framework accurately recognises physical examinations in GP consultations and supports future telehealth and diagnostic research.