Abstract
PURPOSE: This study addresses key challenges in 3D human pose estimation (HPE) and energy expenditure estimation (EEE), focusing on handling complex activities, improving generalization, and jointly enhancing both tasks within a unified framework. METHODS: We propose Pose2Met, a unified end-to-end framework that jointly addresses 3D HPE and EEE. At the core of this framework is STAPFormer, a Transformer model with a SpatioTemporal Aggregated Pose (STAP) representation for efficient and accurate motion modeling. Building on this representation, Pose2Met introduces a unified pose-metabolism learning strategy that jointly optimizes pose dynamics and metabolic patterns within a single learning paradigm, enabling the model to directly predict both 3D pose and energy expenditure from 2D pose inputs, achieving performance comparable to the traditional 2D-3D-expenditure pipeline and significantly enhancing computational efficiency and robustness in practical applications. RESULTS: Experiments show that STAPFormer achieves an MPJPE of 38.2 mm on Human3.6M, outperforming MixSTE and STCFormer. For EEE on Vid2Burn-ADL, it achieves 22.1 kcal MAE with pose-based input, comparable to video-based methods. Under the unified learning framework, 2D pose-based EEE further approaches the accuracy of 3D pose-based prediction, demonstrating enhanced robustness and generalization. CONCLUSION: The results highlight the importance of high-quality motion representations for both HPE and EEE. Pose2Met shows strong potential for intelligent fitness and healthcare applications and offers a promising direction for bridging the gap between pose and expenditure estimation.