Abstract
Human motion prediction and action recognition are critical tasks in computer vision and human-computer interaction, supporting applications in surveillance, robotics, and behavioral analysis. However, effectively capturing the fine-grained semantics and dynamic spatiotemporal dependencies of human skeleton movements remains challenging due to the complexity of coordinated joint and part-level interactions over time. To address these issues, we propose a spatiotemporal skeleton modeling framework that integrates a Part-Joint Attention (PJA) mechanism with a Dynamic Graph Convolutional Network (Dynamic GCN). The proposed framework first employs a multi-granularity sequence encoding module to extract joint-level motion details and part-level semantics, enabling rich feature representations. The PJA module adaptively highlights critical joints and body parts across temporal sequences, enhancing the model's focus on salient regions while maintaining temporal coherence. Additionally, the Dynamic GCN dynamically constructs and updates inter-joint spatial relationships based on temporal feature similarities, facilitating effective spatiotemporal reasoning. Extensive experiments on the Human3.6M dataset demonstrate that our method consistently outperforms strong baselines across various prediction horizons. Specifically, it achieves a Mean Per Joint Position Error (MPJPE) of 10.2 mm at 80 ms and 57.5 mm at 400 ms, outperforming the best baseline by 9-12 percentage relative improvement across diverse actions. These results indicate the proposed method's ability to accurately capture both subtle and large-scale human motions while maintaining temporal stability. This work advances the development of interpretable and precise skeleton-based motion modeling and can benefit broader domains such as real-time human-robot interaction, intelligent surveillance, and behavior recognition in practical environments.