Abstract
To address the limitations in data augmentation and neglect of joint dependencies in self-supervised human action recognition, this paper proposes a hybrid framework that integrates topology-masked motion modeling with contrastive learning. The proposed motion topology-masking technique jointly encodes skeletal topology and motion dynamics, preventing the model from over-focusing on temporally salient regions of prominent motions. We employ a multi-stage hybrid augmentation strategy, combining conventional and extreme augmentation methods to generate diverse, enriched positive pairs for contrastive learning. Additionally, we introduce a trajectory-guided feature dropping module, which selectively discards critical features based on trajectory attention maps, preventing the model from avoiding excessive focus on local joint trajectories. This approach effectively leverages large-scale unlabeled skeleton data through self-supervised learning, significantly reducing reliance on costly annotated datasets. Extensive experiments on NTU-60, NTU-120, and PKU-MMD demonstrate that the proposed model achieves superior performance in both occluded scenarios under complex environments and low-supervision conditions. It effectively mitigates visual interference and annotation scarcity while substantially improving action recognition accuracy.