Abstract
Digital-image technology has broadened the creative space of dance, yet accurately capturing the semantic correspondence between low-level motion data and high-level dance key-points remains challenging, especially when labeled data are scarce. We aim to establish a lightweight, semi-supervised pipeline that can extract discriminative motion features from depth sequences and map them to 3-D key-points of dancers in real time. To achieve pixel-level alignment between dance movement targets and high-dimensional sensory data, we propose a novel LSTM-CNN (Long Short Term Memory-Convolutional Neural Network) framework. Temporal-context features are first extracted by LSTM, after which multi-dimensional spatial features are captured by three convolutional layers and one max-pooling layer; the fused representation is finally regressed to 3-D body key-points. To relieve class imbalance caused by complex postures, an online hard-example mining (OHEM) strategy together with a Dice-cross-entropy weighted loss (3:1) is embedded into semi-supervised learning, enabling the network to converge with only 20% labeled samples. Experiments on the public MSR-Action3D dataset (567 sequences, 20 actions) yielded an average recognition rate of 96.9%, surpassing the best comparison method (MSST) by 1.1%. On our self-established dataset (99 sequences, 11 actions) the accuracy reached 97.99% while training time was reduced by 35% compared with the previous best Multi_perspective_MHPCs approach. Both datasets show low RMSE (≤ 0.032) between predicted and ground-truth key-points, confirming spatial precision. The results demonstrate that the proposed model can reliably track subtle dance gestures under limited annotation, offering an efficient, low-cost solution for digital choreography, motion-style transfer and interactive stage performance.