Abstract
We present a wearable IMU-based human pose estimation framework that couples knowledge distillation with an involution-based student and a principled structural re-parameterization for on-device deployment. A high-capacity Transformer serves as the teacher to learn rich temporal-spatial representations, while the student adopts input-adaptive involution operators. During deployment, structural re-parameterization collapses the training graph by folding batch normalization and fusing in-branch cascades and cross-branch parallel paths, yielding a single inference-time module that is equivalent to two 1D CNN passes. This design decouples training expressiveness from inference efficiency and makes the model hardware-friendly for low-power wearables. Extensive experiments on two public benchmarks, DIP-IMU and IMUPoser, demonstrate that our approach preserves near state-of-the-art accuracy while achieving sub-millisecond latency. Concretely, the proposed model attains 81 mm MPJPE on DIP-IMU and 94 mm on IMUPoser, with per-frame latencies of 0.012 ms and 0.011 ms, respectively-delivering one to two orders of magnitude speedups over heavy Transformer baselines and matching the best accuracies within ≤ 1.25% relative difference. The consistent gains across datasets indicate strong robustness and cross-subject generalization, highlighting the suitability of the method for real-time wearable applications.