Abstract
Pedestrian navigation activity recognition (PNAR) serves a pivotal role in in the pedestrian positioning and navigation field, providing strong technical support for various aspects such as pedestrian dead reckoning, and multi-source information fusion positioning. This paper proposes a PNAR method that combines a two-stream convolutional transformer architecture with self-supervised contrastive pretraining to address challenges in learning robust, transferable, and generalizable representations from sensor data. The spatial stream captures multi-modal sensor dependencies, while the temporal stream leverages attention mechanism to excavate temporal relationships. The two-stream design effectively processes multi-modal sensor data and models complex activities. Contrastive pretraining leverages unlabeled data to learn invariant and transferable representations, significantly enhancing generalization across datasets. The proposed method was evaluated on four public datasets, achieving exceptional performance-99.08% accuracy and 99.22% F1-score, outperforming existing PNAR methods, including CNNLSTM + Attention and Transformer-based PNAR models. Furthermore, we conducted cross-dataset experiments on data with different sensor configurations and activity labels to validate the model's superior generalization ability.