Abstract
This study proposes a real-time augmented reality gesture interaction algorithm based on the Swin Transformer and a masked self-encoder. This algorithm solves the challenges of the traditional Transformer model regarding spatio-temporal feature extraction and real-time performance. During data preprocessing, the study uses a synthetic data annotation method to automatically generate 3D gesture images and annotate joint information, significantly improving data annotation efficiency. Using weighted Euclidean distance and structural similarity optimization, the paper proposes an image denoising model based on maximum a posteriori probability that effectively reduces noise interference in gesture image analysis. The gesture detection and segmentation module combines EfficientNet and Transformer models. It fuses shallow and deep features through skip connections, realizes multi-scale feature extraction, and enhances attention to the target area through the triplet attention module. Additionally, the paper introduces the local texture feature prior (RTHLBP) to optimize gesture recognition and segmentation accuracy. In the gesture classification module, the paper proposes a ViT architecture based on a masked autoencoder. It aligns features at different levels through a dynamic weight fusion strategy and combines the relative total variation map as a self-monitoring element. This significantly improves classification performance. Experimental results demonstrate that the proposed model's accuracy, F1 score, and MIoU on the 4 GTEA sub-dataset surpass those of traditional CNN, Transformer, MobileNet, and DenseNet models, particularly on small datasets. The paper also optimizes the model's real-time performance through a multi-core parallel computing strategy. Experiments show that as the number of DSP cores increases, the computation time is significantly reduced and the computational efficiency remains at a high level.