Abstract
To address the problem of degraded positioning accuracy in traditional visual-inertial navigation systems (VINS) due to interference from moving objects in dynamic scenarios, this paper proposes an improved algorithm based on the VINS-Fusion framework, which resolves this issue through a synergistic combination of multi-scale feature optimization and real-time dynamic feature elimination. First, at the feature extraction front-end, the SuperPoint encoder structure is reconstructed. By integrating dual-branch multi-scale feature fusion and 1 × 1 convolutional channel compression, it simultaneously captures shallow texture details and deep semantic information, enhances the discriminative ability of static background features, and reduces mis-elimination near dynamic-static boundaries. Second, in the dynamic processing module, the ASORT (Adaptive Simple Online and Realtime Tracking) algorithm is designed. This algorithm combines an object detection network, adaptive Kalman filter-based trajectory prediction, and a Hungarian algorithm-based matching mechanism to identify moving objects in images in real time, filter out their associated dynamic feature points from the optimized feature point set, and ensure that only reliable static features are input to the backend optimization, thereby minimizing pose estimation errors caused by dynamic interference. Experiments on the KITTI dataset demonstrate that, compared with the original VINS-Fusion algorithm, the proposed method achieves an average improvement of approximately 14.8% in absolute trajectory accuracy, with an average single-frame processing time of 23.9 milliseconds. This validates that the proposed approach provides an efficient and robust solution for visual-inertial navigation in highly dynamic environments.