Abstract
To enhance the tracking performance of transformer-based trackers in complex scenes, we propose a novel visual object tracking method that incorporates three key components: a pyramid channel attention mechanism, a hierarchical cross-attention structure, and an attention-guided multi-layer perceptron. The pyramid channel attention mechanism dynamically enhances informative feature channels across different scales, while the hierarchical cross-attention structure facilitates effective feature interaction. The attention-guided multi-layer perceptron introduces nonlinear transformations under attention guidance to improve feature representation. Experimental results on benchmark datasets demonstrate the superior performance of the proposed method.