Abstract
Tracking small, fast-moving unmanned aerial vehicles (UAVs) in thermal infrared (TIR) imagery is a significant challenge due to low-resolution targets, Dynamic Background Clutter, and frequent occlusions. To address this, we introduce MemLoTrack, a novel onestream Vision Transformer tracker that integrates a memory mechanism into a parameterefficient LoRA framework. MemLoTrack enhances a baseline tracker (LoRAT) with two key components: (i) a gated First-In, First-Out (FIFO) memory bank (MB) for temporal context aggregation and (ii) a lightweight Memory Attention Layer (MAL) for effective information retrieval. A key component of our method is a selective memory update policy, which commits a frame to the memory bank only when it satisfies both a classification confidence threshold (τ) and a Kalman filter-based motion consistency check. This gating mechanism robustly prevents memory contamination due to distractors, occlusions, and reappearance events. Our training is highly efficient, updating only the LoRA adapters, MAL, and prediction head while the pretrained DINOv2 backbone remains frozen. Evaluated on the challenging Anti-UAV410 benchmark, MemLoTrack (L(mem) = 7, τ = 0.8) achieves an AUC of 63.6 and a State Accuracy (SA) of 64.0, representing a significant improvement over the LoRAT baseline by +1.4 AUC and +1.5 SA. Compared to the state-of-the-art method FocusTrack, MemLoTrack demonstrates superior robustness with higher AUC (63.6 vs. 62.8) and SA (64.0 vs. 63.9), while trading lower precision (P/P-Norm) scores. Furthermore, MemLoTrack operates at 153 FPS on a single RTX 4070 Ti SUPER, demonstrating that parameter-efficient fine-tuning with a selective memory mechanism is a powerful and deployable strategy for real-time Anti-UAV tracking in demanding TIR environments.