Abstract
Different action recognition tasks exhibit significant variations in their reliance on local versus global features. Particularly for long-video understanding, dynamically balancing the contributions of both has become a critical challenge for improving recognition accuracy. This paper proposes a Multi-Layer Bidirectional Distillation Model (MBD) based on the two-stream architecture. It employs 3D CNN and video Transformer to capture local and global spatio-temporal features of videos, respectively, aiming to explore the complementary mechanisms between these two feature types and facilitate their synergistic enhancement across diverse recognition task scenarios. The model quantifies feature contributions across specific recognition tasks to map feature dominance, categorizing videos into distinct feature-dominant groups. This mechanism provides a clear direction for knowledge transfer, overcoming the limitations of traditional unidirectional knowledge distillation. Bidirectional knowledge distillation is then performed at the intermediate and final layers, training the model to learn complementary relationships between features and addressing the issue of insufficient representational capacity of non-dominant features. During inference, an adaptive fusion strategy based on feature dominance is adopted, achieving feature fusion via dynamic weighted summation. This mechanism effectively suppresses noise interference from non-dominant features while maximizing the discriminative advantages of dominant features. The MBD model undergoes systematic comparative experiments across four classic action recognition benchmarks (UCF101, HMDB51, Kinectics-400, Something-Something V2). The results demonstrate that the MBD model not only excels in short-video recognition but also outperforms in analyzing complex actions under long-video scenarios.