Abstract
Action quality assessment (AQA) plays a pivotal role in intelligent sports analysis, aiding athlete training and refereeing decisions. However, existing datasets and methods are limited to short-term actions, lacking comprehensive spatiotemporal modeling for complex, long-duration sequences like those in trampoline gymnastics. To bridge this gap, we introduce Trampoline-AQA, a novel dataset comprising 206 video clips from major competitions (2018-2024), featuring dual-modality (RGB and optical flow) data and rich annotations. Leveraging this dataset, we propose a framework comprising a Temporal Feature Enhancer (TFE) and a forward-looking causal cross-modal attention (FCCA) module, which improves action quality assessment by delivering more accurate and robust scoring for long-duration, high-speed routines, particularly under motion ambiguities. Our approach achieves a Spearman correlation of 0.938 on Trampoline-AQA and 0.882 on UNLV-Dive, demonstrating superior performance and generalization capability.