Abstract
Sign language recognition technology serves as a crucial bridge, fostering meaningful connections between deaf individuals and hearing individuals. This technological innovation plays a substantial role in promoting social inclusivity. Conventional sign language recognition methodologies that rely on static images are inadequate for capturing the dynamic characteristics and temporal information inherent in sign language. This limitation restricts their practical applicability in real-world scenarios. The proposed framework, called SSTA-ResT, integrates ResNet, soft spatiotemporal attention, and Transformer encoders to achieve this objective. The framework utilizes ResNet to extract robust spatial feature representations, employs the lightweight SSTA module for dual-path complementary representation enhancement to strengthen spatiotemporal associations, and leverages the Transformer encoder to capture long-range temporal dependencies. Experimental results on the LSA64 Argentine Sign Language (ASL) dataset demonstrate that the proposed method achieves an accuracy of 96.25%, a precision of 97.18%, and an F1 score of 0.9671. These results surpass the performance of existing methods across all metrics while maintaining a relatively low model parameter count of 11.66 M. This demonstrates the framework's effectiveness and practicality for sign language video recognition tasks.