Abstract
Communication between deaf or mute individuals and hearing persons is often hindered by the lack of mutual understanding of sign or vocal language. To bridge this gap, Indian Sign Language Recognition (ISLR) systems are essential. This paper proposes a real-time ISLR framework based on the YOLOv10-ST model, which integrates the Swin Transformer into the YOLOv10 architecture for enhanced feature extraction. The model also incorporates Mish activation to improve gradient flow and detection accuracy. A custom dataset comprising 15, 000 static images (1, 000 per sign for 15 signs) and 35 dynamic videos (covering 7 sign classes) was used for training and evaluation. Experimental results demonstrate high performance, with the model achieving 97.50% precision, 98.10% recall, and 96.58% F1-score for image-based sign recognition, and 95.24% precision, 96.00% recall, and 95.87% F1-score for video-based gestures. The model also achieves a mean Average Precision (mAP) of 97.62% and real-time inference speeds of 48.7 FPS. Ablation studies validate the contributions of Swin Transformer and Mish activation, while paired t-tests confirm statistical significance (p [Formula: see text]). The experimental findings demonstrate that the YOLOv10-ST model efficiently recognizes static and dynamic ISL in real time with minimal computational overhead.