Abstract
This research examines the advantages of utilizing 3D hand skeletal information for sign language identification from RGB videos within a cutting-edge, multi-stream deep learning recognition framework. Since most sign language datasets are just standard RGB video with no depth information, we want to use a robust architecture that has been mostly used for 3D human pose estimation to get 3D coordinates of hand joints from RGB data. After that, we combine these estimates with extra sign language data streams, such as convolutional neural network-derived representations of the hand and head pose estimation, using an attention-based encoder-decoder to identify the signs. We assess our proposed methodology using a corpus of isolated signs from AUTSL and WLASL, demonstrating substantial improvements through the incorporation of 3D hand posture data. Our method achieved 90.5% accuracy on AUTSL and 88.2% accuracy on WLASL, with F1-scores over 0.89, which is better than several state-of-the-art approaches.