Abstract
Bangladeshi Sign Language is a primary means of communication for many Deaf people in Bangladesh, yet communication with non-signers remains difficult. Most existing systems for this language focus on alphabets or a small set of static signs and do not address accurate word-level translation in real time. This work targets that gap. We aim to build a practical sign-to-text pipeline that is accurate, efficient, and reproducible. Concretely, we set four objectives: first, standardize data curation and preprocessing for 102 everyday BdSL words using pose-based features; second, design a compact Transformer that models temporal structure directly over skeletal landmarks; third, benchmark against strong baselines to quantify gains; and fourth, provide a clear training and evaluation recipe for replication. Our solution is a feature-first pipeline that converts video to MediaPipe hand and upper body landmarks, forms 30-frame sequences with 258 features per frame, and classifies them with a lightweight Transformer encoder that uses positional information, residual blocks, and dropout, followed by global pooling and a dense softmax. The experimental setup uses a sixty-to-forty train-to-test split, batch size 32, 30 epochs with Adam, and accuracy as the primary metric, with identical preprocessing for a stacked LSTM baseline and an SSD MobileNet V2 image baseline. The proposed model reaches about 98.1 percent test accuracy and outperforms both baselines while keeping latency low on a standard laptop. Because the method relies on compact landmarks rather than raw images, it is privacy-friendly and suitable for on-device use. The approach is directly applicable to assistive communication in classrooms, clinics, and service counters, and it is simple to extend by adding new BdSL words to the training set.