Abstract
Down syndrome (DS) is one of the prevalent chromosomal disorders, representing distinctive craniofacial features and a range of developmental and medical challenges. Due to the lack of clinical expertise and high infrastructure costs, access to genetic testing is restricted to resource-constrained clinical settings. There is a demand for developing a non-invasive and equitable DS screening tool, facilitating DS diagnosis for a wide range of populations. In this study, we develop and validate a robust, interpretable deep learning model for the early detection of DS using facial images of infants. A hybrid feature extraction architecture combining RegNet X-MobileNet V3 and vision transformer (ViT)-Linformer is developed for effective feature representation. We use an adaptive attention-based feature fusion to enhance the proposed model's focus on diagnostically relevant facial regions. Bayesian optimization with hyperband (BOHB) fine-tuned extremely randomized trees (ExtraTrees) is employed to classify the features. To ensure the model's generalizability, stratified five-fold cross-validation is performed. Compared to the recent DS classification approaches, the proposed model demonstrates outstanding performance, achieving an accuracy of 99.10%, precision of 98.80%, recall of 98.87%, F1-score of 98.83%, and specificity of 98.81%, on the unseen data. The findings underscore the strengths of the proposed model as a reliable screening tool to identify DS in the early stages using the facial images. This study paves the foundation to build equitable, scalable, and trustworthy digital solution for effective pediatric care across the globe.