Abstract
Early and accurate stroke prediction is critical to reduce death and disability risk, despite the presence of irrelevant and sparse information in clinical datasets that often undermines model performance. The novel machine learning approach is proposed for stroke prediction in the Cardiovascular Health Study (CHS) dataset. The proposed approach consists of two steps. The important features are selected using the Information Gain Ratio (IGR) during the preprocessing, and missing data handled by K-Nearest Neighbour (KNN), which also helps to enhance data integrity as well computing efficiency. Following the classification phase, a Deep Neural Network (DNN) model is trained on the preprocessed information to predict stroke risk. After classification, a DNN model is further trained using preprocessed data to predict the risk of stroke. Model assessment was based on a combined 10-fold nested cross-validation scheme for unbiased internal validation and to avoid data leakage. The effectiveness of the model was evaluated by seven statistical indices, false positive rate, precision, sensitivity, specificity, F1-score, accuracy and AUC-ROC comparison with classical classification methods. Although the developed framework at this study achieved an accuracy of 94.32%, precision of 95.96%, F1-score of 95.00%, specificity of 94.67% and sensitivity of 94.06%, the study is restricted to internal validation and a single optimizer (ALO), necessitating more assessment on external datasets. The findings indicate that the hybrid IGR-KNN-DNN framework offers strong predictive potential and computational efficiency for early stroke-risk assessment, with additional validation enhancing its clinical application.