Abstract
BACKGROUND: Physical activity is a key focus in the field of public health, and subjective life expectancy is closely associated with individuals' physical and psychological well-being. This study aimed to identify the risk factors for subjective life expectancy among middle-aged and older adults with active and inactive physical activity levels, and to provide an evidence base for developing differentiated health intervention strategies. METHODS: Based on data from the China Health and Retirement Longitudinal Study (CHARLS) 2018 survey, a total of 10,945 participants were included. Five machine learning models, including Random Forest (RF), Logistic Regression (LR), Support Vector Machine (SVM), Extreme Gradient Boosting (XGBoost), and Light Gradient Boosting Machine (LightGBM), were separately constructed for the active and inactive groups. To reduce bias caused by class imbalance, the Synthetic Minority Over-sampling Technique (SMOTE) was applied to generate synthetic samples for the minority class. The dataset was split into a training set (70%) and a testing set (30%), and ten-fold cross-validation combined with grid search was employed to optimize hyperparameters, ensuring both robustness and generalizability of the models. Model performance was evaluated using the Area Under the Receiver Operating Characteristic Curve (AUC), accuracy, sensitivity, specificity, and F1-score. RESULTS: The active group (4,707 men and 4,885 women) had a mean age of 59.76 years, while the inactive group (662 men and 691 women) had a mean age of 63.00 years. The Support Vector Machine (SVM) model achieved the best performance in the inactive group (AUC: 0.797; accuracy: 0.722; sensitivity: 0.747), whereas the Light Gradient Boosting Machine (LightGBM) model achieved the best performance in the active group (AUC: 0.775; accuracy: 0.745; specificity: 0.814). Feature importance analysis indicated that "age" was the most important variable in the Support Vector Machine (SVM) model, while "perceived health" was the most important variable in the Light Gradient Boosting Machine (LightGBM) model. CONCLUSION: Machine learning methods can effectively identify key risk factors influencing subjective life expectancy among middle-aged and older adults, and provide valuable guidance for targeted health management strategies tailored to populations with different levels of physical activity.