Abstract
BACKGROUND: Rheumatoid arthritis (RA) exhibits substantial diagnostic overlap with other autoimmune diseases that share similar pathological features, leading to redundant testing and limited diagnostic specificity. Therefore, there is an urgent need to identify critical clinical indicators with high diagnostic and predictive value to improve both diagnostic efficiency and accuracy. METHODS: To address this challenge, we propose a multidimensional embedded feature selection framework based on ensemble learning. This framework integrates Gradient Boosted Decision Trees (GBDT) and Logistic Regression (LR) models to extract potential diagnostic features from multi-source clinical datasets. GBDT captures complex nonlinear interactions among features, enhancing adaptability to heterogeneous data, while LR leverages its sparsity-promoting characteristics to perform dimensionality reduction and highlight discriminative variables. To further improve interpretability, the SHapley Additive exPlanations (SHAP) algorithm was employed to quantify the contribution of each feature to the model's predictions and to identify novel diagnostic markers beyond traditional indicators. RESULTS: Validated on real-world clinical data, the proposed framework achieved excellent diagnostic performance across multiple evaluation metrics, significantly enhancing the specificity and accuracy of RA diagnosis. Compared with conventional diagnostic methods, our model demonstrated marked improvements in test accuracy and area under the receiver operating characteristic curve (AUC). SHAP not only reaffirmed the importance of RF and anti-CCP but revealed that systemic metabolic indicators-such as low HDL, elevated bile acids, and altered creatinine-carry independent diagnostic weight. This supports a paradigm shift toward viewing RA as a multi-system inflammatory disorder, enabling earlier clinical suspicion even before classic articular manifestations. CONCLUSION: The proposed multidimensional embedded feature selection framework showed strong diagnostic performance and interpretability in identifying key biomarkers for RA, effectively addressing the issue of indicator redundancy and enhancing diagnostic precision. This pragmatic application of an established GBDT+LR framework, integrated with SHAP for interpretability and built on routine clinical data, offers potential clinical utility in RA diagnosis.