Abstract
Background: Early diagnosis of rheumatoid arthritis (RA) remains challenging due to the limited performance of existing serum biomarkers. This exploratory study aimed to identify novel serum metabolite and lipoprotein biomarkers for RA and to develop interpretable machine learning models for screening. Methods: This study employed (1)H-NMR metabolomics to analyze serum from 77 RA patients and 70 healthy controls, quantifying 38 endogenous metabolites and 112 lipoprotein parameters. Seven key biomarkers were identified using multiple criteria and Least Absolute Shrinkage and Selection Operator (LASSO) regression. The dataset was split into training and testing sets (7:3 ratio), and four machine learning models were constructed. The Random Forest (RF) model was further interpreted using the SHapley Additive exPlanations (SHAP) method. Results: The selected biomarkers, including formic acid and High-density lipoprotein 4 phospholipids (H4PL), showed significant associations with RA. In the internal test set, the RF model demonstrated promising discriminatory ability. Additionally, a proof-of-concept regression model for predicting the Disease Activity Score in 28 joints (DAS-28) score was developed, explaining a portion of its variance (R(2) = 0.548) in this cohort. Conclusions: This exploratory, single-center study identifies a novel panel of potential biomarkers for RA and provides a preliminary, interpretable predictive tool. The findings, particularly the internally validated high performance of certain markers, are hypothesis-generating and underscore the need for validation in larger, multi-center cohorts. The DAS-28 prediction model also warrants further investigation.