Abstract
Accurate discrimination between iron deficiency anemia (IDA) and thalassemia trait (TT) is clinically essential for the effective management of patients with hypochromic microcytic anemia. Although numerous discrimination indices based on red blood cell (RBC) parameters have been proposed, their diagnostic accuracy remains suboptimal and highly population-specific. This study aimed to develop and validate a machine learning model to enhance discriminative performance. We utilized a derivation cohort of 376 patients (IDA, n = 186; TT, n = 190) for model development and internal validation, and a separate validation cohort of 196 patients for external testing. Five machine learning algorithms—Extreme Gradient Boosting (XGBoost), Logistic Regression (LR), Light Gradient Boosting Machine (LightGBM), Random Forest (RF), and AdaBoost—were trained and evaluated. Model performance was assessed using the area under the receiver operating characteristic curve (AUC), accuracy, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV). The LightGBM classifier achieved an AUC of 0.953 (accuracy 86.9%) in the internal validation set and 0.980 (accuracy 93.1%) in testing set. In the external validation cohort, the model demonstrated robust generalizability, attaining an AUC of 0.992 with an accuracy of 98.5%, sensitivity of 96.7%. Feature importance analysis identified mean corpuscular hemoglobin concentration (MCHC) and red cell distribution width-standard deviation (RDW-SD)mean as the most discriminative predictors. We developed and externally validated a robust LightGBM-based classifier that accurately discriminates between IDA and TT, offering clinicians with with a reliable, non-invasive decision-support tool for the differential diagnosis of microcytic anemia. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1007/s00277-026-06894-5.