Abstract
The application of machine learning in clinical medicine requires systematic evaluation across diverse modeling paradigms. We benchmarked 10 models, including classic machine learning, tabular deep learning, and automated machine learning (AutoML), across eight real-world clinical risk prediction datasets. Using a 10-time repeated 5-fold cross-validation protocol, we assessed discrimination, calibration, and clinical utility. Gradient boosting decision trees, particularly CatBoost, and the tabular foundation model TabPFN consistently demonstrated superior robustness, forming the top tier for performance. AutoGluon also exhibited strong competitiveness. In contrast, most other tabular deep learning models displayed significant instability. These findings indicate that advanced gradient boosting models and TabPFN represent premier strategies for building high-performance clinical risk prediction models, while AutoML offers a reliable alternative. This study provides crucial empirical guidance for clinicians and data scientists in selecting appropriate modeling strategies.