Abstract
The rapid growth of high-dimensional omics datasets in biomedical research has created an urgent need for computational frameworks that are both robust and adaptable to diverse data complexities. Although a wide range of specialized tools and algorithms are available, researchers often rely on trial-and-error approaches to select suitable analytical workflows, compromising both efficiency and reproducibility. In this study, we systematically benchmarked hundreds of algorithms-preprocessing combinations across three common biomedical data challenges, including small sample sizes, missing values, and class imbalance. Our results show that tree-based models (e.g. Gradient Boosting Decision Tree, XGBoost, and Random Forest) consistently perform well in scenarios involving small-sample and missing-data, while partial least squares discriminant analysis (PLS-DA) is more effective in addressing imbalanced classes. Unsupervised cluster methods such as K-means and Density-Based Spatial Clustering of Applications with Noise (DBSCAN) remain robust under moderate missingness, but their performance declines when missingness exceeds 10%. To support data-driven decision-making, we developed SysML, a web-based platform that recommends data-adaptive workflows based on dataset-specific characteristics. Validated on multiple real-world biomedical datasets, SysML demonstrated improvements in both model performance and workflow efficiency. Our findings underscore that adaptive data preprocessing, rather than algorithm choice alone, is critical for achieving reliable and reproducible machine learning applications in biomedicine.