SysML: adaptive recommendation system for heterogeneous biomedical data preprocessing and modeling workflows

SysML:面向异构生物医学数据预处理和建模工作流程的自适应推荐系统

阅读:1

Abstract

The rapid growth of high-dimensional omics datasets in biomedical research has created an urgent need for computational frameworks that are both robust and adaptable to diverse data complexities. Although a wide range of specialized tools and algorithms are available, researchers often rely on trial-and-error approaches to select suitable analytical workflows, compromising both efficiency and reproducibility. In this study, we systematically benchmarked hundreds of algorithms-preprocessing combinations across three common biomedical data challenges, including small sample sizes, missing values, and class imbalance. Our results show that tree-based models (e.g. Gradient Boosting Decision Tree, XGBoost, and Random Forest) consistently perform well in scenarios involving small-sample and missing-data, while partial least squares discriminant analysis (PLS-DA) is more effective in addressing imbalanced classes. Unsupervised cluster methods such as K-means and Density-Based Spatial Clustering of Applications with Noise (DBSCAN) remain robust under moderate missingness, but their performance declines when missingness exceeds 10%. To support data-driven decision-making, we developed SysML, a web-based platform that recommends data-adaptive workflows based on dataset-specific characteristics. Validated on multiple real-world biomedical datasets, SysML demonstrated improvements in both model performance and workflow efficiency. Our findings underscore that adaptive data preprocessing, rather than algorithm choice alone, is critical for achieving reliable and reproducible machine learning applications in biomedicine.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。