Abstract
Machine learning algorithms are commonly employed for classification and interpretation of high-dimensional data. The classification task is often broken down into two separate procedures, and different methods are applied to achieve accurate results and produce interpretable outcomes. First, an effective subset of high-dimensional features must be extracted and then the selected subset will be used to train a classifier. Gradient Boosted Trees (GBT) is an ensemble model and, particularly due to their robustness, ability to model complex nonlinear interactions, and feature interpretability, they are well suited for complex applications. XGBoost (eXtreme Gradient Boosting) is a high-performance implementation of GBT that incorporates regularization, parallel computation, and efficient tree pruning that makes it a suitable efficient, interpretable, and scalable classifier with potential applications to medical data analysis. In this study, a Monte Carlo Gradient Boosted Trees (MCGBT) model is proposed for both feature reduction and classification. The proposed MCGBT method was applied to a lung cancer dataset for feature identification and classification. The dataset contains 107 radiomics which are quantitative imaging biomarkers extracted from CT scans. A reduced set of 12 radiomics were identified, and patients were classified into different cancer stages. Cancer staging accuracy of 90.3% across 100 independent runs was achieved which was on par with that obtained using the full set of 107 radiomics, enabling lean and deployable classifiers.