Abstract
This study introduces a novel framework that simultaneously addresses the challenges of performance accuracy and result interpretability in transcriptomic-data-based classification. Background/objectives: In biological data classification, it is challenging to achieve both high performance accuracy and interpretability at the same time. This study presents a framework to address both challenges in transcriptomic-data-based classification. The goal is to select features, models, and a meta-voting classifier that optimizes both classification performance and interpretability. Methods: The framework consists of a four-step feature selection process: (1) the identification of metabolic pathways whose enzyme-gene expressions discriminate samples with different labels, aiding interpretability; (2) the selection of pathways whose expression variance is largely captured by the first principal component of the gene expression matrix; (3) the selection of minimal sets of genes, whose collective discerning power covers 95% of the pathway-based discerning power; and (4) the introduction of adversarial samples to identify and filter genes sensitive to such samples. Additionally, adversarial samples are used to select the optimal classification model, and a meta-voting classifier is constructed based on the optimized model results. Results: The framework applied to two cancer classification problems showed that in the binary classification, the prediction performance was comparable to the full-gene model, with F1-score differences of between -5% and 5%. In the ternary classification, the performance was significantly better, with F1-score differences ranging from -2% to 12%, while also maintaining excellent interpretability of the selected feature genes. Conclusions: This framework effectively integrates feature selection, adversarial sample handling, and model optimization, offering a valuable tool for a wide range of biological data classification problems. Its ability to balance performance accuracy and high interpretability makes it highly applicable in the field of computational biology.