A comparative analysis of gene expression profiling by statistical and machine learning approaches.

MOTIVATION: Many machine learning (ML) models developed to classify phenotype from gene expression data provide interpretations for their decisions, with the aim of understanding biological processes. For many models, including neural networks, interpretations are lists of genes ranked by their importance for the predictions, with top-ranked genes likely linked to the phenotype. In this article, we discuss the limitations of such approaches using integrated gradient, an explainability method developed for neural networks, as an example. RESULTS: Experiments are performed on RNA sequencing data from public cancer databases. A collection of ML models, including multilayer perceptrons and graph neural networks, are trained to classify samples by cancer type. Gene rankings from integrated gradients are compared to genes highlighted by statistical feature selection methods such as DESeq2 and other learning methods measuring global feature contribution. Experiments show that a small set of top-ranked genes is sufficient to achieve good classification. However, similar performance is possible with lower-ranked genes, although larger sets are required. Moreover, significant differences in top-ranked genes, especially between statistical and learning methods, prevent a comprehensive biological understanding. In conclusion, while these methods identify pathology-specific biomarkers, the completeness of gene sets selected by explainability techniques for understanding biological processes remains uncertain. AVAILABILITY AND IMPLEMENTATION: Python code and datasets are available at https://github.com/mbonto/XAI_in_genomics.

期刊：	Bioinformatics Advances	影响因子：	2.800
时间：	2025	起止号：	2024 Dec 18; 5(1):vbae199
doi：	10.1093/bioadv/vbae199

A comparative analysis of gene expression profiling by statistical and machine learning approaches.

特别声明