Abstract
Omics data analysis often yields extensive lists of genes or enriched gene sets, making it difficult to interpret the underlying cellular mechanisms. Existing gene set categorization methods typically rely on the Gene Ontology hierarchy, neglecting semantic similarity encoded in textual descriptions. We developed Slimformer, an embedding-based Natural Language Processing model that learns contextual relationships between gene sets based on their names, descriptions, and associated genes. A supervised classifier then assigns these embeddings to process categories, trained on a manually curated gold standard. Applied to 2856 annotated gene sets, Slimformer achieved 82.4 % balanced accuracy and an F1-score of 0.867. Applied to gene expression data from human cells infected with Respiratory Syncytial Virus, Slimformer revealed strong downregulation of major cell cycle processes which is highly relevant for the viral pathomechanism, which was overlooked by other tools we tested. By integrating linguistic and functional information, Slimformer enhances the interpretability of omics data and provides a flexible framework for systematic gene set categorization.