Abstract
BACKGROUND: Identification of microbes with large impacts on their microbial communities, known as keystone microbes, is a topic of long-standing interest in microbiome research. However, many approaches to identify keystone microbes are limited by the inherent nonlinearity and state-dependence of microbial dynamics. Machine learning approaches have been applied to address these shortcomings but often require more data than is available for a given microbial system. RESULTS: We propose a keystone identification approach called KeySDL which reduces the amount of data required by incorporating assumptions about the type of microbial dynamics present in the experimental system. The data are modeled as originating from a Generalized Lotka-Volterra (GLV) model, an architecture commonly used to simulate microbial systems. The parameters of this model are then estimated using Sparse Dictionary Learning (SDL).We also propose a self-consistency score to help evaluate whether the assumption of GLV dynamics is reasonable for a given dataset, either through the application of KeySDL or other analysis tools validated using GLV simulation. CONCLUSION: Compared to existing methods, this approach allows accurate prediction of keystone microbes from small numbers of samples and provides an output interpretable as reconstructed system dynamics. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s13040-026-00527-3.