Abstract
Single-cell RNA sequencing (scRNA-seq) technology enables a deep understanding of cellular differentiation during plant development and reveals heterogeneity among the cells of a given tissue. However, the computational characterization of such cellular heterogeneity is complicated by the high dimensionality, sparsity, and biological noise inherent to the raw data. Here, we introduce PhytoCluster, an unsupervised deep learning algorithm, to cluster scRNA-seq data by extracting latent features. We benchmarked PhytoCluster against four simulated datasets and five real scRNA-seq datasets with varying protocols and data quality levels. A comprehensive evaluation indicated that PhytoCluster outperforms other methods in clustering accuracy, noise removal, and signal retention. Additionally, we evaluated the performance of the latent features extracted by PhytoCluster across four machine learning models. The computational results highlight the ability of PhytoCluster to extract meaningful information from plant scRNA-seq data, with machine learning models achieving accuracy comparable to that of raw features. We believe that PhytoCluster will be a valuable tool for disentangling complex cellular heterogeneity based on scRNA-seq data. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1007/s42994-025-00196-6.