Abstract
The single-cell sequencing revolution enables simultaneous molecular profiling of various modalities across thousands of individual cells, allowing scientists to investigate the diverse functions of complex tissues. Among all the analysis steps, assigning individual cells to specific types is fundamental for understanding cellular heterogeneity. However, this process is labor-intensive and requires extensive expert knowledge. Recent advances in large language models (LLMs) have demonstrated their ability to automatically extract biological knowledge, such as marker genes, promoting efficient, and automated cell-type annotations. To evaluate the capability of modern LLMs in automating the cell-type identification process, we first introduce an automated cell-type annotation method with comprehensive benchmark: Single-cell Omics Arena). Specifically, we began by compiling 11 publicly available single-cell RNA sequencing (scRNA-seq) datasets and evaluating eight LLMs across 1226 cell-type annotation-related tasks. This effort established a foundation for automated cell-type annotation from scRNA-seq data using interpretable features such as gene names. Building upon this benchmark, we introduced domain-specific chain-of-thought prompting techniques to enhance the accuracy of cell-type annotation and facilitate the extraction of relevant biological insights. Finally, to accommodate non-interpretable features, we proposed to leverage a pretrained VAE-based cross-modality translation module to convert features such as epigenetic marks into interpretable representations, which enables the seamless extension of LLM-based cell-type annotation to non-RNA-based sequencing technologies. In summary, our benchmark provides key insights into automated cell-type annotation from scRNA-seq data and demonstrates the potential of cross-modality translation for handling non-interpretable features.