Abstract
Transferring cell type annotations from reference dataset to query dataset is a fundamental problem in AI-based single-cell data analysis. However, single-cell measurement techniques lead to domain gaps between multiple batches or datasets. The existing deep learning methods lack consideration on batch integration when learning reference annotations, which is a challenge for cell type annotation on multiple query batches. For cell representation, batch integration can not only eliminate the gaps between batches or datasets but also improve the heterogeneity of cell clusters. In this study, we proposed PLKD, a cell type annotation method based on pattern learning and knowledge distillation. PLKD consists of Teacher (Transformer) and Student (MLP). Teacher groups all input genes (features) into different gene sets (patterns), and each pattern represents a specific biological function. This design enables model to focus on biologically relevant functions interaction rather than gene-level expression that is susceptible to gaps of batches. In addition, knowledge distillation makes lightweight Student resistant to noise, allowing Student to infer quickly and robustly. Furthermore, PLKD supports multi-modal cell type annotation, multi-modal integration and other tasks. Benchmark experiments demonstrate that PLKD is able to achieve accurate and robust cell type annotation.