Abstract
MOTIVATION: Detection of marker genes and other downstream analyses in single-cell sequencing experiments very much rely on the results of unsupervised clustering of cells. However, in two-dimensional representation of clustering results, several cells appear as outliers or in the border area of a cluster suggesting that these cells may also be outliers in the high-dimensional data space and do not adequately represent a particular cell type. RESULTS: We propose a novel and fast approach, scTrimClust, for identifying cells that may be interpreted of extreme specimens of their cell type. Identification is based on measuring each cell's distance to its nearest neighbours in the high-dimensional gene expression space, and marking those as extreme having minimum neighbour distance above a defined quantile threshold for that cluster. We study in two data examples, how cells with non-representative expression profile can influence the results of the analysis. scTrimClust is also useful to compare the influence of other parameters of an scRNA-seq analysis, e.g. normalization or the clustering approach, on the results. We also provide a software implementation of scTrimClust. AVAILABILITY AND IMPLEMENTATION: The scTrimClust approach is available in the R-package RepeatedHighDim (https://cran.r-project.org/web/packages/RepeatedHighDim/index.html).