Abstract
Single-cell RNA sequencing (scRNA-Seq) data from complex human tissues have prevalent blood cell contamination during the sample preparation process. They may also comprise cells of different genetic makeups. These issues demand rigorous preprocessing and filtering prior to the downstream functional analysis, to avoid biased conclusions due to cell types not of interest. Towards this, we propose a new computational framework, Originator, which deciphers single cells by the genetic origin and separates immune cells of blood contamination from those of expected tissue-resident cells. We demonstrate the accuracy of Originator at separating immune cells from the blood and tissue as well as cells of different genetic origins, using a variety of artificially mixed and real datasets. We show the prevalence of blood contamination in scRNA-Seq data of many tissue types. We alert the significant consequences if failing to adjust for these confounders, using scRNA-Seq data of pancreatic cancer and placentas as examples.