Abstract
MOTIVATION: The first step when working with DNA data of human-derived microbiomes is to remove human contamination for two reasons. First, many countries have strict privacy and data protection guidelines for human sequence data, so microbiome data containing partly human data cannot be easily further processed or published. Second, human contamination may cause problems in downstream analysis, such as metagenomic binning or genome assembly. For large-scale metagenomics projects, fast and accurate removal of human contamination is therefore critical. RESULTS: We introduce Cleanifier, a fast and memory frugal alignment-free tool for detecting and removing human contamination based on gapped k-mers, or spaced seeds. Cleanifier uses a pangenome index of known human gapped k-mers, and the creation and use of alternative references is also possible. Reads are classified and filtered according to their gapped k-mer content. Cleanifier supports two filtering modes: one that queries all gapped k-mers and one that queries only a sample of them. A comparison of Cleanifier with other state-of-the-art tools shows that the sampling mode makes Cleanifier the fastest method with comparable accuracy. When using a probabilistic Cuckoo filter to store the complete k-mer set, Cleanifier has similar memory requirements to methods that use a sampled minimizer index. At the same time, Cleanifier is more flexible, because it can use different sampling methods on the same index. AVAILABILITY AND IMPLEMENTATION: Cleanifier is available via gitlab (https://gitlab.com/rahmannlab/cleanifier), PyPi (https://pypi.org/project/cleanifier/), and Bioconda (https://anaconda.org/bioconda/cleanifier). The pre-computed human pangenome index is available at Zenodo (https://doi.org/10.5281/zenodo.15639519).