Abstract
MOTIVATION: Ancestry information is essential to large cohort studies but is often unavailable or inconsistently measured. For studies involving genome sequencing, existing ancestry prediction methods are constrained by computational demands and complex input requirements. Efficient, scalable approaches are needed to infer ancestry directly from sequencing data while maintaining accuracy and reproducibility. RESULTS: We present ntRoot, a computationally lightweight method for inferring human super-population-level ancestry from whole genome assemblies or short or long sequencing data. Utilizing a reference-guided, alignment-free single nucleotide variant detection framework, ntRoot employs a succinct Bloom filter to efficiently query diverse genomic inputs against a variant reference panel with known genotypes and ancestry. Demonstrated on over 600 human genome samples, including complete genomes, draft assemblies, and 280 independently generated samples, ntRoot accurately predicts geographic labels and shows high concordance with traditional methods such as ADMIXTURE (R (2) = 0.9567) when estimating ancestry fractions. Analyses complete within 30 minutes for assemblies and 75 min for 30-fold sequencing data using 13-68 GB of memory. ntRoot provides global and local ancestry inference, delivering high-resolution predictions across genomic loci. This paradigm fills a critical gap in cohort studies by enabling rapid, resource-efficient, and accurate ancestry inference at scale, advancing ancestry characterization in genomic research. AVAILABILITY: ntRoot is freely available on GitHub (https://github.com/bcgsc/ntroot).