Abstract
SUMMARY: Duplicate marking is a critical preprocessing step in gene sequence analysis to flag redundant reads arising from polymerase chain reaction amplification and sequencing artifacts. Although Picard MarkDuplicates is widely recognized as the gold-standard tool, its single-threaded implementation and reliance on global sorting result in significant computational and resource overhead, limiting its efficiency on large-scale datasets. Here, we introduce FastDup: a high-performance, scalable solution that follows the speculation-and-test mechanism. FastDup achieves up to 20× throughput speedup with 32 threads and guarantees 100% identical output compared to Picard MarkDuplicates. AVAILABILITY AND IMPLEMENTATION: FastDup is a C++ program available from Zenodo https://zenodo.org/records/15727829, Bioconda https://anaconda.org/bioconda/fastdup and GitHub https://github.com/zzhofict/FastDup.git under the MIT license.