Abstract
SUMMARY: We present RabbitSketch, a highly optimized library of sketching algorithms such as MinHash, OrderMinHash, and HyperLogLog that can exploit the power of modern multi-core CPUs. It provides significant speedups compared to existing implementations, ranging from 2.30× to 49.55×, as well as flexible and easy-to-use interfaces for both Python and C++. As a result, the similarity analysis of 455GB genomic data can be completed in only 5 minutes using RabbitSketch with merely 20 lines of Python code. As a case study, we enhanced RabbitTClust by integrating RabbitSketch's Kssd algorithm, resulting in a 1.54× speedup with no loss in accuracy. AVAILABILITY AND IMPLEMENTATION: RabbitSketch is available at https://github.com/RabbitBio/RabbitSketch with an archived version at Zenodo: https://doi.org/10.5281/zenodo.14903962. Detailed API documentation is available at https://rabbitsketch.readthedocs.io/en/latest.