genCRC32: collision-free CRC32-based hashing of DNA sequences

genCRC32:基于CRC32的DNA序列无碰撞哈希算法

阅读:1

Abstract

MOTIVATION: Efficient and collision-free hashing of DNA sequences is essential for accuracy and performance in bioinformatics applications such as genome assembly, sequence alignment, and metagenomic classification. Traditional hashing methods often result in collisions, impacting the precision and/or performance of downstream analyses. Thus, it is highly advantageous to have hashing functions that guarantee collision-free mappings for DNA sequences, particularly for k-mers up to length 16, where practical limits for 32-bit hashing are reached. In this study, we evaluate genCRC32 as a hashing primitive, reporting collision behavior, bucket balance, sensitivity to single-base changes, and speed to inform its potential use in downstream tools. Evaluation within specific software tools is outside the scope of this paper and is planned as future work. RESULTS: We present genCRC32, an innovative hashing method that integrates a straightforward preprocessing step (gen32) with CRC32 hashing, specifically identifying eight CRC32 polynomials that ensure collision-free hashing for all DNA k-mers up to 16 nucleotides in length. Through extensive empirical evaluations, genCRC32 demonstrated zero collisions for these k-mers, achieving a one-to-one mapping without auxiliary data structures. Benchmark tests confirmed minimal computational overhead introduced by preprocessing, maintaining hashing performance comparable to established methods such as MurmurHash3 and xxHash32. AVAILABILITY AND IMPLEMENTATION: The source code for genCRC32 is publicly available at: https://github.com/berybox/genCRC32. The implementation is provided in Go (version 1.24) and leverages only standard libraries, ensuring portability and ease of integration into existing bioinformatics workflows.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。