MELO-ED: learning locality-sensitive multi-embeddings for edit distance

MELO-ED:学习局部敏感的多嵌入以计算编辑距离

阅读:1

Abstract

Edit distance is a fundamental metric for quantifying similarity between biological sequences, but its high computational cost limits large-scale applications. Previously, we proposed learned locality-sensitive bucketing (LSB) functions that achieved superior performance and efficiency compared to classical seeding methods for identifying similar and dissimilar sequences. However, each component of an LSB function is represented as a one-dimensional hash value that can only be compared for identity, which constrains the method's accuracy. Here, we introduce MELO-ED, a multi-embedding locality-sensitive framework that upgrades each hash value to a higher-dimensional embedding capable of efficiently approximating edit distance. MELO-ED employs a Siamese convolutional neural architecture that learns complementary embeddings capturing both global sequence context and fine-grained edit operations. By integrating locality-sensitive bucketing with multi-embedding representations, MELO-ED achieves near-perfect accuracy without increasing the number of buckets required. Leveraging mature indexing methods in the embedding space, MELO-ED transforms time-consuming edit distance computations into scalable similarity searches across massive genomic databases. Comprehensive evaluations on simulated DNA sequences and real barcode datasets demonstrate that MELO-ED outperforms both traditional alignment-free methods and contemporary machine learning approaches, including our previously developed learned LSB functions. These results establish MELO-ED as a state-of-the-art framework for fast and accurate classification of similar and dissimilar sequences. MELO-ED is available at https://github.com/Shao-Group/MELO-ED.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。