MOTIVATION: DNA sequence alignment, an important genomic task, involves assigning short DNA reads to the most probable locations on an extensive reference genome. Conventional methods tackle this challenge in two steps: genome indexing followed by efficient search to locate likely positions for given reads. Building on the success of Large Language Models in encoding text into embeddings, where the distance metric captures semantic similarity, recent efforts have encoded DNA sequences into vectors using Transformers and have shown promising results in tasks involving classification of short DNA sequences. Performance at sequence classification tasks does not, however, guarantee sequence alignment, where it is necessary to conduct a genome-wide search to align every read successfully, a significantly longer-range task by comparison. RESULTS: We bridge this gap by developing a "Embed-Search-Align" (ESA) framework, where a novel Reference-Free DNA Embedding (RDE) Transformer model generates vector embeddings of reads and fragments of the reference in a shared vector space; read-fragment distance metric is then used as a surrogate for sequence similarity. ESA introduces: (i) Contrastive loss for self-supervised training of DNA sequence representations, facilitating rich reference-free, sequence-level embeddings, and (ii) a DNA vector store to enable search across fragments on a global scale. RDE is 99% accurate when aligning 250-length reads onto a human reference genome of 3 gigabases (single-haploid), rivaling conventional algorithmic sequence alignment methods such as Bowtie and BWA-Mem. RDE far exceeds the performance of six recent DNA-Transformer model baselines such as Nucleotide Transformer, Hyena-DNA, and shows task transfer across chromosomes and species. AVAILABILITY AND IMPLEMENTATION: Please see https://anonymous.4open.science/r/dna2vec-7E4E/readme.md.
Embed-Search-Align: DNA sequence alignment using Transformer models.
Embed-Search-Align:使用Transformer模型进行DNA序列比对
阅读:5
作者:Holur Pavan, Enevoldsen K C, Rajesh Shreyas, Mboning Lajoyce, Georgiou Thalia, Bouchard Louis-S, Pellegrini Matteo, Roychowdhury Vwani
| 期刊: | Bioinformatics | 影响因子: | 5.400 |
| 时间: | 2025 | 起止号: | 2025 Mar 4; 41(3):btaf041 |
| doi: | 10.1093/bioinformatics/btaf041 | ||
特别声明
1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。
2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。
3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。
4、投稿及合作请联系:info@biocloudy.com。
