Abstract
MOTIVATION: Multiple sequence alignment (MSA) is a fundamental operation in bioinformatics, yet existing MSA tools are struggling to keep up with the speed and volume of incoming data. This is because the runtimes and memory requirements of current MSA tools become untenable when processing large numbers of long input sequences, and they also fail to fully harness the parallelism provided by modern CPUs and GPUs. RESULTS: We present Tall and Wide Alignments at High Throughput (TWILIGHT), a novel MSA tool optimized for speed, accuracy, scalability, and memory constraints, with both CPU and GPU support. TWILIGHT incorporates innovative parallelization and memory-efficiency strategies that enable it to build ultralarge alignments at high speed even on memory-constrained devices. On challenging datasets, TWILIGHT outperformed all other tools in speed and accuracy. It scaled beyond the limits of existing tools and performed an alignment of 1 million RNASim sequences within 30 min while utilizing <16 GB of memory. TWILIGHT is the first tool to align over 8 million publicly available SARS-CoV-2 sequences, setting a new standard for large-scale genomic alignment and data analysis. AVAILABILITY AND IMPLEMENTATION: TWILIGHT's code is freely available under the MIT license at https://github.com/TurakhiaLab/TWILIGHT. The test datasets and experimental results, including our alignment of 8 million SARS-CoV-2 sequences, are available at https://zenodo.org/records/14722035.