Abstract
Multiple sequence alignments are a crucial step in many bioinformatic and computational biology analyses, from protein structure and function prediction to the inference of phylogenetic trees. However, highly divergent sequence alignments often contain a significant amount of noise. Reducing noise is normally achieved by filtering the alignment to remove columns that are poorly aligned or offer minimal useful information-either automatically using various software tools or through manual inspection. Manual approaches are labor-intensive and less reproducible but can utilize the researcher's specialist knowledge, rather than relying on filtering criteria that might not be adequate for each alignment. AliFilter bridges these two approaches to alignment curation, using machine learning to automate manual alignment filtering. AliFilter uses a supervised learning approach to create a model from a small number of manually annotated alignments, then applies this model to reproduce the manual annotation on different datasets. Users can employ the program with a default model or create customized models for individual datasets or filtering criteria. AliFilter accurately reproduces the results of manual annotation (98% accuracy) while being resilient to mistakes in the training data. In a typical phylogenomic workflow, AliFilter reduced the runtime by 35% and produced results that were almost identical to the full alignment, unlike other filtering tools we tested. AliFilter is free and open-source software; it is written in C# and distributed under a GPLv3 license from https://github.com/arklumpus/AliFilter, where both the source code and standalone executables for Windows, macOS, and Linux are available.