Abstract
Molecular epidemiology and HIV-1 transmission networks reconstruction can provide insights into transmission dynamics and inform public health strategies. Long HIV sequences, such as near full-length (nFL) genomes, can improve the accuracy of phylogenetic inference. However, relatively short pol sequences are still broadly used for inferring molecular HIV clusters. Whether a mix of long and short HIV-1 sequences can improve phylogenetic inference of molecular HIV clusters remains unknown. We propose a flexible approach called T-shaped alignments that incorporates both nFL HIV-1 genomes and partial pol sequences, and investigate whether this approach improves phylogenetic reconstruction of molecular clusters. Under the assumption that clustering from 100% of long sequences is the most accurate, we obtained 1196 subtype B nFL HIV-1 sequences from the Los Alamos National Laboratory Database and a single-study subset, varied the proportion of long and short sequences in our T-shape alignments, systematically masked all non-pol regions with missing characters in proportional increments, and compared tree similarity and cluster inference among datasets. With the full dataset, we found that when more than 50% of available sequences are nFL, the T-shaped alignment gradually yields results closer to the 100% n, with more and larger clusters identified. However, below the 50% threshold accuracy did not increase. Stringent bootstrap thresholds decreased cluster accuracy gaps but also decreased number of clusters found and mean cluster size. For the subset dataset, we found that the introduction of nFL sequences to the T-shaped alignment improves accuracy in clustering either after a 30% threshold or immediately depending on bootstrap choice. Our new approach and results suggest that using T-shape alignments to mix HIV-1 sequences of different lengths can improve phylogenetic and clustering accuracy, with needed nFL proportion depending on analysis goals. The T-shape alignment provides a straightforward method for utilizing all available sequences to improve phylogenetic analysis.