Abstract
Ancestral sequence reconstruction is typically performed using homogeneous evolutionary models, which assume that the same substitution propensities affect all sites and lineages. These assumptions are routinely violated: heterogeneous structural and functional constraints favor different amino acids at different sites, and these constraints often change among lineages as epistatic substitutions accrue at other sites. To evaluate how violations of the homogeneity assumption affect ancestral sequence reconstruction under realistic conditions, we developed site-specific substitution models and parameterized them using data from deep mutational scanning experiments on three protein families; we then used these models to perform ancestral sequence reconstruction on the empirical alignments and on alignments simulated under heterogeneous conditions derived from the experiments. Extensive among-site and -lineage heterogeneity is present in these datasets, but the sequences reconstructed from empirical alignments are almost identical when heterogeneous or homogeneous models are used for ancestral sequence reconstruction. Using models fit to deep mutational scanning data from distantly related proteins in which mutational effects are very different also has a minimal impact on ancestral sequence reconstruction. The rare differences occur primarily where phylogenetic signal is weak-at fast-evolving sites and nodes connected by long branches. When ancestral sequence reconstruction is performed on simulated data, errors in the reconstructed sequences become more likely as branch lengths increase, but incorporating heterogeneity into the model does not improve accuracy. These data establish that ancestral sequence reconstruction is robust to unincorporated realistic forms of evolutionary heterogeneity, because the primary determinant of ancestral sequence reconstruction is phylogenetic signal, not the substitution model. The best way to improve accuracy is therefore not to develop more elaborate models but to apply ancestral sequence reconstruction to densely sampled alignments that maximize phylogenetic signal at the nodes of interest.