Abstract
Computational prediction of neoantigens that elicit T cell responses is central to the development of personalized cancer vaccines. Many current predictors represent MHC class I alleles using selected subsets of residues, known as pseudosequences, yet the extent to which pseudosequence choice and encoding strategy influence predictive performance has not been systematically examined. This study addresses that gap by evaluating a range of MHC representations within the BigMHC EL framework. We compared pseudosequence definitions based on protein structure and evolutionary diversity, a randomly sampled pseudosequence baseline, pseudosequences of varying lengths, embeddings generated using the ESM-2 protein language model, and a graph-based annotation embedding derived from allele groupings. Models using biologically informed pseudosequences consistently outperformed the random baseline, underscoring the importance of residue selection. Protein structure and evolutionary diversity pseudosequences showed similar performance, likely reflecting overlap in residues near the peptide-binding groove. We also found that pseudosequences of approximately 30 to 35 residues produced the strongest performance. Lastly, ESM-2 and annotation-based embeddings outperformed the random baseline but did not surpass curated pseudosequences under the current setup. Together, these findings indicate that curated pseudosequences remain efficient representations of MHC alleles in neoantigen prediction models, while alternative encodings can approximate but not yet replace residue-level sequence information.