Abstract
BACKGROUND: Current helminth genomes possess thousands of predicted fusion genes, encoding novel protein domain architectures that are unique to these species. To investigate this, we analyzed 20,313 two-domain proteins annotated in current helminth genomes, of which 10,297 are apparently unique to helminths, and used RNA-seq data from 20 species of helminth to examine their plausibility as true fusion genes. For comparison, we analyzed a set of 400 high confidence, evolutionarily conserved domain fusions that are present in both helminth and non-helminth species. RESULTS: Our analysis suggests that, in contrast to genuine fusion genes, the majority of helminth-specific fusion genes in the 20 species investigated are likely gene prediction artifacts based on several criteria: (1) they show a lack of correlation between RNA-seq derived expression levels of the first and second “fused” domains, as well as the interdomain region; (2) they have significantly longer interdomain regions; (3) there is significantly less continuity of coverage in their interdomain regions consistent with breakpoints in RNA-seq coverage; and (4) they are generally not supported in de novo transcriptome assemblies. CONCLUSIONS: Proteins containing novel domain combinations have been included in widely used sequence and protein databases, including WormBase ParaSite and InterPro, but the analyses presented here suggest that many helminth-specific domain fusion proteins are erroneously annotated. These findings emphasize the importance of using RNA-seq data to validate gene predictions in helminth genomes, especially those with unique structures not observed in other species. Given the increasing need to accurately identify helminth-specific proteins as therapeutic targets, the accuracy of proteome annotation in widely used genomic databases is essential. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12864-026-12589-y.