Abstract
BACKGROUNDS: Accurate annotation of open reading frames (ORFs) is fundamental for understanding gene function and post-transcriptional regulation. A critical but often overlooked aspect of transcriptome annotation is the selection of authentic translation start sites. Many genome annotation pipelines identify the longest possible ORF in alternatively spliced transcripts, using internal methionine codons as putative start sites. However, this computational approach ignores the biological reality that ribosomes select start codons based on sequence context, not ORF length. METHODS: Here, we demonstrate that this practice leads to systematic misannotation of nonsense-mediated decay (NMD) targets in the Arabidopsis thaliana Araport11 reference transcriptome. Using TranSuite software to identify authentic start codons, we reanalyzed transcriptomic data from an NMD-deficient mutant. RESULTS: We found that correct ORF annotation more than doubles the number of identifiable NMD targets with premature termination codons followed by downstream exon junctions, from 203 to 426 transcripts. Furthermore, we show that incorrect ORF annotations can lead to erroneous protein structure predictions, potentially introducing computational artefacts into protein databases. CONCLUSIONS: Our findings underscore the importance of biologically informed ORF annotation for accurate assessment of post-transcriptional regulation and proteome prediction, with implications for all eukaryotic genome annotation projects.