Abstract
New protein-coding genes can arise de novo from ancestrally noncoding regions when open reading frames (ORFs) outside existing genes are exposed to selection via pervasive translation. These ORFs are usually born short, and their elongation is considered a key step in de novo gene birth. However, mechanisms of de novo gene elongation remain understudied. Here, we reconstructed the evolutionary history of c16riboseqorf143 (orf143), one of the longest unannotated human translated ORFs. orf143 is encoded in the oncogenic long noncoding RNA (lncRNA) VPS9D1-AS1 (MYU). Evolutionary reconstruction showed that orf143 originated de novo in the common ancestor of simians through a point mutation that introduced a start codon. A subsequent stop-codon-disrupting mutation extended translation into a downstream region that, in humans, includes multiple binding sites and a tandem repeat (TR) array previously reported to mediate the oncogenicity of VPS9D1-AS1. The TR array frequently expanded in human populations. The overlaps between orf143 and the oncogenic binding sites in VPS9D1-AS1 raise the possibility that orf143 translation may be tumor-suppressive, since ribosomes may compete with oncogenic binding events via steric hindrance. In line with this possibility, we observed an enrichment of somatic mutations in the ORF regions of VPS9D1-AS1 in cancer patients and a positive association between in-ORF mutations and adenomas/adenocarcinomas. Some of these mutations induced truncation of the ORF, potentially impairing ribosome binding to VPS9D1-AS1. This study reveals stop codon disruption and TR array expansion as the mechanisms of orf143 elongation and illustrates how elongation of de novo ORFs may provide a selective advantage.