Abstract
Codon optimization is widely used to improve heterologous gene expression in Escherichia coli. However, many existing methods focus primarily on maximizing the codon adaptation index (CAI) and neglect broader aspects of biological context. In this study, we present ColiFormer, a transformer-based codon optimization framework fine-tuned on 3676 high-expression E. coli genes curated from the NCBI database. Built on the CodonTransformer BigBird architecture, ColiFormer employs self-attention mechanisms and a mathematical optimization method (the augmented Lagrangian approach) to balance multiple biological objectives simultaneously, including CAI, GC content, tRNA adaptation index (tAI), RNA stability, and minimization of negative cis-regulatory elements. Based on in silico evaluations on 37,053 native E. coli genes and 80 recombinant protein targets commonly used in industrial studies, ColiFormer demonstrated significant improvements in CAI and tAI values, maintained GC content within biologically optimal ranges, and reduced inhibitory cis-regulatory motifs compared with established codon optimization approaches, while maintaining competitive runtime performance. These results represent computational predictions derived from standard in silico metrics; future experimental work is anticipated to validate these computational predictions in vivo. ColiFormer has been released as an open-source tool alongside the benchmark datasets used in this study.