GenomeOcean: An Efficient Genome Foundation Model Trained on Large-Scale Metagenomic Assemblies

GenomeOcean:基于大规模宏基因组组装训练的高效基因组基础模型

阅读:2

Abstract

Genome foundation models hold transformative potential for precision medicine, drug discovery, and understanding complex biological systems. However, existing models are often inefficient, constrained by suboptimal tokenization and architectural design, and biased toward reference genomes, limiting their representation of low-abundance, uncultured microbes in the rare biosphere. To address these challenges, we developed GenomeOcean, a 4-billion-parameter generative genome foundation model trained on over 600 Gbp of high-quality contigs derived from 220 TB of metagenomic datasets collected from diverse habitats across Earth's ecosystems. A key innovation of GenomeOcean is training directly on large-scale co-assemblies of metagenomic samples, enabling enhanced representation of rare microbial species and improving generalizability beyond genome-centric approaches. We implemented a byte-pair encoding (BPE) tokenization strategy for genome sequence generation, alongside architectural optimizations, achieving up to 150× faster sequence generation while maintaining high biological fidelity. GenomeOcean excels in representing microbial species and generating protein-coding genes constrained by evolutionary principles. Additionally, its fine-tuned model demonstrates the ability to discover novel biosynthetic gene clusters (BGCs) in natural genomes and perform zero-shot synthesis of biochemically plausible, complete BGCs. GenomeOcean sets a new benchmark for metagenomic research, natural product discovery, and synthetic biology, offering a robust foundation for advancing these fields.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。