The cell as a token: high-dimensional geometry in language models and cell embeddings

细胞作为标记：语言模型和细胞嵌入中的高维几何

阅读：3

期刊：		影响因子：
时间：	2025	起止号：	2025 Nov 1;41(11)
doi：	10.1093/bioinformatics/btaf595	研究方向：	细胞生物学

Abstract

MOTIVATION: Single-cell sequencing technology maps cells to a high-dimensional space encoding their internal activity. Recently-proposed virtual cell models extend this concept, enriching cells' representations based on patterns learned from pretraining on vast cell atlases. RESULTS: This review explores how advances in understanding the structure of natural language embeddings informs ongoing efforts to analyze single-cell datasets. Both fields process unstructured data by partitioning datasets into tokens embedded within a high-dimensional vector space. We discuss how the context of tokens influences the geometry of embedding space, and how low-dimensional manifolds shape this space's robustness and interpretation. We highlight how new developments in foundation models for language, such as interpretability probes and in-context reasoning, can inform efforts to construct cell atlases and train virtual cell models. AVAILABILITY AND IMPLEMENTATION: Code is available at https://github.com/williamgilpin/celltoken.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用；引用内容仅为补充信息，不代表本站立场。

2、若认为本页面引用内容涉及侵权，请及时与本站联系，我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容，需注明“来源：[生知库]”并获得授权；使用引用内容的，需自行联系原作者获得许可。

4、投稿及合作请联系：info@biocloudy.com。