Beyond performance: how design choices shape chemical language models

超越性能:设计选择如何塑造化学语言模型

阅读:2

Abstract

Chemical language models (CLMs) have shown strong performance in molecular property prediction and generation tasks. However, the impact of design choices, such as molecular representation format, tokenization strategy, and model architecture, on both performance and chemical interpretability remains underexplored. In this study, we systematically evaluate how these factors influence CLM performance and chemical understanding. We evaluated models through fine-tuning on downstream tasks and probing the structure of their latent spaces using probing predictors, vector operations, and dimensionality reduction techniques. Although downstream task performance was similar across model configurations, substantial differences were observed in the structure and interpretability of internal representations, highlighting that design choices meaningfully shape how chemical information is encoded. In practice, atomwise tokenization generally improved interpretability, and a RoBERTa-based model with SMILES input remains a reliable starting point for standard prediction tasks, as no alternative consistently outperformed it. These results provide guidance for the development of more chemically grounded and interpretable CLMs.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。