Abstract
The Simplified Molecular Input Line Entry System (SMILES) is one of the most widely adopted molecular representations. However, SMILES notation suffers from limited token diversity and a lack of chemical information within individual tokens. To address these limitations while maintaining its simplicity, we propose a molecular representation method through the hybridization of standard SMILES tokens with Atom-In-SMILES (AIS) tokens, which incorporate local chemical environment information into a single token. This hybrid representation, termed SMI + AIS, combines SMILES and AIS tokens, allowing AIS tokens to differentiate chemical elements based on their chemical context without introducing additional tokens for less frequent elements. Using the SMI + AIS representation, we evaluated its performance by comparing the predefined metric of generated structures in chemical structure generation based on latent space optimization. Compared to standard SMILES, SMI + AIS achieved a 7% improvement in binding affinity and a 6% increase in synthesizability, highlighting its utility in the enhancement of machine learning-based molecular design. Our results demonstrate that the SMI + AIS representation provides a more effective and informative approach to encapsulate chemical context and presents potential for performance enhancement in other machine learning tasks in chemistry.