Abstract
Phenotype-driven drug discovery leverages cellular responses to guide the design of therapeutic molecules. Recent advancements in transcriptomics have provided extensive datasets describing how gene expression changes in response to various chemical stimuli, presenting an opportunity to directly link molecular generation to specific cellular phenotypes. However, effectively linking transcriptomic perturbations to chemical structure generation remains challenging due to the complexity of gene interactions and chemical feasibility constraints. We developed GGIFragGPT, a novel generative model that integrates transcriptomic perturbation profiles with biologically informed gene-gene interaction embeddings to guide fragment-based molecular generation. The model employs an autoregressive transformer architecture to sequentially assemble chemically valid fragments, with cross-attention mechanisms highlighting biologically relevant genes guiding the molecular generation process. Comparative analysis confirmed that the proposed approach yields chemically feasible, novel, and diverse molecules. By leveraging transcriptomic profiles, GGIFragGPT successfully generated compounds aligned with the biological context suggested by transcriptomic data, validated through gene-level interpretability analysis that identified key target genes. Case studies demonstrated the model's capability to produce structurally plausible inhibitors, exemplified by targeted molecule generation against CDK7. This work demonstrates the potential of integrating biological insights into chemical generation processes, offering a promising approach for phenotype-driven therapeutic discovery.