Abstract
BACKGROUND: Chimeric antigen receptor (CAR)-T cell therapy has shown remarkable success in treating hematological malignancies. However, several challenges remain, including limited efficacy against solid tumors, T cell exhaustion, and lack of T cell persistence, which have restricted its clinical efficacy across various indications. Sequence optimization of CAR constructs offers a promising strategy for enhancing the therapeutic efficacy of CAR-T cells. Recent advances in machine learning, particularly in protein language models (PLMs), have enabled the prediction of mutational effects based on sequence representations. However, applying PLMs to CARs is challenging because of the artificial nature of CARs and the absence of comprehensive CAR sequence databases. RESULTS: We developed a computational framework for predicting CAR-T cell activity by fine-tuning ESM-2 with CAR sequences generated using sequence augmentation. The CAR sequences were constructed through the in silico recombination of the homologous domains of the CARs, enabling a task-specific adaptation of the model. To evaluate the prediction performance, we experimentally assessed the cytotoxicity of CAR-T cells expressing mutated CAR variants and compared the results with model predictions. Our results demonstrated that fine-tuning ESM-2 significantly improved the prediction performance of CAR-T cell activity. Furthermore, we showed that training parameters such as sequence diversity, number of training steps, and model size substantially influenced prediction performance. CONCLUSIONS: Our findings highlight the potential of combining sequence augmentation with fine-tuning of PLMs to advance data-driven CAR-T cell design. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-026-06401-7.