Abstract
There exists an urgent need to improve colorectal cancer (CRC) diagnosis due to limitations in current diagnostic approaches. Systematic characterization of the human T cell receptor (TCR) repertoire, coupled with advanced computational methods, provides a promising opportunity to develop more accurate and less invasive diagnostic strategies for this major malignancy. The main objective of this work is to establish a TCR repertoire-based diagnostic model for CRC using machine learning algorithms and to identify the most significant features contributing to accurate diagnosis. Through comprehensive comparative analysis of several machine learning algorithms, our results demonstrated that the Transformer model exhibited superior performance capabilities. The trained model achieved an area under the receiver operating characteristic curve (AUC) of 0.973 in predicting disease status in the internal test set. Furthermore, TCR repertoire analysis from the independent test set demonstrated robust predictions with an AUC of 0.814. Notably, we identified a panel of 50 TCR repertoire features that showed a diagnostic AUC of 0.869 using these 50 TCR CDR3 sequences. Together, this TCR repertoire-based disease model demonstrates significant potential for clinical applications in CRC diagnosis and treatment response monitoring. Furthermore, similar diagnostic models could be established for other immune-related diseases based on disease-specific TCR repertoire data.