Abstract
INTRODUCTION: Crohn's disease (CD) and ulcerative colitis (UC) have overlapping symptoms, but they differ in pathology and treatment. Currently, distinguishing between these diseases involves invasive procedures such as colonoscopy and histopathology. Fecal proteins, stable and in direct contact with inflammation, offer a noninvasive alternative. This study focuses on using high-throughput data-independent acquisition mass spectrometry and machine learning to develop an accurate biomarker signature from complex stool samples. METHODS: Stool samples obtained from 69 active patients were analyzed. Analysis of the stool proteome led to the identification and quantification of approximately 1,250 proteins. The samples were divided into training and testing groups. After data processing, various feature selection algorithms were applied on the training group to determine proteins that were significantly different between the CD and UC groups. In addition, 6 machine learning algorithms were evaluated to identify the best-performing classifiers. RESULTS: Sixteen proteins were selected based on several feature selection algorithms, and 6 models were trained based on them. According to the performance metrics of each algorithm on the training data set, the Naive Bayes model was selected. For performance validation, the final predictive model was applied to 16 blind prospective samples as the test data set. Notably, the model achieved an area under the curve of 0.96 on both the training and test data sets, highlighting its robustness and stability. DISCUSSION: This study demonstrates the potential of combining multiple stool protein biomarkers through high-throughput data-independent acquisition mass spectrometry and machine learning tools to develop a predictive model for efficiently distinguishing CD from UC.