Abstract
Gastrointestinal cancer (GC) is particularly malignant as they tend to progress slowly before advanced stages due to the forecast of early-specific symptoms. The heterogeneous properties of GCs require extremely precise and sensitive diagnostic techniques that can integrate in-depth structural information and surface-level data to distinguish cancer severity grades, thereby significantly lowering mortality rates. Deep learning (DL) algorithms are crucial in the classification of these various GC grades. However, the algorithms cannot detect interpretability and are characterized by a high probability of false alarm rates in detecting the underlying acute relationship between the medical images. Additionally, existing systems lack language-level transparency, preventing them from generating user-based narrative diagnostic explanations consistent with medical standards. To address this aforementioned challenge, this research study introduces a novel explainable LLM (X-LLM) based DL framework, which overcomes the drawbacks of existing DL algorithms. The suggested framework uses the ensemble transformer architectures that combine the clinical features by integrating the endoscopy and computer tomography (CT) scan images for enhancing the performance in detecting the different severity grades of GCs. The proposed system uses several components: (1) heterogeneous image collection; (2) image pre-processing; (3) ensemble networks; (4) interpretability analysis; and (5) user-interaction module. The extensive experiments are conducted utilizing two different datasets, such as Kvasir and TCIA CT (TCGA-STAD) scan images. The severity annotation of both datasets was carried out by experienced medical doctors, including endoscopists. Several evaluation metrics, including accuracy, precision, and recall, are measured and benchmarked against other learning networks. The experimental findings demonstrate the enhanced performance of the proposed framework over the existing models by achieving the accuracy, precision, recall, and F1-score values of 0.99, 0.997, 0.99, and 0.99, respectively. Furthermore, different LLM models such as GPT4.0, GPT3.5, LLAMA, and GEMINI are integrated, and their mode of interaction is also analyzed with SHAP measurements. The suggested framework demonstrates its strong potential by enhancing diagnostic performance, achieving high performance with user-interacted clinical treatment outcomes.