Abstract
Colorectal cancer (CRC) is the leading cause of cancer disease and poses a significant threat to global health. Although deep learning models have been utilized to accurately diagnose CRC, they still face challenges in capturing the global correlations of spatial features, especially in complex textures and morphologically similar features. To overcome these challenges, we propose a hybrid model using a residual network and transformer encoder with mixed attention. The Residual Next Transformer Network (RNTNet) extracts spatial features from CRC images using ResNeXt. ResNeXt utilizes group convolution and skip connections to capture fine-grained features. Furthermore, a vision transformer (ViT) encoder containing a mixed attention block is designed using multiscale feature aggregation to provide global attention to the spatial features. In addition, a Grad-CAM module is added to visualize the model's decision process to support oncologists with a second opinion. Two publicly available datasets, Kather and KvasirV1, were utilized for model training and testing. The model achieved classification accuracies of 97.96% and 98.20% on the KvasirV1 and Kather datasets, respectively. Model efficacy is also further confirmed by ROC curve analysis, where AUC values of 0.9895 and 0.9937 on the KvasirV1 and Kather datasets are obtained, respectively. Comparative study findings support that RNTNet delivers improvements in accuracy and efficiency compared to state-of-the-art methods.