Abstract
INTRODUCTION: In recent years, oral cancer has become one of the most common malignant tumors. Early diagnosis of oral cancer from histopathological images will certainly reduce the severity of the disease and bring down the death rate. Several Deep Learning algorithms are available in the literature, ranging from Convolutional Neural Network models to Vision Transformer (ViT) models, to classify normal tissues and those with Oral Squamous Cell Carcinoma. METHODS: This study proposes a Convolutional Block Attention-aided Transformer Network (CBA-TransNet) that combines ResNet50 with ViT. The ResNet50 acts as a backbone for extracting local features through convolutional layers, while ViT captures global context and long-range dependencies through self-attention mechanism from histopathological images. To further enhance the extracted features, the Convolutional Block Attention Mechanism (CBAM) is applied after the Feed-Forward Network layer in the ViT encoder block. The CBAM has channel and spatial attention, which helps the transformer to focus more effectively on the relevant regions of images. RESULTS: For experiments, a publicly accessible dataset of 5192 histopathological images are used. Experimental results and analysis show that the proposed hybrid model resulted in an accuracy of 98.97%, while comparing with the pre-trained ResNet50 baseline, ViT, CNN and state-of-the-art approaches. DISCUSSION: Experimental outcomes show that the proposed CBA-TransNet is flexible in combining both convolutional and transformer based architectures along with attention mechanisms like CBAM to extracts both local and global features. This hybrid architecture allows the model to concentrate on diagnostically significant areas, resulting in better classification.