Abstract
OBJECTIVE: Cervical cancer is among the most frequently diagnosed malignancies in women. It is the fourth most prevalent malignancy in women worldwide. Pap smear tests, a popular and effective medical procedure, enable the early detection and screening of cervical cancer. Expert physicians perform the smear analysis, which is a laborious, time consuming and prone to mistakes. The main objective of our work is to distinguish or classify the healthy and malignant cervical cells using our proposed CASPNet model. METHODS: This study proposes a novel technique by combining the concept of feature extraction by multi-head self-attention blocks, cross-stage partial network and feature fusion integration by spatial pyramid pooling fast layer components to identify healthy and cancerous cervical cells. Based on the comprehensive ablation study results, our proposed CASPNet architecture shows optimal performance having superior test accuracy with comparable computational efficiency. RESULTS: The experimental study of our proposed model CASPNet (Contextual Attention and Spatial Pooling Network) has achieved an accuracy of 97.07% in the widely used benchmark SIPAKMED dataset. CONCLUSION: When compared to CNN models, self-attention blocks of vision transformer models are more accurate in classification tests and are generally better at capturing global contextual information inside an input image. The architecture's CSP blocks are ideal for classification tasks with constrained resources where efficiency and speed are balanced; as a result, they are suitable for local feature extraction. Again, in cervical cells, objects are of varying sizes. Therefore, SPPF records contextual information at different receptive fields and performs multi-scale feature extraction. Hence, we can understand the images more precisely and reliably by incorporating all these benefits in our suggested CASPNet model.