Abstract
The objective of this study is to create a highly accurate and interpretable deep learning (DL) model for the multi-class classification of fruit using convolutional and transformer architectures. The classification performance can be enhanced by making sure that the used technique is explainable and interpretable. This research data was obtained from Kaggle which contains images of banana, grape, lemon, mango, and strawberry fruit classes. The total data was divided into 70:15:15 for training, validating and testing. To ensure consistent size and quality, all images were pre-processed before use. This study considered four pretrained models namely RegNetY-B3-GE, DarkNet53-SCSE, BEiT, and PVTv2 for performance assessment. We proposed a lightweight hybrid (convolution plus attention-based) CoAT-AgriLite model for fruit disease classification which extracts local lesion features and global context. Transferring training and data augmentation technique was utilized during training for better performance. To ensure interpretability of model decisions, Gradient-weighted Class Activation Mapping (Grad-CAM) which captures the discriminative regions from the input images for model predictions. Among all evaluated models, the proposed model achieved the highest classification accuracy of 99.37% on the testing dataset. Comparative results demonstrated that the proposed model outperformed other pretrained models in terms of precision, recall, and F1-score, confirming its robustness and effectiveness in real-world agricultural classification tasks. The experimental findings validate that the proposed model not only achieves superior classification accuracy but also provides interpretability through Grad-CAM visualizations. This hybrid framework offers a promising solution for intelligent and transparent fruit classification systems, with potential applications in precision agriculture and automated sorting systems.