Abstract
Glaucoma is an eye disease caused by increased intraocular pressure (IOP) that affects the optic nerve head (ONH), leading to vision problems and irreversible blindness. Background/Objectives: Glaucoma is the second leading cause of blindness worldwide, and the number of people affected is increasing each year, with the number expected to reach 111.8 million by 2040. This escalating trend is alarming due to the lack of ophthalmology specialists relative to the population. This study proposes an explainable end-to-end pipeline for automated glaucoma diagnosis from fundus images. It also evaluates the performance of Vision Transformers (ViTs) relative to traditional CNN-based models. Methods: The proposed system uses three datasets: REFUGE, ORIGA, and G1020. It begins with YOLOv11 for object detection of the optic disc. Then, the optic disc (OD) and optic cup (OC) are segmented using U-Net with ResNet50, VGG16, and MobileNetV2 backbones, as well as MaskFormer with a Swin-Base backbone. Glaucoma is classified based on the vertical cup-to-disc ratio (vCDR). Results: MaskFormer outperforms all models in segmentation in all aspects, including IoU OD, IoU OC, DSC OD, and DSC OC, with scores of 88.29%, 91.09%, 93.83%, and 93.71%. For classification, it achieved accuracy and F1-scores of 84.03% and 84.56%. Conclusions: By relying on the interpretable features of the vCDR, the proposed framework enhances transparency and aligns well with the principles of explainable AI, thus offering a trustworthy solution for glaucoma screening. Our findings show that Vision Transformers offer a promising approach for achieving high segmentation performance with explainable, biomarker-driven diagnosis.