Abstract
The presence of subtle mucosal abnormalities makes small bowel Crohn's disease (SBCD) and other gastrointestinal lesions difficult to detect, as these features are often very subtle and can closely resemble other disorders. Although the Kvasir and Esophageal Endoscopy datasets offer high-quality visual representations of various parts of the GI tract, their manual interpretation and analysis by clinicians remain labor-intensive, time-consuming, and prone to subjective variability. To address this, we propose a generalizable ensemble deep learning framework for gastrointestinal lesion detection, capable of identifying pathological patterns such as ulcers, polyps, and esophagitis that visually resemble SBCD-associated abnormalities. Further, the classical convolutional neural network (CNN) extracts shallow high-dimensional features; due to this, it may miss the edges and complex patterns of the gastrointestinal lesions. To mitigate these limitations, this study introduces a deep learning ensemble framework that combines the strengths of EfficientNetB5, MobileNetV2, and multi-head self-attention (MHSA). EfficientNetB5 extracts detailed hierarchical features that help distinguish fine-grained mucosal structures, while MobileNetV2 enhances spatial representation with low computational overhead. The MHSA module further improves the model's global correlation of the spatial features. We evaluated the model on two publicly available DBE datasets and compared the results with four state-of-the-art methods. Our model achieved classification accuracies of 99.25% and 98.86% on the Kvasir and Kaither datasets.