Abstract
Accurate mosquito species recognition underpins vector surveillance and targeted control, yet field imagery suffers from device variability, clutter, and fine-grained inter-species similarity. Deep learning has emerged as a scalable path, but prior systems often lack calibrated probabilities and degrade under domain shift. We propose a dual-head architecture that aligns an 8-class head with an auxiliary 8 to 2 Aedes head to sharpen difficult boundaries, and we fuse heterogeneous CNN/Transformer branches via calibrated logit stacking followed by temperature scaling (specifically, a CNN backbone paired with a Swin-T Transformer branch to capture complementary local texture and long-range morphology). With test-time augmentation (TTA, 5–8 views), the pipeline jointly reduces variance, corrects bias, and improves posterior calibration. We evaluate on AMID v1 (8-class, whole-body images) and on an unseen, phone-style Aedes corpus used strictly as test-only to probe cross-dataset generalization. Against strong baselines (ResNet-50, EfficientNet-V2-S) and naïve probability averaging, our method attains near-ceiling in-domain performance—Macro-F1 ≈ 99.3–99.4% and Micro-Accuracy ≈ 99.4–99.5%—and exceeds 99% accuracy on the unseen Aedes set, while markedly improving calibration (ECE ≈ 0.6%). Confidence intervals (Wilson, 95%) and paired tests (McNemar) indicate that these gains, though incremental, are consistent and statistically reliable. Ablations show that TTA = 5 with calibrated stacking captures most benefits at practical latency. By coupling boundary-aware supervision with calibration-aware fusion, the proposed approach delivers predictions that are both more correct and more trustworthy, stabilizing operating thresholds across sites and capture pipelines —with the Swin-T branch contributing robustness to pose and device variation through its windowed self-attention. This provides a deployment-ready baseline for public-health monitoring and a principled foundation for future extensions to open-set recognition, domain-aware calibration, and multimodal sensing. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1038/s41598-026-35453-1.