Abstract
Background: Accurate and timely diagnosis of skin lesions, including Melanoma (MEL), Basal Cell Carcinoma (BCC), Squamous Cell Carcinoma (SCC), Actinic Keratosis (ACK), Seborrheic Keratosis (SEK), and Nevus (NEV), is often hindered by the severe class imbalance and high morphological similarity among pathologies in clinical practice. Although multimodal learning has shown potential in resolving these issues, existing approaches often fail to address predictive uncertainty or effectively integrate heterogeneous clinical metadata. Therefore, this study proposes DermaCalibra, a robust and explainable multimodal framework optimized for small-scale, imbalanced clinical datasets. Methods: The proposed framework integrates three essential modules: First, the Attention-Based Multimodal Channel Recalibration (AMCR) module introduces a probabilistic Bayesian uncertainty estimation mechanism via Monte Carlo dropout to adjust focal loss weights, prioritizing features from underrepresented classes. Second, the Metadata-Driven Dynamic Feature Modulation and Cross-Attention Fusion (MDFM-CAF) module, designed to resolve inter-class visual ambiguity, dynamically rescales dermoscopic feature maps using non-linear clinical context transformations. Lastly, the Gradient Feature Attribution (GFA) module is implemented to provide pixel-level diagnostic heatmaps and metadata importance scores. Results: Evaluated on the PAD-UFES-20 dataset, DermaCalibra achieves a balanced accuracy (BACC) of 84.2%, outperforming current state-of-the-art (SOTA) methods by 3.6%, and a Macro Area Under the Receiver Operating Characteristic Curve (Macro AUC) of 96.9%. Extensive external validation on unseen hospital and synthetic datasets confirms robust generalizability across diverse clinical settings without the need for retraining. Conclusions: DermaCalibra effectively bridges the gap between deep learning complexity and clinical intuition through uncertainty-aware reasoning and transparent interpretability. The framework provides a reliable and scalable computer-aided diagnostic tool for early skin lesion detection, particularly in resource-limited clinical environments.