Abstract
BACKGROUND: Dermoscopic lesion segmentation is crucial for dermatology, yet existing methods struggle to integrate global context with local details under the efficiency constraints required for clinical use. PURPOSE: We aim to develop a lightweight model that simultaneously captures long-range spatial dependencies and preserves fine-grained boundary details for dermoscopic lesions. The method is designed to achieve a favorable accuracy-efficiency trade-off, thereby improving segmentation performance and ensuring potential for practical clinical deployment. METHODS: Proposing a lightweight hybrid model, HCViT-Net, featuring an encoder-decoder architecture. It incorporates a multi-scale query transformer (MSQFormer) into each stage of its convolutional encoder to efficiently capture global, multi-scale context. Furthermore, a wavelet-guided attention refinement module (WARM) is introduced on the highest-resolution skip connection to selectively enhance high-frequency boundary details and bridge the semantic gap between the encoder and decoder, thus improving model performance. RESULTS: Evaluated on ISIC 2017 and 2018, our model achieved mean intersection-over-union (mIoU) of 87.76% and 87.45%, respectively. With only 5.76M parameters and 7.51 GFLOPs, it demonstrates performance competitive with existing methods at a significantly lower computational cost. CONCLUSIONS: HCViT-Net achieves an excellent accuracy-efficiency trade-off. It improves segmentation accuracy with a low computational footprint, showing strong potential for practical deployment in dermatology workflows.