Abstract
Colposcopy is essential for the early detection of cervical cancer; however, its accuracy depends heavily on clinician experience and is often limited in low-resource settings. Under acetic acid application, most high-grade lesions maintain acetowhitening for 180 seconds, whereas nearly all low-grade or benign areas fade more rapidly. Leveraging this dynamic contrast, we propose TLS-Net, a deep network that processes time-series images captured at 60, 90, 150, and 180 seconds post-application. First, a Swin Transformer encoder extracts rich spatial features to localize lesion candidates. Next, a temporal attention module–incorporating a Convolutional Block Attention Module, fuses information across time points to distinguish persistent acetowhite regions. Finally, a segmentation head delineates High-Grade Squamous Intraepithelial Lesions or worse (HSIL+) areas within the detected regions. Trained and validated on 1,152 images from 288 patients, TLS-Net achieved mean Dice scores of 85.55% ± 1.33%, mean pixel accuracy of 85.61% ± 2.30%, and mean intersection-over-union of 76.65% ± 1.72% on the validation set, outperforming single-frame approaches. This method demonstrates promising potential for AI-assisted colposcopy in clinical practice.