Abstract
BACKGROUND: Coronary artery disease (CAD) remains the leading cause of mortality worldwide. Coronary digital subtraction angiography (DSA) is the gold standard for evaluating coronary lesion location, extent, and severity, and it serves as the primary basis for informing decisions related to revascularization. However, the interpretation of DSA in complex lesions is often time-consuming and subject to interobserver variability. This study aims to develop and validate a multisource cue fusion-based method for precise localization of coronary artery stenosis in digital subtraction angiography. METHODS: In this study, a detection network was developed for accurately localizing stenosis in key frames of coronary DSA by integrating multisource cues. Specifically, the vessel segmentation provides spatial contour cues, while adjacent frames offer dynamic cues. To effectively combine these, a cross-cue attention-based fusion module was designed, which enhances target frame representation by capturing nonlocal spatial dependencies. Furthermore, a distance-based penalty from the coronary ostium was incorporated into the loss function to improve the model's sensitivity to localization errors of proximal stenosis. RESULTS: The proposed method achieved a precision of 87.90%, a recall of 65.78%, an F1-score of 74.99%, a mean average precision at 0.5 intersection of union (IoU) threshold (mAP@0.5) of 78.46%, and a mAP across multiple IoU thresholds from 0.5 to 0.95 with a step size of 0.5v (mAP@0.5-0.95) of 60.37%, outperforming YOLOv5, YOLOv8, and YOLOv12 on all metrics. Ablation studies further demonstrated that the total loss led to superior performance, with mAP@0.5-0.95 increasing to 60.37%, in contrast to 56.23% with complete IoU loss and 58.01% with distance-weighted loss. CONCLUSIONS: This study addresses the critical need for precise identification and localization of coronary artery stenosis in the assessment of CAD by proposing a coronary artery stenosis detection method for DSA that integrates multiple sources of information. This method effectively combines the structural information from the target frame with the dynamic temporal sequence data from adjacent frames, overcoming the limitations of single-frame information and enhancing the accuracy of stenosis localization.