Abstract
BACKGROUND: Mathematical prediction models (MPMs) based on clinical and radiologist-assessed features have been developed to assist with lung cancer risk assessment for imaging-detected lung nodules. However, MPMs were developed using different datasets, thresholds, and feature sets, making it difficult to cross-compare the published performance metrics and determine prospective performance stability. The aim of this study is to utilize a large lung cancer screening cohort with identified pulmonary nodules to compare the performance of four MPMs, at a standardized sensitivity value, to reduce the false positive rate for lung cancer screening exams. METHODS: This retrospective study utilized low-dose computed tomography (LDCT) identified lung nodules from the National Lung Screening Trial (NLST) to evaluate four MPMs [Mayo Clinic (MC), Veterans Affairs (VA), Peking University (PU), and Brock University (BU)]. For cross-comparison, a small NLST sub-cohort (n=270) was used to determine a calibrated decision threshold for each model, targeting a sensitivity for detecting lung cancer of 95%. Performance was evaluated using area under the receiver-operating-characteristic curve (AUC-ROC), area under the precision-recall curve (AUC-PR), sensitivity, and specificity. The calibrated threshold applied to the remaining NLST cohort (n=1,083) was used to demonstrate the stability of performance metrics. RESULTS: A total of 1,353 patients [mean ± standard deviation (SD) age, 62.3±5.2 years; 746 male] were included, of which 122 (9.0%) had a malignant nodule. At the target sensitivity of 95%, the highest testing specificity (correctly identified benigns) was seen in the BU and MC models (55% and 52%, respectively), compared to the VA (45%) and the PU (16%). The AUC-ROCs for BU (83%), MC (83%), PU (76%), and VA (77%) suggest high-moderate performance, while AUC-PR more accurately reflects that all the models have sub-optimal precision (27-33%). CONCLUSIONS: Tuning calibration thresholds of existing MPM aids in performance comparison and stability for application in the lung cancer screening setting. However, targeting high sensitivity (95%), the achievable specificity of the MPMs is low (16-55%), which may limit clinical utility.