Abstract
BACKGROUND: Accurate prediction of radiation-induced toxicity remains a key challenge in head-and-neck radiotherapy. This retrospective study compared traditional normal-tissue complication probability (NTCP) models, Lyman-Kutcher-Burman and relative seriality, with machine learning (ML) approaches (artificial neural network [ANN] and extreme gradient boosting [XGBoost]) for organ-at-risk (OAR) toxicity prediction for a small cohort (n = 57). MATERIALS AND METHODS: Fifty-seven patients treated with intensity-modulated radiotherapy, volumetric modulated arc therapy, or hybrid techniques were analyzed across 115 OARs (parotid glands [54], larynx [31], and spinal cord [30]). Post-treatment toxicities were graded using Common Terminology Criteria for Adverse Events v5.0 with a median follow-up of 10 months. Models and ML were implemented using stratified 5-fold cross-validation and assessed using discrimination (area under the receiver operating characteristic curve [AUC]), Brier score, calibration analysis, and SHapley Additive exPlanations values. RESULTS: Grade ≥2 toxicity occurred in 63.0% (34/54) of parotid glands and 45.2% (14/31) of larynges, with no spinal cord events. ML models achieved superior discrimination for parotid glands (ANN: AUC = 0.866, 95% confidence interval [CI]: 0.81-0.93; XGBoost: AUC = 0.847, 95% CI: 0.78-0.91) and larynx (XGBoost: AUC = 0.853, 95% CI: 0.78-0.92) compared to traditional models (all AUC < 0.60, P < 0.001). Calibration analysis revealed Brier scores of 0.135-0.145 for ML models versus 0.276-0.295 for traditional approaches, though calibration slopes (1.37-1.84) indicated systematic under-prediction requiring attention in clinical implementation. Age (P = 0.002, Cohen's d = 0.908), total dose (P = 0.035), and treatment duration (P = 0.026) were significantly associated with parotid toxicity. Traditional model parameters required substantial adjustment from literature values (parotid tolerance dose for 50% complication: 10.0 vs. 28.4 Gy). CONCLUSION: ML captured nonlinear interactions between dosimetric and clinical variables more effectively than traditional NTCP models, yielding superior predictive accuracy. However, findings are exploratory given the same-dataset validation with a modest cohort size (n = 57), and external validation in larger, multi-institutional cohorts is essential before clinical implementation.