Abstract
Background/Objectives: Noninvasive differentiation of parotid gland tumors remains challenging despite ultrasound being the primary imaging modality for salivary gland lesions. Given its examiner dependence, improving diagnostic consistency and transparency is crucial. We quantified interobserver variability in parotid ultrasound, modeled examiner-specific decision patterns using machine learning surrogates, and tested whether surrogate complexity relates to examiner performance. Methods: In this retrospective, single-center study, six examiners independently rated ultrasound images of 149 parotid tumors using predefined descriptors. Performance was summarized using accuracy and the area under the receiver operating characteristic curve (AUC), with 95% confidence intervals (CIs). AUCs were compared using DeLong tests (Holm-adjusted). Interobserver agreement was assessed using pairwise Cohen's and global Fleiss' κ. For each examiner, a decision-tree surrogate was trained from structured descriptors and clinical metadata to reproduce examiner labels and visualize decision pathways; performance was estimated by 5-fold cross-validation. Results: Examiner accuracy ranged from 63.5% to 90.5% and AUC from 0.66 to 0.89 (best 0.89, 95% CI 0.83-0.95); the best performer exceeded the two lowest performers (p < 0.001). Agreement was higher for objective descriptors (size: κ = 0.57-0.97) than for subjective descriptors (echogenicity: κ = 0.11-0.79). Surrogate decision-tree accuracy versus histopathology ranged from 57.2% to 80.0% for unpruned and from 65.1% to 76.5% for pruned models, with high coverage (95.3-98.7%). Tree complexity showed no consistent association with examiner performance. Conclusions: Parotid ultrasound shows substantial interobserver variability. Interpretable surrogates can approximate individual labeling behavior from structured descriptors and clinical metadata, making examiner-dependent decision patterns explicit.