Abstract
BACKGROUND: The interpretation of nuanced recommendations within complex clinical oncology guidelines, such as those for brain metastases, presents persistent challenges for medical experts, potentially impacting treatment consistency. While Large Language Models offer potential decision support, their comparative efficacy in this domain remains underexplored. This study evaluated the accuracy and convergence of medical experts versus leading Large Language Models in interpreting Strength of Recommendation and Quality of Evidence from the ASTRO and ASCO-SNO-ASTRO brain metastases guidelines. METHODS: Neurosurgeons, radiation oncologists, and four Large Language Models (ChatGPT-4o, Gemini 2.0, Microsoft Copilot Pro, DeepSeek R1) assessed the Strength of Recommendation and Quality of Evidence for guideline recommendations. Accuracy, near-answer rates, and Cohen's weighted kappa (κ) were calculated. RESULTS: Large Language Models, notably Gemini and DeepSeek, demonstrate significantly higher accuracy (up to 100% for ASTRO Strength of Recommendation vs. a maximum 58.82% for experts) and near-perfect convergence (κ up to 1.000 vs. κ ≤ 0.504 for experts) in interpreting ASTRO guideline specifics. While all groups found the Quality of Evidence and the more complex ASCO guideline more challenging, Large Language Models generally maintain an advantage in convergence, with Deepseek achieving 61.53% accuracy and κ = 0.428 for ASCO Strength of Recommendation versus a maximum 53.84% accuracy and highly variable convergence for experts. CONCLUSIONS: Large Language Models demonstrate significantly higher accuracy than human experts in structured interpretation of guideline classifications, with near-perfect inter- Large Language Model convergence. This supports their role as standardization tools for guideline parsing - freeing experts for patient-specific reasoning where clinical context, comorbidities, and preferences dominate decision-making.