Abstract
BACKGROUND: While large language models (LLMs) show promise in healthcare, their reliability in high-stakes perioperative management for elderly patients with multimorbidity remains critically underexplored. METHODS: This benchmarking study evaluated five general-purpose LLMs (ChatGPT, Gemini, DeepSeek, Claude, Kimi) and one domain-optimized model (New Youth Anesthesia Artificial Intelligence Assistant, NYAAI) using a novel three-dimensional framework assessing guideline compliance, clinical applicability, and safety redundancy. A simulated case of an 84-year-old male with femoral fracture and multimorbidity was developed. Two blinded anesthesiologists scored anonymized outputs via a 5-point Likert scale. Additionally, to account for the rapid evolution of AI models, a supplementary analysis was conducted to evaluate the robustness and sensitivity of current model versions. RESULTS: NYAAI achieved the highest total score (12/15), excelling in clinical applicability (5/5) through domain-optimized parameterization. However, it exhibited selective guideline adherence, omitting temperature management and delirium protocols. General-purpose models demonstrated moderate guideline compliance (ChatGPT:4/5; Gemini:3/5) but generated contextually inappropriate recommendations. Safety redundancy emerged as a universal failure—no model addressed extreme-event protocols (aortic rupture management). CONCLUSION: This study evaluated six LLMs for perioperative decision support in a geriatric patient with multimorbidity. The findings confirm that LLMs are useful as structured protocol generators, but they are not sufficient as autonomous clinical agents. Domain-optimized models enhance operational feasibility, yet they heighten the tension between safety redundancy and contextual adaptability. General-purpose models, despite their broad knowledge, are prone to generating inaccurate or hallucinations. To maximize efficiency without jeopardizing medical safety, LLMs should be positioned as as extensions of expert systems rather than independent decision-makers. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12871-025-03605-x.