Abstract
Large language models (LLMs) are increasingly incorporated into preoperative and discharge education, yet their effectiveness and the ways in which they are evaluated remain inconsistent. This systematic review assessed the effectiveness of LLM-based interventions and identified evidence gaps relevant to understanding how model characteristics may influence patient outcomes. We searched five databases from inception to April 18, 2025, ultimately including twenty studies. Outcomes were narratively synthesized, and interventions were evaluated using a published four-dimension framework, with reporting patterns visualized through a heatmap. Many studies reported benefits for anxiety reduction and selected satisfaction domains, whereas findings for pain, recovery, and other satisfaction elements showed no significant differences from conventional materials. Reporting of evaluation sub-dimensions was uneven, with trustworthiness and performance rarely documented alongside clinical endpoints. These gaps highlight the need for future research that integrates model-centric and patient-centric evaluations to support responsible clinical deployment.