Abstract
Personalized out-of-hospital management could significantly improve quality of life of breast cancer patients. We aimed to evaluate the accuracy, effectiveness, safety, personalization and emotional care of Large Language Models (LLMs) in the out-of-hospital management of breast cancer. We established a data cleaning and classification pipeline to summarize three major scenarios of out-of-hospital management. Authentic electronic health record (EHR) datasets for data collection were generated using 10 patients with ID information masked from Breast Cancer Database in Affiliated Sir Run Run Shaw Hospital, Zhejiang University. Then we matched the EHR datasets with three out-of-hospital management scenarios as 100 virtual patients (VPs) for LLMs to perform the conversation generation using GPT-o3 and DeepSeek-R1. Further, we incorporated four human specialists to rate the responses of LLMs in five dimensions using Likert scale. As of April 1, 2025, the 4 evaluator specialists rated the conversations of LLMs and 100 VPs. The results demonstrate that both DS-R1and GPT-o3 performed well, with scores primarily concentrated at 3 and 4 points. We revealed statistically significant differences between DS-R1and GPT-o3 in accuracy, personalization, and emotional care (P < 0.01). However, the P-values for effectiveness and safety were 0.231 and 0.086. Furthermore, DS-R1generated more tokens (approximately 1.8 times) in identical time with less economic cost, and it also had shorter response time than GPT-o3. GPT-o3 and DS-R1 demonstrated personalized, empathetic, and accurate performance in the out-of-hospital management for breast cancer patients. DS-R1 had better overall performance than GPT-o3, especially in personalization, emotional care and accuracy. More research is warranted in the development specific knowledge embedding LLMs to reduce the detractors like hallucinatory or verbose responses.