Abstract
BACKGROUND: ChatGPT has demonstrated strong performance in the complex, full clinical workflow. In recent years, several large language models (LLMs) from China have been introduced; however, their performance in such intricate tasks has yet to be thoroughly assessed, and it remains unclear whether their performance diverges from that of ChatGPT. This study seeks to evaluate the capacity of the Chinese LLMs for providing continuous clinical decision support by assessing their performance with simulated patient cases. METHODS: We selected 29 standard cases from the Merck Manual as simulated patients. We provided their information to the LLMs. Each simulated case is accompanied by a series of sequential questions designed to simulate the process of differential diagnosis, diagnostic workup, diagnosis and management. The responses were then recorded and scored. Then we compared the performance of two Chinese large language models with ChatGPT-4 in entire clinical workflow of simulated patient, selecting the best-performing model for a comparison with 18 human emergency fellow doctors. Additionally, we compared the differences in performance between different versions of the LLMs. RESULTS: There were no significant differences between ChatGPT-4 and Doubao in all four aspects (P > 0.05). However, ERNIE Bot 3.5 was inferior to ChatGPT-4 and Doubao in differential diagnosis, diagnostic questions, and management (P < 0.05). But in diagnosis questions, the average accuracy proportion for all three models was above 97%, with no significant differences observed (P > 0.05). There was no significant difference between LLMs and emergency fellow physicians in diagnosis and differential diagnosis (P > 0.05), but in diagnostic questions as well as management, LMMs were superior to emergency fellow physicians (P < 0.05). ChatGPT-4 was higher than ChatGPT 3.5 in all four aspects (P < 0.05). CONCLUSION: The large language model Doubao from China demonstrates performance similar to ChatGPT across full clinical workflows. LLMs outperform human emergency fellow physicians and exhibit rapid development, offering significant practical application potential in healthcare.