Abstract
WHAT IS ALREADY KNOWN ABOUT THIS TOPIC? Large language models (LLMs) have demonstrated considerable potential in clinical applications. However, their performance in field epidemiology, particularly within Chinese-language contexts, remains largely unexplored. WHAT IS ADDED BY THIS REPORT? This study evaluates six leading LLMs (ChatGPT-o4-mini-high, ChatGPT-4o, DeepSeek-R1, DeepSeek-V3, Qwen3-235B-A22B, and Qwen2.5-Max) using examination questions from the Zhejiang Field Epidemiology Training Program. For multiple-choice questions, all models except DeepSeek-V3 scored below the 75th percentile of junior field epidemiologists, while for case-based questions, LLMs generally outperformed that percentile. However, LLMs demonstrated significant limitations when addressing questions requiring specialized knowledge. Notably, LLMs may generate inaccurate or fabricated references, presenting substantial risks for inexperienced practitioners. WHAT ARE THE IMPLICATIONS FOR PUBLIC HEALTH PRACTICE? LLMs demonstrate promising potential for supporting epidemiological investigations. Nevertheless, current LLMs cannot replace human expertise in field epidemiology. Their practical implementation faces considerable challenges, including ensuring output accuracy and reliability. Future efforts should prioritize optimizing performance through verified knowledge databases and establishing robust regulatory frameworks to enhance their effectiveness in public health applications.