Abstract
In recent years, large language models (LLMs) have achieved remarkable progress in natural language processing and demonstrated potential applications in medicine. However, their professional capabilities in specific medical subfields, such as immunology, still require systematic evaluation. This study systematically evaluated 11 representative LLMs, including DeepSeek, GPT, Llama, Gemma, and Qwen series, based on the Chinese National Health Professional Qualification Examination in Rheumatology and Clinical Immunology. The evaluation covered four dimensions: basic medical knowledge, related medical knowledge, immunology knowledge, and professional practice ability. Results show significant differences among LLMs. DeepSeek-R1 and Qwen3 achieve the best performance, with accuracy exceeding 90%. However, performance on professional practice ability tasks remained relatively low, highlighting limitations in complex clinical applications.