Abstract
BACKGROUND: Large language models (LLMs) are increasingly explored in nursing education, but their capabilities in specialized, high-stakes, culturally specific examinations, such as the Chinese National Nurse Licensure Examination (CNNLE), remain underevaluated, making rigorous evaluation crucial before their adoption in nursing training and practice. OBJECTIVE: This study aimed to evaluate the performance, accuracy, repeatability, confidence, and robustness of 4 LLMs on the CNNLE. METHODS: Four LLMs (Sider Fusion [Vidline Inc], GPT-4o [OpenAI], Gemini 2.0 Pro [Google DeepMind], and DeepSeek V3) were tested on 237 multiple-choice questions from the 2024 CNNLE. Accuracy and repeatability were assessed using 2 prompting strategies. Confidence was evaluated via self-ratings (1-10 scale) and robustness via repeated adversarial prompting. RESULTS: DeepSeek V3 and Gemini 2.0 Pro demonstrated significantly higher overall accuracy (ranging from 199/237 to 209/237; >83%) compared to GPT-4o and Sider Fusion (ranging from 151/237 to 166/237; <71%). However, all LLMs showed suboptimal repeatability (highest at 206/237; <87% consistency). Critically, poor confidence calibration was evident; most models showed high confidence often mismatching actual accuracy (Sider Fusion: P=.01; GPT-4o: P=.03; and Gemini 2.0 Pro: P=.049). A stability-flexibility trade-off paradox was also observed. CONCLUSIONS: While some LLMs show promising accuracy on the CNNLE, fundamental reliability limitations (poor confidence calibration and inconsistent repeatability) hinder safe application in nursing education and practice. Future LLM development must prioritize trustworthiness and calibrated reliability over surface accuracy.