Abstract
OBJECTIVES: To comprehensively evaluate the validity of International Classification of Diseases, Tenth Revision, Clinical Modification (ICD-10-CM) codes for both prevalent diagnoses and less common diseases, and to assess the performance of a large language model (LLM)-based system in validating these codes. MATERIALS AND METHODS: This retrospective study analyzed hospital admissions from the Medical Information Mart for Intensive Care IV (MIMIC-IV) database. We developed a validated LLM-based system using GPT-4o, refined through iterative prompt engineering, to assess ICD-10-CM code validity. We measured the positive predictive value (PPV) of ICD-10-CM codes, PPV of principal and secondary diagnoses, and the performance of an LLM-based system in code validation. RESULTS: Among 865 079 assigned codes, the PPV was 84.6% (95% CI, 84.5%-84.6%). Principal diagnoses had a PPV of 93.9% (95% CI, 93.7%-94.1%), while secondary diagnoses had a PPV of 83.8% (95% CI, 83.7%-83.9%). The LLM system demonstrated high performance in validating ICD codes, achieving 93.6% accuracy, 95.4% sensitivity, and 85.2% specificity. Among correctly assigned secondary diagnoses, the majority (67.9%) represented historical or baseline conditions, while 32.1% reflected active conditions that deviated from baseline status; 22.3% of these emerged after hospital admission. PPV decreases with later diagnosis positions, with the largest decline occurring between principal and secondary diagnoses. DISCUSSION AND CONCLUSION: In this large-scale evaluation, ICD-10-CM codes exhibited generally high accuracy, though variability existed by position and condition type. A validated LLM system performed comparably to physician review and offers a scalable means to improve coding accuracy. These findings support the potential for integrating LLM-based auditing into routine workflows to strengthen the quality of administrative and research data.