Abstract
BACKGROUND: Patient language proficiency plays a critical role in equitable, patient-centered care and language-related clinical research. However, language information recorded in structured fields of electronic health records (EHRs) is often incomplete or inaccurate, especially in multi-institutional settings with heterogeneous documentation practices. OBJECTIVE: To develop and evaluate a named entity recognition (NER) pipeline that accurately extracts detailed patient language status from unstructured clinical notes using large language models (LLMs), thereby enabling scalable and generalizable language information extraction. METHODS: We defined four categories of language status-fluent use, partial ability, lack of understanding, and language mentions unrelated to the patient-and annotated two datasets from Yale New Haven Hospital (YNHH) and MIMIC-III. We evaluated the performance of proprietary and open-source LLMs, including GPT-4o, LLaMA3, and BERT, under zero-shot and fine-tuning settings. Cross-site validation was conducted to assess generalizability across institutions. RESULTS: GPT-4o achieved F1 scores of 87 % and 82 % on YNHH and MIMIC datasets, respectively, without fine-tuning. Fine-tuned open-source models such as BERT and LLaMA3 reached comparable or superior performance when trained on sufficient annotated data. Cross-institutional evaluations confirmed that LLMs, particularly LLaMA3, exhibited stronger generalizability than traditional models. Language mentions unrelated to patient fluency remained the most challenging category across all models. CONCLUSION: Our NER framework enables automated extraction of nuanced language information from clinical narratives with high accuracy and generalizability. This work supports large-scale, language-focused research and has practical implications for improving patient-provider communication, interpreter service allocation, and equitable healthcare delivery.