Abstract
BACKGROUND: Cardiac magnetic resonance imaging (CMR) studies contain a wealth of information on a patient's cardiovascular status. The ability to extract this data from free-text reports could serve to automate clinical decision support tools and generate data for retrospective clinical knowledge discovery, and clinical operational purposes. Few studies have examined the automatic extraction of data from free-text CMR reports, and the existing studies that do have key limitations, including small sample size and disease-specific data extraction. Existing studies also fail to extract features associated with the cardiovascular conditions that reflect nuances in natural language, such as uncertainty, severity, subtype, and anatomical locations of the condition. The goal of this study was to build a broad named entity recognition model to automatically extract a broad variety of common CMR findings and their associated attributes from CMR reports. METHODS: We fine-tuned a Large Language Model Meta AI (LLaMA) model trained to identify 34 cardiovascular conditions and their associated attributes, including certainty, severity, location, and subtype of the condition. This model was trained on 1778 MRI reports and tested on 397 reports in an held-out test set and another 428 reports from another site in our hospital system with independent radiology practice and scanners. RESULTS: Our model shows robust performance in predicting the mention of the 31 cardiovascular conditions (average F1=0.85). It also showed strong performance predicting attributes, including certainty (average F1=0.97) and severity (average F1=0.97). Model performance on the external validation set was generally slightly lower than the internal validation set, but performance was still strong (average F1=0.78 for mention, 0.97 for certainty, and 0.96 for severity). CONCLUSION: CMR-LLaMA has strong performance identifying a variety of concept mentions and moderate accuracies in extracting a selection of other associated attributes. NLP models can be used to automate the extraction of data from CMR reports to potentially assist with clinical and research workflow.