Abstract
INTRODUCTION: Accurate detection of intracranial hemorrhage (ICH) and its subtypes on CT scans is critical for timely treatment. While convolutional neural networks have achieved high diagnostic accuracy in this task, the ability of large language models (LLMs) to perform direct medical image interpretation remains largely untested. METHODS: We evaluated a general-purpose multimodal LLM for binary ICH detection and multi-class subtype classification using the publicly available PhysioNet ICH dataset (version 1.3.1, Massachusetts Institute of Technology (MIT), Cambridge, MA). Preprocessed axial slices were grouped into composite images and encoded in base64 for model input. Binary classification distinguished hemorrhage presence versus absence in 75 scans (36 positive, 39 negative). Subtype classification among positive cases included intraventricular, intraparenchymal, subarachnoid, epidural, and subdural hemorrhages. Performance metrics included accuracy, precision, recall, F1 score, exact match accuracy, and Hamming score. RESULTS: For binary detection, the model achieved an overall accuracy of 0.52, with low recall for hemorrhage-positive cases (0.14) and higher recall for hemorrhage-negative cases (0.87). Subtype performance varied: intraparenchymal hemorrhage reached the highest F1 score (0.57), while epidural hemorrhage showed perfect precision but poor recall (0.14). Exact match accuracy was 0.06, and the Hamming score was 0.54, reflecting partial but inconsistent predictive ability. CONCLUSIONS: The LLM demonstrated limited sensitivity for hemorrhage detection and inconsistent subtype classification, underscoring current constraints of zero-shot application to medical imaging. These findings highlight the need for domain-specific fine-tuning, larger and more diverse datasets, and integration with traditional computer vision methods before clinical deployment.