Abstract
Introduction Magnetic resonance imaging (MRI) of the knee is the gold standard for evaluating meniscal injuries. While specialized artificial intelligence (AI) models have demonstrated high diagnostic capability in detecting meniscal tears, the performance of general-purpose large language models (LLMs) with multimodal vision capabilities remains underexplored. Previous iterations, such as generative pre-trained transformer 4 (GPT-4) (OpenAI, San Francisco, CA, USA) with vision, have shown limited success in direct musculoskeletal image interpretation. This study evaluates the diagnostic performance of the latest-generation LLM, generative pre-trained transformer 5 (GPT-5), in detecting meniscal tears on knee MRI. Objectives This study aimed to evaluate the diagnostic performance of GPT-5 (a general-purpose multimodal LLM) in detecting meniscal tears on knee MRI in a zero-shot setting, using a publicly available dataset. Materials and methods One hundred knee MRI examinations (50 with meniscal tears, 50 without) were randomly selected from the MRNet validation dataset, with ground-truth labels extracted from the dataset. Sagittal T2-weighted and coronal T1-weighted series were reviewed for completeness and image quality and then converted to Portable Network Graphics (PNG) slices. GPT-5 (gpt-5-2025-08-07) analyzed each case in zero-shot fashion using a fixed prompt requesting a binary ("yes/no") determination of meniscal tear presence without any clinical context. Model predictions were compared with ground truth, and accuracy, precision, recall, specificity, and F1-scores were calculated with 95% confidence intervals. Results GPT-5 achieved an overall accuracy of 76% (95% CI: 0.668-0.833). The model demonstrated a sensitivity (recall) of 84% (95% CI: 0.715-0.917) and a specificity of 68% (95% CI: 0.542-0.792). The precision for detecting tears was 72.4%, and the F1-score was 0.778. Conclusion In this pilot study, GPT-5 demonstrates potential in the zero-shot interpretation of knee MRIs for meniscal tear detection, outperforming previous multimodal LLMs. However, the results should be interpreted with caution due to study limitations, and clinical utility is currently limited by a high false-positive rate and lack of visual explainability. Nevertheless, this pilot evaluation provides an initial proof of concept, and with larger datasets, rigorous validation, improved calibration, and enhanced explainability, future multimodal LLMs may evolve into supportive, human-in-the-loop tools in musculoskeletal radiology.