Comparative evaluation of large language models in delivering guideline-compliant recommendations for topical NSAID use in musculoskeletal pain: a multidimensional analysis

大型语言模型在提供符合指南的局部非甾体抗炎药治疗肌肉骨骼疼痛建议方面的比较评价:一项多维分析

阅读:2

Abstract

INTRODUCTION: While large language models (LLMs) are increasingly used in clinical decision support, their adherence to evidence-based guidelines-particularly for musculoskeletal pain management-remains understudied. METHODS: Four LLMs (DeepSeek-R1, ChatGPT-4o, Gemini, Grok-3) were evaluated on their responses to topical NSAID use for musculoskeletal pain through: assessments of response quality (accuracy, over-conclusiveness, supplementary information, and incompleteness), standardized readability metrics (Flesch Reading Ease, Flesch-Kincaid Grade Level), and the PEMAT-P tool to quantify actionability. RESULTS: The four LLMs showed significant variability in accuracy (ANOVA p = 0.045), with Gemini scoring highest (8.33 ± 0.77) and DeepSeek-R1 lowest (7.72 ± 1.52) and in over-conclusiveness (ANOVA p = 0.025), with Grok-3 scoring lowest (4.56 ± 1.42) and ChatGPT-4o highest 6.72 ± 1.49). ChatGPT-4o provided the most supplementary content (6.94 ± 2.29, p = 0.106) and DeepSeek-R1 had the highest incompleteness (5.00 ± 2.52, p = 0.261). All models exceeded recommended readability thresholds (9th-10th grade level), and none met the actionability standard (≤ 33.5%). CONCLUSIONS: LLMs demonstrate potential as clinical aids. The comprehensive performance of Gemini and Grok is relatively favorable, yet their readability and actionability remain unsatisfactory. Future development should integrate clinician feedback and real-world validation to ensure safety. Human oversight and targeted AI training are critical for safe implementation. Key Points • The study reveals significant differences in accuracy among LLMs, highlighting inconsistencies in clinical decision support. • While all models generated readable text, the complexity remained high, potentially limiting accessibility for some patients. • Glucocorticoid use for patients in remission was more strongly associated with impaired physical function in patients aged 75-84 than in patients aged 55-74 years. • Over-conclusiveness and incomplete adherence to evidence-based guidelines underscore the necessity for human oversight and targeted AI training in clinical applications.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。