ChatGPT and low back pain - Evaluating AI-driven patient education in the context of interventional pain medicine

ChatGPT与腰痛——评估人工智能驱动的患者教育在介入性疼痛医学中的应用

阅读:1

Abstract

BACKGROUND: ChatGPT and other Large Language Models (LLMs) are not only being more readily integrated into healthcare but are also being utilized more frequently by patients to answer health-related questions. Given the increased utilization for this purpose, it is essential to evaluate and study the consistency and reliability of artificial intelligence (AI) responses. Low back pain (LBP) remains one of the most frequently seen chief complaints in primary care and interventional pain management offices. OBJECTIVE: This study assesses the readability, accuracy, and overall utility of ChatGPT's ability to address patients' questions concerning low back pain. Our aim is to use clinician feedback to analyze ChatGPT's responses to these common low back pain related questions, as in the future, AI will undoubtedly play a role in triaging patients prior to seeing a physician. METHODS: To assess AI responses, we generated a standardized list of 25 questions concerning low back pain that were split into five categories including diagnosis, seeking a medical professional, treatment, self-treatment, and physical therapy. We explored the influence of how a prompt is worded on ChatGPT by asking questions from a 4th grader to a college/reference level. One board certified interventional pain specialist, one interventional pain fellow, and one emergency medicine resident reviewed ChatGPT's generated answers to assess accuracy and clinical utility. Readability and comprehensibility were evaluated using the Flesch-Kincaid Grade Level Scale. Statistical analysis was performed to analyze differences in readability scores, word count, and response complexity. RESULTS: How a question is phrased influences accuracy in statistically significant ways. Over-simplification of queries (e.g. to a 4th grade level) degrades ChatGPT's ability to return clinically complete responses. In contrast, reference and neutral queries preserve accuracy without additional engineering. Regardless of how the question is phrased, ChatGPT's default register trends towards technical language. Readability remains substantially misaligned with health literacy standards. Verbosity correlates with prompt type, but not necessarily accuracy. Word count is an unreliable proxy for informational completeness or clinical correctness in AI outputs and most errors stem from omission, not commission. Importantly, ChatGPT does not frequently generate false claims. CONCLUSION: This analysis complicates the assumption that "simpler is better" in prompting LLMs for clinical education. Whereas earlier work in structured conditions suggested that plain-language prompts improved accuracy, our findings indicate that a moderate reading level, not maximal simplicity, yields the most reliable outputs in complex domains like pain. This study further supports that AI LLMs can be integrated into a clinical workflow, possibly through electronic health record (EHR) software.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。