Can large language models be trusted? Reliability and readability of responses to perinatal depression FAQs

大型语言模型可信吗？围产期抑郁症常见问题解答的可靠性和可读性

阅读：5

作者：Huang,Jingyu,Yu,Hua,Chen,Junjian,Wang,Xinyue,Huang,Lizhi,Wen,Junjie,Li,Hui

期刊：	Frontiers in Public Health	影响因子：	3.400
时间：	2026	起止号：	2026;14:1760872
doi：	10.3389/fpubh.2026.1760872	研究方向：	神经科学
疾病类型：	抑郁症

Abstract

OBJECTIVE: Large language models (LLMs), a core technology of generative artificial intelligence (AI), are increasingly used in health education and promotion. Although they may expand access to medical information, concerns remain regarding the reliability and readability of AI generated content for the public. This study evaluated the reliability and readability of answers generated by five LLMs to common questions about perinatal depression. The primary aims were to determine (1) the reliability of LLM responses to frequently asked questions about perinatal depression and (2) whether the readability of the generated content aligns with public health literacy levels. METHODS: Twenty-seven frequently asked questions were derived from Google Trends and patient facing resources from the American College of Obstetricians and Gynecologists (ACOG). Each question was submitted to ChatGPT-5, Gemini-2.5, Microsoft Copilot, Grok4, and DeepSeek. Two obstetricians independently rated responses using five validated instruments (DISCERN, EQIP, JAMA, GQS, and HONCODE) and inter-rater agreement was quantified using the interclass correlation coefficient (ICC). Readability was assessed using six indices: ARI, GFI, CLI, OLWF, LWGLF, and FRF. Differences among models were analyzed using the Friedman test. RESULTS: Inter rater agreement was high across 27 perinatal depression questions. ICC values ranged from 0.729 to 0.847. Significant between model differences emerged for DISCERN, EQIP, and HONCODE. All had p less than 0.001. No overall differences were found for JAMA and GQS. Grok4 scored highest on DISCERN at 60.33 ± 5.48. DeepSeek scored highest on EQIP at 53.04 ± 4.91. Copilot scored highest on HONCODE at 9.26 ± 1.85. These results highlight distinct strengths in quality constructs across instruments. Readability posed a common limitation. All models exceeded the NIH recommended sixth grade level on grade-based indices (for example, ARI ranged from 13.49 ± 2.92 to 15.81 ± 3.25). Similarly, OLWF scores fell well below the sixth-grade benchmark of 94 (ranging from 61.44 ± 6.80 to 72.96 ± 10.39, where higher scores denote easier reading). Most models produced empathetic and informative content. However, they fell short in fully addressing clinical safety standards. CONCLUSION: Most LLMs demonstrated moderate to high reliability when responding to perinatal depression questions, supporting their potential as supplementary sources of health information. However, readability levels above recommended benchmarks suggest that current outputs may remain challenging for individuals with lower health literacy. While LLMs improve information accessibility, further improvements in readability, source attribution, and ethical transparency are needed to maximize public benefit and support equitable health communication. Future work should focus on defining and standardizing safety behaviors in high-risk mental health contexts to enable reliable clinical deployment.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用；引用内容仅为补充信息，不代表本站立场。

2、若认为本页面引用内容涉及侵权，请及时与本站联系，我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容，需注明“来源：[生知库]”并获得授权；使用引用内容的，需自行联系原作者获得许可。

4、投稿及合作请联系：info@biocloudy.com。