Safety Audit of a Large Language Model for Lay Self-Triage Using Japanese Symptom Vignettes: Persistent Red-Flag Under-Triage Despite Improved Reproducibility Under Near-Deterministic Decoding

使用日语症状描述对大型语言模型进行安全审计,以辅助非专业人士进行自我分诊:尽管在近乎确定性的解码条件下可重复性有所提高,但仍持续存在分诊不足的危险信号

阅读:1

Abstract

INTRODUCTION: Large language models are increasingly discussed as tools for patient-facing symptom assessment, but safe self-triage depends on the concrete next action recommended to the user rather than on generic urgency language alone. We audited whether a commercially deployed general-purpose model could map Japanese symptom vignettes to clinically acceptable self-triage actions across low-risk and urgent scenarios. METHODS: From a bank of 60 synthetic lay personas and 30 Japanese symptom vignettes, we constructed three predefined slices, each comprising 24 persona-vignette pairs: a mild slice (sev1_24), an intermediate non-red-flag slice (nonredflag24), and an urgent red-flag slice (redflag24). Outputs were restricted to a 10-category action codebook ranging from watchful waiting (A0) to ambulance activation (A9). The audited model was gpt-4o-mini accessed through the OpenAI Responses API. Narrative and structured prompts were compared under near-deterministic decoding (temperature 0.0, top_p 1.0) and stochastic decoding (temperature 1.0, top_p 0.95). A response-schema condition was also evaluated. Acceptable action ranges were defined by an explicit operational reference policy informed by Japanese emergency-care and triage literature. RESULTS: A total of 3,342 valid outputs were analyzed. Hard under-triage and hard over-triage were absent in the mild and non-red-flag slices across all tested conditions. In contrast, the red-flag slice showed near-universal hard under-triage of the primary action: 100.0% for both prompts under near-deterministic decoding, 100.0% for P1 and 99.7% for P2 under stochastic prompt-only decoding, and 100.0% for both prompts under stochastic response-schema decoding. Near-deterministic decoding improved run-to-run modal agreement, but this reproducibility gain did not improve urgent-case safety. Response-schema enforcement often improved agreement relative to stochastic prompt-only execution, yet in some urgent conditions, it increased the proportion of escalation recommendations that still remained below the hard minimum. CONCLUSION: In this safety audit, gpt-4o-mini was conservative in low-risk cases but unsafe in urgent cases because urgency was expressed mainly through timing and escalation fields rather than through an appropriately urgent primary action. Better reproducibility under near-deterministic decoding did not translate into safer self-triage. Medical audits of LLM self-triage systems should report primary-action safety, auxiliary urgency cues, decoding configuration, and schema mode separately.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。