Abstract
INTRODUCTION: Large language models are increasingly discussed as tools for patient-facing symptom assessment, but safe self-triage depends on the concrete next action recommended to the user rather than on generic urgency language alone. We audited whether a commercially deployed general-purpose model could map Japanese symptom vignettes to clinically acceptable self-triage actions across low-risk and urgent scenarios. METHODS: From a bank of 60 synthetic lay personas and 30 Japanese symptom vignettes, we constructed three predefined slices, each comprising 24 persona-vignette pairs: a mild slice (sev1_24), an intermediate non-red-flag slice (nonredflag24), and an urgent red-flag slice (redflag24). Outputs were restricted to a 10-category action codebook ranging from watchful waiting (A0) to ambulance activation (A9). The audited model was gpt-4o-mini accessed through the OpenAI Responses API. Narrative and structured prompts were compared under near-deterministic decoding (temperature 0.0, top_p 1.0) and stochastic decoding (temperature 1.0, top_p 0.95). A response-schema condition was also evaluated. Acceptable action ranges were defined by an explicit operational reference policy informed by Japanese emergency-care and triage literature. RESULTS: A total of 3,342 valid outputs were analyzed. Hard under-triage and hard over-triage were absent in the mild and non-red-flag slices across all tested conditions. In contrast, the red-flag slice showed near-universal hard under-triage of the primary action: 100.0% for both prompts under near-deterministic decoding, 100.0% for P1 and 99.7% for P2 under stochastic prompt-only decoding, and 100.0% for both prompts under stochastic response-schema decoding. Near-deterministic decoding improved run-to-run modal agreement, but this reproducibility gain did not improve urgent-case safety. Response-schema enforcement often improved agreement relative to stochastic prompt-only execution, yet in some urgent conditions, it increased the proportion of escalation recommendations that still remained below the hard minimum. CONCLUSION: In this safety audit, gpt-4o-mini was conservative in low-risk cases but unsafe in urgent cases because urgency was expressed mainly through timing and escalation fields rather than through an appropriately urgent primary action. Better reproducibility under near-deterministic decoding did not translate into safer self-triage. Medical audits of LLM self-triage systems should report primary-action safety, auxiliary urgency cues, decoding configuration, and schema mode separately.