Large language model uncertainty proxies: discrimination and calibration for medical diagnosis and treatment

大型语言模型不确定性代理:用于医学诊断和治疗的区分和校准

阅读:1

Abstract

INTRODUCTION: The inability of large language models (LLMs) to communicate uncertainty is a significant barrier to their use in medicine. Before LLMs can be integrated into patient care, the field must assess methods to estimate uncertainty in ways that are useful to physician-users. OBJECTIVE: Evaluate the ability for uncertainty proxies to quantify LLM confidence when performing diagnosis and treatment selection tasks by assessing the properties of discrimination and calibration. METHODS: We examined confidence elicitation (CE), token-level probability (TLP), and sample consistency (SC) proxies across GPT3.5, GPT4, Llama2, and Llama3. Uncertainty proxies were evaluated against 3 datasets of open-ended patient scenarios. RESULTS: SC discrimination outperformed TLP and CE methods. SC by sentence embedding achieved the highest discriminative performance (ROC AUC 0.68-0.79), yet with poor calibration. SC by GPT annotation achieved the second-best discrimination (ROC AUC 0.66-0.74) with accurate calibration. Verbalized confidence (CE) was found to consistently overestimate model confidence. DISCUSSION AND CONCLUSIONS: SC is the most effective method for estimating LLM uncertainty of the proxies evaluated. SC by sentence embedding can effectively estimate uncertainty if the user has a set of reference cases with which to re-calibrate their results, while SC by GPT annotation is the more effective method if the user does not have reference cases and requires accurate raw calibration. Our results confirm LLMs are consistently over-confident when verbalizing their confidence (CE).

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。