Prompt architecture induces methodological artifacts in large language models

提示架构会在大型语言模型中引入方法论上的缺陷

阅读:1

Abstract

We examine how the seemingly arbitrary way a prompt is posed, which we term "prompt architecture," influences responses provided by large language models (LLMs). Five large-scale, full-factorial experiments performing standard (zero-shot) similarity evaluation tasks using GPT-3, GPT-4, and Llama 3.1 document how several features of prompt architecture (order, label, framing, and justification) interact to produce methodological artifacts, a form of statistical bias. We find robust evidence that these four elements unduly affect responses across all models, and although we observe differences between GPT-3 and GPT-4, the changes are not necessarily for the better. Specifically, LLMs demonstrate both response-order bias and label bias, and framing and justification moderate these biases. We then test different strategies intended to reduce methodological artifacts. Specifying to the LLM that the order and labels of items have been randomized does not alleviate either response-order or label bias, and the use of uncommon labels reduces (but does not eliminate) label bias but exacerbates response-order bias in GPT-4 (and does not reduce either bias in Llama 3.1). By contrast, aggregating across prompts generated using a full factorial design eliminates response-order and label bias. Overall, these findings highlight the inherent fallibility of any individual prompt when using LLMs, as any prompt contains characteristics that may subtly interact with a multitude of hidden associations embedded in rich language data.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。