Abstract
Recent studies [S. Palminteri, G. Lefebvre, E. J. Kilford, S. J. Blakemore, PLoS Comput. Biol. 13, e1005684 (2017); G. Lefebvre, M. Lebreton, F. Meyniel, S. Bourgeois-Gironde, S. Palminteri, Nat. Hum. Behav. 1, 0067 (2017).] among others claim that human behavior in a two armed Bernoulli bandit task is described by positivity and confirmation bias, thereby implying [S. Palminteri, M. Lebreton, Trends Cogn. Sci. 26, 607-621 (2022).] that "Humans do not integrate new information objectively." The claim is based on fitting to human data a Q-learning model with different (and temporally constant) learning rates for positive and negative reward prediction errors. However, we find that even if the agent updates its belief via, arguably objective, Bayesian inference, fitting the above model demonstrates both the biases. This finding seems particularly surprising, as Bayesian inference, when written as an effective Q-learning algorithm, is described by unbiased (and temporally decreasing) learning rates. In this article, we explain the reasons behind this observation, by studying the stochastic dynamics of these learning systems using Master equations. In particular, we show that both confirmation bias and unbiased but temporally decreasing learning rates have the same behavioral signature: decreased action switching probabilities, as compared to temporally constant and unbiased learning rates. Our analysis underscores the need for modeling temporally varying learning rates in subjects before any claims can be made about their choices being biased.