Abstract
Spatial learning emerges not only from static environmental cues but also from the social and semantic context embedded in our surroundings. This study investigates how human agents influence visual exploration and spatial knowledge acquisition in a controlled Virtual Reality (VR) environment, focusing on the role of contextual congruency. Participants freely explored a 1 km2 virtual city while their eye movements were recorded. Agents were visually identical across conditions but placed in locations that were either congruent, incongruent, or neutral with respect to the surrounding environment. Using Bayesian hierarchical modeling, we found that incongruent agents elicited longer fixations and higher gaze transition entropy (GTE), a measure of scanning variability. Crucially, GTE emerged as the strongest predictor of spatial recall accuracy. A counterfactual mediation analysis indicated a small but reliable pathway via GTE and, for incongruent agents, a larger direct component not captured by GTE. These findings suggest that human-contextual incongruence promotes more flexible and distributed visual exploration, thereby enhancing spatial learning. By showing that human agents shape not only where we look but how we explore and encode space, this study contributes to a growing understanding of how social meaning guides attention and supports navigation.