Abstract
Listeners effortlessly extract multi-dimensional auditory objects, such as a localized talker, from complex acoustic scenes. However, the neural mechanisms that enable simultaneous encoding and linking of distinct sound features-such as a talker's voice and location-are not fully understood. Using invasive intracranial recordings in seven neurosurgical patients (four male, three female), we investigated how the human auditory cortex processes and integrates these features during naturalistic multi-talker scenes and how attentional mechanisms modulate such feature integration. We found that cortical sites exhibit a continuum of feature sensitivity, ranging from single-feature-sensitive sites (responsive primarily to voice spectral features or to location features) to dual-feature-sensitive sites (responsive to both features). At the population level, neural response patterns from both single- and dual-feature-sensitive sites jointly encoded the attended talker's voice and location. Notably, single-feature-sensitive sites encoded their primary feature with greater precision but also represented coarse information about the secondary feature. Sites selectively tracking a single, attended speech stream concurrently encoded both voice and location features, demonstrating a link between selective attention and feature integration. Additionally, attention selectively enhanced temporal coherence between voice- and location-sensitive sites, suggesting that temporal synchronization serves as a mechanism for linking these features. Our findings highlight two complementary neural mechanisms-joint population coding and temporal coherence-that enable the integration of voice and location features in the auditory cortex. These results provide new insights into the distributed, multi-dimensional nature of auditory object formation during active listening in complex environments.