Abstract
Recognizing speech in noise, such as in a busy restaurant, is an essential cognitive skill where the task difficulty varies across environments and noise levels. Although there is growing evidence that the auditory system relies on statistical representations for perceiving and coding natural sounds, it is less clear how statistical cues and neural representations contribute to segregating speech in natural auditory scenes. Here we demonstrate that male and female human listeners rely on mid-level statistics to segregate and recognize speech in environmental noise. Using natural backgrounds and variants with perturbed spectrotemporal statistics, we show that speech recognition accuracy at a fixed noise level varies extensively across natural backgrounds (0-100%). Furthermore, for each background the unique interference created by summary statistics can mask or unmask speech, thus hindering or improving speech recognition. To identify the neural coding strategy and statistical cues that influence accuracy, we developed generalized perceptual regression, a framework that links summary statistics from a neural model to word recognition accuracy. Whereas summary statistics from a peripheral cochlear model account for only 60% of perceptual variance, summary statistics from a mid-level auditory midbrain model accurately predict single-trial sensory judgments, accounting for >90% of the perceptual variance. Furthermore, perceptual weights from the regression framework identify which statistics and tuned neural filters are influential and how they impact recognition. Thus, perception of speech in natural backgrounds relies on a mid-level auditory representation involving interference of multiple summary statistics that impact recognition beneficially or detrimentally across natural background sounds.