Abstract
We perceive continuous speech as a series of discrete words, despite the lack of clear acoustic boundaries. The superior temporal gyrus (STG) encodes phonetic elements like consonants and vowels, but it is unclear how whole words are encoded. Using high-density cortical recordings and spoken narratives, we investigated how the human brain represents auditory word forms. STG activity exhibits a distinctive reset at word boundaries, marked by a sharp drop in cortical activity. Between resets, STG encodes acoustic-phonetic, prosodic, and lexical features, supporting integration of phonological features into coherent word forms. This process tracks the relative elapsed time within words, independent of absolute duration, providing a flexible encoding of variable word lengths. Similar dynamics were found in deeper layers of a self-supervised artificial speech network. Finally, a bistable word perception task revealed trial-by-trial STG responses to perceived word boundaries. Together, these findings support a new dynamical model of auditory word forms.