Abstract
Speech perception is fundamental for human communication, but its neural basis is not well understood. Furthermore, while modern neural networks (NNs) can accurately recognize speech, whether they effectively model human speech processing remains unclear. Here, we introduce Wordsworth, a dataset designed to facilitate comparisons of speech representations between artificial and biological NNs. We synthesised 1,200 tokens for each of 84 monosyllabic words while controlling for acoustic parameters such as amplitude, duration, and background noise, thus encouraging the use of phonetic features known to be important for speech perception. Human listening experiments showed that Wordsworth tokens are intelligible. Additional experiments using convolutional NNs showed (i) that Wordsworth tokens were recognizable and (ii) that error patterns could be at least partially explained by acoustic phonetics. The control with which tokens were created permits end users to manipulate them in whatever ways might be useful for their purposes. Finally, a subset of tokens specifically for human neuroscience experiments was also created, with precise and known distributions of amplitude, onset, and offset times.