Abstract
Rapid advancements in artificial intelligence (AI) have enabled text-to-speech (TTS) systems to produce voices increasingly indistinguishable from humans, posing significant societal risks, particularly through potential misuse in fraud and deception. To address this concern, this study combined behavioral assessments and neural measures using electroencephalography (EEG) to examine whether short-term perceptual training enhances people's ability to distinguish AI-generated from human speech. Thirty participants (of either sex) listened to sentences produced by human speakers and corresponding AI-generated clones, judging each sentence as either human or AI-generated before and after a brief (∼12 min) training session, during which voices were explicitly labeled as "human" or "AI." Behaviorally, participants showed consistently poor discrimination before and after training, with only minimal improvement. However, neural analyses revealed substantial training-induced changes. Specifically, temporal response function (TRF) analysis identified significant neural differentiation between speech types at early (∼55 ms, ∼210 ms) and later (∼455 ms) auditory processing stages following training. Additional EEG analyses, including spectral power and decoding, were conducted to further investigate training effects, but these measures revealed limited differentiation. The findings here highlight a dissociation between behavioral and neural sensitivity: while listeners struggle to behaviorally discriminate sophisticated AI-generated voices, their auditory systems rapidly adapt to subtle acoustic differences following short-term exposure. Understanding this neural-behavioral dissociation is crucial for developing effective perceptual training protocols and informing policies to mitigate societal threats posed by increasingly realistic synthetic voices.