Abstract
This study explores the effectiveness of integrating multimodal instruction with artificial intelligence (AI)-generated visual content into English noun vocabulary instruction, as compared to text-only instruction. Rather than treating visual instruction as an end in itself, the approach leverages generative image technology to create contextually relevant stimuli that align with cognitive principles of memory formation. A controlled experiment (text-only vs. text + AI-generated images) was conducted with 40 English learners recruited from China. Participants completed immediate and delayed recall tests, definition selection, image-to-word matching (available only in the multimodal condition), and semantic rating tasks. Results revealed that the multimodal group significantly outperformed the text-only group across all measures, with large effect sizes for memory retention and semantic understanding. However, the study design does not allow us to attribute this advantage to the AI-generated nature of the images, as no condition with traditional images was included. These findings indicate that multimodal presentation can support durable and meaningful vocabulary learning when visual materials are designed to reflect perceptual and contextual features that facilitate memory. The study highlights the pedagogical potential of combining multimodal materials with memory-informed instructional design in language education.