Abstract
The combination of artificial intelligence and education is one of the current trends in research. While observing the daily teaching and learning process at school, we have considered the possibility of using multimodal learning, in particular audio-visual detection (AVD), to improve the teaching and learning process in Japanese-language teaching rooms. AVD can be effectively used to locate sounding objects (e.g. clapping, sneaking, organizing things, etc.) from unknown sources in online or physical classrooms. This study proposes a novel deep learning-based approach for audio-visual detection (AVD) in Japanese-language teaching rooms, combining audio and visual information to detect sound sources at the object level. To evaluate the proposed method, we construct an AVD benchmark that provides object-level annotations according to the sound sources in the videos. The feasibility of applying our proposed method in the classroom is demonstrated by designing evaluation metrics for AVD and comparing it with similar works.