MVIB-Lip: Multi-View Information Bottleneck for Visual Speech Recognition via Time Series Modeling

MVIB-Lip:基于时间序列建模的多视角信息瓶颈视觉语音识别

阅读:1

Abstract

Lipreading, or visual speech recognition, is the task of interpreting utterances solely from visual cues of lip movements. While early approaches relied on Hidden Markov Models (HMMs) and handcrafted spatiotemporal descriptors, recent advances in deep learning have enabled end-to-end recognition using large-scale datasets. However, such methods often require millions of labeled or pretraining samples and struggle to generalize under low-resource or speaker-independent conditions. In this work, we revisit lipreading from a multi-view learning perspective. We introduce MVIB-Lip, a framework that integrates two complementary representations of lip movements: (i) raw landmark trajectories modeled as multivariate time series, and (ii) recurrence plot (RP) images that encode structural dynamics in a texture form. A Transformer encoder processes the temporal sequences, while a ResNet-18 extracts features from RPs; the two views are fused via a product-of-experts posterior regularized by the multi-view information bottleneck. Experiments on the OuluVS and a self-collected dataset demonstrate that MVIB-Lip consistently outperforms handcrafted baselines and improves generalization to speaker-independent recognition. Our results suggest that recurrence plots, when coupled with deep multi-view learning, offer a principled and data-efficient path forward for robust visual speech recognition.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。