Open Access. Powered by Scholars. Published by Universities.®
Graphics and Human Computer Interfaces Commons™
Open Access. Powered by Scholars. Published by Universities.®
Numerical Analysis and Scientific Computing
Research Collection School Of Computing and Information Systems
Articles 1 - 1 of 1
Full-Text Articles in Graphics and Human Computer Interfaces
Catnet: Cross-Modal Fusion For Audio-Visual Speech Recognition, Xingmei Wang, Jianchen Mi, Boquan Li, Yixu Zhao, Jiaxiang Meng
Catnet: Cross-Modal Fusion For Audio-Visual Speech Recognition, Xingmei Wang, Jianchen Mi, Boquan Li, Yixu Zhao, Jiaxiang Meng
Research Collection School Of Computing and Information Systems
Automatic speech recognition (ASR) is a typical pattern recognition technology that converts human speeches into texts. With the aid of advanced deep learning models, the performance of speech recognition is significantly improved. Especially, the emerging Audio–Visual Speech Recognition (AVSR) methods achieve satisfactory performance by combining audio-modal and visual-modal information. However, various complex environments, especially noises, limit the effectiveness of existing methods. In response to the noisy problem, in this paper, we propose a novel cross-modal audio–visual speech recognition model, named CATNet. First, we devise a cross-modal bidirectional fusion model to analyze the close relationship between audio and visual modalities. Second, …