More about HKUST
Practical Improvements to Automatic Visual Speech Recognition
MPhil Thesis Defence Title: "Practical Improvements to Automatic Visual Speech Recognition" By Mr. Ho Long FUNG Abstract Visual speech recognition (a.k.a lipreading) is the task of recognizing speech solely from the visual movement of the mouth. In this work, we propose multiple feasible and practical strategies, and demonstrate significant improvements to the established competitive baselines in both low-resource and resource-sufficient scenarios. On one hand, one main challenge in practical automatic lipreading is to deal with the diverse facial viewpoints in the available video data. With the recent proposal of the spatial transformer, the spatial invariance to input data in the convolutional neural network has been enhanced and it has demonstrated different levels of success in a broad spectrum of areas including face recognition, facial alignment and gesture recognition with promising results by virtue of the increased model robustness to viewpoint variations in the data. We study the effectiveness of the learned spatial transformation to our model through quantitative and qualitative analysis with visualizations and attain an absolute accuracy gain of 0.92% to our data-augmented baseline on the resource-sufficient Lip Reading in the Wild (LRW) continuous word recognition task with incorporation of spatial transformer. On the other, we explore the effectiveness of convolutional neural network (CNN) and long short-term memory (LSTM) recurrent neural network in lip-reading under a low-resource scenario that has not yet been explored before. We propose an end-to-end deep learning model fusing conventional CNN and bidirectional LSTM (BLSTM) together with maxout activation units (maxout-CNN-BLSTM) and dropout, which is capable of attaining a word accuracy of 87.6% on the low-resource Ouluvs2 corpus, offering an absolute improvement of 3.1% to the previous state-of-the-art auto-encoder-BLSTM model at that time. To emphasize, our lip-reading system does not require any separate feature extraction stage nor pre-training phase with external data resources. Date: Wednesday, 12 December 2018 Time: 2:30pm - 4:30pm Venue: Room 2131C Lift 19 Committee Members: Dr. Brian Mak (Supervisor) Prof. Dit-Yan Yeung (Chairperson) Dr. Raymond Wong **** ALL are Welcome ****