MULTI-TASK LEARNING DEEP NEURAL NETWORKS FOR AUTOMATIC SPEECH RECOGNITION

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "MULTI-TASK LEARNING DEEP NEURAL NETWORKS FOR AUTOMATIC SPEECH 
RECOGNITION"

By

Mr. Dongpeng CHEN


Abstract

Multi-task learning (MTL) learns multiple tasks together to improve the 
performance of all the tasks by exploiting extra information from each other 
with shared internal representation. Additional related secondary task(s) acts 
as regularizer(s) to help improve the generalization performance of each singly 
learning task; the effect is more prominent when the amount of training data is 
relatively small. Recently, deep neural network (DNN) is widely utilized for 
acoustic modeling in ASR. The hidden layers of DNN are ideal internal 
representation for the shared knowledge. The main contribution of this thesis 
is proposing three methods of applying MTL to DNN for acoustic modeling to 
exploit extra information from the related tasks, meanwhile imposing the 
guideline that the secondary tasks should not require additional language 
resources, which is a great benefit when language resources are limited.

In the first method, phone and grapheme acoustic models are trained together 
within the same deep neural network. The extra information is the 
phone-to-grapheme mappings, which is confirmed by analysis and visualization of 
implicit phone-to-grapheme correlation matrix computed from the model 
parameters. The training convergence curve also shows that MTL training 
generalizes better to unseen data than common single task learning does. 
Moreover, two extensions are proposed to further improve the performance.

State tying to some extent relieves the data scarcity problem in 
context-dependent acoustic modeling. However, quantization errors are 
inevitably introduced. The second MTL method in this thesis aims at robust 
modeling of a large set of distinct contextdependent acoustic units. More 
specifically, distinct triphone states are trained with a smaller set of 
tied-states, benefiting from better inductive bias to reach a better optimum. 
In return, they embed more contextual information into the hidden layers of the 
MTL-DNN acoustic models.

Our last method works in a multi-lingual setting when data of multiple 
languages are available. Multi-lingual acoustic modeling is improved by 
learning a universal phone set (UPS) modeling task together with 
language-specific triphones modeling tasks to help implicitly map the phones of 
multiple languages to each other.

MTL methods were proved to be effective on a board range of data sets. The 
contributions of this thesis include the three proposed MTL methods, and the 
heuristic guidelines we impose to find helpful secondary tasks. With the 
successful explorations, we hope to stimulate more interest of MTL in improving 
ASR, and our results show that it is promising for wider applications.


Date:			Wednesday, 19 August 2015

Time:			10:30am - 12:30pm

Venue:			Room 2130B
 			Lift 19

Chairman:		Prof. Patrick Yue (ECE)

Committee Members:	Prof. Brian Mak (Supervisor)
 			Prof. James Kwok
 			Prof. Raymond Wong
 			Prof. Chi-Ying Tsui (ECE)
 			Prof. Pak-Chung Ching (Elec. Engg., CUHK)


**** ALL are Welcome ****