Speech Imitation by Neural Speech Synthesis with On-the-Fly Data Augmentation

MPhil Thesis Defence


Title: "Speech Imitation by Neural Speech Synthesis with On-the-Fly Data 
Augmentation"

By

Mr. Man Hon CHUNG


Abstract

Recent deep learning text-to-speech (TTS) systems synthesize natural speech. 
Applying speaker adaptation can make a TTS system speak like the adapting 
speaker, but the speaking style of the synthesized utterance still follows 
closely to the speaker's of the training utterances. In some applications, it 
is desirable to synthesize speech in a speaking manner depending on the 
scenario. A straightforward solution is to record speech data from a speaker 
under different role-playing scenarios. However, excluding professional voice 
talents, most people are not experienced in speaking in different expressive 
styles. Likewise, without being exposed to a multilingual environment from an 
early age, most people cannot speak a second language with its native accent. 
In this thesis, we propose a novel data augmentation method to create a stylish 
TTS model for a speaker. Specifically, augmented data are created by 
''forcing'' a speaker to imitate stylish speeches of other speakers. Our 
proposed method consists of two steps. Firstly, all the data are used to train 
a basic multi-style multi-speaker TTS model. Secondly, augmented utterances are 
created on-the-fly from the latest TTS model during its training and are used 
to further train the TTS model. We select two applications to demonstrate the 
effectiveness of our proposed method: (1) synthesizing speech in three 
scenarios --- newscasting, public speaking, and storytelling --- for a speaker 
who provides only neutral speech data; (2) synthesizing ''beautified'' speech 
of a language spoken by a non-native speaker by reducing his/her accent in the 
aspects of better pronunciation and more native prosody. Our experiment shows 
that for scenario-based TTS, the scenario speeches synthesized by our proposed 
method are overwhelmingly preferred over those from a speaker-adapted TTS 
model. For accent-beautified TTS, our model reduces the foreign accent of the 
non-native speeches while retaining a higher voice similarity than a 
state-of-the-art accent conversion model.


Date:  			Friday, 7 January 2022

Time:			10:00am - 12:00noon

Venue:			Room 4472
 			Lifts 25/26

Committee Members:	Dr. Brian Mak (Supervisor)
 			Prof. Shing-Chi Cheung (Chairperson)
 			Prof. James Kwok


**** ALL are Welcome ****