More about HKUST
Speech Imitation by Neural Speech Synthesis with On-the-Fly Data Augmentation
MPhil Thesis Defence Title: "Speech Imitation by Neural Speech Synthesis with On-the-Fly Data Augmentation" By Mr. Man Hon CHUNG Abstract Recent deep learning text-to-speech (TTS) systems synthesize natural speech. Applying speaker adaptation can make a TTS system speak like the adapting speaker, but the speaking style of the synthesized utterance still follows closely to the speaker's of the training utterances. In some applications, it is desirable to synthesize speech in a speaking manner depending on the scenario. A straightforward solution is to record speech data from a speaker under different role-playing scenarios. However, excluding professional voice talents, most people are not experienced in speaking in different expressive styles. Likewise, without being exposed to a multilingual environment from an early age, most people cannot speak a second language with its native accent. In this thesis, we propose a novel data augmentation method to create a stylish TTS model for a speaker. Specifically, augmented data are created by ''forcing'' a speaker to imitate stylish speeches of other speakers. Our proposed method consists of two steps. Firstly, all the data are used to train a basic multi-style multi-speaker TTS model. Secondly, augmented utterances are created on-the-fly from the latest TTS model during its training and are used to further train the TTS model. We select two applications to demonstrate the effectiveness of our proposed method: (1) synthesizing speech in three scenarios --- newscasting, public speaking, and storytelling --- for a speaker who provides only neutral speech data; (2) synthesizing ''beautified'' speech of a language spoken by a non-native speaker by reducing his/her accent in the aspects of better pronunciation and more native prosody. Our experiment shows that for scenario-based TTS, the scenario speeches synthesized by our proposed method are overwhelmingly preferred over those from a speaker-adapted TTS model. For accent-beautified TTS, our model reduces the foreign accent of the non-native speeches while retaining a higher voice similarity than a state-of-the-art accent conversion model. Date: Friday, 7 January 2022 Time: 10:00am - 12:00noon Venue: Room 4472 Lifts 25/26 Committee Members: Dr. Brian Mak (Supervisor) Prof. Shing-Chi Cheung (Chairperson) Prof. James Kwok **** ALL are Welcome ****