More about HKUST
Multi-lingual and Multi-speaker Neural Text-to-speech System
MPhil Thesis Defence Title: "Multi-lingual and Multi-speaker Neural Text-to-speech System" By Mr. Zhaoyu LIU Abstract We investigate a novel multi-lingual multi-speaker neural text-to-speech (TTS) synthesis approach for generating high-quality native or accented speech for native/foreign seen/unseen speakers in English, Mandarin and Cantonese. Our proposed model extends the single speaker Tacotron-based TTS model by transfer learning technique which conditions the model on the pretrained speaker embeddings, x-vectors, using a speaker verification system. We also replace the input character embedding with a concatenation of phoneme embedding and tone/stress embedding to produce more natural speech. The additional tone/stress embedding works as an extension of language embedding which provides extra controls on accents over the languages. By manipulating the tone/stress input, our model can synthesize native or accented speech for foreign speakers. The WaveNet vocoder in the TTS model trained on Cantonese speech can synthesize English and Mandarin speech very well which demonstrates that the WaveNet conditioned on mel-spectrograms is enough to perform well in multi-lingual speech synthesis. The mean opinion score (MOS) results show that the synthesized native speech of both unseen foreign and native speakers are intelligible and natural. The speaker similarity of such speech is also good. The lower scores of foreign accented speech suggests that it is distinguishable from native speech. The foreign accents we introduced can confuse the meaning of the synthesized speech perceived by human raters. Date: Monday, 16 March 2020 Time: 11:00am – 1:00pm Zoom Meeting: https://hkust.zoom.us/j/927550771 Committee Members: Dr. Brian Mak (Supervisor) Prof. Fangzhen Lin (Chairperson) Prof. Nevin Zhang **** ALL are Welcome ****