Multi-lingual and Multi-speaker Neural Text-to-speech System

MPhil Thesis Defence


Title: "Multi-lingual and Multi-speaker Neural Text-to-speech System"

By

Mr. Zhaoyu LIU


Abstract

We investigate a novel multi-lingual multi-speaker neural text-to-speech (TTS) 
synthesis approach for generating high-quality native or accented speech for 
native/foreign seen/unseen speakers in English, Mandarin and Cantonese. Our 
proposed model extends the single speaker Tacotron-based TTS model by transfer 
learning technique which conditions the model on the pretrained speaker 
embeddings, x-vectors, using a speaker verification system. We also replace the 
input character embedding with a concatenation of phoneme embedding and 
tone/stress embedding to produce more natural speech. The additional 
tone/stress embedding works as an extension of language embedding which 
provides extra controls on accents over the languages. By manipulating the 
tone/stress input, our model can synthesize native or accented speech for 
foreign speakers. The WaveNet vocoder in the TTS model trained on Cantonese 
speech can synthesize English and Mandarin speech very well which demonstrates 
that the WaveNet conditioned on mel-spectrograms is enough to perform well in 
multi-lingual speech synthesis. The mean opinion score (MOS) results show that 
the synthesized native speech of both unseen foreign and native speakers are 
intelligible and natural. The speaker similarity of such speech is also good. 
The lower scores of foreign accented speech suggests that it is distinguishable 
from native speech. The foreign accents we introduced can confuse the meaning 
of the synthesized speech perceived by human raters.


Date:			Monday, 16 March 2020

Time:			11:00am – 1:00pm

Zoom Meeting:		https://hkust.zoom.us/j/927550771

Committee Members:	Dr. Brian Mak (Supervisor)
 			Prof. Fangzhen Lin (Chairperson)
 			Prof. Nevin Zhang


**** ALL are Welcome ****