Non-parallel Many-to-many Voice Conversion by Knowledge Transfer from a Pre-trained Text-to-Speech Model

MPhil Thesis Defence


Title: "Non-parallel Many-to-many Voice Conversion by Knowledge Transfer from a 
Pre-trained Text-to-Speech Model"

By

Mr. Xinyuan YU


Abstract

Voice conversion (VC) is the task of converting a source speaker’s speech such 
that the output speech sounds like it is uttered by a different target speaker. 
Earlier approaches focus on finding a direct mapping function between a pair of 
source and target speakers, which requires pairs of utterances with the same 
content to be available in the training set. However, collecting pairs of 
utterances is often costly and time-consuming. Thus, training VC models with 
unconstrained speech data is more desirable; this is sometimes known as 
non-parallel VC. Recently, various deep learning methods like autoencoder, 
variational autoencoder and generative adversarial network are proposed for 
non-parallel VC. However, most of them cannot be easily trained and perform 
well at the same time. In this thesis, we present a simple but novel framework 
to train a non-parallel many-to-many VC model based on the encoder-decoder 
framework that can convert (seen or unseen) speech between any speaker pairs in 
a non-parallel speech corpus. We propose to transfer knowledge from the 
state-of-the-art multi-speaker text-to-speech (TTS) model, Mellotron, to the VC 
model by adopting Mellotron’s decoder as the VC decoder. The model is trained 
on LibriTTS dataset with simple loss terms. Subjective evaluation shows that 
our proposed model is able to generate natural sounding speech and out-perform 
the state-of-the-art non-parallel VC model, AUTO-VC.


Date:  			Thursday, 27 August 2020

Time:			2:00pm - 4:00pm

Zoom meeting:		https://hkust.zoom.us/j/6789417302

Committee Members:	Dr. Brian Mak (Supervisor)
 			Prof. Fangzhen Lin (Chairperson)
 			Prof. Nevin Zhang


**** ALL are Welcome ****