More about HKUST
Non-parallel Many-to-many Voice Conversion by Knowledge Transfer from a Pre-trained Text-to-Speech Model
MPhil Thesis Defence Title: "Non-parallel Many-to-many Voice Conversion by Knowledge Transfer from a Pre-trained Text-to-Speech Model" By Mr. Xinyuan YU Abstract Voice conversion (VC) is the task of converting a source speaker’s speech such that the output speech sounds like it is uttered by a different target speaker. Earlier approaches focus on finding a direct mapping function between a pair of source and target speakers, which requires pairs of utterances with the same content to be available in the training set. However, collecting pairs of utterances is often costly and time-consuming. Thus, training VC models with unconstrained speech data is more desirable; this is sometimes known as non-parallel VC. Recently, various deep learning methods like autoencoder, variational autoencoder and generative adversarial network are proposed for non-parallel VC. However, most of them cannot be easily trained and perform well at the same time. In this thesis, we present a simple but novel framework to train a non-parallel many-to-many VC model based on the encoder-decoder framework that can convert (seen or unseen) speech between any speaker pairs in a non-parallel speech corpus. We propose to transfer knowledge from the state-of-the-art multi-speaker text-to-speech (TTS) model, Mellotron, to the VC model by adopting Mellotron’s decoder as the VC decoder. The model is trained on LibriTTS dataset with simple loss terms. Subjective evaluation shows that our proposed model is able to generate natural sounding speech and out-perform the state-of-the-art non-parallel VC model, AUTO-VC. Date: Thursday, 27 August 2020 Time: 2:00pm - 4:00pm Zoom meeting: https://hkust.zoom.us/j/6789417302 Committee Members: Dr. Brian Mak (Supervisor) Prof. Fangzhen Lin (Chairperson) Prof. Nevin Zhang **** ALL are Welcome ****