More about HKUST
Bayesian Self-Attentive Speaker Embeddings for Speaker Verification
PhD Thesis Proposal Defence Title: "Bayesian Self-Attentive Speaker Embeddings for Speaker Verification" by Miss Yingke ZHU Abstract: Speaker verification (SV) is the process of verifying whether an utterance belongs to the claimed speaker, based on some reference utterances. The typical speaker verification process has three stages: training, enrollment and evaluation. The training stage aims to learn a low-dimensional embedding rich in speaker information, and a scoring function for computing similarity between embeddings. The enrollment stage is to estimate speaker models for every known speaker using a limited amount of utterances. Finally the evaluation stage is to score unknown utterances against estimated speaker models. The unknown utterance is considered to be produced by the claimed speaker if the evaluation score is above a predefined threshold, but is rejected otherwise. According to the speech content in the enrollment and evaluation stages, SV systems fall into two categories: text-dependent and text-independent. Text-dependent SV systems require the content of input speech to be fixed, while text-independent SV systems do not. Learning effective and discriminative speaker embeddings is a central theme in the speaker verification task. In this thesis, we focus on the speaker embedding learning issues in text-independent SV tasks, and present two methods to learn better speaker embeddings. The first one is the self-attentive speaker embedding learning method. Usually, speaker embeddings are extracted from a speaker-classification neural network that averages the hidden vectors over all the spoken frames of a speaker; the hidden vectors produced from all the frames are assumed to be equally important. We relax this assumption and compute the speaker embedding as a weighted average of a speaker's frame-level hidden vectors, and their weights are automatically determined by a self-attention mechanism. The effect of multiple attention heads is also investigated to capture different aspects of a speaker's input speech. The second method generalizes the multi-head attention in the Bayesian attention framework, where the standard deterministic multi-head attention can be viewed as a special case. In the Bayesian attention framework, parameters of each attention head share a common distribution, and the update of these parameters is related, instead of being independent as in deterministic multi-head attention. The Bayesian attention framework can help alleviate the attention redundancy problem. It also provides a theoretical understanding of the benefits of applying multi-head attention. Based on the Bayesian attention framework, we propose a Bayesian self-attentive speaker embedding learning algorithm. Systematic evaluation of the proposed embedding learning methods is performed on different evaluation sets. Significant and consistent improvements over state-of-the-art systems are achieved on all the evaluation datasets. Date: Wednesday, 1 June 2022 Time: 10:00am - 12:00noon Zoom Meeting: https://hkust.zoom.us/j/6179269271 Committee Members: Dr. Brian Mak (Supervisor) Prof. James Kwok (Chairperson) Dr. Yangqiu Song Prof. Nevin Zhang **** ALL are Welcome ****