More about HKUST
Deep Speaker Representation Learning in Speaker Verification
The Hong Kong University of Science and Technology Department of Computer Science and Engineering PhD Thesis Defence Title: "Deep Speaker Representation Learning in Speaker Verification" By Miss Yingke ZHU Abstract Speaker verification (SV) is the process of verifying whether an utterance belongs to the claimed speaker, based on some reference utterances. Learning effective and discriminative speaker embeddings is a central theme in the speaker verification task. In this thesis, we focus on the speaker embedding learning issues in text-independent SV tasks, and present three methods to learn better speaker embeddings. The first one is the self-attentive speaker embedding learning method. Usually, speaker embeddings are extracted from a speaker-classification neural network that averages the hidden vectors over all the spoken frames of a speaker; the hidden vectors produced from all the frames are assumed to be equally important. We relax this assumption and compute the speaker embedding as a weighted average of a speaker's frame-level hidden vectors, and their weights are automatically determined by a self-attention mechanism. The effect of multiple attention heads is also investigated to capture different aspects of a speaker's input speech. The second method generalizes the multi-head attention in the Bayesian attention framework, where the standard deterministic multi-head attention can be viewed as a special case. In the Bayesian attention framework, parameters of each attention head share a common distribution, and the update of these parameters is related, instead of being independent. The Bayesian attention framework can help alleviate the attention redundancy problem. It also provides a theoretical understanding of the benefits of applying multi-head attention. Based on the Bayesian attention framework, we propose a Bayesian self-attentive speaker embedding learning algorithm. The third method introduces channel attention to the embedding learning framework, and analyzes the channel attention from the perspective of frequency analysis. Frequency-domain pooling methods are then proposed to enhance the channel attention and produce better speaker embeddings. Systematic evaluation of the proposed embedding learning methods is performed on different evaluation sets. Significant and consistent improvements over state-of-the-art systems are achieved on all the evaluation datasets. Date: Friday, 19 August 2022 Time: 10:00am - 12:00noon Zoom Meeting: https://hkust.zoom.us/j/6789417302?pwd=Q0ZXOGJ4S1IxL3N3UlVDSHNTQStMQT09 Chairperson: Prof. Allen HUANG (ACCT) Committee Members: Prof. Brian MAK (Supervisor) Prof. James KWOK Prof. Yangqiu SONG Prof. Shenghui SONG (ISD) Prof. Koichi SHINODA (Tokyo Institute of Technology) **** ALL are Welcome ****