Deep Speaker Representation Learning in Speaker Verification

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "Deep Speaker Representation Learning in Speaker Verification"

By

Miss Yingke ZHU


Abstract

Speaker verification (SV) is the process of verifying whether an utterance 
belongs to the claimed speaker, based on some reference utterances.

Learning effective and discriminative speaker embeddings is a central 
theme in the speaker verification task. In this thesis, we focus on the 
speaker embedding learning issues in text-independent SV tasks, and 
present three methods to learn better speaker embeddings.

The first one is the self-attentive speaker embedding learning method. 
Usually, speaker embeddings are extracted from a speaker-classification 
neural network that averages the hidden vectors over all the spoken frames 
of a speaker; the hidden vectors produced from all the frames are assumed 
to be equally important. We relax this assumption and compute the speaker 
embedding as a weighted average of a speaker's frame-level hidden vectors, 
and their weights are automatically determined by a self-attention 
mechanism. The effect of multiple attention heads is also investigated to 
capture different aspects of a speaker's input speech.

The second method generalizes the multi-head attention in the Bayesian 
attention framework, where the standard deterministic multi-head attention 
can be viewed as a special case. In the Bayesian attention framework, 
parameters of each attention head share a common distribution, and the 
update of these parameters is related, instead of being independent. The 
Bayesian attention framework can help alleviate the attention redundancy 
problem. It also provides a theoretical understanding of the benefits of 
applying multi-head attention. Based on the Bayesian attention framework, 
we propose a Bayesian self-attentive speaker embedding learning algorithm.

The third method introduces channel attention to the embedding learning 
framework, and analyzes the channel attention from the perspective of 
frequency analysis. Frequency-domain pooling methods are then proposed to 
enhance the channel attention and produce better speaker embeddings.

Systematic evaluation of the proposed embedding learning methods is 
performed on different evaluation sets. Significant and consistent 
improvements over state-of-the-art systems are achieved on all the 
evaluation datasets.


Date:			Friday, 19 August 2022

Time:			10:00am - 12:00noon

Zoom Meeting:
https://hkust.zoom.us/j/6789417302?pwd=Q0ZXOGJ4S1IxL3N3UlVDSHNTQStMQT09

Chairperson:		Prof. Allen HUANG (ACCT)

Committee Members:	Prof. Brian MAK (Supervisor)
 			Prof. James KWOK
 			Prof. Yangqiu SONG
 			Prof. Shenghui SONG (ISD)
 			Prof. Koichi SHINODA (Tokyo Institute of 
Technology)


**** ALL are Welcome ****