Bayesian Self-Attentive Speaker Embeddings for Speaker Verification

PhD Thesis Proposal Defence


Title: "Bayesian Self-Attentive Speaker Embeddings for Speaker Verification"

by

Miss Yingke ZHU


Abstract:

Speaker verification (SV) is the process of verifying whether an utterance
belongs to the claimed speaker, based on some reference utterances.

The typical speaker verification process has three stages: training, enrollment 
and evaluation. The training stage aims to learn a low-dimensional embedding 
rich in speaker information, and a scoring function for computing similarity 
between embeddings. The enrollment stage is to estimate speaker models for 
every known speaker using a limited amount of utterances. Finally the 
evaluation stage is to score unknown utterances against estimated speaker 
models. The unknown utterance is considered to be produced by the claimed 
speaker if the evaluation score is above a predefined threshold, but is 
rejected otherwise. According to the speech content in the enrollment and 
evaluation stages, SV systems fall into two categories: text-dependent and 
text-independent. Text-dependent SV systems require the content of input speech 
to be fixed, while text-independent SV systems do not.

Learning effective and discriminative speaker embeddings is a central theme in 
the speaker verification task. In this thesis, we focus on the speaker 
embedding learning issues in text-independent SV tasks, and present two methods 
to learn better speaker embeddings.

The first one is the self-attentive speaker embedding learning method. Usually, 
speaker embeddings are extracted from a speaker-classification neural network 
that averages the hidden vectors over all the spoken frames of a speaker; the 
hidden vectors produced from all the frames are assumed to be equally 
important. We relax this assumption and compute the speaker embedding as a 
weighted average of a speaker's frame-level hidden vectors, and their weights 
are automatically determined by a self-attention mechanism. The effect of 
multiple attention heads is also investigated to capture different aspects of a 
speaker's input speech.

The second method generalizes the multi-head attention in the Bayesian 
attention framework, where the standard deterministic multi-head attention can 
be viewed as a special case. In the Bayesian attention framework, parameters of 
each attention head share a common distribution, and the update of these 
parameters is related, instead of being independent as in deterministic 
multi-head attention. The Bayesian attention framework can help alleviate the 
attention redundancy problem. It also provides a theoretical understanding of 
the benefits of applying multi-head attention. Based on the Bayesian attention 
framework, we propose a Bayesian self-attentive speaker embedding learning 
algorithm.

Systematic evaluation of the proposed embedding learning methods is performed 
on different evaluation sets. Significant and consistent improvements over 
state-of-the-art systems are achieved on all the evaluation datasets.


Date:			Wednesday, 1 June 2022

Time:                  	10:00am - 12:00noon

Zoom Meeting: 		https://hkust.zoom.us/j/6179269271

Committee Members:	Dr. Brian Mak (Supervisor)
  			Prof. James Kwok (Chairperson)
 			Dr. Yangqiu Song
 			Prof. Nevin Zhang


**** ALL are Welcome ****