Low-Resource Speech Recognition Using Pre-trained Speech Representation Models

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


MPhil Thesis Defence


Title: "Low-Resource Speech Recognition Using Pre-trained Speech Representation
Models"

By

Mr. Chun Fung Ranzo HUANG


Abstract:

Difficulties in eliciting substantial spoken data from speaker populations of
interest and producing the accompanying transcripts result in low-resource
scenarios in which the development of robust automatic speech recognition (ASR)
systems may be hindered. With the aid of a large volume of unlabeled audio
data, self-supervised speech representation learning may address this
limitation by learning a model-based feature extractor via a proxy task in
advance, thus offering pre-trained representations transferable to the ASR task
for fine-tuning. This dissertation reviews current self-supervised speech
representation learning methodologies and investigates the application of
wav2vec 2.0 ASR on a developing corpus named CU-MARVEL in order to provide
automatic transcripts for streamlining its human transcription work. The said
corpus involves spontaneous responses from Cantonese-speaking older adults in
Hong Kong—a unique setting concerning a language and a population that are both
low-resource. We contribute a Cantonese wav2vec 2.0 model that is pre-trained
on audio data obtained from the web and segmented using end-to-end neural
diarization methods. We evaluate the usefulness of further pre-training on
in-domain data and semi-supervised learning by pseudo-labeling for ASR under
the pre-training-and-fine-tuning paradigm. Given the availability of
cross-lingual wav2vec 2.0 models, we also compare the downstream performance of
the monolingual pre-trained model to that resulted from the cross-lingual 300M
XLS-R model and justify if a monolingual pre-trained model is necessary. We
benchmark our results against those obtained from parallel experiments on the
English LibriSpeech corpus. Our best performing model for CU-MARVEL is the 300M
XLS-R further pre-trained in two stages: first adapting to the target language
and then confining to the target domain. On participants' speech it reduces the
character error rate (CER) of the vanilla XLS-R baseline by 23.1% relatively.
This dissertation concludes with suggesting directions for future research.


Date:                   Friday, 4 August 2023

Time:                   2:00pm - 4:00pm

Venue:                  Room 3494
                        lifts 25/26

Committee Members:      Dr. Brian Mak (Supervisor)
                        Prof. James Kwok (Chairperson)
                        Prof. Fangzhen Lin


**** ALL are Welcome ****