More about HKUST
Low-Resource Speech Recognition Using Pre-trained Speech Representation Models
The Hong Kong University of Science and Technology Department of Computer Science and Engineering MPhil Thesis Defence Title: "Low-Resource Speech Recognition Using Pre-trained Speech Representation Models" By Mr. Chun Fung Ranzo HUANG Abstract: Difficulties in eliciting substantial spoken data from speaker populations of interest and producing the accompanying transcripts result in low-resource scenarios in which the development of robust automatic speech recognition (ASR) systems may be hindered. With the aid of a large volume of unlabeled audio data, self-supervised speech representation learning may address this limitation by learning a model-based feature extractor via a proxy task in advance, thus offering pre-trained representations transferable to the ASR task for fine-tuning. This dissertation reviews current self-supervised speech representation learning methodologies and investigates the application of wav2vec 2.0 ASR on a developing corpus named CU-MARVEL in order to provide automatic transcripts for streamlining its human transcription work. The said corpus involves spontaneous responses from Cantonese-speaking older adults in Hong Kong—a unique setting concerning a language and a population that are both low-resource. We contribute a Cantonese wav2vec 2.0 model that is pre-trained on audio data obtained from the web and segmented using end-to-end neural diarization methods. We evaluate the usefulness of further pre-training on in-domain data and semi-supervised learning by pseudo-labeling for ASR under the pre-training-and-fine-tuning paradigm. Given the availability of cross-lingual wav2vec 2.0 models, we also compare the downstream performance of the monolingual pre-trained model to that resulted from the cross-lingual 300M XLS-R model and justify if a monolingual pre-trained model is necessary. We benchmark our results against those obtained from parallel experiments on the English LibriSpeech corpus. Our best performing model for CU-MARVEL is the 300M XLS-R further pre-trained in two stages: first adapting to the target language and then confining to the target domain. On participants' speech it reduces the character error rate (CER) of the vanilla XLS-R baseline by 23.1% relatively. This dissertation concludes with suggesting directions for future research. Date: Friday, 4 August 2023 Time: 2:00pm - 4:00pm Venue: Room 3494 lifts 25/26 Committee Members: Dr. Brian Mak (Supervisor) Prof. James Kwok (Chairperson) Prof. Fangzhen Lin **** ALL are Welcome ****