More about HKUST
Embracing Multilingualism: Optimizing LLM Agents for Code-Switching Data Synthesis via Linguistic Principles and Tool Integration
The Hong Kong University of Science and Technology
Department of Computer Science and Engineering
MPhil Thesis Defence
Title: "Embracing Multilingualism: Optimizing LLM Agents for Code-Switching
Data Synthesis via Linguistic Principles and Tool Integration"
By
Mr. Peng XIE
Abstract:
Code-switching (CS) is the alternating use of two or more languages within a
conversation or utterance, often influenced by social context and speaker
identity. This linguistic phenomenon poses challenges for Automatic Speech
Recognition (ASR) systems, which are typically designed for a single
language and struggle to handle multilingual inputs. The growing global
demand for multilingual applications, including Code-Switching ASR (CSASR),
Text-to-Speech (CSTTS), and Cross-Lingual Information Retrieval (CLIR),
highlights the inadequacy of existing monolingual datasets. Although some CS
datasets exist, most are limited to bilingual mixing within homogeneous
ethnic groups, leaving a critical need for a large-scale, diverse benchmark
akin to ImageNet in computer vision. To bridge this gap, we introduce
LinguaMaster, a multi-agent collaboration framework specifically designed
for efficient and scalable multilingual data synthesis. Leveraging this
framework, we curate SwitchLingua, the first large-scale multilingual and
multi-ethnic CS dataset, including: (1) 420K CS textual samples across 12
languages, and (2) over 80 hours of audio recordings from 174 speakers
representing 18 countries/regions and 63 racial/ethnic backgrounds, based on
the textual data. This dataset captures rich linguistic and cultural
diversity, offering a foundational resource for advancing multilingual and
multicultural research. Furthermore, to address the issue that existing ASR
evaluation metrics lack sensitivity to CS scenarios, we propose the
Semantic-Aware Error Rate (SAER), a novel evaluation metric that
incorporates semantic information, providing a more accurate and
context-aware assessment of system performance. Benchmark experiments on
SwitchLingua with state-of-the-art ASR models reveal substantial performance
gaps, underscoring the dataset’s utility as a rigorous benchmark for CS
capability evaluation. In addition, SwitchLingua aims to encourage further
research to promote cultural inclusivity and linguistic diversity in speech
technology, fostering equitable progress in the ASR field.
Date: Thursday, 26 June 2025
Time: 2:00pm - 4:00pm
Venue: Room 3494
Lifts 25/26
Chairman: Dr. Dan XU
Committee Members: Dr. Yangqiu SONG (Supervisor)
Prof. Kani CHEN (Co-supervisor, MATH)
Dr. May FUNG