Embracing Multilingualism: Optimizing LLM Agents for Code-Switching Data Synthesis via Linguistic Principles and Tool Integration

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


MPhil Thesis Defence


Title: "Embracing Multilingualism: Optimizing LLM Agents for Code-Switching 
Data Synthesis via Linguistic Principles and Tool Integration"

By

Mr. Peng XIE


Abstract:

Code-switching (CS) is the alternating use of two or more languages within a 
conversation or utterance, often influenced by social context and speaker 
identity. This linguistic phenomenon poses challenges for Automatic Speech 
Recognition (ASR) systems, which are typically designed for a single 
language and struggle to handle multilingual inputs. The growing global 
demand for multilingual applications, including Code-Switching ASR (CSASR), 
Text-to-Speech (CSTTS), and Cross-Lingual Information Retrieval (CLIR), 
highlights the inadequacy of existing monolingual datasets. Although some CS 
datasets exist, most are limited to bilingual mixing within homogeneous 
ethnic groups, leaving a critical need for a large-scale, diverse benchmark 
akin to ImageNet in computer vision. To bridge this gap, we introduce 
LinguaMaster, a multi-agent collaboration framework specifically designed 
for efficient and scalable multilingual data synthesis. Leveraging this 
framework, we curate SwitchLingua, the first large-scale multilingual and 
multi-ethnic CS dataset, including: (1) 420K CS textual samples across 12 
languages, and (2) over 80 hours of audio recordings from 174 speakers 
representing 18 countries/regions and 63 racial/ethnic backgrounds, based on 
the textual data. This dataset captures rich linguistic and cultural 
diversity, offering a foundational resource for advancing multilingual and 
multicultural research. Furthermore, to address the issue that existing ASR 
evaluation metrics lack sensitivity to CS scenarios, we propose the 
Semantic-Aware Error Rate (SAER), a novel evaluation metric that 
incorporates semantic information, providing a more accurate and 
context-aware assessment of system performance. Benchmark experiments on 
SwitchLingua with state-of-the-art ASR models reveal substantial performance 
gaps, underscoring the dataset’s utility as a rigorous benchmark for CS 
capability evaluation. In addition, SwitchLingua aims to encourage further 
research to promote cultural inclusivity and linguistic diversity in speech 
technology, fostering equitable progress in the ASR field.


Date:                   Thursday, 26 June 2025

Time:                   2:00pm - 4:00pm

Venue:                  Room 3494
                        Lifts 25/26

Chairman:               Dr. Dan XU

Committee Members:      Dr. Yangqiu SONG (Supervisor)
                        Prof. Kani CHEN (Co-supervisor, MATH)
                        Dr. May FUNG