More about HKUST
Embracing Multilingualism: Optimizing LLM Agents for Code-Switching Data Synthesis via Linguistic Principles and Tool Integration
The Hong Kong University of Science and Technology Department of Computer Science and Engineering MPhil Thesis Defence Title: "Embracing Multilingualism: Optimizing LLM Agents for Code-Switching Data Synthesis via Linguistic Principles and Tool Integration" By Mr. Peng XIE Abstract: Code-switching (CS) is the alternating use of two or more languages within a conversation or utterance, often influenced by social context and speaker identity. This linguistic phenomenon poses challenges for Automatic Speech Recognition (ASR) systems, which are typically designed for a single language and struggle to handle multilingual inputs. The growing global demand for multilingual applications, including Code-Switching ASR (CSASR), Text-to-Speech (CSTTS), and Cross-Lingual Information Retrieval (CLIR), highlights the inadequacy of existing monolingual datasets. Although some CS datasets exist, most are limited to bilingual mixing within homogeneous ethnic groups, leaving a critical need for a large-scale, diverse benchmark akin to ImageNet in computer vision. To bridge this gap, we introduce LinguaMaster, a multi-agent collaboration framework specifically designed for efficient and scalable multilingual data synthesis. Leveraging this framework, we curate SwitchLingua, the first large-scale multilingual and multi-ethnic CS dataset, including: (1) 420K CS textual samples across 12 languages, and (2) over 80 hours of audio recordings from 174 speakers representing 18 countries/regions and 63 racial/ethnic backgrounds, based on the textual data. This dataset captures rich linguistic and cultural diversity, offering a foundational resource for advancing multilingual and multicultural research. Furthermore, to address the issue that existing ASR evaluation metrics lack sensitivity to CS scenarios, we propose the Semantic-Aware Error Rate (SAER), a novel evaluation metric that incorporates semantic information, providing a more accurate and context-aware assessment of system performance. Benchmark experiments on SwitchLingua with state-of-the-art ASR models reveal substantial performance gaps, underscoring the dataset’s utility as a rigorous benchmark for CS capability evaluation. In addition, SwitchLingua aims to encourage further research to promote cultural inclusivity and linguistic diversity in speech technology, fostering equitable progress in the ASR field. Date: Thursday, 26 June 2025 Time: 2:00pm - 4:00pm Venue: Room 3494 Lifts 25/26 Chairman: Dr. Dan XU Committee Members: Dr. Yangqiu SONG (Supervisor) Prof. Kani CHEN (Co-supervisor, MATH) Dr. May FUNG