A Dual-Loop Multi-Agent Framework for Controllable and Adaptive Synthetic Data Generation

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


MPhil Thesis Defence


Title: "A Dual-Loop Multi-Agent Framework for Controllable and Adaptive 
Synthetic Data Generation"

By

Mr. Tianyi ZHANG


Abstract:

Today, large language models (LLMs) are significantly expanding the practical 
applications of natural language processing (NLP), particularly excelling in 
areas such as retrieval-augmented generation (RAG), task adaptation, and 
few-shot learning. However, when entering data-scarce "niche verticals"—such 
as legal case analysis, clinical medical reasoning or specialized financial 
tasks—even high-performance general-purpose large models encounter 
challenges: a lack of labeled data, domain-specific biases, and mismatched 
data distributions. These issues severely limit the models' adaptability and 
reasoning performance in specific domains. Although "generating synthetic 
data" is an effective shortcut to address these domain-specific shortcomings, 
existing generation methods often lack precision and fail to strictly adhere 
to the specific rules and conventions of a given domain. This results in 
synthetic data that frequently contains structural flaws, repetitive content, 
or is simply off-topic.

To address these challenges, this thesis proposes a dual-loop multi-agent 
framework that treats data generation as a closed-loop controlled process. 
Specifically, we integrate domain-specific prior knowledge and sample 
real-world datasets from the domain to enable the LLM to construct a clear 
and intuitive "factor tree" to model the variables requiring control. Next, 
we instruct a "Creator" to synthesize specific data based on this "factor 
tree"; finally, we deploy a "Critic" to perform semantic quality control on 
the generated content, providing continuous feedback. Based on this feedback, 
the system automatically selects and refines prompts, and drives the 
iterative evolution of the "factor tree" structure. By running through this 
process, we obtain optimized prompts and use the "factor tree" to identify 
two core elements specific to the domain: "diversity" and "quality" 
(expression style). By combining these elements, we can mass-produce 
synthetic data that closely aligns with the actual data distribution in niche 
vertical domains. Empirical evaluations on convergence, controllability, and 
distributional fit affirm that this framework generates robust, 
domain-oriented synthetic data, which validates its value as a reliable tool 
for specialized tasks.


Date:                   Wednesday, 29 April 2026

Time:                   11:00am - 1:00pm

Venue:                  Room 2128A
                        Lift 22

Chairman:               Dr. Shuai WANG

Committee Members:      Prof. Bo LI (Supervisor)
                        Dr. Chaojian LI