More about HKUST
A Dual-Loop Multi-Agent Framework for Controllable and Adaptive Synthetic Data Generation
The Hong Kong University of Science and Technology
Department of Computer Science and Engineering
MPhil Thesis Defence
Title: "A Dual-Loop Multi-Agent Framework for Controllable and Adaptive
Synthetic Data Generation"
By
Mr. Tianyi ZHANG
Abstract:
Today, large language models (LLMs) are significantly expanding the practical
applications of natural language processing (NLP), particularly excelling in
areas such as retrieval-augmented generation (RAG), task adaptation, and
few-shot learning. However, when entering data-scarce "niche verticals"—such
as legal case analysis, clinical medical reasoning or specialized financial
tasks—even high-performance general-purpose large models encounter
challenges: a lack of labeled data, domain-specific biases, and mismatched
data distributions. These issues severely limit the models' adaptability and
reasoning performance in specific domains. Although "generating synthetic
data" is an effective shortcut to address these domain-specific shortcomings,
existing generation methods often lack precision and fail to strictly adhere
to the specific rules and conventions of a given domain. This results in
synthetic data that frequently contains structural flaws, repetitive content,
or is simply off-topic.
To address these challenges, this thesis proposes a dual-loop multi-agent
framework that treats data generation as a closed-loop controlled process.
Specifically, we integrate domain-specific prior knowledge and sample
real-world datasets from the domain to enable the LLM to construct a clear
and intuitive "factor tree" to model the variables requiring control. Next,
we instruct a "Creator" to synthesize specific data based on this "factor
tree"; finally, we deploy a "Critic" to perform semantic quality control on
the generated content, providing continuous feedback. Based on this feedback,
the system automatically selects and refines prompts, and drives the
iterative evolution of the "factor tree" structure. By running through this
process, we obtain optimized prompts and use the "factor tree" to identify
two core elements specific to the domain: "diversity" and "quality"
(expression style). By combining these elements, we can mass-produce
synthetic data that closely aligns with the actual data distribution in niche
vertical domains. Empirical evaluations on convergence, controllability, and
distributional fit affirm that this framework generates robust,
domain-oriented synthetic data, which validates its value as a reliable tool
for specialized tasks.
Date: Wednesday, 29 April 2026
Time: 11:00am - 1:00pm
Venue: Room 2128A
Lift 22
Chairman: Dr. Shuai WANG
Committee Members: Prof. Bo LI (Supervisor)
Dr. Chaojian LI