More about HKUST
Enhancing Large Language Model via Data-Centric View: Strategies for Selection, Synthesis, and Evaluation
The Hong Kong University of Science and Technology
Department of Computer Science and Engineering
PhD Thesis Defence
Title: "Enhancing Large Language Model via Data-Centric View: Strategies for
Selection, Synthesis, and Evaluation"
By
Mr. Jipeng ZHANG
Abstract:
The development of Large Language Models (LLMs) increasingly depends on data
quality rather than quantity. However, challenges remain in evaluating model
capabilities, selecting efficient training subsets, and generating
high-quality data for underrepresented domains. This thesis proposes
data-centric solutions across evaluation, selection, and generation to
systematically enhance LLM performance.
For evaluation, we address the limitations of likelihood-based metrics, which
suffer from exposure bias, by introducing a rank-based autoregressive metric,
NDCG, that better correlates with human and GPT-4 judgments on fine-tuned
models.
For data selection, we propose TAGCOS, a gradient-based coreset selection
method that identifies highly informative instruction-tuning subsets,
reducing data usage by 95% without sacrificing performance. In pretraining,
we develop Fox-1, a small language model trained with a curriculum-based data
scheduling strategy, achieving strong performance with efficient resource
usage.
For data generation, we target long-tail domains where training data is
scarce. In code generation, Bridge-Assist Generation synthesizes low-resource
programming language data by leveraging LLM knowledge transfer, significantly
improving multilingual code benchmarks. In text-to-SQL, ExeSQL combines
execution-guided filtering and preference learning to adapt models across SQL
dialects. For vision-language alignment, we design an expert-assisted reward
model training pipeline that mitigates hallucinations through iterative data
refinement.
Overall, this thesis demonstrates that targeted data evaluation, selection,
and generation can efficiently scale LLM capabilities across diverse tasks
and domains.
Date: Friday, 29 August 2025
Time: 10:00am - 12:00noon
Venue: Room 2611
Lifts 31/32
Chairman: Dr. Jiachuan YANG (CIVL)
Committee Members: Prof. Xiaofang ZHOU (Supervisor)
Dr. Qifeng CHEN
Dr. Dan XU
Dr. Sirui HAN (EMIA)
Prof. Jing JIANG (ANU)