More about HKUST
Enhancing LLMs from a Data-Centric View: Evaluation, Selection, and Generation
PhD Thesis Proposal Defence
Title: "Enhancing LLMs from a Data-Centric View: Evaluation, Selection, and
Generation"
by
Mr. Jipeng ZHANG
Abstract:
The development of Large Language Models (LLMs) increasingly depends on data
quality rather than quantity. However, challenges remain in evaluating model
capabilities, selecting efficient training subsets, and generating
high-quality data for underrepresented domains. This thesis proposes
data-centric solutions across evaluation, selection, and generation to
systematically enhance LLM performance.
For evaluation, we address the limitations of likelihood-based metrics, which
suffer from exposure bias, by introducing a rank-based autoregressive metric,
NDCG, that better correlates with human and GPT-4 judgments on fine-tuned
models.
For data selection, we propose TAGCOS, a gradient-based coreset selection
method that identifies highly informative instruction-tuning subsets,
reducing data usage by 95% without sacrificing performance. In pretraining,
we develop Fox-1, a small language model trained with a curriculum-based data
scheduling strategy, achieving strong performance with efficient resource
usage.
For data generation, we target long-tail domains where training data is
scarce. In code generation, Bridge-Assist Generation synthesizes low-resource
programming language data by leveraging LLM knowledge transfer, significantly
improving multilingual code benchmarks. In text-to-SQL, ExeSQL combines
execution-guided filtering and preference learning to adapt models across SQL
dialects. For vision-language alignment, we design an expert-assisted reward
model training pipeline that mitigates hallucinations through iterative data
refinement.
Overall, this thesis demonstrates that targeted data evaluation, selection,
and generation can efficiently scale LLM capabilities across diverse tasks
and domains.
Date: Friday, 27 June 2025
Time: 10:00am - 12:00noon
Venue: Room 3494
Lifts 25/26
Committee Members: Prof. Xiaofang Zhou (Supervisor)
Dr. Qifeng Chen (Chairperson)
Dr. May Fung
Dr. Dan Xu