Enhancing Large Language Model via Data-Centric View: Strategies for Selection, Synthesis, and Evaluation

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "Enhancing Large Language Model via Data-Centric View: Strategies for 
Selection, Synthesis, and Evaluation"

By

Mr. Jipeng ZHANG


Abstract:

The development of Large Language Models (LLMs) increasingly depends on data 
quality rather than quantity. However, challenges remain in evaluating model 
capabilities, selecting efficient training subsets, and generating 
high-quality data for underrepresented domains. This thesis proposes 
data-centric solutions across evaluation, selection, and generation to 
systematically enhance LLM performance.

For evaluation, we address the limitations of likelihood-based metrics, which 
suffer from exposure bias, by introducing a rank-based autoregressive metric, 
NDCG, that better correlates with human and GPT-4 judgments on fine-tuned 
models.

For data selection, we propose TAGCOS, a gradient-based coreset selection 
method that identifies highly informative instruction-tuning subsets, 
reducing data usage by 95% without sacrificing performance. In pretraining, 
we develop Fox-1, a small language model trained with a curriculum-based data 
scheduling strategy, achieving strong performance with efficient resource 
usage.

For data generation, we target long-tail domains where training data is 
scarce. In code generation, Bridge-Assist Generation synthesizes low-resource 
programming language data by leveraging LLM knowledge transfer, significantly 
improving multilingual code benchmarks. In text-to-SQL, ExeSQL combines 
execution-guided filtering and preference learning to adapt models across SQL 
dialects. For vision-language alignment, we design an expert-assisted reward 
model training pipeline that mitigates hallucinations through iterative data 
refinement.

Overall, this thesis demonstrates that targeted data evaluation, selection,
and generation can efficiently scale LLM capabilities across diverse tasks
and domains.


Date:                   Friday, 29 August 2025

Time:                   10:00am - 12:00noon

Venue:                  Room 2611
                        Lifts 31/32

Chairman:               Dr. Jiachuan YANG (CIVL)

Committee Members:      Prof. Xiaofang ZHOU (Supervisor)
                        Dr. Qifeng CHEN
                        Dr. Dan XU
                        Dr. Sirui HAN (EMIA)
                        Prof. Jing JIANG (ANU)