Enhancing LLMs from a Data-Centric View: Evaluation, Selection, and Generation

PhD Thesis Proposal Defence


Title: "Enhancing LLMs from a Data-Centric View: Evaluation, Selection, and 
Generation"

by

Mr. Jipeng ZHANG


Abstract:


The development of Large Language Models (LLMs) increasingly depends on data 
quality rather than quantity. However, challenges remain in evaluating model 
capabilities, selecting efficient training subsets, and generating 
high-quality data for underrepresented domains. This thesis proposes 
data-centric solutions across evaluation, selection, and generation to 
systematically enhance LLM performance.

For evaluation, we address the limitations of likelihood-based metrics, which 
suffer from exposure bias, by introducing a rank-based autoregressive metric, 
NDCG, that better correlates with human and GPT-4 judgments on fine-tuned 
models.

For data selection, we propose TAGCOS, a gradient-based coreset selection 
method that identifies highly informative instruction-tuning subsets, 
reducing data usage by 95% without sacrificing performance. In pretraining, 
we develop Fox-1, a small language model trained with a curriculum-based data 
scheduling strategy, achieving strong performance with efficient resource 
usage.

For data generation, we target long-tail domains where training data is 
scarce. In code generation, Bridge-Assist Generation synthesizes low-resource 
programming language data by leveraging LLM knowledge transfer, significantly 
improving multilingual code benchmarks. In text-to-SQL, ExeSQL combines 
execution-guided filtering and preference learning to adapt models across SQL 
dialects. For vision-language alignment, we design an expert-assisted reward 
model training pipeline that mitigates hallucinations through iterative data 
refinement.

Overall, this thesis demonstrates that targeted data evaluation, selection, 
and generation can efficiently scale LLM capabilities across diverse tasks 
and domains.


Date:                   Friday, 27 June 2025

Time:                   10:00am - 12:00noon

Venue:                  Room 3494
                        Lifts 25/26

Committee Members:      Prof. Xiaofang Zhou (Supervisor)
                        Dr. Qifeng Chen (Chairperson)
                        Dr. May Fung
                        Dr. Dan Xu