More about HKUST
Enhancing LLMs from a Data-Centric View: Evaluation, Selection, and Generation
PhD Thesis Proposal Defence Title: "Enhancing LLMs from a Data-Centric View: Evaluation, Selection, and Generation" by Mr. Jipeng ZHANG Abstract: The development of Large Language Models (LLMs) increasingly depends on data quality rather than quantity. However, challenges remain in evaluating model capabilities, selecting efficient training subsets, and generating high-quality data for underrepresented domains. This thesis proposes data-centric solutions across evaluation, selection, and generation to systematically enhance LLM performance. For evaluation, we address the limitations of likelihood-based metrics, which suffer from exposure bias, by introducing a rank-based autoregressive metric, NDCG, that better correlates with human and GPT-4 judgments on fine-tuned models. For data selection, we propose TAGCOS, a gradient-based coreset selection method that identifies highly informative instruction-tuning subsets, reducing data usage by 95% without sacrificing performance. In pretraining, we develop Fox-1, a small language model trained with a curriculum-based data scheduling strategy, achieving strong performance with efficient resource usage. For data generation, we target long-tail domains where training data is scarce. In code generation, Bridge-Assist Generation synthesizes low-resource programming language data by leveraging LLM knowledge transfer, significantly improving multilingual code benchmarks. In text-to-SQL, ExeSQL combines execution-guided filtering and preference learning to adapt models across SQL dialects. For vision-language alignment, we design an expert-assisted reward model training pipeline that mitigates hallucinations through iterative data refinement. Overall, this thesis demonstrates that targeted data evaluation, selection, and generation can efficiently scale LLM capabilities across diverse tasks and domains. Date: Friday, 27 June 2025 Time: 10:00am - 12:00noon Venue: Room 3494 Lifts 25/26 Committee Members: Prof. Xiaofang Zhou (Supervisor) Dr. Qifeng Chen (Chairperson) Dr. May Fung Dr. Dan Xu