More about HKUST
Data-Efficient Optimization of Large Language Models: A Survey
PhD Qualifying Examination
Title: "Data-Efficient Optimization of Large Language Models: A Survey"
by
Mr. Tianhao TANG
Abstract:
Training and fine-tuning large language models (LLMs) are extremely resource-
intensive, often requiring multiple passes over billions of tokens. This
survey focuses on the data-efficient optimization of LLMs, treating the
selection and weighting of training data as explicit optimization objectives.
We organize recent works into two complementary paradigms: data pruning, which
identifies a compact subset of training examples, and data reweighting, which
adapts weighted sampling and the scheduling of data over time. For data
pruning, we categorize methodologies into data-property-based methods and
training-signal-based methods. We further examine advanced extensions,
including dynamic selection, proxy models, multi-criteria objectives, and
task-specific settings. Regarding reweighting, we survey methods across
curriculum learning, instance-level reweighting, and domain-level mixture
optimization, highlighting how these methods utilize signals similar to
pruning to implement softer forms of data efficiency. Finally, we discuss
connections between these paradigms, identify open challenges, and outline
future directions toward data-efficient optimization.
Date: Friday, 12 December 2025
Time: 2:00pm - 4:00pm
Venue: Room 2129A
Lift 19
Committee Members: Prof. Lei Chen (Supervisor)
Prof. Qiong Luo (Chairperson)
Dr. Dan Xu
Prof. Xiaofang Zhou