Data-Efficient Optimization of Large Language Models: A Survey

PhD Qualifying Examination


Title: "Data-Efficient Optimization of Large Language Models: A Survey"

by

Mr. Tianhao TANG


Abstract:

Training and fine-tuning large language models (LLMs) are extremely resource-
intensive, often requiring multiple passes over billions of tokens. This 
survey focuses on the data-efficient optimization of LLMs, treating the 
selection and weighting of training data as explicit optimization objectives. 
We organize recent works into two complementary paradigms: data pruning, which
identifies a compact subset of training examples, and data reweighting, which
adapts weighted sampling and the scheduling of data over time. For data 
pruning, we categorize methodologies into data-property-based methods and 
training-signal-based methods. We further examine advanced extensions, 
including dynamic selection, proxy models, multi-criteria objectives, and 
task-specific settings. Regarding reweighting, we survey methods across 
curriculum learning, instance-level reweighting, and domain-level mixture 
optimization, highlighting how these methods utilize signals similar to 
pruning to implement softer forms of data efficiency. Finally, we discuss 
connections between these paradigms, identify open challenges, and outline 
future directions toward data-efficient optimization.


Date:                   Friday, 12 December 2025

Time:                   2:00pm - 4:00pm

Venue:                  Room 2129A
                        Lift 19

Committee Members:      Prof. Lei Chen (Supervisor)
                        Prof. Qiong Luo (Chairperson)
                        Dr. Dan Xu
                        Prof. Xiaofang Zhou