More about HKUST
A Survey of Data Selection Strategies for Large Language Model Training
PhD Qualifying Examination
Title: "A Survey of Data Selection Strategies for Large Language Model
Training"
by
Mr. Weijie SHI
Abstract:
Data selection determines what a large language model learns and how
efficiently it learns. As training corpora grow to trillions of tokens,
choosing the suitable data has become important, especially when
computational budgets are limited. We review data selection methods across
three training stages, pre-training, fine-tuning, and reinforcement learning.
For pre-training, instance-level selection proceeds in stages of increasing
cost and precision. Heuristic filtering and deduplication remove obvious
noise at scale. Quality assessment then evaluates whether data is inherently
well-formed without reference to any task, while relevance estimation
measures how useful data is for a specific target. These methods treat each
example independently, without determining how much web text, code, books,
and other sources to combine. Domain-level mixture optimization addresses
this by treating source proportions as learnable parameters, optimized via
proxy models or online feedback. Moving to fine-tuning, datasets shrink from
billions of tokens to thousands of examples, and selection shifts from noise
removal to curating a training set that balances quality, difficulty, and
diversity. In reinforcement learning, the model generates its own training
data through rollouts, and data selection becomes inseparable from the
optimization process. Methods must ensure informative contrast within
training batches while filtering stale off-policy data when rollouts are
reused across updates. Comparing across stages reveals a shared cascade
pattern, where cheap filters narrow the pool before expensive methods refine
it. We conclude with open problems and practical guidelines to help
practitioners design selection pipelines across training stages.
Date: Thursday, 2 April 2026
Time: 3:00pm - 5:00pm
Venue: Room 2132C
Lift 22
Committee Members: Prof. Xiaofang Zhou (Supervisor)
Prof. Qiong Luo (Chairperson)
Dr. Binghang Yuan