A Survey of Data Selection Strategies for Large Language Model Training

PhD Qualifying Examination


Title: "A Survey of Data Selection Strategies for Large Language Model 
Training"

by

Mr. Weijie SHI


Abstract:

Data selection determines what a large language model learns and how 
efficiently it learns. As training corpora grow to trillions of tokens, 
choosing the suitable data has become important, especially when 
computational budgets are limited. We review data selection methods across 
three training stages, pre-training, fine-tuning, and reinforcement learning. 
For pre-training, instance-level selection proceeds in stages of increasing 
cost and precision. Heuristic filtering and deduplication remove obvious 
noise at scale. Quality assessment then evaluates whether data is inherently 
well-formed without reference to any task, while relevance estimation 
measures how useful data is for a specific target. These methods treat each 
example independently, without determining how much web text, code, books, 
and other sources to combine. Domain-level mixture optimization addresses 
this by treating source proportions as learnable parameters, optimized via 
proxy models or online feedback. Moving to fine-tuning, datasets shrink from 
billions of tokens to thousands of examples, and selection shifts from noise 
removal to curating a training set that balances quality, difficulty, and 
diversity. In reinforcement learning, the model generates its own training 
data through rollouts, and data selection becomes inseparable from the 
optimization process. Methods must ensure informative contrast within 
training batches while filtering stale off-policy data when rollouts are 
reused across updates. Comparing across stages reveals a shared cascade 
pattern, where cheap filters narrow the pool before expensive methods refine 
it. We conclude with open problems and practical guidelines to help 
practitioners design selection pipelines across training stages.


Date:                   Thursday, 2 April 2026

Time:                   3:00pm - 5:00pm

Venue:                  Room 2132C
                        Lift 22

Committee Members:      Prof. Xiaofang Zhou (Supervisor)
                        Prof. Qiong Luo (Chairperson)
                        Dr. Binghang Yuan
Privacy Sitemap
A Survey of Data Selection Strategies for Large Language Model Training

About

People

Research

Academics

Admissions