A Survey of Data Efficiency in Foundation Model Training

PhD Qualifying Examination


Title: "A Survey of Data Efficiency in Foundation Model Training"

by

Miss Tianyi BAI


Abstract:

Foundation models—especially large language models (LLMs) and 
vision-language models (VLMs)—have achieved remarkable progress by 
harnessing large-scale and heterogeneous data sources. Although much of the 
existing work emphasizes architectural innovations, growing evidence shows 
that the performance and efficiency of these models are strongly shaped by 
the way data is collected, curated, and utilized. Notably, high-quality and 
well-structured data can enable smaller models to perform comparably to 
larger ones, underscoring the pivotal role of data in model development.

This survey provides a comprehensive overview of LLMs and VLMs from a 
data-centric perspective, addressing three fundamental questions: (1) How 
can large-scale multimodal data be effectively collected, processed, and 
selected? (2) In what ways do data characteristics influence model 
performance across language and vision-language tasks? (3) How can data 
quality be evaluated without requiring full-scale model training?

We summarize data preparation pipelines for both pretraining and 
post-training stages, review existing data evaluation methodologies, and 
highlight how careful data design can significantly enhance downstream 
performance. Distinct from model-centric surveys, our work frames data as a 
core driver of model success. We also identify current challenges and 
outline promising directions for developing more data-efficient LLMs and 
VLMs.


Date:                   Tuesday, 8 April 2025

Time:                   4:00pm - 6:00pm

Venue:                  Room 5562
                        Lifts 27/28

Committee Members:      Dr. Binhang Yuan (Supervisor)
                        Dr. Yangqiu Song (Chairperson)
                        Dr. Ling Pan
                        Prof. Ke Yi