More about HKUST
A Survey of Data Efficiency in Foundation Model Training
PhD Qualifying Examination
Title: "A Survey of Data Efficiency in Foundation Model Training"
by
Miss Tianyi BAI
Abstract:
Foundation models—especially large language models (LLMs) and
vision-language models (VLMs)—have achieved remarkable progress by
harnessing large-scale and heterogeneous data sources. Although much of the
existing work emphasizes architectural innovations, growing evidence shows
that the performance and efficiency of these models are strongly shaped by
the way data is collected, curated, and utilized. Notably, high-quality and
well-structured data can enable smaller models to perform comparably to
larger ones, underscoring the pivotal role of data in model development.
This survey provides a comprehensive overview of LLMs and VLMs from a
data-centric perspective, addressing three fundamental questions: (1) How
can large-scale multimodal data be effectively collected, processed, and
selected? (2) In what ways do data characteristics influence model
performance across language and vision-language tasks? (3) How can data
quality be evaluated without requiring full-scale model training?
We summarize data preparation pipelines for both pretraining and
post-training stages, review existing data evaluation methodologies, and
highlight how careful data design can significantly enhance downstream
performance. Distinct from model-centric surveys, our work frames data as a
core driver of model success. We also identify current challenges and
outline promising directions for developing more data-efficient LLMs and
VLMs.
Date: Tuesday, 8 April 2025
Time: 4:00pm - 6:00pm
Venue: Room 5562
Lifts 27/28
Committee Members: Dr. Binhang Yuan (Supervisor)
Dr. Yangqiu Song (Chairperson)
Dr. Ling Pan
Prof. Ke Yi