More about HKUST
A Survey of Data Efficiency in Foundation Model Training
PhD Qualifying Examination Title: "A Survey of Data Efficiency in Foundation Model Training" by Miss Tianyi BAI Abstract: Foundation models—especially large language models (LLMs) and vision-language models (VLMs)—have achieved remarkable progress by harnessing large-scale and heterogeneous data sources. Although much of the existing work emphasizes architectural innovations, growing evidence shows that the performance and efficiency of these models are strongly shaped by the way data is collected, curated, and utilized. Notably, high-quality and well-structured data can enable smaller models to perform comparably to larger ones, underscoring the pivotal role of data in model development. This survey provides a comprehensive overview of LLMs and VLMs from a data-centric perspective, addressing three fundamental questions: (1) How can large-scale multimodal data be effectively collected, processed, and selected? (2) In what ways do data characteristics influence model performance across language and vision-language tasks? (3) How can data quality be evaluated without requiring full-scale model training? We summarize data preparation pipelines for both pretraining and post-training stages, review existing data evaluation methodologies, and highlight how careful data design can significantly enhance downstream performance. Distinct from model-centric surveys, our work frames data as a core driver of model success. We also identify current challenges and outline promising directions for developing more data-efficient LLMs and VLMs. Date: Tuesday, 8 April 2025 Time: 4:00pm - 6:00pm Venue: Room 5562 Lifts 27/28 Committee Members: Dr. Binhang Yuan (Supervisor) Dr. Yangqiu Song (Chairperson) Dr. Ling Pan Prof. Ke Yi