Data Selection for Large Language Model Training

PhD Qualifying Examination


Title: "Data Selection for Large Language Model Training"

by

Mr. Jipeng ZHANG


Abstract:

The impressive capabilities of large language models (LLMs) are largely 
attributed to the utilization of extensive, high-quality datasets for 
self-supervised training. However, feeding all available data into LLMs without 
filtering can lead to suboptimal outcomes. Therefore, careful data selection is 
essential to identify high-quality subsets that enhance model performance and 
reduce training costs.

Effective data selection involves accurately estimating data quality, choosing 
appropriate subsets, and designing scalable, automated data filtering 
pipelines. These processes help in optimizing the training of LLMs by reducing 
unnecessary computational costs and improving overall efficiency.

We provide a comprehensive review of data selection methods for different 
training scenarios, such as pretraining and instruction tuning. The specific 
needs and challenges associated with each scenario discuss are discussed in 
this paper, offering insights into the design and implementation of efficient 
data selection systems to enhance LLM performance.


Date:                   Tuesday, 16 July 2024

Time:                   2:00pm - 4:00pm

Venue:                  Room 3494
                        Lifts 25/26

Committee Members:      Prof. Xiaofang Zhou (Supervisor)
                        Dr. Junxian He (Chairperson)
                        Dr. Qifeng Chen
                        Prof. Jia Zhu (ZJNU)