More about HKUST
Data Selection for Large Language Model Training
PhD Qualifying Examination
Title: "Data Selection for Large Language Model Training"
by
Mr. Jipeng ZHANG
Abstract:
The impressive capabilities of large language models (LLMs) are largely
attributed to the utilization of extensive, high-quality datasets for
self-supervised training. However, feeding all available data into LLMs without
filtering can lead to suboptimal outcomes. Therefore, careful data selection is
essential to identify high-quality subsets that enhance model performance and
reduce training costs.
Effective data selection involves accurately estimating data quality, choosing
appropriate subsets, and designing scalable, automated data filtering
pipelines. These processes help in optimizing the training of LLMs by reducing
unnecessary computational costs and improving overall efficiency.
We provide a comprehensive review of data selection methods for different
training scenarios, such as pretraining and instruction tuning. The specific
needs and challenges associated with each scenario discuss are discussed in
this paper, offering insights into the design and implementation of efficient
data selection systems to enhance LLM performance.
Date: Tuesday, 16 July 2024
Time: 2:00pm - 4:00pm
Venue: Room 3494
Lifts 25/26
Committee Members: Prof. Xiaofang Zhou (Supervisor)
Dr. Junxian He (Chairperson)
Dr. Qifeng Chen
Prof. Jia Zhu (ZJNU)