More about HKUST
Data Selection for Large Language Model Training
PhD Qualifying Examination Title: "Data Selection for Large Language Model Training" by Mr. Jipeng ZHANG Abstract: The impressive capabilities of large language models (LLMs) are largely attributed to the utilization of extensive, high-quality datasets for self-supervised training. However, feeding all available data into LLMs without filtering can lead to suboptimal outcomes. Therefore, careful data selection is essential to identify high-quality subsets that enhance model performance and reduce training costs. Effective data selection involves accurately estimating data quality, choosing appropriate subsets, and designing scalable, automated data filtering pipelines. These processes help in optimizing the training of LLMs by reducing unnecessary computational costs and improving overall efficiency. We provide a comprehensive review of data selection methods for different training scenarios, such as pretraining and instruction tuning. The specific needs and challenges associated with each scenario discuss are discussed in this paper, offering insights into the design and implementation of efficient data selection systems to enhance LLM performance. Date: Tuesday, 16 July 2024 Time: 2:00pm - 4:00pm Venue: Room 3494 Lifts 25/26 Committee Members: Prof. Xiaofang Zhou (Supervisor) Dr. Junxian He (Chairperson) Dr. Qifeng Chen Prof. Jia Zhu (ZJNU)