More about HKUST
Distilling Large Language Models for Software Engineering Tasks with Boostrap Instructing
The Hong Kong University of Science and Technology Department of Computer Science and Engineering Final Year Thesis Oral Defense Title: "Distilling Large Language Models for Software Engineering Tasks with Boostrap Instructing" by LI Yijia Abstract: Although pre-trained large language models (LLMs) have demonstrated a remarkable ability in face of various software engineering problems in general, they are still insufficient to deal with domain-specific codes due to a lack of training data. Prevalent methods of fine-tuning LLM rely heavily on manually creation of instruction data which is very time-consuming and labor-intensive. In this project, we proposed a methodology combining machine learning techniques such as knowledge distillation, self-instruct, and data augmentation for bootstrapping the generation of task-specific training datasets, which is used for improving the performance of local LLMs on downstream software engineering applications. Our pipeline generates instructions, input, and output samples from a limited seed set using GPT-4, then filters invalid or similar pairs before using them to finetune the original model. Applying our method to Magicoder-S-DS-6.7B has shown a significant improvement in the accuracy of the binary classification problem regarding API misuse, overperforming the state-of-the- art LLMs which have larger parameter sizes. This project provides a fast and effective way for aligning pre-trained language models with downstream software engineering tasks, which facilitates the application of LLMs in the research of software engineering. Date : 3 May 2024 (Friday) Time : 14:00 - 14:40 Venue : Room 5501 (near lifts 25/26), HKUST Advisor : Prof. CHEUNG Shing-Chi 2nd Reader : Dr. XU Dan