More about HKUST
Innovative Approaches to Data Curation and Retrieval-Augmented Generation: From Annotation and Preparation to Retrieval
The Hong Kong University of Science and Technology Department of Computer Science and Engineering PhD Thesis Defence Title: "Innovative Approaches to Data Curation and Retrieval-Augmented Generation: From Annotation and Preparation to Retrieval" By Mr. Yushi SUN Abstract: This thesis investigates the challenges and advancements in data curation and Retrieval-Augmented Generation (RAG), addressing critical issues that impact the effectiveness of Large Language Models (LLMs). In recent years, RAG has garnered significant attention for its potential to reduce hallucination problems inherent in LLMs by integrating relevant external data sources. However, the success of RAG relies heavily on the quality of the curated data, which must undergo rigorous annotation, and preparation processes. The first work addresses the complexities of data annotation by presenting a novel cross-domain task allocation scheme to optimize annotator training and selection in crowdsourced scenarios. The second work shifts focus to data preparation, introducing an innovative framework for column semantic type annotation that considers inter-table context, thereby enhancing the effectiveness of data preparation. Finally, the third work explores the implementation of knowledge-based RAG approaches in the context of LLMs. We investigate strategies to effectively manage knowledge overloading and the complexities of multi-hop queries, developing a retrieval content summarization method tailored for knowledge base question answering. Collectively, these contributions aim to enhance the reliability and effectiveness of RAG for more robust AI systems capable of delivering accurate and contextually relevant responses. Through these contributions, this thesis not only elucidates the multifaceted nature of data curation and RAG but also offers practical solutions to the pressing challenges faced in the field. The findings underscore the importance of effective data management practices, ultimately paving the way for enhanced data accessibility and utility in an increasingly data-driven world. We thoroughly evaluated the effectiveness of our proposed advancements against existing state-of-the-art approaches. Finally, we conclude the thesis by raising promising future research directions related to data curation and RAG research. Date: Friday, 13 June 2025 Time: 3:00pm - 5:00pm Venue: Room 3494 Lifts 25/26 Chairman: Dr. Stanley Chun Kwan LAU (OCES) Committee Members: Prof. Lei CHEN (Supervisor) Dr. May FUNG Prof. Ke YI Dr. Can YANG (MATH) Prof. Jianliang XU (HKBU)