Innovative Approaches to Data Curation and Retrieval-Augmented Generation: From Annotation and Preparation to Retrieval

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "Innovative Approaches to Data Curation and Retrieval-Augmented 
Generation: From Annotation and Preparation to Retrieval"

By

Mr. Yushi SUN


Abstract:

This thesis investigates the challenges and advancements in data curation and 
Retrieval-Augmented Generation (RAG), addressing critical issues that impact 
the effectiveness of Large Language Models (LLMs).

In recent years, RAG has garnered significant attention for its potential to 
reduce hallucination problems inherent in LLMs by integrating relevant 
external data sources. However, the success of RAG relies heavily on the 
quality of the curated data, which must undergo rigorous annotation, and 
preparation processes.

The first work addresses the complexities of data annotation by presenting a 
novel cross-domain task allocation scheme to optimize annotator training and 
selection in crowdsourced scenarios. The second work shifts focus to data 
preparation, introducing an innovative framework for column semantic type 
annotation that considers inter-table context, thereby enhancing the 
effectiveness of data preparation. Finally, the third work explores the 
implementation of knowledge-based RAG approaches in the context of LLMs. We 
investigate strategies to effectively manage knowledge overloading and the 
complexities of multi-hop queries, developing a retrieval content 
summarization method tailored for knowledge base question answering. 
Collectively, these contributions aim to enhance the reliability and 
effectiveness of RAG for more robust AI systems capable of delivering 
accurate and contextually relevant responses.

Through these contributions, this thesis not only elucidates the multifaceted 
nature of data curation and RAG but also offers practical solutions to the 
pressing challenges faced in the field. The findings underscore the 
importance of effective data management practices, ultimately paving the way 
for enhanced data accessibility and utility in an increasingly data-driven 
world.

We thoroughly evaluated the effectiveness of our proposed advancements 
against existing state-of-the-art approaches. Finally, we conclude the thesis 
by raising promising future research directions related to data curation and 
RAG research.


Date:                   Friday, 13 June 2025

Time:                   3:00pm - 5:00pm

Venue:                  Room 3494
                        Lifts 25/26

Chairman:               Dr. Stanley Chun Kwan LAU (OCES)

Committee Members:      Prof. Lei CHEN (Supervisor)
                        Dr. May FUNG
                        Prof. Ke YI
                        Dr. Can YANG (MATH)
                        Prof. Jianliang XU (HKBU)