More about HKUST
Innovative Approaches to Data Curation and Retrieval-Augmented Generation: From Annotation and Preparation to Retrieval
The Hong Kong University of Science and Technology
Department of Computer Science and Engineering
PhD Thesis Defence
Title: "Innovative Approaches to Data Curation and Retrieval-Augmented
Generation: From Annotation and Preparation to Retrieval"
By
Mr. Yushi SUN
Abstract:
This thesis investigates the challenges and advancements in data curation and
Retrieval-Augmented Generation (RAG), addressing critical issues that impact
the effectiveness of Large Language Models (LLMs).
In recent years, RAG has garnered significant attention for its potential to
reduce hallucination problems inherent in LLMs by integrating relevant
external data sources. However, the success of RAG relies heavily on the
quality of the curated data, which must undergo rigorous annotation, and
preparation processes.
The first work addresses the complexities of data annotation by presenting a
novel cross-domain task allocation scheme to optimize annotator training and
selection in crowdsourced scenarios. The second work shifts focus to data
preparation, introducing an innovative framework for column semantic type
annotation that considers inter-table context, thereby enhancing the
effectiveness of data preparation. Finally, the third work explores the
implementation of knowledge-based RAG approaches in the context of LLMs. We
investigate strategies to effectively manage knowledge overloading and the
complexities of multi-hop queries, developing a retrieval content
summarization method tailored for knowledge base question answering.
Collectively, these contributions aim to enhance the reliability and
effectiveness of RAG for more robust AI systems capable of delivering
accurate and contextually relevant responses.
Through these contributions, this thesis not only elucidates the multifaceted
nature of data curation and RAG but also offers practical solutions to the
pressing challenges faced in the field. The findings underscore the
importance of effective data management practices, ultimately paving the way
for enhanced data accessibility and utility in an increasingly data-driven
world.
We thoroughly evaluated the effectiveness of our proposed advancements
against existing state-of-the-art approaches. Finally, we conclude the thesis
by raising promising future research directions related to data curation and
RAG research.
Date: Friday, 13 June 2025
Time: 3:00pm - 5:00pm
Venue: Room 3494
Lifts 25/26
Chairman: Dr. Stanley Chun Kwan LAU (OCES)
Committee Members: Prof. Lei CHEN (Supervisor)
Dr. May FUNG
Prof. Ke YI
Dr. Can YANG (MATH)
Prof. Jianliang XU (HKBU)