More about HKUST
A Journey of Effective Data Curation: from Data Annotation to Data Integration and Organization
PhD Thesis Proposal Defence
Title: "A Journey of Effective Data Curation: from Data Annotation to Data
Integration and Organization"
by
Mr. Yushi SUN
Abstract:
In the age of big data, effective data curation plays a pivotal role in
ensuring data integrity and usability across various domains. This thesis,
titled "A Journey of Effective Data Curation: from Data Annotation to Data
Integration and Organization," systematically addresses the essential
processes of data curation, including annotation, integration, and
organization, while highlighting the challenges inherent in each phase.
The first work tackles the complexities of data annotation, presenting a
novel cross-domain task allocation scheme designed to optimize the training
and selection of annotators in crowdsourced data annotation scenarios. The
second work shifts focus to data integration, introducing an innovative
framework for column semantic type annotation that considers inter-table
context, thereby enhancing the effectiveness of data integration from
disparate sources. Finally, the third work explores the evolving paradigm of
data organization in the context of Large Language Models (LLMs). It
investigates the potential of LLMs to internalize taxonomy structures,
proposing a hybrid prototype that marries traditional hierarchical taxonomy
structures with advanced neural-language-model-based approaches.
Through these contributions, this thesis not only elucidates the
multifaceted nature of data curation but also offers practical solutions to
the pressing challenges faced in the field. The findings underscore the
importance of effective data management practices, ultimately paving the way
for enhanced data accessibility and utility in an increasingly data-driven
world.
We thoroughly evaluated the effectiveness of our proposed advancements
against existing state-of-the-art approaches. Finally, we conclude the
thesis by raising promising future research directions related to data
curation research.
Date: Monday, 10 February 2025
Time: 3:00pm - 5:00pm
Venue: Room 2408
Lifts 17/18
Committee Members: Prof. Lei Chen (Supervisor)
Prof. Raymond Wong (Chairperson)
Dr. Junxian He
Prof. Qiong Luo