A Journey of Effective Data Curation: from Data Annotation to Data Integration and Organization

PhD Thesis Proposal Defence


Title: "A Journey of Effective Data Curation: from Data Annotation to Data 
Integration and Organization"

by

Mr. Yushi SUN


Abstract:

In the age of big data, effective data curation plays a pivotal role in 
ensuring data integrity and usability across various domains. This thesis, 
titled "A Journey of Effective Data Curation: from Data Annotation to Data 
Integration and Organization," systematically addresses the essential 
processes of data curation, including annotation, integration, and 
organization, while highlighting the challenges inherent in each phase.

The first work tackles the complexities of data annotation, presenting a 
novel cross-domain task allocation scheme designed to optimize the training 
and selection of annotators in crowdsourced data annotation scenarios. The 
second work shifts focus to data integration, introducing an innovative 
framework for column semantic type annotation that considers inter-table 
context, thereby enhancing the effectiveness of data integration from 
disparate sources. Finally, the third work explores the evolving paradigm of 
data organization in the context of Large Language Models (LLMs). It 
investigates the potential of LLMs to internalize taxonomy structures, 
proposing a hybrid prototype that marries traditional hierarchical taxonomy 
structures with advanced neural-language-model-based approaches.

Through these contributions, this thesis not only elucidates the 
multifaceted nature of data curation but also offers practical solutions to 
the pressing challenges faced in the field. The findings underscore the 
importance of effective data management practices, ultimately paving the way 
for enhanced data accessibility and utility in an increasingly data-driven 
world.

We thoroughly evaluated the effectiveness of our proposed advancements 
against existing state-of-the-art approaches. Finally, we conclude the 
thesis by raising promising future research directions related to data 
curation research.


Date:                   Monday, 10 February 2025

Time:                   3:00pm - 5:00pm

Venue:                  Room 2408
                        Lifts 17/18

Committee Members:      Prof. Lei Chen (Supervisor)
                        Prof. Raymond Wong (Chairperson)
                        Dr. Junxian He
                        Prof. Qiong Luo