More about HKUST
A Journey of Effective Data Curation: from Data Annotation to Data Integration and Organization
PhD Thesis Proposal Defence Title: "A Journey of Effective Data Curation: from Data Annotation to Data Integration and Organization" by Mr. Yushi SUN Abstract: In the age of big data, effective data curation plays a pivotal role in ensuring data integrity and usability across various domains. This thesis, titled "A Journey of Effective Data Curation: from Data Annotation to Data Integration and Organization," systematically addresses the essential processes of data curation, including annotation, integration, and organization, while highlighting the challenges inherent in each phase. The first work tackles the complexities of data annotation, presenting a novel cross-domain task allocation scheme designed to optimize the training and selection of annotators in crowdsourced data annotation scenarios. The second work shifts focus to data integration, introducing an innovative framework for column semantic type annotation that considers inter-table context, thereby enhancing the effectiveness of data integration from disparate sources. Finally, the third work explores the evolving paradigm of data organization in the context of Large Language Models (LLMs). It investigates the potential of LLMs to internalize taxonomy structures, proposing a hybrid prototype that marries traditional hierarchical taxonomy structures with advanced neural-language-model-based approaches. Through these contributions, this thesis not only elucidates the multifaceted nature of data curation but also offers practical solutions to the pressing challenges faced in the field. The findings underscore the importance of effective data management practices, ultimately paving the way for enhanced data accessibility and utility in an increasingly data-driven world. We thoroughly evaluated the effectiveness of our proposed advancements against existing state-of-the-art approaches. Finally, we conclude the thesis by raising promising future research directions related to data curation research. Date: Monday, 10 February 2025 Time: 3:00pm - 5:00pm Venue: Room 2408 Lifts 17/18 Committee Members: Prof. Lei Chen (Supervisor) Prof. Raymond Wong (Chairperson) Dr. Junxian He Prof. Qiong Luo