More about HKUST
Visual Analytics for Data-centric Machine Learning
PhD Thesis Proposal Defence
Title: "Visual Analytics for Data-centric Machine Learning"
by
Mr. Zhihua JIN
Abstract:
Machine learning has achieved great successes in various applications like
image classification, natural language processing, and graph analysis. To keep
improving the models, data plays a critical role in this lifecycle. The
concepts of data-centric machine learning have even promoted the prioritizing
of improving data quality and understanding. Automated methods have been
developed to promote this process, while ensuring alignment between model
behaviors and human values remains a challenge. Human intervention becomes
necessary to adjust the models or pipeline when automated methods are not
effective. Challenges also exist in effectively involving humans throughout the
lifecycle of data-centric machine learning, such as linking complex model
inputs and outputs, systematically understanding dataset issues, and monitoring
deployed models to identify potential problems. Visualization can be an
effective means to involve humans in addressing the challenges throughout the
lifecycle of data-centric machine learning.
In this thesis proposal, we propose three novel visual analytics systems to
address the challenges encountered at different stages of data-centric machine
learning, including model development, model evaluation, and model deployment.
In our first work, we propose GNNLens, a tool that facilitates the
understanding and analysis of Graph Neural Networks (GNNs) in the model
development stage. GNNLens enables model developers and users to diagnose
prediction errors by incorporating proxy models and metrics. Through
interactive visualizations and detailed node-level analysis, GNNLens helps
identify error patterns and formulate hypotheses regarding these patterns. In
our second work, we present ShortcutLens, which focuses on exploring shortcuts
in Natural Language Understanding (NLU) benchmark datasets used in the model
evaluation stage. This system empowers experts to comprehensively explore and
understand shortcuts within the datasets. ShortcutLens provides statistical
insights and hierarchical templates, facilitating the identification and
examination of different types of shortcuts. This approach improves the
understanding of dataset issues and inspires the creation of more challenging
and relevant benchmark datasets. Finally, we discuss the ongoing work on visual
analytics for the discovery of jailbreak prompts from large-scale human-LLM
conversational datasets collected during the model deployment stage.
Date: Thursday, 6 June 2024
Time: 10:00am - 12:00noon
Venue: Room 5501
Lifts 25/26
Committee Members: Prof. Huamin Qu (Supervisor)
Dr. Yangqiu Song (Chairperson)
Prof. Albert Chung
Dr. Junxian He