More about HKUST
Visual Analytics for Data-Centric Machine Learning
The Hong Kong University of Science and Technology
Department of Computer Science and Engineering
PhD Thesis Defence
Title: "Visual Analytics for Data-Centric Machine Learning"
By
Mr. Zhihua JIN
Abstract:
Machine learning has achieved great successes in various applications like
image classification, natural language processing, and graph analysis. To keep
improving the models, data plays a critical role in this lifecycle. The
concepts of data-centric machine learning have even promoted the prioritizing
of improving data quality and understanding. Automated methods have been
developed to promote this process, while ensuring alignment between model
behaviors and human values remains a challenge. Human intervention becomes
necessary to adjust the models or pipeline when automated methods are not
effective. Challenges also exist in effectively involving humans throughout the
lifecycle of data-centric machine learning, such as linking complex model
inputs and outputs, systematically understanding dataset issues, and monitoring
deployed models to identify potential problems. Visualization can be an
effective means to involve humans in addressing the challenges throughout the
lifecycle of data-centric machine learning.
In this thesis, we propose three novel visual analytics systems to address the
challenges encountered at different stages of data-centric machine learning,
including model development, model evaluation, and model deployment. In our
first work, we propose GNNLens, a tool that facilitates the understanding and
analysis of Graph Neural Networks (GNNs) in the model development stage.
GNNLens enables model developers and users to diagnose prediction errors by
incorporating proxy models and metrics. Through interactive visualizations and
detailed node-level analysis, GNNLens helps identify error patterns and
formulate hypotheses regarding these patterns. In our second work, we present
ShortcutLens, which focuses on exploring shortcuts in Natural Language
Understanding (NLU) benchmark datasets used in the model evaluation stage. This
system empowers experts to comprehensively explore and understand shortcuts
within the datasets. ShortcutLens provides statistical insights and
hierarchical templates, facilitating the identification and examination of
different types of shortcuts. This approach improves the understanding of
dataset issues and inspires the creation of more challenging and relevant
benchmark datasets. In our third work, we introduce JailbreakHunter, a tool
that supports the identification of jailbreak prompts for Large Language Models
(LLMs) in large-scale human-LLM conversational datasets collected during the
model deployment stage. JailbreakHunter employs visual analytics to enable
group-level, conversation-level, and turn-level analyses for identifying
potential security vulnerabilities, specifically jailbreak prompts. By
integrating visualizations and interactive features, JailbreakHunter assists
LLM researchers in effectively analyzing and mitigating jailbreak prompts
within such datasets. We demonstrate the effectiveness and usability of the
proposed systems through case studies and interviews with domain experts.
Date: Wednesday, 21 August 2024
Time: 2:00pm - 4:00pm
Venue: Room 5501
Lifts 25/26
Chairman: Prof. Ross MURCH (ECE)
Committee Members: Prof. Huamin QU (Supervisor)
Prof. Qiong LUO
Dr. Xiaojuan MA
Dr. Wenhan LUO (AMC)
Dr. Jaegul CHOO (KAIST)