Visual Analytics for Data-Centric Machine Learning

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "Visual Analytics for Data-Centric Machine Learning"

By

Mr. Zhihua JIN


Abstract:

Machine learning has achieved great successes in various applications like 
image classification, natural language processing, and graph analysis. To keep 
improving the models, data plays a critical role in this lifecycle. The 
concepts of data-centric machine learning have even promoted the prioritizing 
of improving data quality and understanding. Automated methods have been 
developed to promote this process, while ensuring alignment between model 
behaviors and human values remains a challenge. Human intervention becomes 
necessary to adjust the models or pipeline when automated methods are not 
effective. Challenges also exist in effectively involving humans throughout the 
lifecycle of data-centric machine learning, such as linking complex model 
inputs and outputs, systematically understanding dataset issues, and monitoring 
deployed models to identify potential problems. Visualization can be an 
effective means to involve humans in addressing the challenges throughout the 
lifecycle of data-centric machine learning.

In this thesis, we propose three novel visual analytics systems to address the 
challenges encountered at different stages of data-centric machine learning, 
including model development, model evaluation, and model deployment. In our 
first work, we propose GNNLens, a tool that facilitates the understanding and 
analysis of Graph Neural Networks (GNNs) in the model development stage. 
GNNLens enables model developers and users to diagnose prediction errors by 
incorporating proxy models and metrics. Through interactive visualizations and 
detailed node-level analysis, GNNLens helps identify error patterns and 
formulate hypotheses regarding these patterns. In our second work, we present 
ShortcutLens, which focuses on exploring shortcuts in Natural Language 
Understanding (NLU) benchmark datasets used in the model evaluation stage. This 
system empowers experts to comprehensively explore and understand shortcuts 
within the datasets. ShortcutLens provides statistical insights and 
hierarchical templates, facilitating the identification and examination of 
different types of shortcuts. This approach improves the understanding of 
dataset issues and inspires the creation of more challenging and relevant 
benchmark datasets. In our third work, we introduce JailbreakHunter, a tool 
that supports the identification of jailbreak prompts for Large Language Models 
(LLMs) in large-scale human-LLM conversational datasets collected during the 
model deployment stage. JailbreakHunter employs visual analytics to enable 
group-level, conversation-level, and turn-level analyses for identifying 
potential security vulnerabilities, specifically jailbreak prompts. By 
integrating visualizations and interactive features, JailbreakHunter assists 
LLM researchers in effectively analyzing and mitigating jailbreak prompts 
within such datasets. We demonstrate the effectiveness and usability of the 
proposed systems through case studies and interviews with domain experts.


Date:                   Wednesday, 21 August 2024

Time:                   2:00pm - 4:00pm

Venue:                  Room 5501
                        Lifts 25/26

Chairman:               

Committee Members:      Prof. Huamin QU (Supervisor)
                        Prof. Qiong LUO
                        Dr. Xiaojuan MA
                        Dr. Wenhan LUO (EMIA)
                        Dr. Jaegul CHOO (KAIST)