Visual Analytics for Data-centric Machine Learning

PhD Thesis Proposal Defence


Title: "Visual Analytics for Data-centric Machine Learning"

by

Mr. Zhihua JIN


Abstract:

Machine learning has achieved great successes in various applications like 
image classification, natural language processing, and graph analysis. To keep 
improving the models, data plays a critical role in this lifecycle. The 
concepts of data-centric machine learning have even promoted the prioritizing 
of improving data quality and understanding. Automated methods have been 
developed to promote this process, while ensuring alignment between model 
behaviors and human values remains a challenge. Human intervention becomes 
necessary to adjust the models or pipeline when automated methods are not 
effective. Challenges also exist in effectively involving humans throughout the 
lifecycle of data-centric machine learning, such as linking complex model 
inputs and outputs, systematically understanding dataset issues, and monitoring 
deployed models to identify potential problems. Visualization can be an 
effective means to involve humans in addressing the challenges throughout the 
lifecycle of data-centric machine learning.

In this thesis proposal, we propose three novel visual analytics systems to 
address the challenges encountered at different stages of data-centric machine 
learning, including model development, model evaluation, and model deployment. 
In our first work, we propose GNNLens, a tool that facilitates the 
understanding and analysis of Graph Neural Networks (GNNs) in the model 
development stage. GNNLens enables model developers and users to diagnose 
prediction errors by incorporating proxy models and metrics. Through 
interactive visualizations and detailed node-level analysis, GNNLens helps 
identify error patterns and formulate hypotheses regarding these patterns. In 
our second work, we present ShortcutLens, which focuses on exploring shortcuts 
in Natural Language Understanding (NLU) benchmark datasets used in the model 
evaluation stage. This system empowers experts to comprehensively explore and 
understand shortcuts within the datasets. ShortcutLens provides statistical 
insights and hierarchical templates, facilitating the identification and 
examination of different types of shortcuts. This approach improves the 
understanding of dataset issues and inspires the creation of more challenging 
and relevant benchmark datasets. Finally, we discuss the ongoing work on visual 
analytics for the discovery of jailbreak prompts from large-scale human-LLM 
conversational datasets collected during the model deployment stage.


Date:                   Thursday, 6 June 2024

Time:                   10:00am - 12:00noon

Venue:                  Room 5501
                        Lifts 25/26

Committee Members:      Prof. Huamin Qu (Supervisor)
                        Dr. Yangqiu Song (Chairperson)
                        Prof. Albert Chung
                        Dr. Junxian He