More about HKUST
Visual Analytics for Data-Centric Machine Learning
The Hong Kong University of Science and Technology Department of Computer Science and Engineering PhD Thesis Defence Title: "Visual Analytics for Data-Centric Machine Learning" By Mr. Zhihua JIN Abstract: Machine learning has achieved great successes in various applications like image classification, natural language processing, and graph analysis. To keep improving the models, data plays a critical role in this lifecycle. The concepts of data-centric machine learning have even promoted the prioritizing of improving data quality and understanding. Automated methods have been developed to promote this process, while ensuring alignment between model behaviors and human values remains a challenge. Human intervention becomes necessary to adjust the models or pipeline when automated methods are not effective. Challenges also exist in effectively involving humans throughout the lifecycle of data-centric machine learning, such as linking complex model inputs and outputs, systematically understanding dataset issues, and monitoring deployed models to identify potential problems. Visualization can be an effective means to involve humans in addressing the challenges throughout the lifecycle of data-centric machine learning. In this thesis, we propose three novel visual analytics systems to address the challenges encountered at different stages of data-centric machine learning, including model development, model evaluation, and model deployment. In our first work, we propose GNNLens, a tool that facilitates the understanding and analysis of Graph Neural Networks (GNNs) in the model development stage. GNNLens enables model developers and users to diagnose prediction errors by incorporating proxy models and metrics. Through interactive visualizations and detailed node-level analysis, GNNLens helps identify error patterns and formulate hypotheses regarding these patterns. In our second work, we present ShortcutLens, which focuses on exploring shortcuts in Natural Language Understanding (NLU) benchmark datasets used in the model evaluation stage. This system empowers experts to comprehensively explore and understand shortcuts within the datasets. ShortcutLens provides statistical insights and hierarchical templates, facilitating the identification and examination of different types of shortcuts. This approach improves the understanding of dataset issues and inspires the creation of more challenging and relevant benchmark datasets. In our third work, we introduce JailbreakHunter, a tool that supports the identification of jailbreak prompts for Large Language Models (LLMs) in large-scale human-LLM conversational datasets collected during the model deployment stage. JailbreakHunter employs visual analytics to enable group-level, conversation-level, and turn-level analyses for identifying potential security vulnerabilities, specifically jailbreak prompts. By integrating visualizations and interactive features, JailbreakHunter assists LLM researchers in effectively analyzing and mitigating jailbreak prompts within such datasets. We demonstrate the effectiveness and usability of the proposed systems through case studies and interviews with domain experts. Date: Wednesday, 21 August 2024 Time: 2:00pm - 4:00pm Venue: Room 5501 Lifts 25/26 Chairman: Prof. Ross MURCH (ECE) Committee Members: Prof. Huamin QU (Supervisor) Prof. Qiong LUO Dr. Xiaojuan MA Dr. Wenhan LUO (EMIA) Dr. Jaegul CHOO (KAIST)