More about HKUST
Visual Analytics for Data-centric Machine Learning
PhD Thesis Proposal Defence Title: "Visual Analytics for Data-centric Machine Learning" by Mr. Zhihua JIN Abstract: Machine learning has achieved great successes in various applications like image classification, natural language processing, and graph analysis. To keep improving the models, data plays a critical role in this lifecycle. The concepts of data-centric machine learning have even promoted the prioritizing of improving data quality and understanding. Automated methods have been developed to promote this process, while ensuring alignment between model behaviors and human values remains a challenge. Human intervention becomes necessary to adjust the models or pipeline when automated methods are not effective. Challenges also exist in effectively involving humans throughout the lifecycle of data-centric machine learning, such as linking complex model inputs and outputs, systematically understanding dataset issues, and monitoring deployed models to identify potential problems. Visualization can be an effective means to involve humans in addressing the challenges throughout the lifecycle of data-centric machine learning. In this thesis proposal, we propose three novel visual analytics systems to address the challenges encountered at different stages of data-centric machine learning, including model development, model evaluation, and model deployment. In our first work, we propose GNNLens, a tool that facilitates the understanding and analysis of Graph Neural Networks (GNNs) in the model development stage. GNNLens enables model developers and users to diagnose prediction errors by incorporating proxy models and metrics. Through interactive visualizations and detailed node-level analysis, GNNLens helps identify error patterns and formulate hypotheses regarding these patterns. In our second work, we present ShortcutLens, which focuses on exploring shortcuts in Natural Language Understanding (NLU) benchmark datasets used in the model evaluation stage. This system empowers experts to comprehensively explore and understand shortcuts within the datasets. ShortcutLens provides statistical insights and hierarchical templates, facilitating the identification and examination of different types of shortcuts. This approach improves the understanding of dataset issues and inspires the creation of more challenging and relevant benchmark datasets. Finally, we discuss the ongoing work on visual analytics for the discovery of jailbreak prompts from large-scale human-LLM conversational datasets collected during the model deployment stage. Date: Thursday, 6 June 2024 Time: 10:00am - 12:00noon Venue: Room 5501 Lifts 25/26 Committee Members: Prof. Huamin Qu (Supervisor) Dr. Yangqiu Song (Chairperson) Prof. Albert Chung Dr. Junxian He