Interactive Visual Analytics for Multimodal Video Data Exploration and Model Steering

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "Interactive Visual Analytics for Multimodal Video Data Exploration 
and Model Steering"

By

Miss Jianben HE


Abstract:

Video, a versatile multimodal data format, has emerged as the predominant 
channel for information transmission and communication. Efficient video 
exploration and analysis are critical for numerous applications spanning 
business, security, and education. Taking advantage of powerful 
computational resources, numerous efforts have recently sought to utilize 
automatic machine learning or deep learning models to automatically 
summarize and analyze videos. However, considering the inherent complexity 
of video data, building models that effectively and comprehensively 
understand the spatial-temporal relationships and cross-modal information 
within videos present significant challenges. Moreover, these automated 
methods provide poor interactions and result representations, hindering 
analysts from conducting fine-grained exploration for efficient high-level 
insight distillation. Visual analytics, which combines computational 
algorithm efficiency with human visual ability for pattern discovery and 
domain knowledge for decision-making, has brought new opportunities for 
model practitioners and data analysts to perform comprehensive video data 
exploration and model steering.

In this thesis proposal, we concentrate on designing and developing 
interactive visual analytic systems to 1) facilitate efficient and scalable 
video data labeling for model building, 2) support comprehensive diagnosis 
and effective prompt engineering to steer model reasoning performance on 
multimodal video content, and 3) leverage advanced model results for 
digesting multimodal video content and distilling insights in complex online 
education settings. In the first work, we present VideoPro which employs 
data programming concepts to enable efficient video data labeling and 
supplementation at scale during model training. To further support effective 
model adaptation for domain-specific tasks, in the second work, we introduce 
POEM that streamlines the prompt engineering process to incorporate 
multimodal knowledge in both inductive and deductive manner. In the third 
ongoing work Engager, we further apply advanced model results to conduct a 
multi-granularity analysis of human complex engagement and interaction 
performance for pedagogical insight distillation in specific online 
education settings.


Date:                   Monday, 2 December 2024

Time:                   2:00pm - 4:00pm

Venue:                  Room 3494
                        Lifts 25/26

Chairman:               Dr. Jiachuan YANG (CIVL)

Committee Members:      Prof. Huamin QU (Supervisor)
                        Prof. Qian ZHANG (Co-supervisor)
                        Prof. Qiong LUO
                        Prof. Pedro SANDER
                        Prof. Shengdong ZHAO (CityU)