The Hong Kong University of Science and Technology
Department of Computer Science and Engineering

PhD Thesis Defence



Mr. Xingbo WANG


People often communicate with each other through multimodal verbal and 
non-verbal behavior, including voice, words, facial expression, and body 
language. Interpreting human communication behavior has great value for many 
applications, such as business, healthcare, and education. For example, if 
students show signs of boredom or confusion during the courses, teachers can 
adjust the teaching methods to improve students’ engagement. With the rapid 
development of digital technology and social media, a huge amount of multimodal 
human communication data (e.g., opinion videos) is generated and collected. To 
facilitate the analysis of human communication data, researchers adopt 
computational approaches to quantify human behavior with multimodal features. 
However, it is still demanding and inefficient to manually extract insights 
(e.g., social meanings of the features) in the large and complex feature space. 
Furthermore, it remains challenging to utilize the knowledge distilled from the 
computational features to enhance human communication skills. Meanwhile, 
interactive visual analytics combines computational algorithms with 
human-centered visualization to effectively supports information 
representation, knowledge discovery, and skills acquisition. It demonstrates 
great potential to solve the challenges above.

In this thesis, we design and build novel interactive visual analytics systems 
to 1) help users discover valuable behavioral patterns in multimodal human 
communication video and 2) further provide end-users with visual feedback and 
guidance to improve their communication skills. In the first work, we present 
DeHumor, a visual analytics system that visually decomposes humor speeches into 
quantifiable multimodal features and enables humor researchers and 
communication coaches to systematically explore humorous verbal content and 
vocal delivery. In the second work, we further characterize and investigate the 
intra- and inter-modal interactions between visual, acoustic, and language 
modalities, including dominance, complement, and conflict. Then, we develop 
M2Lens, a visual analytics system that helps model developers and users conduct 
multi- level and multi-faceted exploration of the influences of individual 
modalities and their interplay on model predictions for multimodal sentiment 
analysis. Besides understanding multimodal human communication behavior, in the 
third work, we present VoiceCoach, a visual analytics system that can evaluate 
speakers’ voice modulation skills regarding volume, pitch, speed, and pause, 
and recommend good learning examples of voice modulation in TED Talks to 
follow. Moreover, during the practice, the system can provide immediate visual 
feedback to speakers for self-awareness and performance improvement.

Date:			Tuesday, 16 August 2022

Time:			2:00pm - 4:00pm

Zoom Meeting:

Chairperson:		Prof. Mengze SHI (MARK)

Committee Members:	Prof. Huamin QU (Supervisor)
 			Prof. Minhao CHENG
 			Prof. Cunsheng DING
 			Prof. Jimmy FUNG (ENVR)
 			Prof. Hongbo FU (CityU)

**** ALL are Welcome ****