Designing Interactive Scaffolds for Steering VLM-Powered Visualization Generation

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "Designing Interactive Scaffolds for Steering VLM-Powered 
Visualization Generation"

By

Miss Liwenhan XIE


Abstract:

Visualization is a powerful means for everyday data consumption, such as 
business analytics, personal informatics, and data-driven storytelling. 
Recent advances in vision language models (VLMs) have been rapidly 
transforming complex, data-intensive tasks, allowing creating high-fidelity 
visualizations from one simple prompt. However, transactional VLM use 
fundamentally compromises human agency in open-ended data tasks. Functional 
data processing risks hallucinations and distortion of the analytical intent, 
and communicative design often leads to ineffective regeneration cycles when 
tweaking nuanced visual specifications. Critically, this pervasive challenge 
stems from the VLM's black-box nature and the resulting coarse granularity of 
control available for precise intervention.

This thesis focuses on empowering proficient creators and data practitioners 
in subjective, iterative tasks where human agency is paramount. This 
emphasizes the nuanced, exploratory, and iterative process essential for 
effective human-VLM collaboration. My approach is to construct interactive 
layers built on top of generative models that provide semantically meaningful 
and fine-grained intervention points, namely interactive scaffolds. The 
design of these interactive scaffolds enforces an "intent first, nuances on 
demand" paradigm, which is structured around three conceptual pillars: (1) 
anchoring human intent early and deeply in the VLM's workflow, (2) surfacing 
model-internal representations with semantically meaningful blocks for 
proactive steering, and (3) extracting semantic controls to enable granular 
refinement of the output.

I investigate the critical stages of the visualization pipeline powered by 
VLMs: data transformation, visual mapping, and visual augmentation. The first 
study introduces an approach (WaitGPT) that transforms VLM-generated data 
processing code and execution results into a flow diagram. This scaffold 
anchors analysis logic, providing intermediate entry points for users to 
verify and refine analytical intent at a smaller operation granularity. The 
second study presents a pipeline (DataWink) to create bespoke visualization 
with a reference by automatically extracting and encapsulating the data 
mapping scheme into an extensible template. This scaffold anchors the VLM to 
a high-fidelity structural intent, allowing flexible reuse and adaptation. 
The third study describes an animation tool (DataSway) to vivify static 
metaphoric visualizations. This scaffold enables the refinement of 
communicative intent by connecting high-level descriptions to fine-grained 
animation configurations.

Ultimately, these pieces contribute to a deeper understanding of how the 
interactive scaffolds framework can mitigate inherent limitations of VLMs, 
promoting verifiable, user-centered data visualization practices with 
increased human agency.


Date:                   Thursday, 27 November 2025

Time:                   9:00am - 11:00am

Venue:                  Room 5501
                        Lifts 25/26

Chairman:               Dr. Shangyu DANG (LIFS)

Committee Members:      Prof. Huamin QU (Supervisor)
                        Dr. Xiaojuan MA
                        Dr. Arpit NARECHANIA
                        Prof. Yunya SONG (IEDA)
                        Dr. Fanny CHEVALIER (UofT)