More about HKUST
Consistent Visual Editing in the Diffusion Era: Escalating from 2D Images to 3D Space and Temporal Dynamics
PhD Thesis Proposal Defence
Title: "Consistent Visual Editing in the Diffusion Era: Escalating from 2D
Images to 3D Space and Temporal Dynamics"
by
Mr. Qingyan BAI
Abstract:
Controllable visual editing aims to modify specific visual content with
surgical precision, while rigorously preserving semantic, spatial, and
temporal coherence across diverse conditions. Although diffusion models have
redefined the landscape of visual generation, achieving unprecedented
fidelity in synthesizing images, videos, and 3D content, the fundamental
bottleneck of consistency remains largely unsolved: every additional axis of
control—a second view of the same object, a new camera pose, a new frame in
time—introduces new failure modes that pure scaling cannot address.
This thesis tackles consistency through a unified dimensionality-escalation
framework, structured across three progressively complex pillars: 2D
cross-image correspondence, 3D spatial awareness, and spatiotemporal video
dynamics. Pillar I confronts 2D discrete space. We introduce Edicho, a
training-free diffusion paradigm for editing in-the-wild images that
leverages explicit pre-computed image correspondence to manipulate
self-attention blocks and Classifier-Free Guidance (CFG), delivering robust
semantic and structural consistency across unconstrained multi-image edits
while remaining plug-and-play with state-of-the-art editors such as
ControlNet and BrushNet. Pillar II elevates the condition to 3D continuous
space. We propose 3DPE, a real-time 3D-aware portrait editor that distills
geometry priors from a 3D-aware face generator and editing capabilities from
a text-to-image diffusion model into a lightweight feedforward module,
realizing strict 3D-aware multi-view consistency at ~0.04s per frame—over
100× faster than the closest competitor—while supporting fast adaptation to
user-specified edits in roughly five minutes of fine-tuning. Pillar III
confronts the temporal axis and the scalability of instruction-based video
editing. To overcome the extreme scarcity of high-quality dynamic data, we
propose Ditto, a scalable synthetic-data pipeline that fuses a leading image
editor with an in-context video generator, augmented by a temporal enhancer
and an autonomous Vision-Language-Model agent for instruction authoring and
quality filtering. With this pipeline we construct Ditto-1M, a
one-million-clip dataset, and train Editto with a modality curriculum that
progressively anneals visual scaffolding into pure language guidance,
achieving state-of-the-art temporal consistency on instruction-based video
editing. Together, these three pillars establish a unified, scalable, and
multi- dimensional paradigm for practical visual generation and editing in
the diffusion era.
Date: Tuesday, 26 May 2026
Time: 3:00pm - 5:00pm
Venue: Room 2129A
Lift 19
Committee Members: Dr. Qifeng Chen (Supervisor)
Prof. Dit-Yan Yeung (Chairperson)
Dr. Yinghao Xu