Consistent Visual Editing in the Diffusion Era: Escalating from 2D Images to 3D Space and Temporal Dynamics

PhD Thesis Proposal Defence


Title: "Consistent Visual Editing in the Diffusion Era: Escalating from 2D 
Images to 3D Space and Temporal Dynamics"

by

Mr. Qingyan BAI


Abstract:

Controllable visual editing aims to modify specific visual content with 
surgical precision, while rigorously preserving semantic, spatial, and 
temporal coherence across diverse conditions. Although diffusion models have 
redefined the landscape of visual generation, achieving unprecedented 
fidelity in synthesizing images, videos, and 3D content, the fundamental 
bottleneck of consistency remains largely unsolved: every additional axis of 
control—a second view of the same object, a new camera pose, a new frame in 
time—introduces new failure modes that pure scaling cannot address.

This thesis tackles consistency through a unified dimensionality-escalation 
framework, structured across three progressively complex pillars: 2D 
cross-image correspondence, 3D spatial awareness, and spatiotemporal video 
dynamics. Pillar I confronts 2D discrete space. We introduce Edicho, a 
training-free diffusion paradigm for editing in-the-wild images that 
leverages explicit pre-computed image correspondence to manipulate 
self-attention blocks and Classifier-Free Guidance (CFG), delivering robust 
semantic and structural consistency across unconstrained multi-image edits 
while remaining plug-and-play with state-of-the-art editors such as 
ControlNet and BrushNet. Pillar II elevates the condition to 3D continuous 
space. We propose 3DPE, a real-time 3D-aware portrait editor that distills 
geometry priors from a 3D-aware face generator and editing capabilities from 
a text-to-image diffusion model into a lightweight feedforward module, 
realizing strict 3D-aware multi-view consistency at ~0.04s per frame—over 
100× faster than the closest competitor—while supporting fast adaptation to 
user-specified edits in roughly five minutes of fine-tuning. Pillar III 
confronts the temporal axis and the scalability of instruction-based video 
editing. To overcome the extreme scarcity of high-quality dynamic data, we 
propose Ditto, a scalable synthetic-data pipeline that fuses a leading image 
editor with an in-context video generator, augmented by a temporal enhancer 
and an autonomous Vision-Language-Model agent for instruction authoring and 
quality filtering. With this pipeline we construct Ditto-1M, a 
one-million-clip dataset, and train Editto with a modality curriculum that 
progressively anneals visual scaffolding into pure language guidance, 
achieving state-of-the-art temporal consistency on instruction-based video 
editing. Together, these three pillars establish a unified, scalable, and 
multi- dimensional paradigm for practical visual generation and editing in 
the diffusion era.


Date:                   Tuesday, 26 May 2026

Time:                   3:00pm - 5:00pm

Venue:                  Room 2129A
                        Lift 19

Committee Members:      Dr. Qifeng Chen (Supervisor)
                        Prof. Dit-Yan Yeung (Chairperson)
                        Dr. Yinghao Xu