TALE: Training-free Cross-domain Image Composition via Adaptive Latent Manipulation and Energy-guided Optimization

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


MPhil Thesis Defence


Title: "TALE: Training-free Cross-domain Image Composition via Adaptive 
Latent Manipulation and Energy-guided Optimization"

By

Mr. Trung Kien PHAM


Abstract:

In this thesis, we present TALE, a novel training-free framework 
harnessing the power of text-driven diffusion models to tackle 
cross-domain image composition task that aims at seamlessly incorporating 
user-provided objects into a specific visual context regardless of domain 
disparity. Previous methods often involve either training auxiliary 
networks or finetuning diffusion models on customized datasets, which are 
expensive and may undermine the robust textual and visual priors of 
pretrained diffusion models. Some recent works attempt to break the 
barrier by proposing training-free workarounds that rely on manipulating 
attention maps to tame the denoising process implicitly. However, 
composing via attention maps does not necessarily yield desired 
compositional outcomes. These approaches could only retain some semantic 
information and usually fall short in preserving identity characteristics 
of input objects or exhibit limited background-object style adaptation in 
generated images. In contrast, TALE is a novel method that operates 
directly on latent space to provide explicit and effective guidance for 
the composition process to resolve these problems. Specifically, we equip 
TALE with two mechanisms dubbed Adaptive Latent Manipulation and 
Energy-guided Latent Optimization. The former formulates noisy latents 
conducive to initiating and steering the composition process by directly 
leveraging background and foreground latents at corresponding timesteps, 
and the latter exploits designated energy functions to further optimize 
intermediate latents conforming to specific conditions that complement the 
former to generate desired final results. Our experiments demonstrate that 
TALE surpasses prior baselines and attains state-of-the-art performance in 
image-guided composition across various photorealistic and artistic 
domains.


Date:                   Tuesday, 13 August 2024

Time:                   1:00pm - 3:00pm

Venue:                  Room 5501
                        Lifts 25/26

Chairman:               Dr. Long CHEN

Committee Members:      Dr. Qifeng CHEN (Supervisor)
                        Prof. James KWOK