High-quality image and video editing with Generative Models

PhD Thesis Proposal Defence


Title: "High-quality image and video editing with Generative Models"

by

Mr. Chenyang QI


Abstract:

Recent years have witnessed a growing demand for visual content, such as 2D
images and multi-frame videos, in fields like computational photography,
virtual reality, gaming, and film industry. At the same time, various
generative models, including VQVAE, GAN, and Diffusion models, have been
proposed to facilitate the visual content generation from noise or text.
However, it is still an open challenge to adopt these models for more practical
image-to-image generation, also known as image processing and editing. This
thesis targets exploring the paradigm of image editing with generative models,
with a focus on those foundational models from large-scale pretraining.

Images from modern cameras can reach 6K resolution, but they also take up too
much storage space. We begin this thesis with real-time image rescaling, where
we compress a large 6K image to a JPEG thumbnail. Our quantized auto-encoder
allows real-time upscaling and reduces the file size by optimizing entropy
loss. Although our thumbnail can preserve the fidelity of the original image,
its quality can still be degraded by blur, noise, or other effects.

Then, we move on to text-driven image restoration using semantic and
restoration instruction. To enable such multi-modal application at a lower
training cost, we propose to fine-tune the pre-trained latent diffusion model
using synthetic degradation images.

In addition to computational photography, creative special effects are also
widely used in games, movies, and short video applications. These effects
typically require temporal motion consistency and semantic-level editing, such
as identity and style. Since there is no robust, open-sourced, video generative
model available, we focus on exploiting the text-to-image latent diffusion
model in a zero-shot way. Specifically, we transform the image generative model
to a video model, and extract the spatial-temporal attention maps in diffusion
Unet during DDIM inversion as motion and geometry representation. Then, we
refuse these attention maps during DDIM denoising following the target prompt.
Our succinct framework allows shape, attribute, and global style while
maintaining impressive temporal consistency.


Date:                   Friday, 19 January 2024

Time:                   4:00pm - 6:00pm

Venue:                  Room 4472
                        lifts 25/26

Committee Members:      Dr. Qifeng Chen (Supervisor)
                        Dr. Dan Xu (Chairperson)
                        Dr. Xiaomeng Li
                        Dr. Yingcong Chen (EMIA)


**** ALL are Welcome ****