High-quality Image and Video Editing with Generative Models

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "High-quality Image and Video Editing with Generative Models"

By

Mr. Chenyang QI


Abstract:

Recent years have witnessed a growing demand for visual content, such as 2D 
images and multi-frame videos, in fields like computational photography, 
virtual reality, gaming, and the film industry. In response to those demands, 
various generative models, including VQVAE, GAN, and Diffusion models, have 
been proposed to facilitate visual content generation from noise or text. 
However, it remains an open challenge to adopt these models for more practical 
image-to-image generation, also known as image processing and editing. This 
thesis explores the paradigm of image editing with generative models, with a 
focus on the foundational models from large-scale pretraining.

We begin this thesis by exploring real-time image rescaling. Images from modern 
cameras can reach 6K resolution, but they also take up too much storage space. 
Here, we propose a quantized auto-encoder to compress a large 6K image to a 
JPEG thumbnail and reduce the file size by optimizing entropy loss. Then, an 
efficient decoder can upscale the low-resolution thumbnail back to a 
high-resolution image in real-time.

Then, we move on to text-driven image restoration. Camera motion, digital 
circuit noise, and bad weather (e.g., rain and fog) can degrade photographers' 
images. We propose to restore images using a diffusion model guided by semantic 
and restoration instructions. To enable such multi-modal application at a lower 
training cost, we fine-tune an adaptor for the pre-trained latent diffusion 
model using synthetic degradation images.

Finally, we discuss the text-driven video editing method. In addition to 
image-level computational photography, creative special effects are also widely 
used in games, movies, and short video applications. These effects typically 
require temporal motion consistency and semantic-level editing, such as 
identity and style. Since there is no robust, open-sourced, video generative 
model available, we focus on exploiting the text-to-image latent diffusion 
model in a zero-shot way. Specifically, we transform the image generative model 
to a video model, and extract the spatial-temporal attention maps in diffusion 
UNet during DDIM inversion as motion and geometry representation. Then, we 
refuse these attention maps during DDIM denoising following the target prompt. 
Our succinct framework allows shape, attribute, and global style while 
maintaining impressive temporal consistency.


Date:                   Friday, 12 July 2024

Time:                   10:00am - 12:00noon

Venue:                  Room 3494
                        Lifts 25/26

Chairman:               Prof. Chi Ying TSUI (ISD)

Committee Members:      Dr. Qifeng CHEN (Supervisor)
                        Prof. Chiew-Lan TAI
                        Dr. Dan XU
                        Prof. Ling SHI (ECE)
                        Prof. Yinqiang ZHENG (UTokyo)