More about HKUST
High-quality Image and Video Editing with Generative Models
The Hong Kong University of Science and Technology
Department of Computer Science and Engineering
PhD Thesis Defence
Title: "High-quality Image and Video Editing with Generative Models"
By
Mr. Chenyang QI
Abstract:
Recent years have witnessed a growing demand for visual content, such as 2D
images and multi-frame videos, in fields like computational photography,
virtual reality, gaming, and the film industry. In response to those demands,
various generative models, including VQVAE, GAN, and Diffusion models, have
been proposed to facilitate visual content generation from noise or text.
However, it remains an open challenge to adopt these models for more practical
image-to-image generation, also known as image processing and editing. This
thesis explores the paradigm of image editing with generative models, with a
focus on the foundational models from large-scale pretraining.
We begin this thesis by exploring real-time image rescaling. Images from modern
cameras can reach 6K resolution, but they also take up too much storage space.
Here, we propose a quantized auto-encoder to compress a large 6K image to a
JPEG thumbnail and reduce the file size by optimizing entropy loss. Then, an
efficient decoder can upscale the low-resolution thumbnail back to a
high-resolution image in real-time.
Then, we move on to text-driven image restoration. Camera motion, digital
circuit noise, and bad weather (e.g., rain and fog) can degrade photographers'
images. We propose to restore images using a diffusion model guided by semantic
and restoration instructions. To enable such multi-modal application at a lower
training cost, we fine-tune an adaptor for the pre-trained latent diffusion
model using synthetic degradation images.
Finally, we discuss the text-driven video editing method. In addition to
image-level computational photography, creative special effects are also widely
used in games, movies, and short video applications. These effects typically
require temporal motion consistency and semantic-level editing, such as
identity and style. Since there is no robust, open-sourced, video generative
model available, we focus on exploiting the text-to-image latent diffusion
model in a zero-shot way. Specifically, we transform the image generative model
to a video model, and extract the spatial-temporal attention maps in diffusion
UNet during DDIM inversion as motion and geometry representation. Then, we
refuse these attention maps during DDIM denoising following the target prompt.
Our succinct framework allows shape, attribute, and global style while
maintaining impressive temporal consistency.
Date: Friday, 12 July 2024
Time: 10:00am - 12:00noon
Venue: Room 3494
Lifts 25/26
Chairman: Prof. Chi Ying TSUI (ISD)
Committee Members: Dr. Qifeng CHEN (Supervisor)
Prof. Chiew-Lan TAI
Dr. Dan XU
Prof. Ling SHI (ECE)
Prof. Yinqiang ZHENG (UTokyo)