More about HKUST
High-quality Image and Video Editing with Generative Models
The Hong Kong University of Science and Technology Department of Computer Science and Engineering PhD Thesis Defence Title: "High-quality Image and Video Editing with Generative Models" By Mr. Chenyang QI Abstract: Recent years have witnessed a growing demand for visual content, such as 2D images and multi-frame videos, in fields like computational photography, virtual reality, gaming, and the film industry. In response to those demands, various generative models, including VQVAE, GAN, and Diffusion models, have been proposed to facilitate visual content generation from noise or text. However, it remains an open challenge to adopt these models for more practical image-to-image generation, also known as image processing and editing. This thesis explores the paradigm of image editing with generative models, with a focus on the foundational models from large-scale pretraining. We begin this thesis by exploring real-time image rescaling. Images from modern cameras can reach 6K resolution, but they also take up too much storage space. Here, we propose a quantized auto-encoder to compress a large 6K image to a JPEG thumbnail and reduce the file size by optimizing entropy loss. Then, an efficient decoder can upscale the low-resolution thumbnail back to a high-resolution image in real-time. Then, we move on to text-driven image restoration. Camera motion, digital circuit noise, and bad weather (e.g., rain and fog) can degrade photographers' images. We propose to restore images using a diffusion model guided by semantic and restoration instructions. To enable such multi-modal application at a lower training cost, we fine-tune an adaptor for the pre-trained latent diffusion model using synthetic degradation images. Finally, we discuss the text-driven video editing method. In addition to image-level computational photography, creative special effects are also widely used in games, movies, and short video applications. These effects typically require temporal motion consistency and semantic-level editing, such as identity and style. Since there is no robust, open-sourced, video generative model available, we focus on exploiting the text-to-image latent diffusion model in a zero-shot way. Specifically, we transform the image generative model to a video model, and extract the spatial-temporal attention maps in diffusion UNet during DDIM inversion as motion and geometry representation. Then, we refuse these attention maps during DDIM denoising following the target prompt. Our succinct framework allows shape, attribute, and global style while maintaining impressive temporal consistency. Date: Friday, 12 July 2024 Time: 10:00am - 12:00noon Venue: Room 3494 Lifts 25/26 Chairman: Prof. Chi Ying TSUI (ISD) Committee Members: Dr. Qifeng CHEN (Supervisor) Prof. Chiew-Lan TAI Dr. Dan XU Prof. Ling SHI (ECE) Prof. Yinqiang ZHENG (UTokyo)