More about HKUST
High-quality image and video editing with Generative Models
PhD Thesis Proposal Defence Title: "High-quality image and video editing with Generative Models" by Mr. Chenyang QI Abstract: Recent years have witnessed a growing demand for visual content, such as 2D images and multi-frame videos, in fields like computational photography, virtual reality, gaming, and film industry. At the same time, various generative models, including VQVAE, GAN, and Diffusion models, have been proposed to facilitate the visual content generation from noise or text. However, it is still an open challenge to adopt these models for more practical image-to-image generation, also known as image processing and editing. This thesis targets exploring the paradigm of image editing with generative models, with a focus on those foundational models from large-scale pretraining. Images from modern cameras can reach 6K resolution, but they also take up too much storage space. We begin this thesis with real-time image rescaling, where we compress a large 6K image to a JPEG thumbnail. Our quantized auto-encoder allows real-time upscaling and reduces the file size by optimizing entropy loss. Although our thumbnail can preserve the fidelity of the original image, its quality can still be degraded by blur, noise, or other effects. Then, we move on to text-driven image restoration using semantic and restoration instruction. To enable such multi-modal application at a lower training cost, we propose to fine-tune the pre-trained latent diffusion model using synthetic degradation images. In addition to computational photography, creative special effects are also widely used in games, movies, and short video applications. These effects typically require temporal motion consistency and semantic-level editing, such as identity and style. Since there is no robust, open-sourced, video generative model available, we focus on exploiting the text-to-image latent diffusion model in a zero-shot way. Specifically, we transform the image generative model to a video model, and extract the spatial-temporal attention maps in diffusion Unet during DDIM inversion as motion and geometry representation. Then, we refuse these attention maps during DDIM denoising following the target prompt. Our succinct framework allows shape, attribute, and global style while maintaining impressive temporal consistency. Date: Friday, 19 January 2024 Time: 4:00pm - 6:00pm Venue: Room 4472 lifts 25/26 Committee Members: Dr. Qifeng Chen (Supervisor) Dr. Dan Xu (Chairperson) Dr. Xiaomeng Li Dr. Yingcong Chen (EMIA) **** ALL are Welcome ****