More about HKUST
High-Quality Visual Content Creation with Foundation Generative Models
The Hong Kong University of Science and Technology Department of Computer Science and Engineering PhD Thesis Defence Title: "High-Quality Visual Content Creation with Foundation Generative Models" By Mr. Tengfei WANG Abstract: The increasing demand for high-quality visual content, encompassing 2D images and 3D models, is evident across various applications, such as virtual reality and video games. However, creating such visual content can be laborious as it requires a combination of artistic expertise and proficiency in 2D painting or 3D modeling pipelines. Recently, a plethora of deep generative models has enabled the creation of visual content on an unprecedented scale and at remarkable speed. Nonetheless, to achieve the generation under different control conditions, e.g., editing attribute, sketch, and text prompt, acquiring corresponding large-scale training data poses a significant challenge due to concerns on copyright, privacy, and collection costs. The limited availability of data and computing resources can thus hamper the quality of generated content. Inspired by the tremendous success of model pretraining in visual understanding and natural language processing, this thesis aims at exploring a new generative paradigm that leverages well-trained foundation generative models to boost visual content creation including both 2D image synthesis and 3D model rendering. We begin this thesis with high-fidelity face image editing, where we embed real images to the latent space of well-trained generative adversarial networks (GAN). Our GAN inversion framework allows for various attribute editing within a unified model, while preserving image-specific details such as background and illumination. Next, we move on to the controllable generation of general images beyond faces. Rather than using GANs that mainly work for specific domains (e.g., faces), we opt to diffusion models that emerge to show impressive expressivity in synthesizing complex and general images. With pretraining, we proposed a unified architecture to boost various kinds of image-to-image translation tasks. Besides 2D images, we also extend this pretraining philosophy to 3D content creation. We propose a 3D generative model that uses diffusion models to automatically generate 3D avatars represented as neural radiance fields. Building upon this foundational generative model for avatars, we also demonstrate 3D avatar creation from an image or a text prompt while allowing for text-based semantic editability. Date: Wednesday, 19 July 2023 Time: 2:00pm - 4:00pm Venue: Room 3494 lifts 25/26 Chairperson: Prof. Jiewen HONG (MARK) Committee Members: Prof. Qifeng CHEN (Supervisor) Prof. Long CHEN Prof. Long QUAN Prof. Ling SHI (ECE) Prof. Ping LUO (HKU)