High-Quality Visual Content Creation with Foundation Generative Models

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "High-Quality Visual Content Creation with Foundation Generative Models"

By

Mr. Tengfei WANG


Abstract:

The increasing demand for high-quality visual content, encompassing 2D images 
and 3D models, is evident across various applications, such as virtual reality 
and video games. However, creating such visual content can be laborious as it 
requires a combination of artistic expertise and proficiency in 2D painting or 
3D modeling pipelines. Recently, a plethora of deep generative models has 
enabled the creation of visual content on an unprecedented scale and at 
remarkable speed.  Nonetheless, to achieve the generation under different 
control conditions, e.g., editing attribute, sketch, and text prompt, 
acquiring corresponding large-scale training data poses a significant 
challenge due to concerns on copyright, privacy, and collection costs. The 
limited availability of data and computing resources can thus hamper the 
quality of generated content.

Inspired by the tremendous success of model pretraining in visual 
understanding and natural language processing, this thesis aims at exploring a 
new generative paradigm that leverages well-trained foundation generative 
models to boost visual content creation including both 2D image synthesis and 
3D model rendering.  We begin this thesis with high-fidelity face image 
editing, where we embed real images to the latent space of well-trained 
generative adversarial networks (GAN). Our GAN inversion framework allows for 
various attribute editing within a unified model, while preserving 
image-specific details such as background and illumination. Next, we move on 
to the controllable generation of general images beyond faces. Rather than 
using GANs that mainly work for specific domains (e.g., faces), we opt to 
diffusion models that emerge to show impressive expressivity in synthesizing 
complex and general images. With pretraining, we proposed a unified 
architecture to boost various kinds of image-to-image translation tasks. 
Besides 2D images, we also extend this pretraining philosophy to 3D content 
creation. We propose a 3D generative model that uses diffusion models to 
automatically generate 3D avatars represented as neural radiance fields.  
Building upon this foundational generative model for avatars, we also 
demonstrate 3D avatar creation from an image or a text prompt while allowing 
for text-based semantic editability.


Date:                   Wednesday, 19 July 2023

Time:                   2:00pm - 4:00pm

Venue:                  Room 3494
                        lifts 25/26

Chairperson:            Prof. Jiewen HONG (MARK)

Committee Members:      Prof. Qifeng CHEN (Supervisor)
                        Prof. Long CHEN
                        Prof. Long QUAN
                        Prof. Ling SHI (ECE)
                        Prof. Ping LUO (HKU)