High-Quality Visual Content Creation with Foundation Generative Models

PhD Thesis Proposal Defence


Title: "High-Quality Visual Content Creation with Foundation Generative Models"

by

Mr. Tengfei WANG


Abstract:

The increasing demand for high-quality visual content, encompassing 2D images 
and 3D models, is evident across various applications, such as virtual reality, 
video games, animation, and interactive design. However, creating such visual 
content can be laborious and time-consuming as it requires a combination of 
artistic expertise and proficiency in 2D painting or 3D modeling pipelines. 
Recent advancements in deep generative models have led to the emergence of 
Artificial Intelligence Generated Content (AIGC) technology, which enables the 
creation of high-quality visual content on an unprecedented scale and at 
remarkable speed.

Over the years, a plethora of generative models with task-specific designs has 
progressively advanced the generation quality under different control 
conditions, e.g., editing attribute, semantic mask, sketch, and text prompt. 
Nonetheless, as these models grow in size, they demand an increasing amount of 
training data and computing resources. Acquiring such large-scale data is 
significantly challenged in many cases due to concerns on copyright, privacy, 
and collection costs. Consequently, the limited availability of data can hamper 
the quality of generated content.

Inspired by the tremendous success of model pretraining in visual understanding 
and natural language processing, this thesis aims at exploring a new generative 
paradigm that leverages well-trained foundation generative models to boost 
visual content creation including both 2D image synthesis and 3D model 
rendering. The fundamental idea is to cultivate and leverage the knowledge in 
pretrained generative models as a generative prior, which have already captured 
the natural image manifold. Leveraging the powerful capacity of foundation 
generative models, we can unify various synthesis tasks and achieve 
unprecedented performance.

We begin this thesis with high-fidelity face image editing, where we embed real 
face images to the latent space of well-trained generative adversarial networks 
(GAN), allowing for various attribute editing in the latent space within a 
unified model. To achieve this, we present a high-fidelity GAN inversion 
framework that enables fast attribute editing while preserving image-specific 
details, such as background, appearance, and illumination.

Next, we move on to the controllable generation of general images beyond faces. 
Rather than using GANs that mainly work for specific domains (e.g., faces), we 
opt to the diffusion models that emerge to show impressive expressivity in 
synthesizing complex and general images. With pretraining, we proposed a 
unified architecture to boost various kinds of image-to-image translation 
tasks.

Besides 2D images, we also extend this pretraining philosophy to 3D content 
creation. We propose a 3D generative model that uses diffusion model to 
automatically generate 3D avatars represented as neural radiance fields. The 
digital avatars generated from our model favorably compare to those produced by 
prior generative works. Building upon this foundational generative model for 
avatars, we also demonstrate 3D avatar creation from an image or a text prompt 
while allowing for text-based semantic editability.


Date:			Tuesday, 9 May 2023

Time:                  	4:00pm - 6:00pm

Venue:			Room 4475
  			lifts 25/26

Committee Members:	Dr. Qifeng Chen (Supervisor)
 			Prof. Pedro Sander (Chairperson)
 			Dr. Long Chen
 			Dr. Dan Xu


**** ALL are Welcome ****