More about HKUST
High-Quality Visual Content Creation with Foundation Generative Models
PhD Thesis Proposal Defence Title: "High-Quality Visual Content Creation with Foundation Generative Models" by Mr. Tengfei WANG Abstract: The increasing demand for high-quality visual content, encompassing 2D images and 3D models, is evident across various applications, such as virtual reality, video games, animation, and interactive design. However, creating such visual content can be laborious and time-consuming as it requires a combination of artistic expertise and proficiency in 2D painting or 3D modeling pipelines. Recent advancements in deep generative models have led to the emergence of Artificial Intelligence Generated Content (AIGC) technology, which enables the creation of high-quality visual content on an unprecedented scale and at remarkable speed. Over the years, a plethora of generative models with task-specific designs has progressively advanced the generation quality under different control conditions, e.g., editing attribute, semantic mask, sketch, and text prompt. Nonetheless, as these models grow in size, they demand an increasing amount of training data and computing resources. Acquiring such large-scale data is significantly challenged in many cases due to concerns on copyright, privacy, and collection costs. Consequently, the limited availability of data can hamper the quality of generated content. Inspired by the tremendous success of model pretraining in visual understanding and natural language processing, this thesis aims at exploring a new generative paradigm that leverages well-trained foundation generative models to boost visual content creation including both 2D image synthesis and 3D model rendering. The fundamental idea is to cultivate and leverage the knowledge in pretrained generative models as a generative prior, which have already captured the natural image manifold. Leveraging the powerful capacity of foundation generative models, we can unify various synthesis tasks and achieve unprecedented performance. We begin this thesis with high-fidelity face image editing, where we embed real face images to the latent space of well-trained generative adversarial networks (GAN), allowing for various attribute editing in the latent space within a unified model. To achieve this, we present a high-fidelity GAN inversion framework that enables fast attribute editing while preserving image-specific details, such as background, appearance, and illumination. Next, we move on to the controllable generation of general images beyond faces. Rather than using GANs that mainly work for specific domains (e.g., faces), we opt to the diffusion models that emerge to show impressive expressivity in synthesizing complex and general images. With pretraining, we proposed a unified architecture to boost various kinds of image-to-image translation tasks. Besides 2D images, we also extend this pretraining philosophy to 3D content creation. We propose a 3D generative model that uses diffusion model to automatically generate 3D avatars represented as neural radiance fields. The digital avatars generated from our model favorably compare to those produced by prior generative works. Building upon this foundational generative model for avatars, we also demonstrate 3D avatar creation from an image or a text prompt while allowing for text-based semantic editability. Date: Tuesday, 9 May 2023 Time: 4:00pm - 6:00pm Venue: Room 4475 lifts 25/26 Committee Members: Dr. Qifeng Chen (Supervisor) Prof. Pedro Sander (Chairperson) Dr. Long Chen Dr. Dan Xu **** ALL are Welcome ****