Talking Head Video Diffusion Generation

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering

Final Year Thesis Oral Defense

Title: "Talking Head Video Diffusion Generation"

by

WANG Yucheng

Abstract:

Generating natural and expressive talking head video has proven to be a 
challenging task, which involves various heterogeneous and multi-source 
conditions. While GAN-based methods have achieved notable progress, certain 
difficulties still remain unaddressed. These challenges include precise 
alignment of lip movements with the audio signal, identity preservation of the 
original image, and temporal consistency of the generated video. Thus, we 
present LFDHead, a method that utilizes two pipelines -- one at the spatial level 
and the other at the temporal level -- to improve the consistency in the generated 
results. In the Multimodal-to-Latent Diffusion Pipeline, we incorporate the 
encoded original image as a condition for each diffusion step, which enhances 
spatial consistency and preserves identity. In the Latent-Fusion Rendering 
Pipeline, we use the latent fusion techniques to swiftly obtain temporal latent 
information from neighbors in the temporal sequence and generate continuous 
frames. We have conducted extensive experiments to demonstrate the generation 
consistency of our proposed method.


Date            : 29 April 2024 (Monday)

Time            : 15:00 - 15:40

Venue           : Room 5501 (near lifts 25/26), HKUST

Advisor         : Dr. XU Dan

2nd Reader      : Dr. CHEN Long