Deep Generative Models for Controllable Talking Head Video Synthesis

PhD Thesis Proposal Defence


Title: "Deep Generative Models for Controllable Talking Head Video Synthesis"

by

Mr. Fating HONG


Abstract:

Talking head video generation is a pivotal task in computer vision, facing 
challenges in photorealism, controllable generation, and structural 
consistency. This thesis presents a comprehensive investigation into these 
challenges, proposing novel deep generative frameworks organized into three 
research paradigms.

The first paradigm focuses on highly expressive and photorealistic 
video-driven synthesis. We introduce contributions that ensure 3D-consistent 
synthesis and robust identity preservation: 1) DaGAN and DaGAN++ integrate 
explicit 3D geometry via self-supervised depth estimation, significantly 
improving structural and pose realism. 2) Implicit Identity Representation 
Conditioned Memory Compensation Network (MCNet) uses a global facial meta-
memory bank to compensate for occlusions and artifacts from large poses, 
retrieving identity-consistent priors.

The second paradigm addresses controllable, conflict-free, and view-robust 
audio-driven synthesis. 1) ACTalker is an audio-visual controlled diffusion 
model utilizing a novel Parallel-Control Mamba (PCM) layer to efficiently 
resolve control conflicts between audio (mouth) and visual motion signals 
(expression), enabling harmonious multi-modal control. 2) A Free-viewpoint 
Animation diffusion network is introduced to support substantial viewpoint 
variations (including zoom) by using multiple references and a pose 
correlation module, generalizing the 2D synthesis to cinematic perspectives 
while maintaining identity.

Finally, we transition to the domain of explicit 3D head modeling for 
applications like VR. This ongoing work explores integrating 2D neural 
renderers with high-fidelity deformation models to create fully controllable, 
volumetric 3D avatars, overcoming the inherent limitations of 2D methods.

Extensive experiments on benchmark datasets demonstrate that the proposed 
methods achieve state-of-the-art performance across 2D video-driven and 
audio-driven generation tasks, making significant advancements in high-
fidelity, temporally consistent, and highly controllable talking head 
synthesis and laying the groundwork for future 3D head modeling research.


Date:                   Wednesday, 10 December 2025

Time:                   10:00am - 12:00noon

Venue:                  Room 2128A
                        Lift 19

Committee Members:      Dr. Dan Xu (Supervisor)
                        Dr. Hao Chen (Chairperson)
                        Dr. Qifeng Chen