More about HKUST
Deep Generative Models for Controllable Talking Head Video Synthesis
PhD Thesis Proposal Defence
Title: "Deep Generative Models for Controllable Talking Head Video Synthesis"
by
Mr. Fating HONG
Abstract:
Talking head video generation is a pivotal task in computer vision, facing
challenges in photorealism, controllable generation, and structural
consistency. This thesis presents a comprehensive investigation into these
challenges, proposing novel deep generative frameworks organized into three
research paradigms.
The first paradigm focuses on highly expressive and photorealistic
video-driven synthesis. We introduce contributions that ensure 3D-consistent
synthesis and robust identity preservation: 1) DaGAN and DaGAN++ integrate
explicit 3D geometry via self-supervised depth estimation, significantly
improving structural and pose realism. 2) Implicit Identity Representation
Conditioned Memory Compensation Network (MCNet) uses a global facial meta-
memory bank to compensate for occlusions and artifacts from large poses,
retrieving identity-consistent priors.
The second paradigm addresses controllable, conflict-free, and view-robust
audio-driven synthesis. 1) ACTalker is an audio-visual controlled diffusion
model utilizing a novel Parallel-Control Mamba (PCM) layer to efficiently
resolve control conflicts between audio (mouth) and visual motion signals
(expression), enabling harmonious multi-modal control. 2) A Free-viewpoint
Animation diffusion network is introduced to support substantial viewpoint
variations (including zoom) by using multiple references and a pose
correlation module, generalizing the 2D synthesis to cinematic perspectives
while maintaining identity.
Finally, we transition to the domain of explicit 3D head modeling for
applications like VR. This ongoing work explores integrating 2D neural
renderers with high-fidelity deformation models to create fully controllable,
volumetric 3D avatars, overcoming the inherent limitations of 2D methods.
Extensive experiments on benchmark datasets demonstrate that the proposed
methods achieve state-of-the-art performance across 2D video-driven and
audio-driven generation tasks, making significant advancements in high-
fidelity, temporally consistent, and highly controllable talking head
synthesis and laying the groundwork for future 3D head modeling research.
Date: Wednesday, 10 December 2025
Time: 10:00am - 12:00noon
Venue: Room 2128A
Lift 19
Committee Members: Dr. Dan Xu (Supervisor)
Dr. Hao Chen (Chairperson)
Dr. Qifeng Chen