Photorealistic Digital Humans: From Controllable Talking Heads to Free‑Viewpoint 4D Avatars

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "Photorealistic Digital Humans: From Controllable Talking Heads to
Free‑Viewpoint 4D Avatars"

By

Mr. Fating HONG


Abstract:

Human-centric video generation, encompassing facial synthesis and full-body 
animation, is crucial for virtual reality and digital content creation. 
However, accurately capturing complex human dynamics remains challenging. Key 
bottlenecks include maintaining 3D structural consistency, handling severe 
occlusions from large poses, harmonizing conflicting multi-modal driving 
signals, and preserving rendering quality across extreme viewpoints and 
scales. Existing methods often struggle to balance these factors, resulting 
in structural artifacts, control conflicts, and degraded visual fidelity. 
This thesis addresses these challenges through a series of novel deep 
generative frameworks, progressing from localized talking head synthesis to 
free-viewpoint 4D human animation. First, to ensure 3D structural integrity 
in facial generation, it introduces DaGAN and DaGAN++, leveraging 
self-supervised dense geometry learning. To handle severe occlusions, it 
develops MCNet, which queries a global facial meta-memory via implicit 
identity representations to robustly recover missing regions. Second, it 
resolves multi-modal control conflicts with ACTalker, a video diffusion model 
utilizing a Parallel-Control Mamba architecture to harmoniously synchronize 
audio-driven lip movements and visually-driven global expressions. Third, 
expanding to full-body dynamics under novel views, it proposes a 
free-viewpoint diffusion framework using pose-correlated adaptive token 
selection to efficiently aggregate appearance context from multiple 
references. Finally, to overcome the scale inconsistency of discrete 3D 
Gaussian Splatting, it introduces ContinueAvatar. By integrating a continuous 
feature splatting mechanism, this framework formulates rendering as a 
continuous spatial function to achieve arbitrary-resolution 4D avatars. 
Extensive empirical evaluations demonstrate that these contributions achieve 
state-of-the-art realism, controllability, and multi-scale robustness.


Date:                   Wednesday, 6 May 2026

Time:                   10:00am - 12:00noon

Venue:                  Room 2132C
                        Lift 22

Chairman:               Prof. Qinglu ZENG (OCES)

Committee Members:      Dr. Dan XU (Supervisor)
                        Dr. Long CHEN
                        Dr. Qifeng CHEN
                        Dr. Wenhan LUO (AMC)
                        Prof. Lu SHENG (Beihang University)