Nuisance Variability Reduction in Audiovisual Data for Lip-to-Speech Synthesis

PhD Thesis Proposal Defence


Title: "Nuisance Variability Reduction in Audiovisual Data for Lip-to-Speech 
Synthesis"

by

Mr. Zhe NIU


Abstract:

Lip-to-speech (LTS) synthesis aims to reconstruct intelligible and natural 
speech from silent video of a talking face. Recent neural LTS systems have 
made substantial progress in mapping visual articulations to acoustic 
features, yet they are typically trained on audiovisual data that contain 
large amounts of variability unrelated to the underlying linguistic content. 
Examples include changes in mouth position and scale, and audio—visual 
time offsets. From the perspective of learning the lip-to- speech mapping, 
these factors act as nuisance variables: they do not encode lexical 
information, yet the model must expend capacity to cope with them.

This thesis adopts a nuisance variability reduction view of LTS and argues 
that explicitly modeling and reducing such variability in the inputs is 
crucial for building efficient, robust, and high-quality systems. We focus 
on two dominant forms of nuisance variability in realistic LTS pipelines: 
spatial variability in the mouth region and temporal variability in 
audio—visual synchronization.

For spatial variability, we propose the Mouth Alignment Network (MAN), an 
end-to-end module that predicts affine transformations directly from raw 
frames and uses a spatial transformer to align each frame to a canonical 
face configuration, stabilizing the apparent mouth position, scale, and 
orientation. To assess and further improve alignment, we introduce the Mouth 
Scoring Network (MSN), a learned quality estimator trained on clean and 
synthetically perturbed mouth sequences to enhance the learning of MAN. For 
temporal variability, we propose the Synchronized Lip-to-Speech (SLTS) 
framework, which explicitly models and corrects audio—visual time 
offsets. SLTS combines an Automatic Synchronization Mechanism for offset 
estimation and compensation with a synchronization loss that penalizes 
internal temporal drift, and incorporates a time-alignment step for standard 
time-sensitive evaluation metrics that fails on desynchronized audiovisual 
data. Overall, we show that treating misalignment as nuisance variability 
and reducing it at the input level provides a practical, model-agnostic way 
to improve lip-to-speech synthesis systems.


Date:                   Thursday, 11 December 2025

Time:                   2:15pm - 4:00pm

Venue:                  Room 2128B
                        Lift 19

Committee Members:      Dr. Brian Mak (Supervisor)
                        Prof. Nevin Zhang (Chairperson)
                        Dr. Dan Xu