More about HKUST
Nuisance Variability Reduction in Audiovisual Data for Lip-to-Speech Synthesis
PhD Thesis Proposal Defence
Title: "Nuisance Variability Reduction in Audiovisual Data for Lip-to-Speech
Synthesis"
by
Mr. Zhe NIU
Abstract:
Lip-to-speech (LTS) synthesis aims to reconstruct intelligible and natural
speech from silent video of a talking face. Recent neural LTS systems have
made substantial progress in mapping visual articulations to acoustic
features, yet they are typically trained on audiovisual data that contain
large amounts of variability unrelated to the underlying linguistic content.
Examples include changes in mouth position and scale, and audio—visual
time offsets. From the perspective of learning the lip-to- speech mapping,
these factors act as nuisance variables: they do not encode lexical
information, yet the model must expend capacity to cope with them.
This thesis adopts a nuisance variability reduction view of LTS and argues
that explicitly modeling and reducing such variability in the inputs is
crucial for building efficient, robust, and high-quality systems. We focus
on two dominant forms of nuisance variability in realistic LTS pipelines:
spatial variability in the mouth region and temporal variability in
audio—visual synchronization.
For spatial variability, we propose the Mouth Alignment Network (MAN), an
end-to-end module that predicts affine transformations directly from raw
frames and uses a spatial transformer to align each frame to a canonical
face configuration, stabilizing the apparent mouth position, scale, and
orientation. To assess and further improve alignment, we introduce the Mouth
Scoring Network (MSN), a learned quality estimator trained on clean and
synthetically perturbed mouth sequences to enhance the learning of MAN. For
temporal variability, we propose the Synchronized Lip-to-Speech (SLTS)
framework, which explicitly models and corrects audio—visual time
offsets. SLTS combines an Automatic Synchronization Mechanism for offset
estimation and compensation with a synchronization loss that penalizes
internal temporal drift, and incorporates a time-alignment step for standard
time-sensitive evaluation metrics that fails on desynchronized audiovisual
data. Overall, we show that treating misalignment as nuisance variability
and reducing it at the input level provides a practical, model-agnostic way
to improve lip-to-speech synthesis systems.
Date: Thursday, 11 December 2025
Time: 2:15pm - 4:00pm
Venue: Room 2128B
Lift 19
Committee Members: Dr. Brian Mak (Supervisor)
Prof. Nevin Zhang (Chairperson)
Dr. Dan Xu