More about HKUST
Toward Unified Multimodal Models for Audio-Video Understanding and Generation: A Survey
PhD Qualifying Examination
Title: "Toward Unified Multimodal Models for Audio-Video Understanding and
Generation: A Survey"
by
Mr. Trung Kien PHAM
Abstract:
The convergence of audio and video modalities has become a central topic in
multimodal learning, driven by the demand for models that can both understand
and generate audio-visual content. While significant progress has been made
independently in audio-video generation (e.g., video-to-audio, audio-to-video,
and joint audio-video generations) and audio-video understanding (e.g.,
audio-visual question answering, segmentation, and source localization), these
two research streams have largely evolved in isolation. Meanwhile, recent
advances in unified multimodal models, which integrate autoregressive language
modeling with diffusion-based generation within a single framework, have
demonstrated remarkable success in the image-text domains but remain largely
unexplored for joint audio-video modalities. This survey provides a
comprehensive review of the rapidly evolving landscape of large audio-video
multimodal models, spanning three interconnected pillars: (1) audio-video
generation, covering cross-modal (video-to-audio and audio-to-video) synthesis
and joint audio-video generation with emphasis on modern Diffusion Transformer
(DiT) architectures and diffusion/flow matching modeling; (2) audio-video
understanding, encompassing both task-specific approaches and multimodal large
language model (MLLM)-based methods for holistic audio-visual comprehension;
and (3) unified understanding and generation models, exploring emerging
paradigms that aim to bridge perception and creation within a single model. We
further discuss audio-video tokenization strategies, datasets and benchmarks,
evaluation protocols, and identify key open challenges including long-form
generation, real-time interaction, spatial realization, and the largely
untrodden path toward truly unified audio-video foundation models.
Date: Monday, 4 May 2026
Time: 3:00pm - 5:00pm
Venue: Room 2132C
Lift 22
Committee Members: Dr. Long Chen (Supervisor)
Dr. Qifeng Chen (Co-supervisor)
Prof. Pedro Sander (Chairperson)
Dr. Dan Xu