Toward Unified Multimodal Models for Audio-Video Understanding and Generation: A Survey

PhD Qualifying Examination


Title: "Toward Unified Multimodal Models for Audio-Video Understanding and 
Generation: A Survey"

by

Mr. Trung Kien PHAM


Abstract:

The convergence of audio and video modalities has become a central topic in 
multimodal learning, driven by the demand for models that can both understand 
and generate audio-visual content. While significant progress has been made 
independently in audio-video generation (e.g., video-to-audio, audio-to-video, 
and joint audio-video generations) and audio-video understanding (e.g., 
audio-visual question answering, segmentation, and source localization), these 
two research streams have largely evolved in isolation. Meanwhile, recent 
advances in unified multimodal models, which integrate autoregressive language 
modeling with diffusion-based generation within a single framework, have 
demonstrated remarkable success in the image-text domains but remain largely 
unexplored for joint audio-video modalities. This survey provides a 
comprehensive review of the rapidly evolving landscape of large audio-video 
multimodal models, spanning three interconnected pillars: (1) audio-video 
generation, covering cross-modal (video-to-audio and audio-to-video) synthesis 
and joint audio-video generation with emphasis on modern Diffusion Transformer 
(DiT) architectures and diffusion/flow matching modeling; (2) audio-video 
understanding, encompassing both task-specific approaches and multimodal large 
language model (MLLM)-based methods for holistic audio-visual comprehension; 
and (3) unified understanding and generation models, exploring emerging 
paradigms that aim to bridge perception and creation within a single model. We 
further discuss audio-video tokenization strategies, datasets and benchmarks, 
evaluation protocols, and identify key open challenges including long-form 
generation, real-time interaction, spatial realization, and the largely 
untrodden path toward truly unified audio-video foundation models.


Date:                   Monday, 4 May 2026

Time:                   3:00pm - 5:00pm

Venue:                  Room 2132C
                        Lift 22

Committee Members:      Dr. Long Chen (Supervisor)
                        Dr. Qifeng Chen (Co-supervisor)
                        Prof. Pedro Sander (Chairperson)
                        Dr. Dan Xu