Unifying Understanding and Generation in Vision Language Models: A Survey of Vision View

PhD Qualifying Examination


Title: "Unifying Understanding and Generation in Vision Language Models: 
A Survey of Vision View"

by

Mr. Xiaocheng LU


Abstract:

Despite significant advancements in vision-language models along two distinct
trajectories—visual understanding and generation—their divergence in
underlying paradigms has hindered seamless unification. Unifying these tasks is
crucial because effective understanding can significantly enhance generation
quality. The success of models like GPT-4o, which integrates both understanding
and generation in a unified framework, demonstrates the immense potential of
such an approach to produce more coherent and controllable multimodal content.
This survey provides an overview of this rapidly converging field, classifying
unified models based on their core visual generation mechanisms—discrete visual
generation, which aligns with text-based autoregressive models for better text
integration, and continuous visual generation, which prioritizes image quality
and smoother synthesis. We analyze their structural designs, key innovations,
and the challenges they face, including tokenization strategies, cross-modal
attention, and data management. A comprehensive compilation of relevant
datasets and benchmarks is also provided.


Date:                   Wednesday, 26 November 2025

Time:                   4:00pm - 6:00pm

Venue:                  Room 5564
                        Lift 27/28

Committee Members:      Prof. Song Guo (Supervisor)
                        Dr. Wei Wang (Chairperson)
                        Dr. Binhang Yuan