Towards Open-World Visual Perception: From Vision-Language Pretraining to Open Vocabulary Object Detection

PhD Thesis Proposal Defence


Title: "Towards Open-World Visual Perception: From Vision-Language 
Pretraining to Open Vocabulary Object Detection"

by

Mr. Lewei YAO


Abstract:

Visual perception - the ability to interpret and understand visual 
information - is fundamental to both human cognition and artificial 
intelligence, playing a crucial role in areas like autonomous driving, 
robotics, and augmented reality. While traditional visual perception systems 
perform effectively in controlled, predefined settings, they often struggle 
in open-world scenarios that require adaptability to unseen objects and 
complex contexts. To overcome this limitation, we investigate open-world 
visual perception, a paradigm designed to enable systems to recognize and 
understand a wide range of visual concepts without the need for predefined 
categories. This research focuses on two key areas: Vision Language 
Pretraining (VLP) and Open Vocabulary Object Detection (OVD), examining their 
potential to extend the boundaries of visual perception in unstructured, 
real-world settings.

Our first contribution is FILIP, a VLP model that introduces a fine-grained 
contrastive pretraining mechanism that aligns visual and textual tokens at a 
detailed level. FILIP’s cross-modal late interaction approach enables 
token-wise matching between image patches and text words, achieving 
significant performance improvements across multiple tasks and laying a solid 
foundation for more complex visual perception tasks such as open-vocabulary 
object detection.

Building on FILIP, we further introduce the DetCLIP series, a suite of models 
tailored for OVD tasks to enhance object localization in open-domain 
contexts. The DetCLIP series - comprising DetCLIP, DetCLIPv2, and DetCLIPv3 - 
progressively refines and expands OVD capabilities. Specifically, DetCLIP 
introduces parallel concept formulation and a curated concept dictionary, 
achieving strong zero-shot detection. Inspired by FILIP, DetCLIPv2 integrates 
fine-grained word-region alignment and hybrid supervision from large-scale 
image-text pairs, enhancing both training efficiency and scalability. 
DetCLIPv3 further broadens the model’s application by incorporating 
generative captioning and hierarchical labeling, setting new benchmarks in 
both object detection and dense captioning tasks.

Through these contributions, this thesis addresses core challenges in 
open-world visual perception by developing models that recognize and localize 
objects beyond predefined categories. Our work presents a clear progression 
in model design, from foundational VLP techniques to advanced OVD frameworks, 
establishing a solid foundation for future research in dynamic, real-world 
settings. We hope these advancements can bridge the gap between artificial 
intelligence and human-like perception, offering new insights and 
methodologies that drive the development of adaptable, robust systems capable 
of interpreting and interacting within diverse and unstructured visual 
contexts.


Date:                   Thursday, 14 November 2024

Time:                   4:00pm - 6:00pm

Venue:                  Room CYT-G001
                        Lifts 35/36

Committee Members:      Prof. Bo Li (Supervisor)
                        Dr. Wei Wang (Co-supervisor)
                        Dr. Shuai Wang (Chairperson)
                        Prof. Ke Yi