Towards Open-World Visual Perception: From Vision-Language Pretraining to Open Vocabulary Object Detection

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "Towards Open-World Visual Perception: From Vision-Language 
Pretraining to Open Vocabulary Object Detection"

By

Mr. Lewei YAO


Abstract:

Visual perception--the ability to interpret and understand visual 
information--is fundamental to both human cognition and artificial 
intelligence, playing a crucial role in areas like autonomous driving, 
robotics, and augmented reality. While traditional visual perception systems 
perform effectively in controlled, predefined settings, they often struggle 
in open-world scenarios that require adaptability to unseen objects and 
complex contexts. To overcome this limitation, we investigate open-world 
visual perception, a paradigm designed to enable systems to recognize and 
understand a wide range of visual concepts without the need for predefined 
categories. This research focuses on two key areas: Vision Language 
Pretraining (VLP) and Open Vocabulary Object Detection (OVD), examining their 
potential to extend the boundaries of visual perception in unstructured, 
real-world settings.

Our first contribution is FILIP, a VLP model that introduces a fine-grained 
contrastive pretraining mechanism that aligns visual and textual tokens at a 
detailed level. FILIP’s cross-modal late interaction approach enables 
token-wise matching between image patches and text words, achieving 
significant performance improvements across multiple tasks and laying a solid 
foundation for more complex visual perception tasks such as open-vocabulary 
object detection.

Building on FILIP, we further introduce the DetCLIP series, a suite of models 
tailored for OVD tasks to enhance object localization in open-domain 
contexts. The DetCLIP series--comprising DetCLIP, DetCLIPv2, and 
DetCLIPv3--progressively refines and expands OVD capabilities. Specifically, 
DetCLIP introduces parallel concept formulation and a curated concept 
dictionary, achieving strong zero-shot detection. Inspired by FILIP, 
DetCLIPv2 integrates fine-grained word-region alignment and hybrid 
supervision from large-scale image-text pairs, enhancing both training 
efficiency and scalability. DetCLIPv3 further broadens the model’s 
application by incorporating generative captioning and hierarchical labeling, 
setting new benchmarks in both object detection and dense captioning tasks.

Through these contributions, this thesis addresses core challenges in 
open-world visual perception by developing models that recognize and localize 
objects beyond predefined categories. Our work presents a clear progression 
in model design, from foundational VLP techniques to advanced OVD frameworks, 
establishing a solid foundation for future research in dynamic, real-world 
settings. We hope these advancements can bridge the gap between artificial 
intelligence and human-like perception, offering new insights and 
methodologies that drive the development of adaptable, robust systems capable 
of interpreting and interacting within diverse and unstructured visual 
contexts.


Date:                   Thursday, 9 January 2025

Time:                   4:00pm - 6:00pm

Venue:                  Room 3494
                        Lifts 25/26

Chairman:               Prof. Yong HUANG (CHEM)

Committee Members:      Dr. Dan XU (Supervisor)
                        Dr. Qifeng CHEN
                        Prof. Jamesk KWOK
                        Dr. Wenhan LUO (EMIA)
                        Dr. Xuming HE (ShanghaiTech)