More about HKUST
Towards Open-World Visual Perception: From Vision-Language Pretraining to Open Vocabulary Object Detection
PhD Thesis Proposal Defence
Title: "Towards Open-World Visual Perception: From Vision-Language
Pretraining to Open Vocabulary Object Detection"
by
Mr. Lewei YAO
Abstract:
Visual perception - the ability to interpret and understand visual
information - is fundamental to both human cognition and artificial
intelligence, playing a crucial role in areas like autonomous driving,
robotics, and augmented reality. While traditional visual perception systems
perform effectively in controlled, predefined settings, they often struggle
in open-world scenarios that require adaptability to unseen objects and
complex contexts. To overcome this limitation, we investigate open-world
visual perception, a paradigm designed to enable systems to recognize and
understand a wide range of visual concepts without the need for predefined
categories. This research focuses on two key areas: Vision Language
Pretraining (VLP) and Open Vocabulary Object Detection (OVD), examining their
potential to extend the boundaries of visual perception in unstructured,
real-world settings.
Our first contribution is FILIP, a VLP model that introduces a fine-grained
contrastive pretraining mechanism that aligns visual and textual tokens at a
detailed level. FILIP’s cross-modal late interaction approach enables
token-wise matching between image patches and text words, achieving
significant performance improvements across multiple tasks and laying a solid
foundation for more complex visual perception tasks such as open-vocabulary
object detection.
Building on FILIP, we further introduce the DetCLIP series, a suite of models
tailored for OVD tasks to enhance object localization in open-domain
contexts. The DetCLIP series - comprising DetCLIP, DetCLIPv2, and DetCLIPv3 -
progressively refines and expands OVD capabilities. Specifically, DetCLIP
introduces parallel concept formulation and a curated concept dictionary,
achieving strong zero-shot detection. Inspired by FILIP, DetCLIPv2 integrates
fine-grained word-region alignment and hybrid supervision from large-scale
image-text pairs, enhancing both training efficiency and scalability.
DetCLIPv3 further broadens the model’s application by incorporating
generative captioning and hierarchical labeling, setting new benchmarks in
both object detection and dense captioning tasks.
Through these contributions, this thesis addresses core challenges in
open-world visual perception by developing models that recognize and localize
objects beyond predefined categories. Our work presents a clear progression
in model design, from foundational VLP techniques to advanced OVD frameworks,
establishing a solid foundation for future research in dynamic, real-world
settings. We hope these advancements can bridge the gap between artificial
intelligence and human-like perception, offering new insights and
methodologies that drive the development of adaptable, robust systems capable
of interpreting and interacting within diverse and unstructured visual
contexts.
Date: Thursday, 14 November 2024
Time: 4:00pm - 6:00pm
Venue: Room CYT-G001
Lifts 35/36
Committee Members: Prof. Bo Li (Supervisor)
Dr. Wei Wang (Co-supervisor)
Dr. Shuai Wang (Chairperson)
Prof. Ke Yi