More about HKUST
Towards Open-World Visual Perception: From Vision-Language Pretraining to Open Vocabulary Object Detection
PhD Thesis Proposal Defence Title: "Towards Open-World Visual Perception: From Vision-Language Pretraining to Open Vocabulary Object Detection" by Mr. Lewei YAO Abstract: Visual perception - the ability to interpret and understand visual information - is fundamental to both human cognition and artificial intelligence, playing a crucial role in areas like autonomous driving, robotics, and augmented reality. While traditional visual perception systems perform effectively in controlled, predefined settings, they often struggle in open-world scenarios that require adaptability to unseen objects and complex contexts. To overcome this limitation, we investigate open-world visual perception, a paradigm designed to enable systems to recognize and understand a wide range of visual concepts without the need for predefined categories. This research focuses on two key areas: Vision Language Pretraining (VLP) and Open Vocabulary Object Detection (OVD), examining their potential to extend the boundaries of visual perception in unstructured, real-world settings. Our first contribution is FILIP, a VLP model that introduces a fine-grained contrastive pretraining mechanism that aligns visual and textual tokens at a detailed level. FILIP’s cross-modal late interaction approach enables token-wise matching between image patches and text words, achieving significant performance improvements across multiple tasks and laying a solid foundation for more complex visual perception tasks such as open-vocabulary object detection. Building on FILIP, we further introduce the DetCLIP series, a suite of models tailored for OVD tasks to enhance object localization in open-domain contexts. The DetCLIP series - comprising DetCLIP, DetCLIPv2, and DetCLIPv3 - progressively refines and expands OVD capabilities. Specifically, DetCLIP introduces parallel concept formulation and a curated concept dictionary, achieving strong zero-shot detection. Inspired by FILIP, DetCLIPv2 integrates fine-grained word-region alignment and hybrid supervision from large-scale image-text pairs, enhancing both training efficiency and scalability. DetCLIPv3 further broadens the model’s application by incorporating generative captioning and hierarchical labeling, setting new benchmarks in both object detection and dense captioning tasks. Through these contributions, this thesis addresses core challenges in open-world visual perception by developing models that recognize and localize objects beyond predefined categories. Our work presents a clear progression in model design, from foundational VLP techniques to advanced OVD frameworks, establishing a solid foundation for future research in dynamic, real-world settings. We hope these advancements can bridge the gap between artificial intelligence and human-like perception, offering new insights and methodologies that drive the development of adaptable, robust systems capable of interpreting and interacting within diverse and unstructured visual contexts. Date: Thursday, 14 November 2024 Time: 4:00pm - 6:00pm Venue: Room CYT-G001 Lifts 35/36 Committee Members: Prof. Bo Li (Supervisor) Dr. Wei Wang (Co-supervisor) Dr. Shuai Wang (Chairperson) Prof. Ke Yi