More about HKUST
From query to prompt: Towards Open-World Perception
The Hong Kong University of Science and Technology Department of Computer Science and Engineering PhD Thesis Defence Title: "From query to prompt: Towards Open-World Perception" By Mr. Hao ZHANG Abstract: The majority of contemporary perception models leverage Transformer-based architectures, such as DETR for object detection and Mask2Former for image segmentation. Central to these frameworks is the concept of extracting objects from image features through the formulation of queries, underscoring the significance of query design. In this dissertation, we embark on an exploration by integrating locality priors into the global attention mechanism via innovative query designs in DN-DETR and DINO. These designs encompass: 1. the conceptualization of queries as anchor boxes; 2. the prediction of relative object locations across each decoder layer; 3. an auxiliary denoising task that refines queries to close to object bounding boxes; and 4. the strategic initialization of queries coupled with a selection process. These advancements have yielded substantial improvements in both performance and training efficiency. As a result, our DINO is the strongest detection head that is adopted by many top performed detection model. In the domain of open-world perception, defining objects presents a fundamental challenge. In computer vision, visual prompts are often used to identify objects in open-world settings. We've found that these prompts serve a similar function to queries in closed-set perception. To address this, we introduced Semantic-SAM, a novel model that integrates visual prompts into the positional component of queries. Semantic-SAM, trained on the extensive SA-1B visual prompt dataset, achieves performance comparable to that of SAM. However, directly using visual prompts as queries restricts their format and precludes multi-round interactions which require memory prompts. To overcome this, we developed methods SEEM, which incorporate visual prompts through a cross-attention mechanism with queries. SEEM demonstrated best results in interactive segmentation upon its introduction. As language models progress, the importance of language prompts in computer vision becomes more recognized. We introduced OpenSEED, a method that uses contrastive learning to align language prompts with queries, achieving top performance in zero-shot segmentation. Employing a similar contrastive approach, LLaVA-Grounding excelled in referring expression comprehension (REC) and referring expression segmentation (RES), outperforming other multi-modal LLMs of the same model size. Additionally, SEEM fuses queries with both language and visual prompts via cross-attention. Our proposed techniques, including contrastive learning for matching queries with prompts and their fusion via cross-attention, are now widely recognized in open-world perception strategies. In summary, this dissertation advances open-world perception by introducing effective query designs that enhance object localization through the integration of local priors. It also presents innovative strategies for matching and integrating prompt information with queries, significantly enriching perception research. Date: Monday, 5 August 2024 Time: 1:00pm - 3:00pm Venue: Room 5506 Lifts 25/26 Chairman: Prof. Bert SHI (ECE) Committee Members: Prof. Lionel NI (Supervisor) Prof. Harry SHUM (Supervisor) Dr. Qifeng CHEN Dr. Dan XU Prof. Ping TAN (ECE) Dr. Hao SU (UCSD)