More about HKUST
From query to prompt: Towards Open-World Perception
The Hong Kong University of Science and Technology
Department of Computer Science and Engineering
PhD Thesis Defence
Title: "From query to prompt: Towards Open-World Perception"
By
Mr. Hao ZHANG
Abstract:
The majority of contemporary perception models leverage Transformer-based
architectures, such as DETR for object detection and Mask2Former for image
segmentation. Central to these frameworks is the concept of extracting objects
from image features through the formulation of queries, underscoring the
significance of query design.
In this dissertation, we embark on an exploration by integrating locality
priors into the global attention mechanism via innovative query designs in
DN-DETR and DINO. These designs encompass: 1. the conceptualization of queries
as anchor boxes; 2. the prediction of relative object locations across each
decoder layer; 3. an auxiliary denoising task that refines queries to close to
object bounding boxes; and 4. the strategic initialization of queries coupled
with a selection process. These advancements have yielded substantial
improvements in both performance and training efficiency. As a result, our DINO
is the strongest detection head that is adopted by many top performed detection
model.
In the domain of open-world perception, defining objects presents a fundamental
challenge. In computer vision, visual prompts are often used to identify
objects in open-world settings. We've found that these prompts serve a similar
function to queries in closed-set perception. To address this, we introduced
Semantic-SAM, a novel model that integrates visual prompts into the positional
component of queries. Semantic-SAM, trained on the extensive SA-1B visual
prompt dataset, achieves performance comparable to that of SAM. However,
directly using visual prompts as queries restricts their format and precludes
multi-round interactions which require memory prompts. To overcome this, we
developed methods SEEM, which incorporate visual prompts through a
cross-attention mechanism with queries. SEEM demonstrated best results in
interactive segmentation upon its introduction.
As language models progress, the importance of language prompts in computer
vision becomes more recognized. We introduced OpenSEED, a method that uses
contrastive learning to align language prompts with queries, achieving top
performance in zero-shot segmentation. Employing a similar contrastive
approach, LLaVA-Grounding excelled in referring expression comprehension (REC)
and referring expression segmentation (RES), outperforming other multi-modal
LLMs of the same model size. Additionally, SEEM fuses queries with both
language and visual prompts via cross-attention. Our proposed techniques,
including contrastive learning for matching queries with prompts and their
fusion via cross-attention, are now widely recognized in open-world perception
strategies.
In summary, this dissertation advances open-world perception by introducing
effective query designs that enhance object localization through the
integration of local priors. It also presents innovative strategies for
matching and integrating prompt information with queries, significantly
enriching perception research.
Date: Monday, 5 August 2024
Time: 1:00pm - 3:00pm
Venue: Room 5506
Lifts 25/26
Chairman: Prof. Bert SHI (ECE)
Committee Members: Prof. Lionel NI (Supervisor)
Prof. Harry SHUM (Supervisor)
Dr. Qifeng CHEN
Dr. Dan XU
Prof. Ping TAN (ECE)
Dr. Hao SU (UCSD)