From query to prompt: Towards Open-World Perception

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "From query to prompt: Towards Open-World Perception"

By

Mr. Hao ZHANG


Abstract:

The majority of contemporary perception models leverage Transformer-based 
architectures, such as DETR for object detection and Mask2Former for image 
segmentation. Central to these frameworks is the concept of extracting objects 
from image features through the formulation of queries, underscoring the 
significance of query design.

In this dissertation, we embark on an exploration by integrating locality 
priors into the global attention mechanism via innovative query designs in 
DN-DETR and DINO. These designs encompass: 1. the conceptualization of queries 
as anchor boxes; 2. the prediction of relative object locations across each 
decoder layer; 3. an auxiliary denoising task that refines queries to close to 
object bounding boxes; and 4. the strategic initialization of queries coupled 
with a selection process. These advancements have yielded substantial 
improvements in both performance and training efficiency. As a result, our DINO 
is the strongest detection head that is adopted by many top performed detection 
model.

In the domain of open-world perception, defining objects presents a fundamental 
challenge. In computer vision, visual prompts are often used to identify 
objects in open-world settings. We've found that these prompts serve a similar 
function to queries in closed-set perception. To address this, we introduced 
Semantic-SAM, a novel model that integrates visual prompts into the positional 
component of queries. Semantic-SAM, trained on the extensive SA-1B visual 
prompt dataset, achieves performance comparable to that of SAM. However, 
directly using visual prompts as queries restricts their format and precludes 
multi-round interactions which require memory prompts. To overcome this, we 
developed methods SEEM, which incorporate visual prompts through a 
cross-attention mechanism with queries. SEEM demonstrated best results in 
interactive segmentation upon its introduction.

As language models progress, the importance of language prompts in computer 
vision becomes more recognized. We introduced OpenSEED, a method that uses 
contrastive learning to align language prompts with queries, achieving top 
performance in zero-shot segmentation. Employing a similar contrastive 
approach, LLaVA-Grounding excelled in referring expression comprehension (REC) 
and referring expression segmentation (RES), outperforming other multi-modal 
LLMs of the same model size. Additionally, SEEM fuses queries with both 
language and visual prompts via cross-attention. Our proposed techniques, 
including contrastive learning for matching queries with prompts and their 
fusion via cross-attention, are now widely recognized in open-world perception 
strategies.

In summary, this dissertation advances open-world perception by introducing 
effective query designs that enhance object localization through the 
integration of local priors. It also presents innovative strategies for 
matching and integrating prompt information with queries, significantly 
enriching perception research.


Date:                   Monday, 5 August 2024

Time:                   1:00pm - 3:00pm

Venue:                  Room 5506
                        Lifts 25/26

Chairman:               Prof. Bert SHI (ECE)

Committee Members:      Prof. Lionel NI (Supervisor)
                        Prof. Harry SHUM (Supervisor)
                        Dr. Qifeng CHEN
                        Dr. Dan XU
                        Prof. Ping TAN (ECE)
                        Dr. Hao SU (UCSD)