More about HKUST
Vision-Aware Text Features in Referring Expression Segmentation: From Object Understanding to Context Understanding
The Hong Kong University of Science and Technology
Department of Computer Science and Engineering
MPhil Thesis Defence
Title: "Vision-Aware Text Features in Referring Expression Segmentation: From
Object Understanding to Context Understanding"
By
Mr. Truong Hai NGUYEN
Abstract:
Referring Expression Segmentation is a challenging task that involves
generating pixel- wise segmentation masks based on natural language
descriptions. Existing methods have relied mostly on visual features to
generate the segmentation masks while treating text features as supporting
components. This under-utilization of text features can lead to suboptimal
results, especially in complex scenarios where text prompts are ambiguous or
context-dependent. To overcome these challenges, we present a novel framework
VATEX to improve referring image segmentation by enhancing object and context
understanding with Vision-Aware Text Features. Our method involves using CLIP
to derive a CLIP Prior that integrates an object-centric visual heatmap with
text description, which can be used as the initial query in DETR-based
architecture for the segmentation task. Furthermore, by observing that there
are multiple ways to describe an instance in an image, we enforce feature
similarity between text variations referring to the same visual input by two
components: a novel Contextual Multimodal Decoder that turns text embeddings
into vision-aware text features, and a Meaning Consistency Constraint to ensure
further the coherent and consistent interpretation of language expressions with
the context understanding obtained from the image. Our method achieves a
significant performance improvement on five benchmark datasets RefCOCO,
RefCOCO+, G-Ref (images input) and Ref-DAVIS17 and Ref-Youtube-VOS (for video
input).
Additionally, we apply VATEX to enhance the explainability of MarineVRS, our
video retrieval system designed for the marine environment. Unlike conventional
systems, MarineVRS includes an Explainability module that outputs segmentation
masks of objects referred by the input query. This feature allows users to
identify and isolate specific objects in video footage, leading to more
detailed analysis and understanding of marine species' behavior and movements.
Date: Monday, 12 August 2024
Time: 10:00am - 12:00noon
Venue: Room 5501
Lifts 25/26
Chairman: Prof. Pedro SANDER
Committee Members: Prof. Sai-Kit YEUNG (Supervisor)
Dr. Rob SCHARFF (ISD)