Vision-Aware Text Features in Referring Expression Segmentation: From Object Understanding to Context Understanding

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


MPhil Thesis Defence


Title: "Vision-Aware Text Features in Referring Expression Segmentation: From 
Object Understanding to Context Understanding"

By

Mr. Truong Hai NGUYEN


Abstract:

Referring Expression Segmentation is a challenging task that involves 
generating pixel- wise segmentation masks based on natural language 
descriptions. Existing methods have relied mostly on visual features to 
generate the segmentation masks while treating text features as supporting 
components. This under-utilization of text features can lead to suboptimal 
results, especially in complex scenarios where text prompts are ambiguous or 
context-dependent. To overcome these challenges, we present a novel framework 
VATEX to improve referring image segmentation by enhancing object and context 
understanding with Vision-Aware Text Features. Our method involves using CLIP 
to derive a CLIP Prior that integrates an object-centric visual heatmap with 
text description, which can be used as the initial query in DETR-based 
architecture for the segmentation task. Furthermore, by observing that there 
are multiple ways to describe an instance in an image, we enforce feature 
similarity between text variations referring to the same visual input by two 
components: a novel Contextual Multimodal Decoder that turns text embeddings 
into vision-aware text features, and a Meaning Consistency Constraint to ensure 
further the coherent and consistent interpretation of language expressions with 
the context understanding obtained from the image. Our method achieves a 
significant performance improvement on five benchmark datasets RefCOCO, 
RefCOCO+, G-Ref (images input) and Ref-DAVIS17 and Ref-Youtube-VOS (for video 
input).

Additionally, we apply VATEX to enhance the explainability of MarineVRS, our 
video retrieval system designed for the marine environment. Unlike conventional 
systems, MarineVRS includes an Explainability module that outputs segmentation 
masks of objects referred by the input query. This feature allows users to 
identify and isolate specific objects in video footage, leading to more 
detailed analysis and understanding of marine species' behavior and movements.


Date:                   Monday, 12 August 2024

Time:                   10:00am - 12:00noon

Venue:                  Room 5501
                        Lifts 25/26

Chairman:               Prof. Pedro SANDER

Committee Members:      Prof. Sai-Kit YEUNG (Supervisor)
                        Dr. Rob SCHARFF (ISD)