More about HKUST
Vision-Aware Text Features in Referring Expression Segmentation: From Object Understanding to Context Understanding
The Hong Kong University of Science and Technology Department of Computer Science and Engineering MPhil Thesis Defence Title: "Vision-Aware Text Features in Referring Expression Segmentation: From Object Understanding to Context Understanding" By Mr. Truong Hai NGUYEN Abstract: Referring Expression Segmentation is a challenging task that involves generating pixel- wise segmentation masks based on natural language descriptions. Existing methods have relied mostly on visual features to generate the segmentation masks while treating text features as supporting components. This under-utilization of text features can lead to suboptimal results, especially in complex scenarios where text prompts are ambiguous or context-dependent. To overcome these challenges, we present a novel framework VATEX to improve referring image segmentation by enhancing object and context understanding with Vision-Aware Text Features. Our method involves using CLIP to derive a CLIP Prior that integrates an object-centric visual heatmap with text description, which can be used as the initial query in DETR-based architecture for the segmentation task. Furthermore, by observing that there are multiple ways to describe an instance in an image, we enforce feature similarity between text variations referring to the same visual input by two components: a novel Contextual Multimodal Decoder that turns text embeddings into vision-aware text features, and a Meaning Consistency Constraint to ensure further the coherent and consistent interpretation of language expressions with the context understanding obtained from the image. Our method achieves a significant performance improvement on five benchmark datasets RefCOCO, RefCOCO+, G-Ref (images input) and Ref-DAVIS17 and Ref-Youtube-VOS (for video input). Additionally, we apply VATEX to enhance the explainability of MarineVRS, our video retrieval system designed for the marine environment. Unlike conventional systems, MarineVRS includes an Explainability module that outputs segmentation masks of objects referred by the input query. This feature allows users to identify and isolate specific objects in video footage, leading to more detailed analysis and understanding of marine species' behavior and movements. Date: Monday, 12 August 2024 Time: 10:00am - 12:00noon Venue: Room 5501 Lifts 25/26 Chairman: Prof. Pedro SANDER Committee Members: Prof. Sai-Kit YEUNG (Supervisor) Dr. Rob SCHARFF (ISD)