More about HKUST
Think Before You Segment: Chain-of-Thought Reasoning Segmentation for Image and Video
The Hong Kong University of Science and Technology Department of Computer Science and Engineering MPhil Thesis Defence Title: "Think Before You Segment: Chain-of-Thought Reasoning Segmentation for Image and Video" By Mr. Shiu-hong KAO Abstract: Reasoning segmentation is a challenging vision-language task that aims to output the segmentation mask with respect to a complex, implicit, and even non-visual query text. Previous works incorporated multimodal Large Language Models (MLLMs) with segmentation models to approach the difficult problem. However, their segmentation quality often falls short in complex cases, particularly when dealing with out-of-domain objects with intricate structures, blurry boundaries, occlusions, or high similarity with surroundings. In this thesis, we address these challenging cases by incorporating MLLM's chain-of-thought reasoning for image and video. In the context of reasoning image segmentation, we introduce ThinkFirst, a training-free framework that enables GPT-4o to generate a detailed, chain-of-thought description of an image. This summarized description is then passed to a language-instructed segmentation assistant to aid the segmentation process. Our ThinkFirst allows users to easily interact with the segmentation agent using multimodal inputs, such as easy text and image scribbles, for successive refinement or communication. Based on ThinkFirst, we further propose ThinkDeeper as a progressive refinement process, using GPT-4o to autonomously evaluate the correctness of reasoning image segmentation by achieving self-correction and evaluation. For reasoning video segmentation, we first introduce two challenging tasks in practical settings: Offline Reasoning Video Instance Segmentation (Offline Reasoning VIS) and Online Reasoning Video Object Segmentation (Online Reasoning VOS), contingent upon the availability of future frame observations. Regardless of online or offline, our new, training-free ThinkVideo framework can uniformly address these two problems. Again, ThinkVideo leverages the chain-of- thought reasoning from GPT-4o or LLaVA to select the best instance keyframes, and connects them with the reasoning image segmentation model and video processor to generate mask sequences. We use a greedy strategy to periodically update the keyframe when tackling online videos. We evaluate the performance of ThinkFirst, ThinkDeeper, and ThinkVideo on diverse datasets. Extensive experiments show that, this zero-shot-CoT approach significantly improves the vanilla reasoning segmentation, both qualitatively and quantitatively, while being less sensitive or critical to user-supplied prompts after Thinking First and Deeper. Date: Monday, 19 May 2025 Time: 4:00pm - 6:00pm Venue: Room 2128A Lift 19 Chairman: Dr. Dan XU Committee Members: Prof. Chi-Keung TANG (Supervisor) Prof. Dit-Yan YEUNG