More about HKUST
Think Before You Segment: Chain-of-Thought Reasoning Segmentation for Image and Video
The Hong Kong University of Science and Technology
Department of Computer Science and Engineering
MPhil Thesis Defence
Title: "Think Before You Segment: Chain-of-Thought Reasoning Segmentation
for Image and Video"
By
Mr. Shiu-hong KAO
Abstract:
Reasoning segmentation is a challenging vision-language task that aims to
output the segmentation mask with respect to a complex, implicit, and even
non-visual query text. Previous works incorporated multimodal Large Language
Models (MLLMs) with segmentation models to approach the difficult problem.
However, their segmentation quality often falls short in complex cases,
particularly when dealing with out-of-domain objects with intricate
structures, blurry boundaries, occlusions, or high similarity with
surroundings. In this thesis, we address these challenging cases by
incorporating MLLM's chain-of-thought reasoning for image and video. In the
context of reasoning image segmentation, we introduce ThinkFirst, a
training-free framework that enables GPT-4o to generate a detailed,
chain-of-thought description of an image. This summarized description is then
passed to a language-instructed segmentation assistant to aid the
segmentation process. Our ThinkFirst allows users to easily interact with the
segmentation agent using multimodal inputs, such as easy text and image
scribbles, for successive refinement or communication. Based on ThinkFirst,
we further propose ThinkDeeper as a progressive refinement process, using
GPT-4o to autonomously evaluate the correctness of reasoning image
segmentation by achieving self-correction and evaluation. For reasoning video
segmentation, we first introduce two challenging tasks in practical settings:
Offline Reasoning Video Instance Segmentation (Offline Reasoning VIS) and
Online Reasoning Video Object Segmentation (Online Reasoning VOS), contingent
upon the availability of future frame observations. Regardless of online or
offline, our new, training-free ThinkVideo framework can uniformly address
these two problems. Again, ThinkVideo leverages the chain-of- thought
reasoning from GPT-4o or LLaVA to select the best instance keyframes, and
connects them with the reasoning image segmentation model and video processor
to generate mask sequences. We use a greedy strategy to periodically update
the keyframe when tackling online videos. We evaluate the performance of
ThinkFirst, ThinkDeeper, and ThinkVideo on diverse datasets. Extensive
experiments show that, this zero-shot-CoT approach significantly improves the
vanilla reasoning segmentation, both qualitatively and quantitatively, while
being less sensitive or critical to user-supplied prompts after Thinking
First and Deeper.
Date: Monday, 19 May 2025
Time: 4:00pm - 6:00pm
Venue: Room 2128A
Lift 19
Chairman: Dr. Dan XU
Committee Members: Prof. Chi-Keung TANG (Supervisor)
Prof. Dit-Yan YEUNG