Think Before You Segment: Chain-of-Thought Reasoning Segmentation for Image and Video

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


MPhil Thesis Defence


Title: "Think Before You Segment: Chain-of-Thought Reasoning Segmentation 
for Image and Video"

By

Mr. Shiu-hong KAO


Abstract:

Reasoning segmentation is a challenging vision-language task that aims to 
output the segmentation mask with respect to a complex, implicit, and even 
non-visual query text. Previous works incorporated multimodal Large Language 
Models (MLLMs) with segmentation models to approach the difficult problem. 
However, their segmentation quality often falls short in complex cases, 
particularly when dealing with out-of-domain objects with intricate 
structures, blurry boundaries, occlusions, or high similarity with 
surroundings. In this thesis, we address these challenging cases by 
incorporating MLLM's chain-of-thought reasoning for image and video. In the 
context of reasoning image segmentation, we introduce ThinkFirst, a 
training-free framework that enables GPT-4o to generate a detailed, 
chain-of-thought description of an image. This summarized description is then 
passed to a language-instructed segmentation assistant to aid the 
segmentation process. Our ThinkFirst allows users to easily interact with the 
segmentation agent using multimodal inputs, such as easy text and image 
scribbles, for successive refinement or communication. Based on ThinkFirst, 
we further propose ThinkDeeper as a progressive refinement process, using 
GPT-4o to autonomously evaluate the correctness of reasoning image 
segmentation by achieving self-correction and evaluation. For reasoning video 
segmentation, we first introduce two challenging tasks in practical settings: 
Offline Reasoning Video Instance Segmentation (Offline Reasoning VIS) and 
Online Reasoning Video Object Segmentation (Online Reasoning VOS), contingent 
upon the availability of future frame observations. Regardless of online or 
offline, our new, training-free ThinkVideo framework can uniformly address 
these two problems. Again, ThinkVideo leverages the chain-of- thought 
reasoning from GPT-4o or LLaVA to select the best instance keyframes, and 
connects them with the reasoning image segmentation model and video processor 
to generate mask sequences. We use a greedy strategy to periodically update 
the keyframe when tackling online videos. We evaluate the performance of 
ThinkFirst, ThinkDeeper, and ThinkVideo on diverse datasets. Extensive 
experiments show that, this zero-shot-CoT approach significantly improves the 
vanilla reasoning segmentation, both qualitatively and quantitatively, while 
being less sensitive or critical to user-supplied prompts after Thinking 
First and Deeper.


Date:                   Monday, 19 May 2025

Time:                   4:00pm - 6:00pm

Venue:                  Room 2128A
                        Lift 19

Chairman:               Dr. Dan XU

Committee Members:      Prof. Chi-Keung TANG (Supervisor)
                        Prof. Dit-Yan YEUNG