Image and Video Instance Segmentation: Towards Better Quality, Robustness and Annotation-efficiency

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "Image and Video Instance Segmentation: Towards Better Quality,
Robustness and Annotation-efficiency"

By

Mr. Lei KE


Abstract:

Instance segmentation is a fundamental task in computer vision with many
real-world applications, such as image/video editing, robotic perception,
self-driving and medical imaging. Various image/video instance segmentation
approaches have been proposed with remarkable progress. However, three major
challenges still significantly degrade their performance in the complex real
world environments: 1) The predicted mask quality by existing image/video
instance segmentation is not desirable with over-smoothing boundaries. 2) The
segmentation robustness on heavily occluded instances, and temporal robustness
in the video frames still have significant room for improvement. 3) The mask
annotation are tedious, especially in videos. This constrains the scale and
category diversity of existing benchmarks. In this thesis, we set out to solve
these three challenges:

For high-quality image-based instance segmentation, we present Mask Transfiner
and propose the concept of Incoherent Regions. Instead of operating on regular
dense tensors, our Mask Transfiner decomposes and represents the image regions
as a quadtree, and only corrects sparse error-prone areas. This allows Mask
Transfiner to predict highly accurate instance masks, with efficiency and at a
low computational cost. For better video segmentation quality, based on Mask
Transfiner, we further design the first high-quality video instance
segmentation (VIS) method VMT, capable of leveraging high-resolution features
thanks to a highly efficient video transformer structure.

For enhancing segmentation robustness under heavy occlusions, we propose BCNet
with a simple bilayer decoupling network for explicit occluder-occludee
modeling. We extensively investigate the efficacy of bilayer structure using
FCN, GCN and ViT network architectures. To better promote temporal robustness,
we present Prototypical Cross-Attention Network (PCAN), which leverages rich
spatio-temporal information for online multiple object tracking and
segmentation. PCAN first distills a space-time memory into a set of prototypes
and then employs cross-attention to retrieve rich information from the past
frames.

For promoting annotation efficiency, as video masks labels are tedious and
expensive to annotate, we design MaskFreeVis to remove the mask-annotation
requirement. We leverage the rich temporal mask consistency constraints in
videos by introducing the Temporal KNN-patch Loss (TK-Loss), providing strong
mask supervision without any labels. Our TK-Loss finds one-to-many matches
across frames, through an efficient patch-matching step followed by a K-nearest
neighbor selection. A consistency loss is then enforced on the found matches.
MaskFreeVIS drastically narrows the gap between fully and weakly-supervised VIS
performance.


Date: 			Thursday, 25 May 2023

Time: 			10:00am - 12:00noon

Venue: 			Room 5566
			lifts 27/28

Chairperson: 		Prof. Song LIN (MARK)

Committee Members: 	Prof. Chi Keung TANG (Supervisor)
			Prof. Dan XU
			Prof. Dit Yan YEUNG
			Prof. Ping TAN (ECE)
			Prof. Jinwei GU (CUHK)


**** ALL are Welcome ****