More about HKUST
Image and Video Instance Segmentation: Towards Better Quality, Robustness and Annotation-efficiency
The Hong Kong University of Science and Technology Department of Computer Science and Engineering PhD Thesis Defence Title: "Image and Video Instance Segmentation: Towards Better Quality, Robustness and Annotation-efficiency" By Mr. Lei KE Abstract: Instance segmentation is a fundamental task in computer vision with many real-world applications, such as image/video editing, robotic perception, self-driving and medical imaging. Various image/video instance segmentation approaches have been proposed with remarkable progress. However, three major challenges still significantly degrade their performance in the complex real world environments: 1) The predicted mask quality by existing image/video instance segmentation is not desirable with over-smoothing boundaries. 2) The segmentation robustness on heavily occluded instances, and temporal robustness in the video frames still have significant room for improvement. 3) The mask annotation are tedious, especially in videos. This constrains the scale and category diversity of existing benchmarks. In this thesis, we set out to solve these three challenges: For high-quality image-based instance segmentation, we present Mask Transfiner and propose the concept of Incoherent Regions. Instead of operating on regular dense tensors, our Mask Transfiner decomposes and represents the image regions as a quadtree, and only corrects sparse error-prone areas. This allows Mask Transfiner to predict highly accurate instance masks, with efficiency and at a low computational cost. For better video segmentation quality, based on Mask Transfiner, we further design the first high-quality video instance segmentation (VIS) method VMT, capable of leveraging high-resolution features thanks to a highly efficient video transformer structure. For enhancing segmentation robustness under heavy occlusions, we propose BCNet with a simple bilayer decoupling network for explicit occluder-occludee modeling. We extensively investigate the efficacy of bilayer structure using FCN, GCN and ViT network architectures. To better promote temporal robustness, we present Prototypical Cross-Attention Network (PCAN), which leverages rich spatio-temporal information for online multiple object tracking and segmentation. PCAN first distills a space-time memory into a set of prototypes and then employs cross-attention to retrieve rich information from the past frames. For promoting annotation efficiency, as video masks labels are tedious and expensive to annotate, we design MaskFreeVis to remove the mask-annotation requirement. We leverage the rich temporal mask consistency constraints in videos by introducing the Temporal KNN-patch Loss (TK-Loss), providing strong mask supervision without any labels. Our TK-Loss finds one-to-many matches across frames, through an efficient patch-matching step followed by a K-nearest neighbor selection. A consistency loss is then enforced on the found matches. MaskFreeVIS drastically narrows the gap between fully and weakly-supervised VIS performance. Date: Thursday, 25 May 2023 Time: 10:00am - 12:00noon Venue: Room 5566 lifts 27/28 Chairperson: Prof. Song LIN (MARK) Committee Members: Prof. Chi Keung TANG (Supervisor) Prof. Dan XU Prof. Dit Yan YEUNG Prof. Ping TAN (ECE) Prof. Jinwei GU (CUHK) **** ALL are Welcome ****