Mask-Free Video Instance Segmentation [CVPR 2023]
Mask-Free Video Instance Segmentation [CVPR 2023].
This is the official pytorch implementation of MaskFreeVIS built on the open-source detectron2. We aim to remove the necessity for expensive video masks and even image masks for training VIS models. Our project website contains more information, including the visual video comparison: vis.xyz/pub/maskfreevis.
Mask-Free Video Instance Segmentation
Lei Ke, Martin Danelljan, Henghui Ding, Yu-Wing Tai, Chi-Keung Tang, Fisher Yu
CVPR 2023
The recent advancement in Video Instance Segmentation (VIS) has largely been driven by the use of deeper and increasingly data-hungry transformer-based models. However, video masks are tedious and expensive to annotate, limiting the scale and diversity of existing VIS datasets. In this work, we aim to remove the mask-annotation requirement. We propose MaskFreeVIS, achieving highly competitive VIS performance, while only using bounding box annotations for the object state. We leverage the rich temporal mask consistency constraints in videos by introducing the Temporal KNN-patch Loss (TK-Loss), providing strong mask supervision without any labels. Our TK-Loss finds one-to-many matches across frames, through an efficient patch-matching step followed by a K-nearest neighbor selection. A consistency loss is then enforced on the found matches. Our mask-free objective is simple to implement, has no trainable parameters, is computationally efficient, yet outperforms baselines employing, e.g., state-of-the-art optical flow to enforce temporal mask consistency. We validate MaskFreeVIS on the YouTube-VIS 2019/2021, OVIS and BDD100K MOTS benchmarks. The results clearly demonstrate the efficacy of our method by drastically narrowing the gap between fully and weakly-supervised VIS performance.
Please see Getting Started with Detectron2 for full usage.
pip install -r requirements.txt
After preparing the required environment, run the following command to compile CUDA kernel for MSDeformAttn:
CUDA_HOME
must be defined and points to the directory of the installed CUDA toolkit.
cd mask2former/modeling/pixel_decoder/ops
sh make.sh
To build on a system that does not have a GPU device but provide the drivers:
TORCH_CUDA_ARCH_LIST='8.0' FORCE_CUDA=1 python setup.py build install
conda create --name maskfreevis python=3.8 -y
conda activate maskfreevis
conda install pytorch==1.9.0 torchvision==0.10.0 cudatoolkit=11.1 -c pytorch -c nvidia
pip install -U opencv-python
# under your working directory
git clone [email protected]:facebookresearch/detectron2.git
cd detectron2
pip install -e .
cd ..
git clone https://github.com/SysCV/MaskFreeVIS.git
cd MaskFreeVIS
pip install -r requirements.txt
cd mask2former/modeling/pixel_decoder/ops
sh make.sh
Please see the document here.
Using COCO image masks without YTVIS video masks during training:
Config Name | Backbone | AP | download | Training Script | COCO Init Weight |
---|---|---|---|---|---|
MaskFreeVIS | R50 | 46.6 | model | script | Init |
MaskFreeVIS | R101 | 49.1 | model | script | Init |
MaskFreeVIS | Swin-L | 56.0 | model | script | Init |
For below two training settings without using pseudo COCO images masks for joint video training, please change the folder to:
cd mfvis_nococo
Config Name | Backbone | AP | download | Training Script | COCO Init Weight |
---|---|---|---|---|---|
MaskFreeVIS | R50 | 43.8 | model | script | Init |
MaskFreeVIS | R101 | 47.3 | model | script | Init |
Config Name | Backbone | AP | download | Training Script | COCO Box Init Weight |
---|---|---|---|---|---|
MaskFreeVIS | R50 | 42.5 | model | script | Init |
Please see our script folder.
First download the provided trained model from our model zoo table and put them into the mfvis_models.
mkdir mfvis_models
Refer to our scripts folder for more commands:
Example evaluation scripts:
bash scripts/eval_8gpu_mask2former_r50_video.sh
bash scripts/eval_8gpu_mask2former_r101_video.sh
bash scripts/eval_8gpu_mask2former_swinl_video.sh
Example visualization script:
bash scripts/visual_video.sh
If you find MaskFreeVIS useful in your research or refer to the provided baseline results, please star :star: this repository and consider citing :pencil::
@inproceedings{maskfreevis,
author={Ke, Lei and Danelljan, Martin and Ding, Henghui and Tai, Yu-Wing and Tang, Chi-Keung and Yu, Fisher},
title={Mask-Free Video Instance Segmentation},
booktitle = {CVPR},
year = {2023}
}