Code release for "STMask: Spatial Feature Calibration and Temporal Fusion for Effective One-stage Video Instance Segmentation"(CVPR2021)
The code is implmented for our paper in CVPR2021:
Clone this repository and enter it:
git clone https://github.com/MinghanLi/STMask.git
cd STMask
Set up the environment using one of the following methods:
conda env create -f environment.yml
# Cython needs to be installed before pycocotools
pip install cython
pip install opencv-python pillow pycocotools matplotlib
Install mmcv and mmdet
pip install mmcv-full==1.1.2 -f https://download.openmmlab.com/mmcv/dist/cu101/torch1.5.0/index.html
pip install "git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI"
git clone https://github.com/youtubevos/cocoapi
cd cocoapi/PythonAPI
# To compile and install locally
python setup.py build_ext --inplace
# To install library to Python site-packages
python setup.py build_ext install
Install spatial-correlation-sampler
pip install spatial-correlation-sampler
Complie DCNv2 code (see Installation)
git clone https://github.com/CharlesShang/DCNv2.git
cd DCNv2
python setup.py build develop
Modify mmcv/ops/deform_conv.py to handle deformable convolution with different height and width (like 3 * 5) in FCB(ali) or FCB(ada)
vim /your_conda_env_path/mmcv/ops/deform_conv.py
ext_module.deform_conv_forward(
input,
weight,
offset,
output,
ctx.bufs_[0],
ctx.bufs_[1],
kW=weight.size(3),
kH=weight.size(2),
dW=ctx.stride[1],
dH=ctx.stride[0],
padW=ctx.padding[0],
padH=ctx.padding[1],
dilationW=ctx.dilation[1],
dilationH=ctx.dilation[0],
group=ctx.groups,
deformable_group=ctx.deform_groups,
im2col_step=cur_im2col_step)
The input size on all VIS benchmarks is 360*640 here.
Here are our STMask models (released on April, 2021) along with their FPS on a 2080Ti and mAP on valid set
, where mAP and mAP* are obtained under cross class fast nms and fast nms respectively.
Note that FCB(ali) and FCB(ada) are only executed on the classification branch.
Backbone | FCA | FCB | TF | FPS | mAP | mAP* | Weights |
---|---|---|---|---|---|---|---|
R50-DCN-FPN | FCA | - | TF | 29.3 | 32.6 | 33.4 | STMask_plus_resnet50.pth |
R50-DCN-FPN | FCA | FCB(ali) | TF | 27.8 | - | 32.1 | STMask_plus_resnet50_ali.pth |
R50-DCN-FPN | FCA | FCB(ada) | TF | 28.6 | 32.8 | 33.0 | STMask_plus_resnet50_ada.pth |
R101-DCN-FPN | FCA | - | TF | 24.5 | 36.0 | 36.3 | STMask_plus_base.pth |
R101-DCN-FPN | FCA | FCB(ali) | TF | 22.1 | 36.3 | 37.1 | STMask_plus_base_ali.pth |
R101-DCN-FPN | FCA | FCB(ada) | TF | 23.4 | 36.8 | 37.9 | STMask_plus_base_ada.pth |
Backbone | FCA | FCB | TF | mAP* | Weights | Results |
---|---|---|---|---|---|---|
R50-DCN-FPN | FCA | - | TF | 30.6 | STMask_plus_resnet50_YTVIS2021.pth | - |
R50-DCN-FPN | FCA | FCB(ada) | TF | 31.1 | STMask_plus_resnet50_ada_YTVIS2021.pth | stdout.txt |
R101-DCN-FPN | FCA | - | TF | 33.7 | STMask_plus_base_YTVIS2021.pth | - |
R101-DCN-FPN | FCA | FCB(ada) | TF | 34.6 | STMask_plus_base_ada_YTVIS2021.pth | stdout.txt |
Backbone | FCA | FCB | TF | mAP* | Weights | Results |
---|---|---|---|---|---|---|
R50-DCN-FPN | FCA | - | TF | 15.4 | STMask_plus_resnet50_OVIS.pth | - |
R50-DCN-FPN | FCA | FCB(ada) | TF | 15.4 | STMask_plus_resnet50_ada_OVIS.pth | stdout.txt |
R101-DCN-FPN | FCA | - | TF | 17.3 | STMask_plus_base_OVIS.pth | stdout.txt |
R101-DCN-FPN | FCA | FCB(ada) | TF | 15.8 | STMask_plus_base_ada_OVIS.pth | - |
To evalute the model, put the corresponding weights file in the ./weights
directory and run one of the following commands. The name of each config is everything before the numbers in the file name (e.g., STMask_plus_base
for STMask_plus_base.pth
).
Here all STMask models are trained based on yolact_plus_base_54_80000.pth
or yolact_plus_resnet_54_80000.pth
from Yolact++ here.
We also provide quantitative results of Yolcat++ with our proposed feature calibration for anchors and boxes on COCO (w/o temporal fusion module). Here are the results on COCO valid set.
Image Size | Backbone | FCA | FCB | B_AP | M_AP | Weights |
---|---|---|---|---|---|---|
[550,550] | R50-DCN-FPN | FCA | - | 34.5 | 32.9 | yolact_plus_resnet50_54.pth |
[550,550] | R50-DCN-FPN | FCA | FCB(ali) | 34.6 | 33.3 | yolact_plus_resnet50_ali_54.pth |
[550,550] | R50-DCN-FPN | FCA | FCB(ada) | 34.7 | 33.2 | yolact_plus_resnet50_ada_54.pth |
[550,550] | R101-DCN-FPN | FCA | - | 35.7 | 33.3 | yolact_plus_base_54.pth |
[550,550] | R101-DCN-FPN | FCA | FCB(ali) | 35.6 | 34.1 | yolact_plus_base_ali_54.pth |
[550,550] | R101-DCN-FPN | FCA | FCB(ada) | 36.4 | 34.8 | yolact_plus_baseada_54.pth |
# Output a YTVOSEval json to submit to the website.
# This command will create './weights/results.json' for instance segmentation.
python eval.py --config=STMask_plus_base_ada_config --trained_model=weights/STMask_plus_base_ada.pth --mask_det_file=weights/results.json
# Output a visual segmentation results
python eval.py --config=STMask_plus_base_ada_config --trained_model=weights/STMask_plus_base_ada.pth --mask_det_file=weights/results.json --display
By default, we train on YouTubeVOS2019 dataset. Make sure to download the entire dataset using the commands above.
To train, grab an COCO-pretrained model and put it in ./weights
.
yolact_plus_base_54_80000.pth
or yolact_plus_resnet_54_80000.pth
from Yolact++ here.Run one of the training commands below.
*_interrupt.pth
file at the current iteration../weights
directory by default with the file name <config>_<epoch>_<iter>.pth
.# Trains STMask_plus_base_config with a batch_size of 8.
CUDA_VISIBLE_DEVICES=0,1 python train.py --config=STMask_plus_base_config --batch_size=8 --lr=1e-4 --save_folder=weights/weights_r101
# Resume training STMask_plus_base with a specific weight file and start from the iteration specified in the weight file's name.
CUDA_VISIBLE_DEVICES=0,1 python train.py --config=STMask_plus_base_config --resume=weights/STMask_plus_base_10_32100.pth
If you use STMask or this code base in your work, please cite
@inproceedings{STMask-CVPR2021,
author = {Minghan Li and Shuai Li and Lida Li and Lei Zhang},
title = {Spatial Feature Calibration and Temporal Fusion for Effective One-stage Video Instance Segmentation},
booktitle = {CVPR},
year = {2021},
}
For questions about our paper or code, please contact Li Minghan ([email protected] or [email protected]).