ICNet implemented by pytorch, for real-time semantic segmentation on high-resolution images, mIOU=71.0 on cityscapes, single inference time is 19ms, FPS is 52.6.
This repo contains ICNet implemented by PyTorch, based on paper by Hengshuang Zhao, and et. al(ECCV'18). Training and evaluation are done on the Cityscapes dataset by default.
Python 3.6 or later with the following pip3 install -r requirements.txt
:
crop_size=960
, the best mIoU increased to 71.0%. It took about 2 days. Get icnet_resnet50_197_0.710_best_model.pth
Method | mIoU(%) | Time(ms) | FPS | Memory(GB) | GPU |
---|---|---|---|---|---|
ICNet(paper) | 67.7% | 33ms | 30.3 | 1.6 | TitanX |
ICNet(ours) | 71.0% | 19ms | 52.6 | 1.86 | GTX 1080Ti |
image | predict |
---|---|
demo/
directory to check more demo results.First, modify the configuration in the configs/icnet.yaml
file:
### 3.Trainning
train:
specific_gpu_num: "1" # for example: "0", "1" or "0, 1"
train_batch_size: 7 # adjust according to gpu resources
cityscapes_root: "/home/datalab/ex_disk1/open_dataset/Cityscapes/"
ckpt_dir: "./ckpt/" # ckpt and trainning log will be saved here
Then, run: python3 train.py
First, modify the configuration in the configs/icnet.yaml
file:
### 4.Test
test:
ckpt_path: "./ckpt/icnet_resnet50_197_0.710_best_model.pth" # set the pretrained model path correctly
Then, run: python3 evaluate.py
The structure of ICNet is mainly composed of sub4
, sub2
, sub1
and head
:
sub4
: basically a pspnet
, the biggest difference is a modified pyramid pooling module
.sub2
: the first three phases convolutional layers of sub4
, sub2
and sub4
share these three phases convolutional layers.sub1
: three consecutive stried convolutional layers, to fastly downsample the original large-size input imageshead
: through the CFF
module, the outputs of the three cascaded branches( sub4
, sub2
and sub1
) are connected. Finaly, using 1x1 convolution and interpolation to get the output.During the training, I found that pyramid pooling module
in sub4
is very important. It can significantly improve the performance of the network and lightweight models.
The most import thing in data preprocessing phase is to set the crop_size
reasonably, you should set the crop_size
as close as possible to the input size of prediction phase, here is my experiment:
base_size
to 520, it means resize the shorter side of image between 520x0.5 and 520x2, and set the crop size
to 480, it means randomly crop 480x480 patch to train. The final best mIoU is 66.7%.base_size
to 1024, it means resize the shorter side of image between 1024x0.5 and 1024x2, and set the crop_size
to 720, it means randomly crop 720x720 patch to train. The final best mIoU is 69.9%.crop_size
(720x720) is better. I have not tried a larger crop_size
(such as 960x960 or 1024x1024) yet, beacuse it will result in a very small batch size and is very time-consuming, in addition, the current mIoU is already high. But I believe that larger crop_size
will bring higher mIoU.In addition, I found that a small training technique can improve the performance of the model:
sub4
to orginal initial learning rate(0.01), because it has backbone pretrained weights.sub1
and head
to 10 times initial learning rate(0.1), because there are no pretrained weights for them.This small training technique is really effective, it can improve the mIoU performance by 1~2 percentage points.
Any other questions or my mistakes can be fedback in the comments section. I will replay as soon as possible.