OpenMMLab Detection Toolbox and Benchmark
An Open and Comprehensive Pipeline for Unified Object Grounding and Detection
Grounding-DINO is a state-of-the-art open-set detection model that tackles multiple vision tasks including Open-Vocabulary Detection (OVD), Phrase Grounding (PG), and Referring Expression Comprehension (REC). Its effectiveness has led to its widespread adoption as a mainstream architecture for various downstream applications. However, despite its significance, the original Grounding-DINO model lacks comprehensive public technical details due to the unavailability of its training code. To bridge this gap, we present MM-Grounding-DINO, an open-source, comprehensive, and user-friendly baseline, which is built with the MMDetection toolbox. It adopts abundant vision datasets for pre-training and various detection and grounding datasets for fine-tuning. We give a comprehensive analysis of each reported result and detailed settings for reproduction. The extensive experiments on the benchmarks mentioned demonstrate that our MM-Grounding-DINO-Tiny outperforms the Grounding-DINO-Tiny baseline. We release all our models to the research community.
Detail: https://github.com/open-mmlab/mmdetection/tree/main/configs/mm_grounding_dino
v3.2.0 was released in 12/10/2023:
1. Detection Transformer SOTA Model Collection
(1) Supported four updated and stronger SOTA Transformer models: DDQ, CO-DETR, AlignDETR, and H-DINO.
(2) Based on CO-DETR, MMDet released a model with a COCO performance of 64.1 mAP.
(3) Algorithms such as DINO support AMP/Checkpoint/FrozenBN
, which can effectively reduce memory usage.
2. Comprehensive Performance Comparison between CNN and Transformer RF100 consists of a dataset collection of 100 real-world datasets, including 7 domains. It can be used to assess the performance differences of Transformer models like DINO and CNN-based algorithms under different scenarios and data volumes. Users can utilize this benchmark to quickly evaluate the robustness of their algorithms in various scenarios.
3. Support for GLIP and Grounding DINO fine-tuning, the only algorithm library that supports Grounding DINO fine-tuning The Grounding DINO algorithm in MMDet is the only library that supports fine-tuning. Its performance is one point higher than the official version, and of course, GLIP also outperforms the official version. We also provide a detailed process for training and evaluating Grounding DINO on custom datasets. Everyone is welcome to give it a try.
Model | Backbone | Style | COCO mAP | Official COCO mAP |
---|---|---|---|---|
Grounding DINO-T | Swin-T | Zero-shot | 48.5 | 48.4 |
Grounding DINO-T | Swin-T | Finetune | 58.1(+0.9) | 57.2 |
Grounding DINO-B | Swin-B | Zero-shot | 56.9 | 56.7 |
Grounding DINO-B | Swin-B | Finetune | 59.7 | |
Grounding DINO-R50 | R50 | Scratch | 48.9(+0.8) | 48.1 |
4. Support for the open-vocabulary detection algorithm Detic and multi-dataset joint training. 5. Training detection models using FSDP and DeepSpeed.
ID | AMP | GC of Backbone | GC of Encoder | FSDP | Peak Mem (GB) | Iter Time (s) |
---|---|---|---|---|---|---|
1 | 49 (A100) | 0.9 | ||||
2 | √ | 39 (A100) | 1.2 | |||
3 | √ | 33 (A100) | 1.1 | |||
4 | √ | √ | 25 (A100) | 1.3 | ||
5 | √ | √ | 18 | 2.2 | ||
6 | √ | √ | √ | 13 | 1.6 | |
7 | √ | √ | √ | 14 | 2.9 | |
8 | √ | √ | √ | √ | 8.5 | 2.4 |
6. Support for the V3Det dataset, a large-scale detection dataset with over 13,000 categories.
v3.2.0 版本已经在 2023.10.12 发布:
1. 检测 Transformer SOTA 模型大合集 (1) 支持了 DDQ、CO-DETR、AlignDETR 和 H-DINO 4 个更新更强的 SOTA Transformer 模型 (2) 基于 CO-DETR, MMDet 中发布了 COCO 性能为 64.1 mAP 的模型 (3) DINO 等算法支持 AMP/Checkpoint/FrozenBN,可以有效降低显存
2. 提供了全面的 CNN 和 Transformer 的性能对比 RF100 是由 100 个现实收集的数据集组成,包括 7 个域,可以验证 DINO 等 Transformer 模型和 CNN 类算法在不同场景不同数据量下的性能差异。用户可以用这个 Benchmark 快速验证自己的算法在不同场景下的鲁棒性。
3. 支持了 GLIP 和 Grounding DINO 微调,全网唯一支持 Grounding DINO 微调 MMDet 中的 Grounding DINO 是全网唯一支持微调的算法库,且性能高于官方 1 个点,当然 GLIP 也比官方高。 我们还提供了详细的 Grounding DINO 在自定义数据集上训练评估的流程,欢迎大家试用。
Model | Backbone | Style | COCO mAP | Official COCO mAP |
---|---|---|---|---|
Grounding DINO-T | Swin-T | Zero-shot | 48.5 | 48.4 |
Grounding DINO-T | Swin-T | Finetune | 58.1(+0.9) | 57.2 |
Grounding DINO-B | Swin-B | Zero-shot | 56.9 | 56.7 |
Grounding DINO-B | Swin-B | Finetune | 59.7 | |
Grounding DINO-R50 | R50 | Scratch | 48.9(+0.8) | 48.1 |
4. 支持开放词汇检测算法 Detic 并提供多数据集联合训练可能
5. 轻松使用 FSDP 和 DeepSpeed 训练检测模型
ID | AMP | GC of Backbone | GC of Encoder | FSDP | Peak Mem (GB) | Iter Time (s) |
---|---|---|---|---|---|---|
1 | 49 (A100) | 0.9 | ||||
2 | √ | 39 (A100) | 1.2 | |||
3 | √ | 33 (A100) | 1.1 | |||
4 | √ | √ | 25 (A100) | 1.3 | ||
5 | √ | √ | 18 | 2.2 | ||
6 | √ | √ | √ | 13 | 1.6 | |
7 | √ | √ | √ | 14 | 2.9 | |
8 | √ | √ | √ | √ | 8.5 | 2.4 |
6. 支持了 V3Det 1.3w+ 类别的超大词汇检测数据集
s multimodal vision algorithms continue to evolve, MMDetection has also supported such algorithms. This section demonstrates how to use the demo and eval scripts corresponding to multimodal algorithms using the GLIP algorithm and model as the example. Moreover, MMDetection integrated a gradio_demo project, which allows developers to quickly play with all image input tasks in MMDetection on their local devices. Check the document for more details.
Please first make sure that you have the correct dependencies installed:
# if source
pip install -r requirements/multimodal.txt
# if wheel
mim install mmdet[multimodal]
MMDetection has already implemented GLIP algorithms and provided the weights, you can download directly from urls:
cd mmdetection
wget https://download.openmmlab.com/mmdetection/v3.0/glip/glip_tiny_a_mmdet-b3654169.pth
Once the model is successfully downloaded, you can use the demo/image_demo.py
script to run the inference.
python demo/image_demo.py demo/demo.jpg glip_tiny_a_mmdet-b3654169.pth --texts bench
Demo result will be similar to this:
If users would like to detect multiple targets, please declare them in the format of xx . xx .
after the --texts
.
python demo/image_demo.py demo/demo.jpg glip_tiny_a_mmdet-b3654169.pth --texts 'bench . car .'
And the result will be like this one:
You can also use a sentence as the input prompt for the --texts
field, for example:
python demo/image_demo.py demo/demo.jpg glip_tiny_a_mmdet-b3654169.pth --texts 'There are a lot of cars here.'
The result will be similar to this:
The GLIP implementation in MMDetection does not have any performance degradation, our benchmark is as follows:
Model | official mAP | mmdet mAP |
---|---|---|
glip_A_Swin_T_O365.yaml | 42.9 | 43.0 |
glip_Swin_T_O365.yaml | 44.9 | 44.9 |
glip_Swin_L.yaml | 51.4 | 51.3 |
Users can use the test script we provided to run evaluation as well. Here is a basic example:
# 1 gpu
python tools/test.py configs/glip/glip_atss_swin-t_fpn_dyhead_pretrain_obj365.py glip_tiny_a_mmdet-b3654169.pth
# 8 GPU
./tools/dist_test.sh configs/glip/glip_atss_swin-t_fpn_dyhead_pretrain_obj365.py glip_tiny_a_mmdet-b3654169.pth 8
The result will be similar to this:
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.428
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=1000 ] = 0.594
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=1000 ] = 0.466
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.300
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.477
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.534
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.634
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=300 ] = 0.634
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=1000 ] = 0.634
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.473
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.690
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.789
# if source
pip install -r requirements/multimodal.txt
# if wheel
mim install mmdet[multimodal]
For convenience, you can download the weights to the mmdetection
root dir
wget https://download.openmmlab.com/mmdetection/v3.0/xdecoder/xdecoder_focalt_last_novg.pt
wget https://download.openmmlab.com/mmdetection/v3.0/xdecoder/xdecoder_focalt_best_openseg.pt
The above two weights are directly copied from the official website without any modification. The specific source is https://github.com/microsoft/X-Decoder
For convenience of demonstration, please download the folder and place it in the root directory of mmdetection.
(1) Open Vocabulary Semantic Segmentation
cd projects/XDecoder
python demo.py ../../images/animals.png configs/xdecoder-tiny_zeroshot_open-vocab-semseg_coco.py --weights ../../xdecoder_focalt_last_novg.pt --texts zebra.giraffe
(2) Open Vocabulary Instance Segmentation
cd projects/XDecoder
python demo.py ../../images/owls.jpeg configs/xdecoder-tiny_zeroshot_open-vocab-instance_coco.py --weights ../../xdecoder_focalt_last_novg.pt --texts owl
(3) Open Vocabulary Panoptic Segmentation
cd projects/XDecoder
python demo.py ../../images/street.jpg configs/xdecoder-tiny_zeroshot_open-vocab-panoptic_coco.py --weights ../../xdecoder_focalt_last_novg.pt --text car.person --stuff-text tree.sky
(4) Referring Expression Segmentation
cd projects/XDecoder
python demo.py ../../images/fruit.jpg configs/xdecoder-tiny_zeroshot_open-vocab-ref-seg_refcocog.py --weights ../../xdecoder_focalt_last_novg.pt --text "The larger watermelon. The front white flower. White tea pot."
(5) Image Caption
cd projects/XDecoder
python demo.py ../../images/penguin.jpeg configs/xdecoder-tiny_zeroshot_caption_coco2014.py --weights ../../xdecoder_focalt_last_novg.pt
(6) Referring Expression Image Caption
cd projects/XDecoder
python demo.py ../../images/fruit.jpg configs/xdecoder-tiny_zeroshot_ref-caption.py --weights ../../xdecoder_focalt_last_novg.pt --text 'White tea pot'
(7) Text Image Region Retrieval
cd projects/XDecoder
python demo.py ../../images/coco configs/xdecoder-tiny_zeroshot_text-image-retrieval.py --weights ../../xdecoder_focalt_last_novg.pt --text 'pizza on the plate'
The image that best matches the given text is ../../images/coco/000.jpg and probability is 0.998
We have also prepared a gradio program in the projects/gradio_demo
directory, which you can run interactively all the inference supported by mmdetection in your browser.
Prepare your dataset according to the docs.
Test Command
Since semantic segmentation is a pixel-level task, we don't need to use a threshold to filter out low-confidence predictions. So we set model.test_cfg.use_thr_for_mc=False
in the test command.
./tools/dist_test.sh projects/XDecoder/configs/xdecoder-tiny_zeroshot_open-vocab-semseg_ade20k.py xdecoder_focalt_best_openseg.pt 8 --cfg-options model.test_cfg.use_thr_for_mc=False
Model | mIoU | mIOU(official) | Config |
---|---|---|---|
xdecoder_focalt_best_openseg.pt |
25.24 | 25.13 | config |
Prepare your dataset according to the docs.
./tools/dist_test.sh projects/XDecoder/configs/xdecoder-tiny_zeroshot_open-vocab-instance_ade20k.py xdecoder_focalt_best_openseg.pt 8
Model | mIoU | mIOU(official) | Config |
---|---|---|---|
xdecoder_focalt_best_openseg.pt |
10.1 | 10.1 | config |
Prepare your dataset according to the docs.
./tools/dist_test.sh projects/XDecoder/configs/xdecoder-tiny_zeroshot_open-vocab-panoptic_ade20k.py xdecoder_focalt_best_openseg.pt 8
Model | mIoU | mIOU(official) | Config |
---|---|---|---|
xdecoder_focalt_best_openseg.pt |
19.11 | 18.97 | config |
Prepare your dataset according to the docs of (2) use panoptic dataset
part.
./tools/dist_test.sh projects/XDecoder/configs/xdecoder-tiny_zeroshot_open-vocab-semseg_coco.py xdecoder_focalt_last_novg.pt 8 --cfg-options model.test_cfg.use_thr_for_mc=False
Model | mIOU | mIOU(official) | Config |
---|---|---|---|
xdecoder-tiny_zeroshot_open-vocab-semseg_coco |
62.1 | 62.1 | config |
Prepare your dataset according to the docs.
./tools/dist_test.sh projects/XDecoder/configs/xdecoder-tiny_zeroshot_open-vocab-instance_coco.py xdecoder_focalt_last_novg.pt 8
Model | Mask mAP | Mask mAP(official) | Config |
---|---|---|---|
xdecoder-tiny_zeroshot_open-vocab-instance_coco |
39.8 | 39.7 | config |
Prepare your dataset according to the docs.
./tools/dist_test.sh projects/XDecoder/configs/xdecoder-tiny_zeroshot_open-vocab-panoptic_coco.py xdecoder_focalt_last_novg.pt 8
Model | PQ | PQ(official) | Config |
---|---|---|---|
xdecoder-tiny_zeroshot_open-vocab-panoptic_coco |
51.42 | 51.16 | config |
Prepare your dataset according to the docs.
./tools/dist_test.sh projects/XDecoder/configs/xdecoder-tiny_zeroshot_open-vocab-ref-seg_refcocog.py xdecoder_focalt_last_novg.pt 8 --cfg-options test_dataloader.dataset.split='val'
Model | text mode | cIoU | cIOU(official) | Config |
---|---|---|---|---|
xdecoder_focalt_last_novg.pt |
select first | 58.8415 | 57.85 | config |
xdecoder_focalt_last_novg.pt |
original | 60.0321 | - | config |
xdecoder_focalt_last_novg.pt |
concat | 60.3551 | - | config |
Note:
Resize
to (1024, 512), the result will be 57.69
.text mode
is the RefCoCoDataset
parameter in MMDetection, it determines the texts loaded to the data list. It can be set to select_first
, original
, concat
and random
.
select_first
: select the first text in the text list as the description to an instance.original
: use all texts in the text list as the description to an instance.concat
: concatenate all texts in the text list as the description to an instance.random
: randomly select one text in the text list as the description to an instance, usually used for training.Prepare your dataset according to the docs.
Before testing, you need to install jdk 1.8, otherwise it will prompt that java does not exist during the evaluation process
./tools/dist_test.sh projects/XDecoder/configs/xdecoder-tiny_zeroshot_caption_coco2014.py xdecoder_focalt_last_novg.pt 8
Model | BLEU-4 | CIDER | Config |
---|---|---|---|
xdecoder-tiny_zeroshot_caption_coco2014 |
35.26 | 116.81 | config |
Please refer to https://github.com/open-mmlab/mmdetection/blob/dev-3.x/projects/gradio_demo/README.md for details.
A total of 30 developers contributed to this release.
Thanks @jjjkkkjjj @lovelykite, @minato-ellie, @freepoet, @wufan-tb, @yalibian, @keyakiluo, @gihanjayatilaka, @i-aki-y, @xin-li-67, @RangeKing, @JingweiZhang12, @MambaWong, @lucianovk, @tall-josh, @xiuqhou, @jamiechoi1995, @YQisme, @yechenzhi, @bjzhb666, @xiexinch, @jamiechoi1995, @yarkable, @Renzhihan, @nijkah, @amaizr, @Lum1104, @zwhus, @Czm369, @hhaAndroid
We have released the official version of MMDetection v3.0.0
RTMDetIns
prior generator device error (#9964)img_shape
in data pipeline (#9966)solov2_r50_fpn_ms-3x_coco.py
config error (#10030)common/ms_3x_coco-instance.py
config error (#10056)data_root
in CocoOccludedSeparatedMetric
to fix bug (#9969)A total of 19 developers contributed to this release.
Thanks @IRONICBo, @vansin, @RangeKing, @Ghlerrix, @okotaku, @JosonChan1998, @zgzhengSEU, @bobo0810, @yechenzhi, @Zheng-LinXiao, @LYMDLUT, @yarkable, @xiejiajiannb, @chhluo, @BIGWangYuDong, @RangiLyu, @zwhus, @hhaAndroid, @ZwwWayne
customize_runtime.md
(#9797)WIDERFace SSD
loss for Nan problem (#9734)A total of 4 developers contributed to this release. Thanks @co63oc, @Ginray, @vansin, @RangiLyu
Full Changelog: https://github.com/open-mmlab/mmdetection/compare/v2.28.1...v2.28.2
Projects
DetInferencer
for inference, Test Time Augmentation, and automatically importing modules from registryProjects
(#9619)Projects
(#9639, #9768)Projects
(#9645)Projects
(#9645)DetInferencer
for inference (#9561)use_depthwise
in RTMDet (#9624)albumentations
augmentation post process with masks (#9551)LoadPanopticAnnotations
bug (#9703)isort
CI (#9680)MultiImageMixDataset
(#9764)sklearn
(#9725)Project
(#9599)github
with gitee
in .pre-commit-config-zh-cn.yaml
file (#9586)isort
in .pre-commit-config.yaml
file (#9701)2.0.0rc4
for dev-3.x
(#9695)DarknetBottleneck
(#9591)non_blocking
parameters (#9723)finetune.md
and inference.md
(#9578)A total of 27 developers contributed to this release.
Thanks @JosonChan1998, @RangeKing, @NoFish-528, @likyoo, @Xiangxu-0103, @137208, @PeterH0323, @tianleiSHI, @wufan-tb, @lyviva, @zwhus, @jshilong, @Li-Qingyun, @sanbuphy, @zylo117, @triple-Mu, @KeiChiTse, @LYMDLUT, @nijkah, @chg0901, @DanShouzhu, @zytx121, @vansin, @BIGWangYuDong, @hhaAndroid, @RangiLyu, @ZwwWayne
Full Changelog: https://github.com/open-mmlab/mmdetection/compare/v3.0.0rc5...v3.0.0rc6
A total of 4 developers contributed to this release. Thanks @triple-Mu, @i-aki-y, @twmht, @RangiLyu
Full Changelog: https://github.com/open-mmlab/mmdetection/compare/v2.28.0...v2.28.1
-
to --format-only
in documentation.DeformableDETRHead
(#9607)A total of 11 developers contributed to this release. Thanks @eantono, @akstt, @@lpizzinidev, @RangiLyu, @kbumsik, @tianleiSHI, @nijkah, @BIGWangYuDong, @wangjiangben-hw, @@jamiechoi1995, @ZwwWayne
Full Changelog: https://github.com/open-mmlab/mmdetection/compare/v2.27.0...v2.28.0
A total of 12 developers contributed to this release. Thanks @Min-Sheng, @gasvn, @lzyhha, @jbwang1997, @zachcoleman, @chenyuwang814, @MilkClouds, @Fizzez, @boahc077, @apatsekin, @zytx121, @DonggeunYu
Full Changelog: https://github.com/open-mmlab/mmdetection/compare/v2.26.0...v2.27.0
batch_size
is greater than 1 in inference (#9400)analyze_logs.py
to plot mAP and calculate train time correctly (#9409)PAFPN
(#9450)DeformableDETRHead
object has no attribute loss_single
(#9477)analyze_results
(#9380)builder.py
(#9479)(width, height)
order (#9324).pre-commit-config-zh-cn.yaml
file (#9388)FocalLoss
and QualityFocalLoss
to allow different kinds of targets (#9481)setup.cfg
(#9370)[0, 1]
(#9391)faq.md
(#9396)get_started
(#9480)useful_tools.md
and useful_hooks.md
(#9453)bfp
and channel_mapper
(#9410)A total of 20 developers contributed to this release.
Thanks @liuyanyi, @RangeKing, @lihua199710, @MambaWong, @sanbuphy, @Xiangxu-0103, @twmht, @JunyaoHu, @Chan-Sun, @tianleiSHI, @zytx121, @kitecats, @QJC123654, @JosonChan1998, @lvhan028, @Czm369, @BIGWangYuDong, @RangiLyu, @hhaAndroid, @ZwwWayne
Full Changelog: https://github.com/open-mmlab/mmdetection/compare/v3.0.0rc4...v3.0.0rc5