X PLUG MPLUG Save

mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections. (EMNLP 2022)

Project README

mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections. (EMNLP 2022)

https://arxiv.org/abs/2205.12005

Introduction

We presents mPLUG, a new vision-language foundation model for both cross-modal understanding and generation. Most existing pre-trained models suffer from inefficiency and linguistic signal overwhelmed by long visual sequences in cross modal alignment. To address both problems, mPLUG introduces an effective and efficient vision-language architecture with novel cross modal skip-connections. mPLUG achieves state-of-the-art results on a wide range of vision language downstream tasks, including image captioning, image-text retrieval, visual grounding and visual question answering.

News

2023.5.08: Moved from AliceMind repo for further update.
2022.8.28: Released mPLUG downstream tasks!

Pre-trained models and datasets

Pre-trained models

For VQA and image captioning tasks, we do an additional continue pre-training on 4M image-text pairs based mplug.en.large to get mplug.en.large.v2.

Model	Visual Backbone	Text Enc Layers	Fusion Layers	Text Dec Layers	#params	Download
mplug.en.base	vit-b-16	6	6	12	350M	mplug.en.base
mplug.en.large	vit-l-14	6	6	12	600M	mplug.en.large
mplug.en.large.v2	vit-l-14	6	6	12	600M	mplug.en.large.v2
mplug.en.huge	vit-l-14	24	6	12	1.1B	comming soon

Pre-train Datasets

	COCO	VG	SBU	CC3M	CC13M
image	113K	100K	860K	3M	10M
text	567K	769K	860K	3M	10M

Results

Image-text

Task	VQA	Image Captioning	Retrieval		Referring Expression Comprehension			Visual Entailment
Dataset	VQA v2	COCO	MSCOCO	Flickr30K	RefCOCO	RefCOCO+	RefCOCOg	SNLI-VE	NLVR2
Split	test-dev/test-std	Karpathy test (CE/CIDEr)	5k test (TR/IR)	1k test (TR/IR)	val/test-a/test-b	val/test-a/test-b	val-u/test-u	val/test	dev/test-P
Metric	Acc.	CIDEr	R@1	R@1	Acc.			Acc.	Acc.
mPLUG_Base	79.79/79.98	137.5/150.4	-/-	-/-	-/-	-/-	-/-	-/-	-/-
mPLUG_Large	81.27/81.26	141.0/155.1	82.8/65.8	97.6/88.4	92.40/94.51/88.42	86.02/90.17 / 78.17	85.88/86.42	89.45/89.29	84.58/84.95
mPLUG_Huge	82.27/82.41	142.3/158.7	-/-	-/-	-/-/-	-/-/-	-/-	-/-	-/-/-

Video-text

Task	Video Retrieval	Video QA		Video Captioning
Dataset	MSRVTT	MSRVTT-QA	MSVD-QA	VATEX
Split	test	test	test	test(CE)
Metric	R@1	Acc.	Acc.	CIDEr
mPLUG	38.1	21.1	37.2	42.0

Requirements

PyTorch version >= 1.11.0
Install other libraries via

pip install -r requirements.txt

Pre-training

Comming soon.

Fine-tuning

Download json files of downstream tasks

Visual Question Answering

Download VQA v2 dataset and Visual Genome dataset from the original websites VQA 2.0.
Download and extract the provided dataset json files.
In configs/vqa_mplug_base.yaml, set the paths for the json files and the image paths.
Finetune the pre-trained mplug_base or large model using 8 A100 GPUs:

sh scripts/vqa_mplug_base.sh

sh scripts/vqa_mplug_large.sh

Evaluate the result using the official evaluation server.

Image Captioning

Download COCO Caption dataset from the original websites.
Download and extract the provided dataset json files.
Download language evalution tool(language_evalution).
In configs/caption_mplug_base.yaml, set the paths for the json files and the image paths.
Finetune the pre-trained mplug_base or large model using 8 A100 GPUs:

sh scripts/caption_mplug_base.sh

sh scripts/caption_mplug_large.sh

Image-text Retrieval

Download MSCOCO or Flickr30k datasets from the original websites.
Download and extract the provided dataset json files.
In configs/retrieval_flickr30k_mplug_large.yaml or configs/retrieval_coco_mplug_large.yaml, set the paths for the json files and the image path.
Finetune the pre-trained checkpoint using 8 A100 GPUs:

sh scripts/retrieval_flickr30k_mplug_large.sh

sh scripts/retrieval_coco_mplug_large.sh

Visual Grounding

Download RefCOCO datasets from the original websites.
Download and extract the provided dataset json files.
In configs/grounding_mplug_large.yaml, set the paths for the json files and the image path. Data preparation can follow TransVG
Finetune the pre-trained checkpoint using 8 A100 GPUs:

 sh scripts/grounding_mplug_base.sh

Zero-shot Video-text Retrieval

Download MSRVTT datasets from the original websites.
In configs/retrieval_msrvtt_mplug_large.yaml, set the paths for the json files and the video paths.
To perform zero-shot evaluation, run：

sh scripts/retrieval_msrvtt_mplug_large.sh

Zero-shot Video Question Answering

Download MSRVTT-QA datasets from the original websites.
In configs/videoqa_msrvtt_mplug_base.yaml, set the paths for the json files and the video paths.
To perform zero-shot evaluation, run：

sh scripts/videoqa_msrvtt_mplug_base.sh

Zero-shot Video Captioning

Download VATEX datasets from the original websites.
In configs/videocap_vatex_mplug_large.yaml, set the paths for the json files and the video paths.
To perform zero-shot evaluation, run：

sh scripts/videocap_vatex_mplug_large.sh

Citation

If you use our work, please cite:

@article{li2022mplug,
  title={mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections},
  author={Li, Chenliang and Xu, Haiyang and Tian, Junfeng and Wang, Wei and Yan, Ming and Bi, Bin and Ye, Jiabo and Chen, Hehong and Xu, Guohai and Cao, Zheng and others},
  journal={arXiv preprint arXiv:2205.12005},
  year={2022}
}

Acknowledgement

The implementation of mPLUG relies on resources from ALBEF, BLIP, and timm. We thank the original authors for their open-sourcing.

Open Source Agenda is not affiliated with "X PLUG MPLUG" Project. README Source: X-PLUG/mPLUG

Stars

Open Issues

Last Commit

11 months ago

Repository

X-PLUG/mPLUG

Homepage

https://arxiv.org/abs/2205.12005

Open Source Agenda Badge

<a href="https://www.opensourceagenda.com/projects/x-plug-mplug"><img src="https://www.opensourceagenda.com/projects/x-plug-mplug/reviews/badge.svg" alt="Open Source Agenda"></a>

Submit Review Review Your Favorite Project

Submit Resource Articles, Courses, Videos

Submit Article Submit a post to our blog

From the blog

Dec 11, 2022

How to Choose Which Programming Language to Learn First?

From the blog

Dec 11, 2022