Co-attending Regions and Detections for VQA.
Co-attending Regions and Detections with Multi-modal Multiplicative Embedding for VQA.
The network has two attention branches with the proposed multiplicative feature embedding scheme: one branch attends free-form image regions, another branch attends detection boxes for encoding question-related visual features.
This current code can get 66.09 on Open-Ended and 69.97 on Multiple-Choice on test-standard split for the VQA 1.0 dataset.
Spotlights
This main part of code is written in Lua and requires Torch. After installing torch, you can install these dependencies by running the following:
cd ~/torch
luarocks install loadcaffe
luarocks install hdf5
pip install h5py
luarocks install optim
luarocks install nn
luarocks install math
luarocks install image
luarocks install dp
cd ~/torch
git clone [email protected]:Element-Research/rnn.git
cd rnn
luarocks make rocks/rnn-scm-1.rockspec
cd /usr/local/
sudo wget https://www.kyne.com.au/~mark/software/download/lua-cjson-2.1.0.tar.gz
sudo tar -xzvf lua-cjson-2.1.0.tar.gz
cd lua-cjson-2.1.0
sudo luarocks make
sudo rm ../lua-cjson-2.1.0.tar.gz
cd /usr/share/
sudo mkdir nltk_data
sudo pip install -U nltk
python -m nltk.downloader all
luarocks install cutorch
luarocks install cunn
luarocks install cudnn
cd ~/torch
# download the right cudnn file to cuda version
tar -xzvf cudnn-7.5-linux-x64-v5.1.tgz
sudo cp cuda/lib64/libcudnn* /usr/local/cuda-7.5/lib64/
sudo cp cuda/include/cudnn.h /usr/local/cuda-7.5/include/
luarocks install cudnn
cd torch/
git clone https://github.com/NVIDIA/nccl.git
# build the library
cd nccl/
make CUDA_HOME=/usr/local/cuda-7.5 test
# update LIBRARY_PATH
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/plu/torch/nccl/build/lib
source ~/.bashrc
# test demo
$ ./build/test/single/all_reduce_test
$ ./build/test/single/all_reduce_test 10000000
luarocks install nccl
Extracting and visualizing bounding boxes are supported by caffe and py-faster-rcnn. You can install Caffe and faster-rcnn following the instructions.
Then copy the faster-rcnn for vqa files to the target folder:
cp ~/dual-mfa-vqa/faster-rcnn-vqa/tools/*.py ~/py-faster-rcnn/tools/
mkdir -p ~/VQA/Images/mscoco
cd ~/VQA/Images/mscoco
wget http://msvocds.blob.core.windows.net/coco2014/train2014.zip
unzip train2014.zip
cd ~/VQA/Images/mscoco
wget http://msvocds.blob.core.windows.net/coco2014/val2014.zip
unzip val2014.zip
cd ~/VQA/Images/mscoco
wget http://msvocds.blob.core.windows.net/coco2015/test2015.zip
unzip test2015.zip
ln -s test2015 test-dev2015
mkdir -p ~/VQA/Annotations
cd ~/dual-mfa-vqa/data_train-val_test-dev_2k
python vqa_preprocess.py --download 1
python prepro_vqa.py
cd ~/dual-mfa-vqa/data_train_test-dev_2k
python vqa_preprocess.py
python prepro_vqa.py
cd ~/dual-mfa-vqa
th prepro/prepro_seconds.lua
mkdir -p ~/VQA/Images/Image_model
cd ~/VQA/Image_model
wget https://d2j0dndfm35trm.cloudfront.net/resnet-152.t7
wget https://raw.githubusercontent.com/facebook/fb.resnet.torch/master/datasets/transforms.lua
cd ~/py-faster-rcnn/data/
mkdir faster_rcnn_models
cd faster_rcnn_models
wget https://dl.dropboxusercontent.com/s/cotx0y81zvbbhnt/coco_vgg16_faster_rcnn_final.caffemodel?dl=0
mv coco_vgg16_faster_rcnn_final.caffemodel?dl=0 coco_vgg16_faster_rcnn_final.caffemodel
You can download the pretrained Skipthoughts models to folder skipthoughts_model/
for learning (See more details):
This current code can get 66.01 on Open-Ended and 70.04 on Multiple-Choice on test-tev split for the VQA 1.0 dataset. Download the pre-trained model vqa_dual-mfa_model_6601.t7
(315M) from here into folder dual-mfa-vqa/model/save/
.
cd prepro
th prepro_res_train.lua -batch_size 8
th prepro_res_test.lua -batch_size 8
python extract_box_feat_train.py
python extract_box_feat_train.py
faster-rcnn_box4_19_test.h5
from here.python extract_box_test.py
Now, everything is ready, let's train the vqa network. Here are some common training ways for different needs.
th train.lua -phase 1 -val_nqs -1 -nGPU 4
th train.lua -phase 2 -nGPU 4 -batchsize 300
th train.lua -phase 1 -val_nqs 10000 -nGPU 4 -memory_ms -memory_frms
th train.lua -phase 2 -nGPU 4 -memory_ms -load_checkpoint_path model/save/vqa_model_dual-mfa_6601.t7 -previous_iters 350000
phase
:training phase, 1
: train on Train, 2
: train on Train+Valvqa_type
: vqa dataset type, vqa
or coco-qa
memory_ms
: load image resnet feature to memorymemory_frms
: load image fast-rcnn feature to memoryval
: running validationval_nqs
: number of validation questions, -1
for all questionsbatch_size
: batch_size for each iterations, change it to smaller value if out of the memoryrun_id
: running model idmodel_label
: model label namesave_checkpoint_every
: how often to save a model checkpointskip_save_model
: skip saving t7 modelcg_every
: How often do we collectgarbage in the training process, change it to smaller value if out of the memoryquick_check
: quick check for codequickquick_check
: very quick check for codenGPU
: how many GPUs to use, 1 = use 1 GPU, change it to larger value if out of the memoryEvaluate the pre-trained model on VQA dataset:
cd ~/dual-mfa-vqa
th eval.lua -model_path model/vqa_model_dual-mfa_6601.t7 -output_model_name vqa_model_dual-mfa_6601 -batch_size 10
Then you can submit the result jsons and obtain the evaluation scores:
cd data_coco
python cocoqa_preprocess.py --download 1
python prepro_cocoqa.py
cd prepro
th prepro_res_coco.lua -batch_size 8
th train.lua -vqa_type coco-qa -learning_rate 4e-4 -nGPU 4 -batch_size 300 \
-model_id 1 -model_label dual-mfa
cd ~/dual-mfa-vqa/metric
python gen_wups_input.py
python calculate_wups.py gt_ans_save.txt pd_ans_save.txt 0.9
python calculate_wups.py gt_ans_save.txt pd_ans_save.txt 0.0
cd ~/dual-mfa-vqa
th eval_vis_att.lua -model_path model/vqa_model_dual-mfa_6601.t7 -output_model_name vqa_model_dual-mfa_6601 -batch_size 8
cd vis_att
python vis_prepro.py
vis_attention_demo.m
to show the results of attention maps.vis_attention.m
to save the results of attention maps.If you use this code as part of any published research, please acknowledge the following paper.
@inproceedings{lu2018co-attending,
title={Co-attending Free-form Regions and Detections with Multi-modal Multiplicative Feature Embedding for Visual Question Answering.},
author={Lu, Pan and Li, Hongsheng and Zhang, Wei and Wang, Jianyong and Wang, Xiaogang},
booktitle={AAAI 2018},
pages={7218-7225},
year={2018}
}