Sam Textvqa Save

Official code for paper "Spatially Aware Multimodal Transformers for TextVQA" published at ECCV, 2020.

Project README

Spatially Aware Multimodal Transformers for TextVQA

Yash Kant, Dhruv Batra, Peter Anderson, Alex Schwing, Devi Parikh, Jiasen Lu, Harsh Agrawal Published at ECCV, 2020

Paper: arxiv.org/abs/2007.12146

Project Page: yashkant.github.io/projects/sam-textvqa

We propose a novel spatially aware self-attention layer such that each visual entity only looks at neighboring entities defined by a spatial graph and use it to solve TextVQA.

Repository Setup

Create a fresh conda environment, and install all dependencies.

conda create -n sam python=3.6
conda activate sam
cd sam-textvqa
pip install -r requirements.txt

Install pytorch

conda install pytorch torchvision cudatoolkit=10.0 -c pytorch

Finally, install apex from: https://github.com/NVIDIA/apex

Data Setup

Download files from the dropbox link and place it in the data/ folder. Ensure that data paths match the directory structure provided in data/README.md

Run Experiments

From the below table pick the suitable configuration file:

Method	context (c)	Train splits	Evaluation Splits	Config File
SA-M4C	3	TextVQA	TextVQA	train-tvqa-eval-tvqa-c3.yml
SA-M4C	3	TextVQA + STVQA	TextVQA	train-tvqa_stvqa-eval-tvqa-c3.yml
SA-M4C	3	STVQA	STVQA	train-stvqa-eval-stvqa-c3.yml
SA-M4C	5	TextVQA	TextVQA	train-tvqa-eval-tvqa-c5.yml

To run the experiments use:

python train.py \
--config config.yml \
--tag experiment-name

To evaluate the pretrained checkpoint provided use:

python train.py \
--config configs/train-tvqa_stvqa-eval-tvqa-c3.yml \
--pretrained_eval data/pretrained-models/best_model.tar

Note: The beam-search evaluation is undergoing changes and will be updated.

Resources Used: We ran all the experiments on 2 Titan Xp gpus.

Citation

@inproceedings{kant2020spatially,
  title={Spatially Aware Multimodal Transformers for TextVQA},
  author={Kant, Yash and Batra, Dhruv and Anderson, Peter 
          and Schwing, Alexander and Parikh, Devi and Lu, Jiasen
          and Agrawal, Harsh},
  booktitle={ECCV}
  year={2020}}

Acknowledgements

Parts of this codebase were borrowed from the following repositories:

12-in-1: Multi-Task Vision and Language Representation Learning: Training Setup
MMF: A multimodal framework for vision and language research: Dataset processors and M4C model

We thank Abhishek Das, Abhinav Moudgil for their feedback and Ronghang Hu for sharing an early version of his work. The Georgia Tech effort was supported in part by NSF, AFRL, DARPA, ONR YIPs, ARO PECASE, Amazon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the U.S. Government, or any sponsor.

License

MIT

Open Source Agenda is not affiliated with "Sam Textvqa" Project. README Source: yashkant/sam-textvqa

Stars

Open Issues

Last Commit

2 years ago

Repository

yashkant/sam-textvqa

Homepage

https://yashkant.github.io/projects/sam-textvqa

Open Source Agenda Badge

<a href="https://www.opensourceagenda.com/projects/sam-textvqa"><img src="https://www.opensourceagenda.com/projects/sam-textvqa/reviews/badge.svg" alt="Open Source Agenda"></a>

Submit Review Review Your Favorite Project

Submit Resource Articles, Courses, Videos

Submit Article Submit a post to our blog

From the blog

Dec 11, 2022

How to Choose Which Programming Language to Learn First?

From the blog

Dec 11, 2022