Document Visual Question Answering
This repo hosts the basic functional code for our approach entitled HyperDQA in the Document Visual Question Answering competition hosted as a part of Workshop on Text and Documents in Deep Learning Era at CVPR2020. Our approach stands at position 4 on the Leaderboard.
Read more about our approach in this blogpost!
git clone https://github.com/anisha2102/docvqa.git
pip install -r requirements.txt
Download the dataset The dataset for Task 1 can be downloaded from the Competition Website from the Downloads Section. The dataset consists of document images and their corresponding OCR transcriptions.
Download the pretrained model Download the pretrained model for LayoutLM-Base, Uncased from here
python create_dataset.py \
<data-ocr-folder> \
<data-documents-folder> \
<path-to-train_v1.0.json> \
<train-output-json-path> \
<validation-output-json-path>
CUDA_VISIBLE_DEVICES=0 python run_docvqa.py \
--data_dir <data-folder> \
--model_type layoutlm \
--model_name_or_path <pretrained-model-path> \ #example ./models/layoutlm-base-uncased
--do_lower_case \
--max_seq_length 512 \
--do_train \
--num_train_epochs 15 \
--logging_steps 500 \
--evaluate_during_training \
--save_steps 500 \
--do_eval \
--output_dir <data-folder>/<exp-folder> \
--per_gpu_train_batch_size 8 \
--overwrite_output_dir \
--cache_dir <data-folder>/models \
--skip_match_answers \
--val_json <train-output-json-path> \
--train_json <train-output-json-path> \
Download the pytorch_model.bin file from the link below and copy it to the models folder. Google Drive Link
Try out the demo on a sample datapoint with demo.ipynb
The code and pretrained models are based on LayoutLM and HuggingFace Transformers. Many thanks for their amazing open source contributions.