[ACL 2020] DeFormer: Decomposing Pre-trained Transformers for Faster Question Answering
This repo is the code for the DeFormer paper (Accepted to ACL 2020).
Tested on Ubuntu 16.04, 18.04 and macOS. (Windows should also work, but not tested)
You can create a separate python environment, e.g. virtualenv -p python3.7 .env
and activate it by source .env/bin/activate
Requirements: Python>=3.5 and TensorFlow >=1.14.0,<2.0
pip install "tensorflow>=1.14.0,<2.0"
or pip install tensorflow-gpu==1.15.3
(for GPU)
pip install -r requirements.txt
NOTE: we call ebert
for DeFormer BERT version, and sbert
for applying KD & LRS in the paper.
For XLNet, you can check my fork for a reference implementation.
data/datasets
the dataset dir should look like below (use tree -L 2 data/datasets
):
data/datasets
├── BoolQ
│ ├── test.jsonl
│ ├── train.jsonl
│ └── val.jsonl
├── mnli
│ ├── dev_mismatched.tsv
│ └── train.tsv
├── qqp
│ ├── dev.tsv
│ ├── test.tsv
│ └── train.tsv
├── RACE
│ ├── dev
│ ├── test
│ └── train
└── squad_v1.1
├── dev-v1.1.json
└── train-v1.1.json
convert:
deformer_dir=data/datasets/deformer
mkdir -p ${deformer_dir}
# squad v1.1
for version in 1.1; do
data_dir=data/datasets/squad_v${version}
for split in dev train; do
python tools/convert_squad.py ${data_dir}/${split}-v${version}.json \
${deformer_dir}/squad_v${version}-${split}.jsonl
done
done
# mnli
data_dir=data/datasets/mnli
python tools/convert_pair_dataset.py ${data_dir}/train.tsv ${deformer_dir}/mnli-train.jsonl -t mnli
python tools/convert_pair_dataset.py ${data_dir}/dev_matched.tsv ${deformer_dir}/mnli-dev.jsonl -t mnli
# qqp
data_dir=data/datasets/qqp
python tools/convert_pair_dataset.py ${data_dir}/train.tsv ${deformer_dir}/qqp-train.jsonl -t qqp
python tools/convert_pair_dataset.py ${data_dir}/dev.tsv ${deformer_dir}/qqp-dev.jsonl -t qqp
# boolq
data_dir=data/datasets/BoolQ
python tools/convert_pair_dataset.py ${data_dir}/train.jsonl ${deformer_dir}/boolq-train.jsonl -t boolq
python tools/convert_pair_dataset.py ${data_dir}/val.jsonl ${deformer_dir}/boolq-dev.jsonl -t boolq
# race
data_dir=data/datasets/RACE
python tools/convert_race.py ${data_dir}/train ${deformer_dir}/race-train.jsonl
python tools/convert_race.py ${data_dir}/dev ${deformer_dir}/race-dev.jsonl
split 10% of train for tuning hyper-parameters:
cd ${deformer_dir}
cat squad_v1.1-train.jsonl | shuf > squad_v1.1-train-shuf.jsonl
head -n8760 squad_v1.1-train-shuf.jsonl > squad_v1.1-tune.jsonl
tail -n78839 squad_v1.1-train-shuf.jsonl > squad_v1.1-train.jsonl
cat boolq-train.jsonl | shuf > boolq-train-shuf.jsonl
head -n943 boolq-train-shuf.jsonl > boolq-tune.jsonl
tail -n8484 boolq-train-shuf.jsonl > boolq-train.jsonl
cat race-train.jsonl | shuf > race-train-shuf.jsonl
head -n8786 race-train-shuf.jsonl > race-tune.jsonl
tail -n79080 race-train-shuf.jsonl > race-train.jsonl
cat qqp-train.jsonl | shuf > qqp-train-shuf.jsonl
head -n36385 qqp-train-shuf.jsonl > qqp-tune.jsonl
tail -n327464 qqp-train-shuf.jsonl > qqp-train.jsonl
cat mnli-train.jsonl | shuf > mnli-train-shuf.jsonl
head -n39270 mnli-train-shuf.jsonl > mnli-tune.jsonl
tail -n353432 mnli-train-shuf.jsonl > mnli-train.jsonl
download bert.vocab to data/res
usage: python prepare.py -h
e.g., convert squad_v1.1
for bert
:
python prepare.py -m bert -t squad_v1.1 -s dev
python prepare.py -m bert -t squad_v1.1 -s tune
python prepare.py -m bert -t squad_v1.1 -s train -sm tf
e.g., convert squad_v1.1
for xlnet
:
model=xlnet
task=squad_v1.1
python prepare.py -m ${model} -t ${task} -s dev
python prepare.py -m ${model} -t ${task} -s train -sm tf
convert all available tasks and all models:
for model in bert ebert; do
for task in squad_v1.1 mnli qqp boolq race; do
python prepare.py -m ${model} -t ${task} -s dev
python prepare.py -m ${model} -t ${task} -s tune
python prepare.py -m ${model} -t ${task} -s train -sm tf
done
done
download original fine-tuned BERT-base checkpoints from bert-base-squad_v1.1.tgz and DeFormer fine-tuned version from ebert-base-s9-squad_v1.1.tgz
python eval.py -m bert -t squad_v1.1 2>&1 | tee data/bert-base-eval.log
example output:
INFO:2020-07-01_15:36:30.339:eval.py:65: model.ckpt-8299, em=80.91769157994324, f1=88.33819502660548, metric=88.33819502660548
python eval.py -m ebert -t squad_v1.1 2>&1 | tee data/ebert-base-s9-eval.log
example output:
INFO:2020-07-01_15:39:15.418:eval.py:65: model.ckpt-8321, em=79.12961210974456, f1=86.99636369864814, metric=86.99636369864814
See config/*.ini
for customizing training and evaluation script
train: python train.py
specify model by -m
(--model
), task by -t
(--task
), eval is similar.
see below example commands for boolq
:
# for running on tpu, should specify gcs bucket data_dir, and set use_tpu to yes
# also need to set tpu_name=<some_ip_or_just_name> if not exported to environment
base_dir=<your google cloud storage bucket>
data_dir=${base_dir} use_tpu=yes \
python train.py -m bert -t boolq 2>&1 | tee data/boolq-bert-train.log
data_dir=${base_dir} use_tpu=yes \
python eval.py -m bert -t boolq 2>&1 | tee data/boolq-bert-eval.log
# for list of models and list of tasks
for task in boolq mnli qqp squad_v1.1; do
for model in bert ebert; do
data_dir=${base_dir} use_tpu=yes \
python train.py -m ${model} -t ${task} 2>&1 | tee data/${task}-${model}-train.log
data_dir=${base_dir} use_tpu=yes \
python eval.py -m ${model} -t ${task} 2>&1 | tee data/${task}-${model}-eval.log
done
done
BERT wwm large:
base_dir=<your google cloud storage bucket>
for t in boolq qqp squad_v1.1 mnli; do
use_tpu=yes data_dir=${base_dir} \
learning_rate=1e-5 epochs=2 keep_checkpoint_max=1 \
init_checkpoint=${base_dir}/ckpt/init/wwm_uncased_large/bert_model.ckpt \
checkpoint_dir=${base_dir}/ckpt/bert_large/${t} \
hidden_size=1024 intermediate_size=4096 num_heads=16 num_hidden_layers=24 \
python train.py -m bert -t ${t} 2>&1 | tee data/${t}-large-train.log
data_dir=${base_dir} use_tpu=yes init_checkpoint="" \
checkpoint_dir=${base_dir}/ckpt/bert_large/${t} \
hidden_size=1024 intermediate_size=4096 num_heads=16 num_hidden_layers=24 \
python eval.py -m bert -t ${t} 2>&1 | tee data/${t}-large-eval.log
done || exit 1
fine tuning for separation at different layers for bert base:
for t in boolq qqp mnli squad_v1.1; do
for n in `seq 1 1 11`; do
echo "n=${n}, t=${t}"
base_dir=${base_dir}
sep_layers=${n} use_tpu=yes data_dir=${base_dir} keep_checkpoint_max=1 \
checkpoint_dir="${base_dir}/ckpt/separation/${t}/ebert_s${n}" \
python train.py -m ebert -t ${t} 2>&1 | tee data/${t}-base-sep${n}-train.log
sep_layers=${n} use_tpu=yes data_dir=${base_dir} init_checkpoint="" \
checkpoint_dir="${base_dir}/ckpt/separation/${t}/ebert_s${n}" \
python eval.py -m ebert -t ${t} 2>&1 | tee data/${t}-base-sep${n}-eval.log
done
done
fine tuning for separation at different layers for wwm large bert:
for t in boolq qqp mnli squad_v1.1; do
for n in `seq 10 1 23`; do
echo "n=${n}, t=${t}"
base_dir=${base_dir}
sep_layers=${n} use_tpu=yes data_dir=${base_dir} \
learning_rate=1e-5 epochs=2 keep_checkpoint_max=1 \
init_checkpoint=${base_dir}/ckpt/init/wwm_uncased_large/bert_model.ckpt \
checkpoint_dir=${base_dir}/ckpt/separation/${t}/ebert_large_s${n} \
hidden_size=1024 intermediate_size=4096 num_heads=16 num_hidden_layers=24 \
python train.py -m ebert -t ${t} 2>&1 | tee data/${t}-large-sep${n}-train.log
sep_layers=${n} use_tpu=yes data_dir=${base_dir} init_checkpoint="" \
checkpoint_dir=${base_dir}/ckpt/separation/${t}/ebert_large_s${n} \
hidden_size=1024 intermediate_size=4096 num_heads=16 num_hidden_layers=24 \
output_file=${base_dir}/predictions/${t}-large-sep${n}-dev.json \
python eval.py -m ebert -t ${t} 2>&1 | tee data/${t}-large-sep${n}-eval.log
done || exit 1
done || exit 1
training script needs further verification (due to migrated from old codebase)
sbert procedure, first get ebert_s0, then merge bert_base and ebert_s0 checkpoints
using tools/merge_checkpoints.py
to get initial checkpoint for sbert, then run the training.
base_dir=gs://xxx
init_dir="data/ckpt/init"
large_model="${init_dir}/wwm_uncased_large/bert_model.ckpt"
base_model="${init_dir}/uncased_base/bert_model.ckpt"
for t in squad_v1.1 boolq qqp mnli; do
mkdir -p data/ckpt/separation/${t}
# sbert large init
large_init="data/ckpt/separation/${t}/ebert_large_s0"
gsutil -m cp -r "${base_dir}/ckpt/separation/${t}/ebert_large_s0" data/ckpt/separation/${t}/
python tools/merge_checkpoints.py -c1 "${large_init}" \
-c2 "${large_model}" -o ${init_dir}/${t}_sbert_large.ckpt
gsutil -m cp -r "${init_dir}/${t}_sbert_large.ckpt*" "${base_dir}/ckpt/init"
# sbert large init from ebert_large_s0 all
python tools/merge_checkpoints.py -c1 "${large_init}" -c2 "${large_model}" \
-o ${init_dir}/${t}_sbert_large_all.ckpt -fo
gsutil -m cp -r "${init_dir}/${t}_sbert_large_all.ckpt*" "${base_dir}/ckpt/init"
# sbert large init from ebert_large_s0 upper, e.g. 20
python tools/merge_checkpoints.py -c1 "${large_init}" -c2 "${large_model}" \
-o ${init_dir}/${t}_sbert_large_upper20.ckpt -fo -fou 20
gsutil -m cp -r "${init_dir}/${t}_sbert_large_upper20.ckpt*" "${base_dir}/ckpt/init"
# sbert base init
base_init="data/ckpt/separation/${t}/ebert_s0"
gsutil -m cp -r "${base_dir}/ckpt/separation/${t}/ebert_s0" data/ckpt/separation/${t}/
python tools/merge_checkpoints.py -c1 "${base_init}" -c2 "${base_model}" \
-o ${init_dir}/${t}_sbert_base.ckpt
gsutil -m cp -r "${init_dir}/${t}_sbert_base.ckpt*" "${base_dir}/ckpt/init"
python tools/merge_checkpoints.py -c1 "${base_init}" -c2 "${base_model}" \
-o ${init_dir}/${t}_sbert_base_all.ckpt -fo
gsutil -m cp -r "${init_dir}/${t}_sbert_base.ckpt*" "${base_dir}/ckpt/init"
python tools/merge_checkpoints.py -c1 "${base_init}" -c2 "${base_model}" \
-o ${init_dir}/${t}_sbert_base_upper9.ckpt -fo -fou 9
gsutil -m cp -r "${init_dir}/${t}_sbert_base.ckpt*" "${base_dir}/ckpt/init"
done || exit 1
sbert finetuning:
# squad_v1.1, search 50 params for bert large separated at layer 21
python tools/explore_hp.py -p data/sbert-squad-large.json -n 50 \
-s large -sp 1.4 0.3 0.8 -hp 5e-5,3,32 2>&1 | tee data/sbert-squad-explore-s21.log
./search.sh squad_v1.1 large 21 bert-tpu2
# race search 50
python tools/explore_hp.py -p data/race-sbert-s9.json -n 50 -t race 2>&1 | \
tee data/race-sbert-explore-s9.log
./search.sh race base 9
profile model flops:
for task in race boolq race qqp mnli squad_v1.1; do
for size in base large; do
profile_dir=data/log2-${task}-${size}-profile
mkdir -p "${profile_dir}"
if [[ "${task}" == "mnli" ]]; then
cs=1 # cache_segment
else
cs=2
fi
if [[ ${size} == "base" ]] ; then
allowed_layers="9 10" # $(seq 1 1 11)
large_params=""
else
allowed_layers="20 21" #$(seq 1 1 23)
large_params="hidden_size=1024 intermediate_size=4096 num_heads=16 num_hidden_layers=24"
fi
if [[ ${task} == "race" ]] ; then
large_params="num_choices=4 ${large_params}"
fi
# bert
eval "${large_params}" python profile.py -m bert -t ${task} -pm 2>&1 | \
tee ${profile_dir}/bert-profile.log
# ebert
for n in "${(@s/ /)allowed_layers}"; do
eval "${large_params}" sep_layers="${n}" \
python profile.py -m ebert -t ${task} -pm 2>&1 | \
tee ${profile_dir}/ebert-s${n}-profile.log
eval "${large_params}" sep_layers="${n}" \
python profile.py -m ebert -t ${task} -pm -cs ${cs} 2>&1 | \
tee ${profile_dir}/ebert-s${n}-profile-cache.log
done
done
done
benchmarking inference latency:
python profile.py -npf -pt -b 32 2>&1 | tee data/batch-time-bert.log
python profile.py -npf -pt -b 32 -m ebert -cs 2 2>&1 | tee data/batch-time-ebert.log
analyze bert, ebert, sbert:
python analyze.py -o data/qa-outputs -m bert 2>&1 | tee data/ana-bert.log
python tools/compute_rep_variance.py data/qa-outputs -n 20
python tools/compare_rep.py data/qa-outputs -m sbert
python tools/compare_rep.py data/qa-outputs -m ebert
python infer_qa.py -m bert
(add -e
for eager mode)tools/get_dataset_stats.py
: get dataset statistics (length of tokens mainly)tools/inspect_checkpoint.py
: print variable info in checkpoints (support monitoring variables during training)tools/rename_checkpoint_variables.py
: rename variable names in checkpoint (add -dr
for dry run)
e.g. python tools/rename_checkpoint_variables.py "data/ckpt/bert/mnli/" -p "bert_mnli" "mnli" -dr
tools/visualize_model.py
: visualize TensorFlow model structure given inference graphredis
redis-cli -p 60001 lrange queue:params 0 -1
redis-cli -p 60001 lrange queue:results 0 -1
redis-cli -p 60001 lpop queue:params
redis-cli -p 60001 rpush queue:results 89.532
gcloud sdk for TPU access: pip install --upgrade google-api-python-client oauth2client
TPU start: ctpu up --tpu-size=v3-8 --tpu-only --name=bert-tpu --noconf
(can support tf version, e.g.--tf-version=1.13
)
TPU stop: ctpu pause --tpu-only --name=bert-tpu --noconf
move instances: gcloud compute instances move bert-vm --zone us-central1-b --destination-zone us-central1-a
upload and download:
cd data
# upload
gsutil -m cp -r datasets/qqp/ebert "gs://xxx/datasets/qqp/ebert"
gsutil -m cp -r datasets/qa/ebert "gs://xxx/datasets/qa/ebert"
gsutil -m cp -r datasets/mnli/ebert "gs://xxx/datasets/mnli/ebert"
gsutil -m cp -r "datasets/qa/bert/hotpot-*" "gs://xxx/datasets/qa/bert"
# download
gsutil -m cp -r "gs://xxx/datasets/qqp/ebert" qqp/ebert
cd data/ckpt
# download
gsutil -m cp -r "gs://xxx/ckpt/bert/qa/model.ckpt-8299*" bert/qa/
gsutil -m cp -r "gs://xxx/ckpt/ebert_s9/qa/model.ckpt-8321*" ebert_s9/qa/
gsutil -m cp -r "gs://xxx/ckpt/ebert_s9/mnli/model.ckpt-18407*" ebert_s9/mnli/
gsutil -m cp -r "gs://xxx/ckpt/ebert_s9/qqp/model.ckpt-17055*" ebert_s9/qqp/
function dl()
{
num=$2
for suffix in meta index data-00000-of-00001; do
gsutil cp gs://xxx/ckpt/$1/model.ckpt-${num}.${suffix} .
done;
echo model_checkpoint_path: \"model.ckpt-${num}\" > checkpoint
}
If you have any question, please create an issue.
If you find our work useful to your research, please consider using the following citation:
@inproceedings{cao-etal-2020-deformer,
title = "{D}e{F}ormer: Decomposing Pre-trained Transformers for Faster Question Answering",
author = "Cao, Qingqing and
Trivedi, Harsh and
Balasubramanian, Aruna and
Balasubramanian, Niranjan",
booktitle = "Proceedings of the 58th Annual Mdeformering of the Association for Computational Linguistics",
month = jul,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.acl-main.411",
pages = "4487--4497",
}