Transformer Abstractive Summarization Save

Code for the paper "Efficient Adaption of Pretrained Transformers for Abstractive Summarization"

Project README

Code for the paper "Efficient Adaption of Pretrained Transformers for Abstractive Summarization"

Requirements

To run the training script in train.py you will need in addition:

PyTorch (version >=0.4)
tqdm
pyrouge
newsroom
tensorflow (cpu version is ok)
nltk
spacy (and 'en' model)

You can download the weights of the OpenAI pre-trained version by cloning Alec Radford's repo and placing the model folder containing the pre-trained weights in the present repo.

In order to run this code, you will need to pre-process the datasets using bpe through the scripts provided in scripts

Dataset Preprocessing

The training and evaluation scripts expect 3 total output files: train_encoded.jsonl, val_encoded.jsonl, and test_encoded.jsonl

CNN/Daily Mail

The data and splits used in the paper can be downloaded from OpenNMT. First, remove the start and end sentence tags using the sed command in the link provided. To process the data, run the following command:

python scripts/encode_cnndm.py --src_file {source file} --tgt_file {target file} --out_file {output file}

XSum

The data and splits used in the paper can be scraped using XSum. Run the commands up through Extract text from HTML Files section. To process the data, run the following command:

python scripts/encode_xsum.py --summary_dir {summary directory} --splits_file {split file} --train_file {train file} --val_file {val file} --test_file {test_file}

Newsroom

The data and splits used in the paper can be downloaded from Newsroom. To process the data, run the following command:

python scripts/encode_newsroom.py --in_file {input split file} --out_file {output file}

Training

To train a model, run the following command:

python train.py \
  --data_dir {directory containing encoded data} \
  --output_dir {name of folder to save data in} \
  --experiment_name {name of experiment to save data with} \
  --show_progress \
  --doc_model \
  --num_epochs_dat 10 \
  --num_epochs_ft 10 \
  --n_batch 16 \
  --accum_iter 4 \
  --use_pretrain

to train the pre-trained document embedding model over dataset for 10 epochs using domain adaptive training, and 10 epochs using fine tuning. The model will be trained with a effective batch size of 64, since the actual batch size is 16 and we accumulate gradients over 4 batches. Batch size must be divisible by the number of gpus available. Training is currently optimized for multi-gpu usage, and may not work for single gpu machines.

Evaluation

To evaluate a model, run the following command:

python evaluate.py \
  --data_file {path to encoded data file encoded data} \
  --checkpoint {checkpoint to load model weights from} \
  --beam {beam size to do beam search with} \
  --doc_model \
  --save_file {file to output results to} \
  --n_batch {batch size for evaluation, must be divisible by number of gpus}

to evaluate the document embedding model on the test set. Evaluation is currently optimized for multi-gpu usage, and may not work for single gpu machines. Since the evaluation script will leave out some examples if the number of data points isn't divisible by the number of gpus, you might need to run the create_small_test.py script to get the last few files that are being left out and aggregate results at the end.

Open Source Agenda is not affiliated with "Transformer Abstractive Summarization" Project. README Source: Andrew03/transformer-abstractive-summarization

Stars

Open Issues

Last Commit

4 years ago

Repository

Andrew03/transformer-abstractive-summarization

Open Source Agenda Badge

<a href="https://www.opensourceagenda.com/projects/transformer-abstractive-summarization"><img src="https://www.opensourceagenda.com/projects/transformer-abstractive-summarization/reviews/badge.svg" alt="Open Source Agenda"></a>

Submit Review Review Your Favorite Project

Submit Resource Articles, Courses, Videos

Submit Article Submit a post to our blog

From the blog

Dec 11, 2022

How to Choose Which Programming Language to Learn First?

From the blog

Dec 11, 2022