Guide: Finetune GPT2-XL (1.5 Billion Parameters) and finetune GPT-NEO (2.7 B) on a single GPU with Huggingface Transformers using DeepSpeed
Note: The GPT2-xl model does run on any server with a GPU with at least 16 GB VRAM and 60 GB RAM. The GPT-NEO model needs at least 70 GB RAM. If you use your own server and not the setup described here, you will need to install CUDA and Pytorch on it.
gcloud auth login
and gcloud init
and follow the steps until you are set up.--preemptible
flag from the command below, but keeping it reduces your cost to about 1/3 and allows Google to shut down your instance at any point. At the time of writing, this configuration only costs about $1.28 / hour in GCE, when using preemptible. Depending on the size of your dataset, finetuning usually only takes a few hours.--preemptible
flagRun this to create the instance:
gcloud compute instances create gpuserver \
--project YOURPROJECTID \
--zone us-west1-b \
--custom-cpu 12 \
--custom-memory 78 \
--maintenance-policy TERMINATE \
--image-family pytorch-1-7-cu110 \
--image-project deeplearning-platform-release \
--boot-disk-size 200GB \
--metadata "install-nvidia-driver=True" \
--accelerator="type=nvidia-tesla-v100,count=1" \
--preemptible
After 5 minutes or so (the server needs to install nvidia drivers first), you can connect to your instance with the command below. If you changed the zone, you also will need to change it here.
gcloud compute ssh YOURSDKACCOUNT@gpuserver --zone=us-west1-b
Don't forget to shut down the server once your done, otherwise you will keep getting billed for it. This can be done here.
The next time you can restart the server from the same web ui here.
Run this to download the script and to install all libraries:
git clone https://github.com/Xirider/finetune-gpt2xl.git
chmod -R 777 finetune-gpt2xl/
cd finetune-gpt2xl
pip install -r requirements.txt
(Optional) If you want to use Wandb.ai for experiment tracking, you have to login:
wandb login
Then add your training data:
python text2csv.py
. This converts your .txt files into one column csv files with a "text" header and puts all the text into a single line. We need to use .csv files instead of .txt files, because Huggingface's dataloader removes line breaks when loading text from a .txt file, which does not happen with the .csv files.Run this:
deepspeed --num_gpus=1 run_clm.py \
--deepspeed ds_config.json \
--model_name_or_path gpt2-xl \
--train_file train.csv \
--validation_file validation.csv \
--do_train \
--do_eval \
--fp16 \
--overwrite_cache \
--evaluation_strategy="steps" \
--output_dir finetuned \
--eval_steps 200 \
--num_train_epochs 1 \
--gradient_accumulation_steps 2 \
--per_device_train_batch_size 8
--learning_rate
and --warmup_steps
to improve the finetuning.You can test your finetuned GPT2-xl model with this script from Huggingface Transfomers (is included in the folder):
python run_generation.py --model_type=gpt2 --model_name_or_path=finetuned --length 200
Or you can use it now in your own code like this to generate text in batches:
# credit to Niels Rogge - https://github.com/huggingface/transformers/issues/10704
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
tokenizer = GPT2Tokenizer.from_pretrained('finetuned')
tokenizer.padding_side = "left"
tokenizer.pad_token = tokenizer.eos_token
model = GPT2LMHeadModel.from_pretrained('finetuned').to(device)
print("model loaded")
# this is a single input batch with size 3
texts = ["From off a hill whose concave womb", "Another try", "A third test"]
encoding = tokenizer(texts, padding=True, return_tensors='pt').to(device)
with torch.no_grad():
generated_ids = model.generate(**encoding, max_length=100)
generated_texts = tokenizer.batch_decode(
generated_ids, skip_special_tokens=True)
print(generated_texts)
This works now. I tested it with a server with one V100 GPU (16 GB VRAM) and 78 GB normal RAM, but it might not actually need that much RAM.
Add your training data like you would for GPT2-xl:
replace the example train.txt and validation.txt files in the folder with your own training data with the same names and then run python text2csv.py
. This converts your .txt files into one column csv files with a "text" header and puts all the text into a single line. We need to use .csv files instead of .txt files, because Huggingface's dataloader removes line breaks when loading text from a .txt file, which does not happen with the .csv files.
If you want to feed the model separate examples instead of one continuous block of text, you need to modify the function group_texts
in run_clm.py
.
Be careful with the encoding of your text. If you don't clean your text files or if you just copy text from the web into a text editor, the dataloader from the datasets library might not load them.
Be sure to either login into wandb.ai with wandb login
or uninstall it completely. Otherwise it might cause a memory error during the run.
Then start the training run this command:
deepspeed --num_gpus=1 run_clm.py \
--deepspeed ds_config_gptneo.json \
--model_name_or_path EleutherAI/gpt-neo-2.7B \
--train_file train.csv \
--validation_file validation.csv \
--do_train \
--do_eval \
--fp16 \
--overwrite_cache \
--evaluation_strategy="steps" \
--output_dir finetuned \
--num_train_epochs 1 \
--eval_steps 15 \
--gradient_accumulation_steps 2 \
--per_device_train_batch_size 4 \
--use_fast_tokenizer False \
--learning_rate 5e-06 \
--warmup_steps 10
--per_device_train_batch_size
to 1 and --gradient_accumulation_steps
to 8. You can also then try to reduce the values for "allgather_bucket_size" and "reduce_bucket_size" in the ds_config_gptneo.json file to 5e7.I provided a script, that allows you to interactively prompt your GPT-NEO model. If you just want to sample from the pretrained model without finetuning it yourself, replace "finetuned" with "EleutherAI/gpt-neo-2.7B". Start it with this:
python run_generate_neo.py finetuned
Or use this snippet to generate text from your finetuned model within your code:
# credit to Suraj Patil - https://github.com/huggingface/transformers/pull/10848 - modified to create multiple texts and use deepspeed inference
from transformers import GPTNeoForCausalLM, AutoTokenizer
import deepspeed
# casting to fp16 "half" gives a large speedup during model loading
model = GPTNeoForCausalLM.from_pretrained("finetuned").half().to("cuda")
tokenizer = AutoTokenizer.from_pretrained("finetuned")
# using deepspeed inference is optional: it gives about a 2x speed up
deepspeed.init_inference(model, mp_size=1, dtype=torch.half, replace_method='auto')
texts = ["From off a hill whose concave", "Paralell text 2"]
ids = tokenizer(texts, padding=padding, return_tensors="pt").input_ids.to("cuda")
gen_tokens = model.generate(
ids,
do_sample=True,
min_length=0,
max_length=200,
temperature=1.0,
top_p=0.8,
use_cache=True
)
gen_text = tokenizer.batch_decode(gen_tokens)
print(gen_text)
You can change the learning rate, weight decay and warmup by setting them as flags to the training command. Warm up and learning rates in the config are ignored, as the script always uses the Huggingface optimizer/trainer default values. If you want to overwrite them you need to use flags. You can check all the explanations here:
https://huggingface.co/transformers/master/main_classes/trainer.html#deepspeed
The rest of the training arguments can be provided as a flags and are all listed here:
https://huggingface.co/transformers/master/main_classes/trainer.html#trainingarguments