Unsloth Versions Save

Finetune Llama 3, Mistral & Gemma LLMs 2-5x faster with 80% less memory

April-Llama-3-2024

3 weeks ago

Llama-3 (15 trillion tokens, GPT3.5 level) is fully supported! Get 2x faster, 60% less VRAM usage than HF + FA2!

Colab notebook: https://colab.research.google.com/drive/135ced7oHytdxu3N2DNe1Z0kqjyYIkDXp?usp=sharing

Pre-quantized 8b and 70b weights (4x faster downloading) via https://huggingface.co/unsloth

What's Changed

Readme Changes by @danielhanchen in https://github.com/unslothai/unsloth/pull/324
Tokenizers fix by @danielhanchen in https://github.com/unslothai/unsloth/pull/336
Update README.md by @danielhanchen in https://github.com/unslothai/unsloth/pull/351
Update README.md by @danielhanchen in https://github.com/unslothai/unsloth/pull/352

Full Changelog: https://github.com/unslothai/unsloth/compare/April-2024...April-Llama-3-2024

April-2024

1 month ago

Long Context Window support

You can now 2x your batch size or train on long context windows with Unsloth! 228K context windows on H100s are now possible (4x longer than HF+FA2) with Mistral 7b.

How? We coded up an async offloaded gradient checkpointing in 20 loc of pure @PyTorch, reducing VRAM by >30% with +1.9% extra overhead. We carefully mask movement betw RAM<=>GPU. No extra dependencies needed.

Try our Colab notebook with Mistral's new long context v2 7b model + our new VRAM savings

You can turn it on with use_gradient_checkpointing = "unsloth":

model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
)

Below shows the maximum possible sequence length with Mistral 7b QLoRA rank=32:

GPU	Memory	HF+FA2	Unsloth	Unsloth New
RTX 4060	8 GB	1,696	3,716	7,340
RTX 4070	12 GB	4,797	11,055	19,610
RTX 4080	16 GB	7,898	18,394	31,880
RTX 4090	24 GB	14,099	33,073	56,420
A100	40 GB	26,502	62,431	105,500
A6000	48 GB	32,704	77,110	130,040
H100	80 GB	57,510	135,826	228,199

Self Healing Tokenizers

We managed to smartly and on the fly convert a slow HF tokenizer to a fast one. We also automatically now load the tokenizer, and fix some dangling incorrect tokens. What can this be useful for?

Broken tokenizers like Starling or CodeLlama can be “self healed” to work. Not healing them can cause unlucky out of bounds memory accesses.
No need to manually edit the tokenizer files to support the ChatML format. Sloth automatically edits the sentencepiece tokenizer.model and other files.
Sometimes model uploaders require you to use the slow tokenizer, due to the fast tokenizer (HF’s Rust version) giving wrong results. We try to convert it to a fast variant, and confirm if it tokenizes correctly.

28% Faster RoPE Embeddings

@HuyNguyen-hust managed to make Unsloth RoPE Embeddings around 28% faster! This primarily is useful for long context windows. Via torch profiler, Unsloth's original kernel made RoPE use up less than 2% of total runtime, so you will see maybe 0.5 to 1% speedups especially for large training runs. Any speedup is vastly welcome! See #238 for more details.

Bug Fixes

Gemma would not convert to GGUF correctly due to tied weights. Now fixed.
Merging to 16bit on Kaggle breaks since Kaggle only supports 20GB of disk space - we smartly delete the 4GB model.safetensors file, allowing you to merge to 16bit.
Inference is finally fixed on batched generation. We did not accidentally account for the attention mask and position ids. Reminder inference is 2x faster natively!
Finetuning on lm_head and embed_tokens now works correctly! See https://github.com/unslothai/unsloth/wiki#finetuning-the-lm_head-and-embed_tokens-matrices. Remember to set modules_to_save.
@oKatanaaa via #305 noticed you must downgrade protobuf<4.0.0. We edited the pyproject.toml to make it work.

As always, Colab and Kaggle do not need updating. On local machines, please use pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git to update Unsloth with no dependency changes.

February-Gemma-2024

2 months ago

You can now finetune Gemma 7b 2.43x faster than HF + Flash Attention 2 with 57.5% less VRAM use. When compared to vanilla HF, Unsloth is 2.53x faster and uses 70% less VRAM. Blog post: https://unsloth.ai/blog/gemma. On local machines, update Unsloth via pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git

On 1x A100 80GB GPU, Unsloth can fit 40K total tokens (8192 * bsz of 5), whilst FA2 can fit ~15K tokens and vanilla HF can fit 9K tokens. gemma reddit

Gemma 7b Colab Notebook free Tesla T4: https://colab.research.google.com/drive/10NbwlsRChbma1v55m8LAPYG15uQv6HLo?usp=sharing

Gemma 2b Colab Notebook free Tesla T4: https://colab.research.google.com/drive/15gGm7x_jTm017_Ic8e317tdIpDG53Mtu?usp=sharing

To use Gemma, simply use FastLanguageModel:

# Load Llama model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/gemma-7b-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = None,
    load_in_4bit = True,
)

February-2024

2 months ago

Update Unsloth on local machines with no dependency updates with pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git

2.43x faster Gemma support

Read our blog post for more info: https://unsloth.ai/blog/gemma. You can now finetune Gemma 7b 2.43x faster than HF + Flash Attention 2 with 57.5% less VRAM use. When compared to vanilla HF, Unsloth is 2.53x faster and uses 70% less VRAM. On 1x A100 80GB GPU, Unsloth can fit 40K total tokens (8192 * bsz of 5), whilst FA2 can fit ~15K tokens and vanilla HF can fit 9K tokens.

2x Faster Inference

Unsloth supports natively 2x faster inference. All QLoRA, LoRA and non LoRA inference paths are 2x faster. This requires no change of code or any new dependencies.

from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 64)

Chat Templates

Assuming your dataset is a list of list of dictionaries like the below:

[
    [{'from': 'human', 'value': 'Hi there!'},
     {'from': 'gpt', 'value': 'Hi how can I help?'},
     {'from': 'human', 'value': 'What is 2+2?'}],
    [{'from': 'human', 'value': 'What's your name?'},
     {'from': 'gpt', 'value': 'I'm Daniel!'},
     {'from': 'human', 'value': 'Ok! Nice!'},
     {'from': 'gpt', 'value': 'What can I do for you?'},
     {'from': 'human', 'value': 'Oh nothing :)'},],
]

You can use our get_chat_template to format it. Select chat_template to be any of zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth, and use mapping to map the dictionary values from, value etc. map_eos_token allows you to map <|im_end|> to EOS without any training.

from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "chatml", # Supports zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth
    mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style
    map_eos_token = True, # Maps <|im_end|> to </s> instead
)

def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }
pass

from datasets import load_dataset
dataset = load_dataset("philschmid/guanaco-sharegpt-style", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,)

You can also make your own custom chat templates! For example our internal chat template we use is below. You must pass in a tuple of (custom_template, eos_token) where the eos_token must be used inside the template.

unsloth_template = \
    "{{ bos_token }}"\
    "{{ 'You are a helpful assistant to the user\n' }}"\
    "{% endif %}"\
    "{% for message in messages %}"\
        "{% if message['role'] == 'user' %}"\
            "{{ '>>> User: ' + message['content'] + '\n' }}"\
        "{% elif message['role'] == 'assistant' %}"\
            "{{ '>>> Assistant: ' + message['content'] + eos_token + '\n' }}"\
        "{% endif %}"\
    "{% endfor %}"\
    "{% if add_generation_prompt %}"\
        "{{ '>>> Assistant: ' }}"\
    "{% endif %}"
unsloth_eos_token = "eos_token"

tokenizer = get_chat_template(
    tokenizer,
    chat_template = (unsloth_template, unsloth_eos_token,), # You must provide a template and EOS token
    mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style
    map_eos_token = True, # Maps <|im_end|> to </s> instead
)

And many bug fixes!

January-2024

3 months ago

Upgrade Unsloth via pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git. No dependency updates will be done.

6x faster GGUF conversion and QLoRA to float16 merging support

# To merge to 16bit:
model.save_pretrained_merged("dir", tokenizer, save_method = "merged_16bit")
# To merge to 4bit:
model.save_pretrained_merged("dir", tokenizer, save_method = "merged_4bit")
# To save to GGUF:
model.save_pretrained_gguf("dir", tokenizer, quantization_method = "q4_k_m")
model.save_pretrained_gguf("dir", tokenizer, quantization_method = "q8_0")
model.save_pretrained_gguf("dir", tokenizer, quantization_method = "f16")
# All methods supported (listed below)

To push to HF:

model.push_to_hub_merged("hf_username/dir", tokenizer, save_method = "merged_16bit")
model.push_to_hub_merged("hf_username/dir", tokenizer, save_method = "merged_4bit")
model.push_to_hub_gguf("hf_username/dir", tokenizer, quantization_method = "q4_k_m")
model.push_to_hub_gguf("hf_username/dir", tokenizer, quantization_method = "q8_0")

4x faster model downloading + >= 500MB less GPU fragmentation by pre-quantized models:

    "unsloth/mistral-7b-bnb-4bit",
    "unsloth/llama-2-7b-bnb-4bit",
    "unsloth/llama-2-13b-bnb-4bit",
    "unsloth/codellama-34b-bnb-4bit",
    "unsloth/tinyllama-bnb-4bit",

packing = True support making training 5x faster via TRL.
DPO support! 188% faster DPO training + no OOMs!
Dropout. Bias LoRA support
RSLoRA (Rank stabilized LoRA), LoftQ support
Llama-Factory support as a UI - https://github.com/hiyouga/LLaMA-Factory/wiki/Performance-comparison
Tonnes of bug fixes
And if you can - please support out work via Kofi! https://ko-fi.com/unsloth

GGUF:

Choose for `quantization_method` to be:
"not_quantized"  : "Recommended. Fast conversion. Slow inference, big files.",
"fast_quantized" : "Recommended. Fast conversion. OK inference, OK file size.",
"quantized"      : "Recommended. Slow conversion. Fast inference, small files.",
"f32"     : "Not recommended. Retains 100% accuracy, but super slow and memory hungry.",
"f16"     : "Fastest conversion + retains 100% accuracy. Slow and memory hungry.",
"q8_0"    : "Fast conversion. High resource use, but generally acceptable.",
"q4_k_m"  : "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K",
"q5_k_m"  : "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K",
"q2_k"    : "Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors.",
"q3_k_l"  : "Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K",
"q3_k_m"  : "Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K",
"q3_k_s"  : "Uses Q3_K for all tensors",
"q4_0"    : "Original quant method, 4-bit.",
"q4_1"    : "Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.",
"q4_k_s"  : "Uses Q4_K for all tensors",
"q5_0"    : "Higher accuracy, higher resource usage and slower inference.",
"q5_1"    : "Even higher accuracy, resource usage and slower inference.",
"q5_k_s"  : "Uses Q5_K for all tensors",
"q6_k"    : "Uses Q8_K for all tensors",

December-2023

4 months ago

1. Preliminary Mistral support (4K context) Solves #2
2. FINAL Mistral support (Sliding Window Attention) Solves #2
3. Solves #10
4. Preliminary Solves #8 and #6 Now supports Yi, TinyLlama and all with Grouped Query Attention
5. FINAL GQA support - allow Flash Attn v2 install path
6. Solves #5
7. Solves #7 which supports larger vocab sizes over 2^15 but below 2^16
8. Update Readme
9. Preliminary DPO Support by example from https://github.com/152334H
10. WSL (Windows) Support confirmed by https://github.com/RandomInternetPreson

Use Mistral as follows:

pip install "unsloth[colab_ampere] @ git+https://github.com/unslothai/unsloth.git"
from unsloth import FastMistralModel
import torch

model, tokenizer = FastMistralModel.from_pretrained(
    model_name = model_name,
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

model = FastMistralModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Currently only supports dropout = 0
    bias = "none",    # Currently only supports bias = "none"
    use_gradient_checkpointing = True,
    random_state = 3407,
    max_seq_length = max_seq_length,
)

https://unsloth.ai/blog/mistral-benchmark for full benchmarks and more details