🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
A patch release to resolve import errors from removed custom types in generation utils
Qwen2 is the new model series of large language models from the Qwen team. Previously, the Qwen series was released, including Qwen-72B, Qwen-1.8B, Qwen-VL, Qwen-Audio, etc.
Qwen2 is a language model series including decoder language models of different model sizes. For each size, we release the base language model and the aligned chat model. It is based on the Transformer architecture with SwiGLU activation, attention QKV bias, group query attention, mixture of sliding window attention and full attention, etc. Additionally, we have an improved tokenizer adaptive to multiple natural languages and codes.
Phi-2 is a transformer language model trained by Microsoft with exceptionally strong performance for its small size of 2.7 billion parameters. It was previously available as a custom code model, but has now been fully integrated into transformers.
phi-2
example by @susnato in #28392softmax_scale
in PhiFlashAttention2
. by @gugarosa in #28537The SigLIP model was proposed in Sigmoid Loss for Language Image Pre-Training by Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer. SigLIP proposes to replace the loss function used in CLIP by a simple pairwise sigmoid loss. This results in better performance in terms of zero-shot classification accuracy on ImageNet.
The VipLlava model was proposed in Making Large Multimodal Models Understand Arbitrary Visual Prompts by Mu Cai, Haotian Liu, Siva Karthik Mustikovela, Gregory P. Meyer, Yuning Chai, Dennis Park, Yong Jae Lee.
VipLlava enhances the training protocol of Llava by marking images and interact with the model using natural cues like a “red bounding box” or “pointed arrow” during training.
The FastSpeech2Conformer model was proposed with the paper Recent Developments On Espnet Toolkit Boosted By Conformer by Pengcheng Guo, Florian Boyer, Xuankai Chang, Tomoki Hayashi, Yosuke Higuchi, Hirofumi Inaguma, Naoyuki Kamo, Chenda Li, Daniel Garcia-Romero, Jiatong Shi, Jing Shi, Shinji Watanabe, Kun Wei, Wangyou Zhang, and Yuekai Zhang.
FastSpeech 2 is a non-autoregressive model for text-to-speech (TTS) synthesis, which develops upon FastSpeech, showing improvements in training speed, inference speed and voice quality. It consists of a variance adapter; duration, energy and pitch predictor and waveform and mel-spectrogram decoder.
The Wav2Vec2-BERT model was proposed in Seamless: Multilingual Expressive and Streaming Speech Translation by the Seamless Communication team from Meta AI.
This model was pre-trained on 4.5M hours of unlabeled audio data covering more than 143 languages. It requires finetuning to be used for downstream tasks such as Automatic Speech Recognition (ASR), or Audio Classification.
Enables saving and loading transformers models in 4bit formats - you can now push bitsandbytes 4-bit weights on Hugging Face Hub. To save 4-bit models and push them on the hub, simply install the latest bitsandbytes
package from pypi pip install -U bitsandbytes
, load your model in 4-bit precision and call save_pretrained
/ push_to_hub
. An example repo here
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "facebook/opt-125m"
model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True)
model.push_to_hub("ybelkada/opt-125m-bnb-4bit")
Docs
] Add 4-bit serialization docs by @younesbelkada in #28182Enable passing in 4D attention masks to models that support it. This is useful for reducing memory footprint of certain generation tasks.
attention_mask
support by @poedator in #27539Ability to customise which modules are quantized and which are not.
Awq
] Enable the possibility to skip quantization for some target modules by @younesbelkada in #27950modules_in_block_to_quantize
arg in GPTQconfig by @SunMarc in #27956Added fused modules support
Awq
] Add llava fused modules support by @younesbelkada in #28239Mixtral
/ Awq
] Add mixtral fused modules for Awq by @younesbelkada in #28240Llava
/ Vip-Llava
] Add SDPA into llava by @younesbelkada in #28107Mixtral
& Mistral
] Add support for sdpa by @ArthurZucker in #28133All decoding strategies (temperature fallback, compression/log-prob/no-speech threshold, ...) of OpenAI's long-form transcription (see: https://github.com/openai/whisper or section 4.5 in paper) have been added. Contrary to https://github.com/openai/whisper, Transformers long-form transcription is fully compatible with pure FP16 and Batching!
For more information see: https://github.com/huggingface/transformers/pull/27658.
Assisted generation was reworked to accept arbitrary sources of candidate sequences. This enabled us to smoothly integrate ngram speculation, and opens the door for new candidate generation methods. Additionally, we've added the speculative decoding strategy on top of assisted generation: when you call assisted generation with an assistant model and do_sample=True
, you'll benefit from the faster speculative decoding sampling 🏎️💨
assisted_decoding
now accepts arbitrary candidate generators by @gante in #27751generate
for the assistant by @gante in #28031Adding pickle protection via weights_only=True in the torch.load calls.
Unlike PyTorch, TensorFlow models build their weights "lazily" after model initialization, using the shape of their inputs to figure out what their weight shapes should be. We previously needed a full forward pass through TF models to ensure that all layers received an input they could use to build their weights, but with this change we now have proper build()
methods that can correctly infer shapes and build model weights. This avoids a whole range of potential issues, as well as significantly accelerating model load times.
The last version to support PyTorch 1.10 was 4.36.x. As it has been more than 2 years, and we're looking forward to using features available in PyTorch 1.11 and up, we do not support PyTorch 1.10 for v4.37 (i.e. we don't run the tests against torch 1.10).
You can now add custom tags into your model before pushing it on the Hub! This enables you to filter models that contain that tag on the Hub with a simple URL filter. For example if you want to filter models that have trl
tag you can search: https://huggingface.co/models?other=trl&sort=created
core
/ FEAT] Add the possibility to push custom tags using PreTrainedModel
itself by @younesbelkada in #28405 - e.g.from transformers import AutoModelForCausalLM
model_name = "HuggingFaceM4/tiny-random-LlamaForCausalLM"
model = AutoModelForCausalLM.from_pretrained(model_name)
model.add_model_tags(["tag-test"])
model.push_to_hub("llama-tagged")
Mixtral
] Change mistral op order by @younesbelkada in #27955Tokenizer Serialization
] Fix the broken serialisation by @ArthurZucker in #27099Whisper
] raise better errors by @ArthurZucker in #27971CI slow
] Fix expected values by @ArthurZucker in #27999SeamlessM4TTokenizer
] Safe import by @ArthurZucker in #28026core
/ modeling
] Fix training bug with PEFT + GC by @younesbelkada in #28031test_retain_grad_hidden_states_attentions
is flaky by @gante in #28035FA-2
] Fix fa-2 issue when passing config
to from_pretrained
by @younesbelkada in #28043Modeling
/ Mixtral
] Fix GC + PEFT issues with Mixtral by @younesbelkada in #28061Mixtral
] update conversion script to reflect new changes by @younesbelkada in #28068test_retain_grad_hidden_states_attentions
by @ylacombe in #28060low_cpu_mem_usage
Flag Conflict with DeepSpeed Zero 3 in from_pretrained
for Models with keep_in_fp32_modules
" by @kotarotanahashi in #27762DISABLE_TELEMETRY
is used by @Wauplin in #28113Mixtral
] Fix loss + nits by @ArthurZucker in #28115CLIPConfig
by @ydshieh in #28108input_embeds
docstring in encoder-decoder architectures by @gante in #28168docs/source/en/perf_infer_gpu_one.md
by @ydshieh in #28198training_args.py
fix missing import with accelerate with version accelerate==0.20.1
by @michaelfeil in #28171feature_extractor_type
when loading an image processor file by @ydshieh in #28195Llava
] Fix llava index errors by @younesbelkada in #28032from_pretrained
under ZeRO-3 by @XuehaiPan in #28245_merge_input_ids_with_image_features
for llava model by @VictorSanh in #28333DeepSpeed
when using auto find batch size by @muellerzr in #28088cache_dir
for evaluate.load()
in example scripts by @aphedges in #28422TFTrainer
by @gante in #28483chore
] Update warning text, a word was missing by @tomaarsen in #28017finetuned_from
if it is a local path by @ydshieh in #28482task
arg in load_dataset
in image-classification example by @regisss in #28408TokenizationUtils
] Fix add_special_tokens
when the token is already there by @ArthurZucker in #28520TokenizationRoformerFast
] Fix the save and loading by @ArthurZucker in #28527SpeechT5Tokenization
] Add copied from and fix the convert_tokens_to_string
to match the fast decoding scheme by @ArthurZucker in #28522Processor
by @ydshieh in #27761weights_only
only if torch >= 1.13 by @ydshieh in #28506Core Tokenization
] Support a fix for spm fast models by @ArthurZucker in #26678LoggingLevel
context manager in 3 tests by @ydshieh in #28575processor_config.json
if a processor has no extra attribute by @ydshieh in #28584The following contributors have made significant changes to the library over the last release:
attention_mask
support (#27539)Patch release to resolve some critical issues relating to the recent cache refactor, flash attention refactor and training in the multi-gpu and multi-node settings:
config
to from_pretrained
with FA #28043A patch release for critical torch issues mostly:
🔥
Mixtral is the new open-source model from Mistral AI announced by the blogpost Mixtral of Experts. The model has been proven to have comparable capabilities to Chat-GPT according to the benchmark results shared on the release blogpost.
The architecture is a sparse Mixture of Experts with Top-2 routing strategy, similar as NllbMoe
architecture in transformers. You can use it through AutoModelForCausalLM
interface:
>>> import torch
>>> from transformers import AutoModelForCausalLM, AutoTokenizer
>>> model = AutoModelForCausalLM.from_pretrained("mistralai/Mixtral-8x7B", torch_dtype=torch.float16, device_map="auto")
>>> tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-8x7B")
>>> prompt = "My favourite condiment is"
>>> model_inputs = tokenizer([prompt], return_tensors="pt").to(device)
>>> model.to(device)
>>> generated_ids = model.generate(**model_inputs, max_new_tokens=100, do_sample=True)
>>> tokenizer.batch_decode(generated_ids)[0]
The model is compatible with existing optimisation tools such Flash Attention 2, bitsandbytes
and PEFT library. The checkpoints are release under mistralai
organisation on the Hugging Face Hub.
Llava is an open-source chatbot trained by fine-tuning LlamA/Vicuna on GPT-generated multimodal instruction-following data. It is an auto-regressive language model, based on the transformer architecture. In other words, it is an multi-modal version of LLMs fine-tuned for chat / instructions.
The Llava model was proposed in Improved Baselines with Visual Instruction Tuning by Haotian Liu, Chunyuan Li, Yuheng Li and Yong Jae Lee.
Llava
] Add Llava to transformers by @younesbelkada in #27662The integration also includes BakLlava
which is a Llava model trained with Mistral backbone.
The mode is compatible with "image-to-text"
pipeline:
from transformers import pipeline
from PIL import Image
import requests
model_id = "llava-hf/llava-1.5-7b-hf"
pipe = pipeline("image-to-text", model=model_id)
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/ai2d-demo.jpg"
image = Image.open(requests.get(url, stream=True).raw)
prompt = "USER: <image>\nWhat does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud\nASSISTANT:"
outputs = pipe(image, prompt=prompt, generate_kwargs={"max_new_tokens": 200})
print(outputs)
And you can find all Llava weights under llava-hf
organisation on the Hub.
SeamlessM4T-v2 is a collection of models designed to provide high quality translation, allowing people from different linguistic communities to communicate effortlessly through speech and text. It is an improvement on the previous version and was proposed in Seamless: Multilingual Expressive and Streaming Speech Translation by the Seamless Communication team from Meta AI.
For more details on the differences between v1 and v2, refer to section Difference with SeamlessM4T-v1.
SeamlessM4T enables multiple tasks without relying on separate models:
The PatchTST model was proposed in A Time Series is Worth 64 Words: Long-term Forecasting with Transformers by Yuqi Nie, Nam H. Nguyen, Phanwadee Sinthong and Jayant Kalagnanam.
At a high level, the model vectorizes time series into patches of a given size and encodes the resulting sequence of vectors via a Transformer that then outputs the prediction length forecast via an appropriate head. The model is illustrated in the following figure:
The PatchTSMixer model was proposed in TSMixer: Lightweight MLP-Mixer Model for Multivariate Time Series Forecasting by Vijay Ekambaram, Arindam Jati, Nam Nguyen, Phanwadee Sinthong and Jayant Kalagnanam.
PatchTSMixer is a lightweight time-series modeling approach based on the MLP-Mixer architecture. In this HuggingFace implementation, we provide PatchTSMixer’s capabilities to effortlessly facilitate lightweight mixing across patches, channels, and hidden features for effective multivariate time-series modeling. It also supports various attention mechanisms starting from simple gated attention to more complex self-attention blocks that can be customized accordingly. The model can be pretrained and subsequently used for various downstream tasks such as forecasting, classification and regression.
The CLVP (Contrastive Language-Voice Pretrained Transformer) model was proposed in Better speech synthesis through scaling by James Betker.
The Phi-1 model was proposed in Textbooks Are All You Need by Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee and Yuanzhi Li.
The Phi-1.5 model was proposed in Textbooks Are All You Need II: phi-1.5 technical report by Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar and Yin Tat Lee.
The text-visual prompting (TVP) framework was proposed in the paper Text-Visual Prompting for Efficient 2D Temporal Video Grounding by Yimeng Zhang, Xin Chen, Jinghan Jia, Sijia Liu, Ke Ding.
This research addresses temporal video grounding (TVG), which is the process of pinpointing the start and end times of specific events in a long video, as described by a text sentence. Text-visual prompting (TVP), is proposed to enhance TVG. TVP involves integrating specially designed patterns, known as ‘prompts’, into both the visual (image-based) and textual (word-based) input components of a TVG model. These prompts provide additional spatial-temporal context, improving the model’s ability to accurately determine event timings in the video. The approach employs 2D visual inputs in place of 3D ones. Although 3D inputs offer more spatial-temporal detail, they are also more time-consuming to process. The use of 2D inputs with the prompting method aims to provide similar levels of context and accuracy more efficiently.
Depth estimation is added to the DINO v2 implementation.
AMD's ROCm GPU architecture is now supported across the board and fully tested in our CI with MI210/MI250 GPUs. We further enable specific hardware acceleration for ROCm in Transformers, such as Flash Attention 2, GPTQ quantization and DeepSpeed.
scaled_dot_product_attention
native supportPyTorch's torch.nn.functional.scaled_dot_product_attention
operator is now supported in the most-used Transformers models and used by default when using torch>=2.1.1
, allowing to dispatch on memory-efficient attention and Flash Attention backend implementations with no other package than torch
required. This should significantly speed up attention computation on hardware that that supports these fastpath.
While Transformers automatically handles the dispatch to use SDPA when available, it is possible to force the usage of a given attention implementation ("eager"
being the manual implementation, where each operation is implemented step by step):
# or `attn_implementation="sdpa", or `attn_implementation="flash_attention_2"`
model = AutoModelForSpeechSeq2Seq.from_pretrained("openai/whisper-tiny", attn_implementation="eager")
Training benchmark, run on A100-SXM4-80GB.
Model | Batch size | Sequence length | Time per batch ("eager" , s) |
Time per batch ("sdpa" , s) |
Speedup | Peak memory ("eager" , MB) |
Peak memory ("sdpa" , MB) |
Memory savings |
---|---|---|---|---|---|---|---|---|
llama2 7b | 4 | 1024 | 1.065 | 0.90 | 19.4% | 73878.28 | 45977.81 | 60.7% |
llama2 7b | 4 | 2048 | OOM | 1.87 | / | OOM | 78394.58 | SDPA does not OOM |
llama2 7b | 1 | 2048 | 0.64 | 0.48 | 32.0% | 55557.01 | 29795.63 | 86.4% |
llama2 7b | 1 | 3072 | OOM | 0.75 | / | OOM | 37916.08 | SDPA does not OOM |
llama2 7b | 1 | 4096 | OOM | 1.03 | / | OOM | 46028.14 | SDPA does not OOM |
llama2 7b | 2 | 4096 | OOM | 2.05 | / | OOM | 78428.14 | SDPA does not OOM |
Inference benchmark, run on A100-SXM4-80GB.
Model | Batch size | Prompt length | Num new tokens | Per token latency "eager" (ms) |
Per token latency "sdpa" (ms) |
Speedup |
---|---|---|---|---|---|---|
llama2 13b | 1 | 1024 | 1 (prefill) | 178.66 | 159.36 | 12.11% |
llama2 13b | 1 | 100 | 100 | 40.35 | 37.62 | 7.28% |
llama2 13b | 8 | 100 | 100 | 40.55 | 38.06 | 6.53% |
Whisper v3 large | 1 | / | 62 | 20.05 | 18.90 | 6.10% |
Whisper v3 large | 8 | / | 77 | 25.42 | 24.77 | 2.59% |
Whisper v3 large | 16 | / | 77 | 28.51 | 26.32 | 8.34% |
We are rolling out a new abstraction for the past_key_values
cache, which enables the use of different types of caches. For now, only llama
and llama
-inspired architectures (mistral
, persimmon
, phi
) support it, with other architectures scheduled to have support in the next release. By default, a growing cache (DynamicCache
) is used, which preserves the existing behavior.
This release also includes a new SinkCache
cache, which implements the Attention Sinks paper. With SinkCache
, the model is able to continue generating high-quality text well beyond its training sequence length! Note that it does not expand the context window, so it can’t digest very long inputs — it is suited for streaming applications such as multi-round dialogues. Check this colab for an example.
Cache
abstraction and Attention Sinks support by @tomaarsen in #26681We continue toggling features enabling safetensors as a default across the board, in PyTorch, Flax, and TensorFlow.
When using PyTorch model and forcing the load of safetensors
file with use_safetensors=True
, if the repository does not contain the safetensors files, they will now be converted on-the-fly server-side.
from_pt
flag when loading with safetensors
by @LysandreJik in #27394We now disallow the use of pickle.load
internally for security purposes. To circumvent this, you can use the TRUST_REMOTE_CODE=True
command to indicate that you would still like to load it.
pickle.load
unless TRUST_REMOTE_CODE=True
by @ydshieh in #27776In the previous implementation of beam search, when length_penalty
is active, the beam score for decoder-only models was penalized by the total length of both prompt and generated sequence. However, the length of prompt should not be included in the penalization step -- this release fixes it.
AttentionMaskConverter
compatible with torch.compile(..., fullgraph=True)
by @fxmarty in #27868PEFT
/ Tests
] Fix peft integration failing tests by @younesbelkada in #27258Docs
/ SAM
] Reflect correct changes to run inference without OOM by @younesbelkada in #27268FA2
] Add flash attention for for DistilBert
by @susnato in #26489PretrainedTokenizer
] add some of the most important functions to the doc by @ArthurZucker in #27313Kosmos2Processor
batch mode by @ydshieh in #27323FA2
] Add flash attention for GPT-Neo
by @susnato in #26486Whisper
] Add conversion script for the tokenizer by @ArthurZucker in #27338gpt_bigcode
. by @susnato in #27348Whisper
] Nit converting the tokenizer by @ArthurZucker in #27349Kosmos-2
device issue by @ydshieh in #27346from_pt=True
by @ydshieh in #27372torch.range
in test_modeling_ibert.py
by @kit1980 in #27355CodeLlamaTokenizer
] Nit, update init to make sure the AddedTokens are not normalized because they are special by @ArthurZucker in #27359pyproject.toml
by @ydshieh in #27366pytest.mark
directly by @ydshieh in #27390FuyuConfig
by @ydshieh in #27399Owlv2
checkpoint name and a default value in Owlv2VisionConfig
by @ydshieh in #27402circleci/create_circleci_config.py
is modified by @ydshieh in #27413Quantization
] Add str to enum conversion for AWQ by @younesbelkada in #27320AttentionMaskConverter
] ]Fix-mask-inf by @ArthurZucker in #27114examples_torch_job
faster by @ydshieh in #27437utils/not_doctested.txt
by @ydshieh in #27459Llama + Mistral
] Add attention dropout by @ArthurZucker in #27315gradient_checkpointing_kwargs
by @tomaszcichy98 in #27470python-Levenshtein
for nougat
in CI image by @ydshieh in #27465AWQ
] Addresses TODO for awq tests by @younesbelkada in #27467Peft
] modules_to_save
support for peft integration by @younesbelkada in #27466CI-test_torch
] skip test_tf_from_pt_safetensors
for 4 models by @ArthurZucker in #27481ExponentialDecayLengthPenalty
doctest by @gante in #27485GenerationConfig.from_pretrained
can return unused kwargs by @gante in #27488CI-test_torch
] skip test_tf_from_pt_safetensors and test_assisted_decoding_sample
by @ArthurZucker in #27508CircleCI
] skip test_assisted_decoding_sample for everyone by @ArthurZucker in #27511tokenizers
] update tokenizers
version pin by @ArthurZucker in #27494PretrainedConfig
] Improve messaging by @ArthurZucker in #27438en/model_doc
docs to Japanese. by @Yuki-Imajuku in #27401pytest
] Avoid flash attn test marker warning by @ArthurZucker in #27509usedforsecurity=False
in hashlib methods (FIPS compliance) by @Wauplin in #27483latest-pytorch-amd
for now by @ydshieh in #27541Styling
] stylify using ruff by @ArthurZucker in #27144convert_hf_to_openai.py
script to Whisper documentation resources by @zuazo in #27590FA-2
] Add fa2 support for from_config
by @younesbelkada in #26914large-v3
version support by @flyingleafe in #27336core
/ gradient_checkpointing
] add support for old GC method by @younesbelkada in #27610past_key_values
in generate
by @gante in #27612init_git_repo
by @statelesshz in #27617use_cache=True
in Flash Attention tests by @fxmarty in #27635resize_token_embeddings
by @czy-orange in #26861)dependency
] update pillow pins by @ArthurZucker in #27409max_steps
documentation regarding the end-of-training condition by @qgallouedec in #27624FA2
] Add flash attention for opt by @susnato in #26414save_only_model
arg and simplifying FSDP integration by @pacman100 in #27652TransfoXL
by @ydshieh in #27607DocString
] Support a revision in the docstring add_code_sample_docstrings
to facilitate integrations by @ArthurZucker in #27645TVPModelTest
by @ydshieh in #27695en/model_doc
to JP by @rajveer43 in #27264TransfoXLTokenizer.__init__
by @ydshieh in #27721tests/utils/tiny_model_summary.json
is modified by @ydshieh in #27693~transformer.
-> ~transformers.
by @tomaarsen in #27740check_runner_status.yml
by @ydshieh in #27767GenerationConfig
throws an exception when generate
args are passed by @gante in #27757persistent_workers
parameter to TrainingArguments
by @Sorrow321 in #27189ModelOnTheFlyConversionTester
] Mark as slow for now by @ArthurZucker in #27823TvpModelIntegrationTests
by @ydshieh in #27792Owlv2ModelIntegrationTest::test_inference_object_detection
by @ydshieh in #27793en/tasks
folder docs to Japanese 🇯🇵 by @rajveer43 in #27098ruff==0.1.5
by @ydshieh in #27849ClipVision
] accelerate
support for clip-vision by @younesbelkada in #27851VitDetModelTester.get_config
to use pretrain_image_size
by @ydshieh in #27831Docs
] Update broken image on fused modules by @younesbelkada in #27856_keep_in_fp32_modules
being modified by @ydshieh in #27867Flash Attention 2
] Add flash attention 2 for GPT-Neo-X by @younesbelkada in #26463blip
to clap
) 🇯🇵 by @rajveer43 in #27673FA-2
] Add Flash Attention to Phi
by @susnato in #27661# Ignore copy
by @ydshieh in #27328create_model_card
to properly save peft details when using Trainer with PEFT by @pacman100 in #27754get_default_device
to v4.38 by @statelesshz in #27848model_doc
files from clip
to cpm
to JP by @rajveer43 in #27774prefix_allowed_tokens_fn
return empty set of tokens by @Saibo-creator in #27797test_initialization
as flaky in 2 model tests by @ydshieh in #27906notification_service.py
by @ydshieh in #27903FillMaskPipelineTests
by @ydshieh in #27889resume_from_checkpoint
to handle auto_find_batch_size
by @muellerzr in #27568The following contributors have made significant changes to the library over the last release:
FA2
] Add flash attention for for DistilBert
(#26489)FA2
] Add flash attention for GPT-Neo
(#26486)gpt_bigcode
. (#27348)FA2
] Add flash attention for opt (#26414)FA-2
] Add Flash Attention to Phi
(#27661)en/model_doc
docs to Japanese. (#27401)en/model_doc
to JP (#27264)en/tasks
folder docs to Japanese 🇯🇵 (#27098)blip
to clap
) 🇯🇵 (#27673)model_doc
files from clip
to cpm
to JP (#27774)A patch release was made for the following commit:
tokenizers
] update tokenizers version pin #27494to fix all the issues with versioning regarding tokenizers
and huggingface_hub
A patch release was made for the following three commits:
Distil-Whisper is a distilled version of Whisper that is 6 times faster, 49% smaller, and performs within 1% word error rate (WER) on out-of-distribution data. It was proposed in the paper Robust Knowledge Distillation via Large-Scale Pseudo Labelling.
Distil-Whisper copies the entire encoder from Whisper, meaning it retains Whisper's robustness to different audio conditions. It only copies 2 decoder layers, which significantly reduces the time taken to auto-regressively generate text tokens:
Distil-Whisper is MIT licensed and directly available in the Transformers library with chunked long-form inference, Flash Attention 2 support, and Speculative Decoding. For details on using the model, refer to the following instructions.
Joint work from @sanchit-gandhi, @patrickvonplaten and @srush.
The Fuyu model was created by ADEPT, and authored by Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, Sağnak Taşırlar.
The authors introduced Fuyu-8B, a decoder-only multimodal model based on the classic transformers architecture, with query and key normalization. A linear encoder is added to create multimodal embeddings from image inputs.
By treating image tokens like text tokens and using a special image-newline character, the model knows when an image line ends. Image positional embeddings are removed. This avoids the need for different training phases for various image resolutions. With 8 billion parameters and licensed under CC-BY-NC, Fuyu-8B is notable for its ability to handle both text and images, its impressive context size of 16K, and its overall performance.
Joint work from @molbap, @pcuenca, @amyeroberts, @ArthurZucker
The SeamlessM4T model was proposed in SeamlessM4T — Massively Multilingual & Multimodal Machine Translation by the Seamless Communication team from Meta AI.
SeamlessM4T is a collection of models designed to provide high quality translation, allowing people from different linguistic communities to communicate effortlessly through speech and text.
SeamlessM4T enables multiple tasks without relying on separate models:
SeamlessM4TModel can perform all the above tasks, but each task also has its own dedicated sub-model.
The KOSMOS-2 model was proposed in Kosmos-2: Grounding Multimodal Large Language Models to the World by Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Furu Wei.
KOSMOS-2 is a Transformer-based causal language model and is trained using the next-word prediction task on a web-scale dataset of grounded image-text pairs GRIT. The spatial coordinates of the bounding boxes in the dataset are converted to a sequence of location tokens, which are appended to their respective entity text spans (for example, a snowman followed by <patch_index_0044><patch_index_0863>). The data format is similar to “hyperlinks” that connect the object regions in an image to their text span in the corresponding caption.
Kosmos-2
model by @ydshieh in #24709OWLv2 was proposed in Scaling Open-Vocabulary Object Detection by Matthias Minderer, Alexey Gritsenko, Neil Houlsby. OWLv2 scales up OWL-ViT using self-training, which uses an existing detector to generate pseudo-box annotations on image-text pairs. This results in large gains over the previous state-of-the-art for zero-shot object detection.
torch
serialization 🚨🚨🚨Version v4.35.0 now puts safetensors
serialization by default. This is a significant change targeted at making users of the Hugging Face Hub, transformers
, and any downstream library leveraging it safer.
The safetensors
library is a safe serialization framework for machine learning tensors. It has been audited and will become the default serialization framework for several organizations (Hugging Face, EleutherAI, Stability AI).
It was already the default loading mechanism since v4.30.0 and would therefore already default to loading model.safetensors
files instead of pytorch_model.bin
if these were present in the repository.
With v4.35.0, any call to save_pretrained
for torch models will now save a safetensors
file. This safetensors
file is in the PyTorch format, but can be loaded in TensorFlow and Flax models alike.
⚠️ If you run into any issues with this, please let us know ASAP in the issues so that we may help you. Namely, the following errors may indicate something is up:
safetensors
file and having a warning mentioning missing weights unexpectedlysafetensors
If you wish to continue saving files in the .bin
format, you can do so by specifying safe_serialization=False
in all your save_pretrained
calls.
Chat templates have been expanded with the addition of the add_generation_prompt
argument to apply_chat_template()
. This has also enabled us to rework the ConversationalPipeline class to use chat templates. Any model with a chat template is now automatically usable through ConversationalPipeline
.
Two new guides on LLMs were added the library:
Exllama-v2 provides better GPTQ kernel for higher throughput and lower latency for GPTQ models. The original code can be found here.
You will need the latest versions of optimum
and auto-gptq
. Read more about the integration here.
AWQ is a new and popular quantization scheme, already used in various libraries such as TGI, vllm, etc. and known to be faster than GPTQ models according to some benchmarks. The original code can be found here and here you can read more about the original paper.
We support AWQ inference with original kernels as well as kernels provided through autoawq
package that you can simply install with pip install autoawq
.
core
/ Quantization
] AWQ integration by @younesbelkada in #27045We also provide an example script on how to push quantized weights on the hub on the original repository.
Read more about the benchmarks and the integration here
You can now run GPTQ models on CPU using the latest version of auto-gptq
thanks to @vivekkhandelwal1 !
We refactored the attention mask logic for major models in transformers. For instance, we removed padding_mask
argument which was ambiguous for some users
padding_mask
and instead use a 2D->4D Attn Mask Mapper by @patrickvonplaten in #26792Gpt-bigcode
(starcoder), whisper, Bart and MBart now supports FA-2 ! Use it by simply passing use_flash_attention_2=True
to from_pretrained
. Some bugfixes with respect to mixed precision training with FA2 have been also addressed.
gpt_bigcode
by @susnato in #26479FA2
] Fix flash attention 2 fine-tuning with Falcon by @younesbelkada in #26852A bugfix with respect to fine-tuning with FA-2 in bfloat16 was addressed. You should now smoothly fine-tune FA-2 models in bfloat16 using quantized base models.
Quantization
] Store the original dtype in the config as a private attribute 🚨🚨🚨 by @younesbelkada in #26761FA-2
] Final fix for FA2 dtype by @younesbelkada in #26846NEFTune is a new technique to boost Supervised Fine-tuning performance by adding random noise on the embedding vector. Read more about it on the original paper here
We propose a very simple API for users to benefit from this technique, simply pass a valid neftune_noise_alpha
parameter to TrainingArguments
Read more about the API here
We have refactored the gradient checkpointing API so that users can pass keyword arguments supported by torch.utils.checkpoint.checkpoint
directly through gradient_checkpointing_kwargs
when calling gradient_checkpointing_enable()
, e.g.
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("facebook/opt-125m")
model.gradient_checkpointing_enable(gradient_checkpointing_kwargs={"use_reentrant": False})
gradient_checkpointing_kwargs
is also supported with Trainer
through TrainingArguments
.
Trainer
/ GC
] Add gradient_checkpointing_kwargs
in trainer and training arguments by @younesbelkada in #27068core
] Refactor of gradient_checkpointing
by @younesbelkada in #27020core
/ GC
/ tests
] Stronger GC tests by @younesbelkada in #27124The refactor should be totally backward compatible with previous behaviour. For superusers, you can still use the attribute gradient_checkpointing
on model's submodules to control the activation / deactivation of gradient_checkpointing.
Quantization
] Store the original dtype in the config as a private attribute 🚨🚨🚨 by @younesbelkada in #26761Nougat
] from transformers import * by @ArthurZucker in #26562semantic_segmentation.md
to Korean by @jungnerd in #26515NougatProcessor
] Fix the default channel by @ArthurZucker in #26608GPTNeoX
] Faster rotary embedding for GPTNeoX (based on llama changes) by @ArthurZucker in #25830use_cache=False
before creating presents
which relies on use_cache
by @yundai424 in #26328main
due to torch 2.1 by @ydshieh in #26607ModelOutput
serializable by @cbensimon in #26493core
] fix silent bug keep_in_fp32
modules by @younesbelkada in #26589transformers-pytorch-gpu
docker build by @ydshieh in #26615pytorch-quantization
in Doc Builder docker file by @ydshieh in #26622view
s of position_ids
by @ramiro050 in #26059MusicgenTest .test_pipeline_text_to_audio
by @ydshieh in #26586LlamaTokenizerFast
] Adds edge cases for the template processor by @ArthurZucker in #26606AlbertConfig
by @ydshieh in #26636CLIPImageProcessor
by @isaac-chung in #26676CLIP
by @isaac-chung in #26691LlamaConfig
by @pavaris-pm in #26685jnp.array
in types with jnp.ndarray
. by @hvaara in #26703Copied from
for test files by @ydshieh in #26713SwinModel
docstring fix by @shivanandmn in #26679use_cuda_amp
is no more available by @pacman100 in #26731no_trainer
scripts by @muellerzr in #26733torch==2.1.0
by @ydshieh in #26735LlamaTokenizer
and LlamaTokenizerFast
by @minhoryang in #26669CodeLlamaTokenizer
by @Bojun-Feng in #26709Blip2ForConditionalGeneration
by @ydshieh in #26737PersimmonIntegrationTest
OOM by @ydshieh in #26750MistralIntegrationTest
OOM by @ydshieh in #26754UniSpeech
, UniSpeechSat
, Wav2Vec2ForCTC
by @gizemt in #26664GPT2
and Whisper
by @McDonnellJoseph in #26642PerceiverModelIntegrationTest::test_inference_masked_lm
by @ydshieh in #26760core
] Fix fa-2 import by @younesbelkada in #26785TrainerIntegrationFSDP::test_basic_run_with_cpu_offload
if torch < 2.1
by @ydshieh in #26764big_models.md
to Korean by @wonhyeongseo in #26245IdeficsProcessorTest.test_tokenizer_padding
by @ydshieh in #26779RwkvConfig
by @Bojun-Feng in #26782DPRConfig
by @AVAniketh0905 in #26674Flava
] Fix flava doc by @younesbelkada in #26789CanineConfig
by @Sparty in #26771CodeLlamaTokenizerFast
by @Bojun-Feng in #26666en/internal
folder docs to Japanese 🇯🇵 by @rajveer43 in #26747Tokenizer
] Fix slow and fast serialization by @ArthurZucker in #26570FA-2
] Revert suggestion that broke FA2 fine-tuning with quantized models by @younesbelkada in #26916ChineseCLIP
by @Sparty in #26880CodeGen
by @daniilgaltsev in #26821FA-2
/ Mistral
] Supprot fa-2 + right padding + forward by @younesbelkada in #26912max_shard_size
to smaller value by @younesbelkada in #26942NLLB-MoE
] Fix NLLB MoE 4bit inference by @younesbelkada in #27012SeamlessM4T
] fix copies with NLLB MoE int8 by @ArthurZucker in #27018pipeline_tutorial.md
to chinese by @jiaqiw09 in #26954preprocessing.md
to Chinese by @jiaqiw09 in #26955default_to_square_for_size
to CLIPImageProcessor
by @ydshieh in #26965TFxxxxForSequenceClassifciation
] Fix the eager mode after #25085 by @ArthurZucker in #25751docs
] Add MaskGenerationPipeline
in docs by @younesbelkada in #27063set_epoch
for Accelerate-based dataloaders by @muellerzr in #26850flash_attn
version to 2.1
by @younesbelkada in #27079T5Tokenizer
] Fix fast and extra tokens by @ArthurZucker in #27085core
/ gradient_checkpointing
] Refactor GC - part 2 by @younesbelkada in #27073FA2
/ Mistral
] Revert previous behavior with right padding + forward by @younesbelkada in #27125"common_voice"
by @ydshieh in #27147tests
/ Quantization
] Fix bnb test by @younesbelkada in #27145copied from
by @ydshieh in #27149en/main_classes
folder docs to Japanese 🇯🇵 by @rajveer43 in #26894get_default_device
in tools/base.py
by @statelesshz in #26774tiny_model_summary.json
is modified by @ydshieh in #27175Quantization
/ tests
] Fix bnb MPT test by @younesbelkada in #27178StarCoder
by @susnato in #27182core
/ Quantization
] Fix for 8bit serialization tests by @younesbelkada in #27234The following contributors have made significant changes to the library over the last release:
semantic_segmentation.md
to Korean (#26515)get_default_device
in tools/base.py
(#26774)en/internal
folder docs to Japanese 🇯🇵 (#26747)en/main_classes
folder docs to Japanese 🇯🇵 (#26894)pipeline_tutorial.md
to chinese (#26954)preprocessing.md
to Chinese (#26955)A patch release was made for the following three commits:
Mistral-7B-v0.1 is a decoder-based LM with the following architectural choices:
The authors introduced Persimmon-8B, a decoder model based on the classic transformers architecture, with query and key normalization. Persimmon-8B is a fully permissively licensed model with approximately 8 billion parameters, released under the Apache license. Some of the key attributes of Persimmon-8B are long context size (16K), performance, and capabilities for multimodal extensions.
Persimmon
] Add support for persimmon by @ArthurZucker in #26042BROS stands for BERT Relying On Spatiality. It is an encoder-only Transformer model that takes a sequence of tokens and their bounding boxes as inputs and outputs a sequence of hidden states. BROS encode relative spatial information instead of using absolute spatial information.
ViTMatte leverages plain Vision Transformers for the task of image matting, which is the process of accurately estimating the foreground object in images and videos.
Nougat uses the same architecture as Donut, meaning an image Transformer encoder and an autoregressive text Transformer decoder to translate scientific PDFs to markdown, enabling easier access to them.
We've added a new template feature for chat models. This allows the formatting that a chat model was trained with to be saved with the model, ensuring that users can exactly reproduce that formatting when they want to fine-tune the model or use it for inference. For more information, see our template documentation.
Tokenizer
] attemp to fix add_token issues by @ArthurZucker in #23909🚨Workflow Changes 🚨:
These are not breaking changes per se but rather bugfixes. However, we understand that this may result in some workflow changes so we highlight them below.
➕ Most visible features:
tokenizer.added_tokens_decoder
for both fast and slow tokenizers. Moreover, additional tokens that were already part of the initial vocab are also found there.from_pretrained
, faster add_tokens
because special and non special can be mixed together and the trie is not always rebuilt.added_tokens_decoder/encoder
.tokenizer_config.json
For any issues relating to this, make sure to open a new issue and ping @ArthurZucker.
FA2 support added to transformers for most popular architectures (llama, mistral, falcon) architectures actively being contributed in this issue (https://github.com/huggingface/transformers/issues/26350). Simply pass use_flash_attention_2=True
when calling from_pretrained
In the future, PyTorch will support Flash Attention 2 through torch.scaled_dot_product_attention
, users would be able to benefit from both (transformers core & transformers + SDPA) implementations of Flash Attention-2 with simple changes (model.to_bettertransformer()
and force-dispatch the SDPA kernel to FA-2 in the case of SDPA)
core
] Integrate Flash attention 2 in most used models by @younesbelkada in #25598For our future plans regarding integrating F.sdpa from PyTorch in core transformers, see here: https://github.com/huggingface/transformers/issues/26557
Support for lazy loading integration libraries has been added. This will drastically speed up importing transformers
and related object from the library.
Example before this change:
2023-09-11 11:07:52.010179: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
python3 -c "from transformers import CLIPTextModel" 3.31s user 3.06s system 220% cpu 2.893 total
After this change:
python3 -c "from transformers import CLIPTextModel" 1.70s user 1.49s system 220% cpu 1.447 total
test_load_img_url_timeout
by @ydshieh in #25976Pop2Piano
space demo. by @susnato in #25975generation_config
by @gante in #25987CI
] Fix red CI and ERROR failed should show by @ArthurZucker in #25995VITS
] tokenizer integration test: fix revision did not exist by @ArthurZucker in #25996llm_tutorial.md
to Korean by @harheem in #25791tgs
speed metrics by @CokeDong in #25858activation_dropout
and fix DropOut docs for SEW-D by @gau-nernst in #26031llama.md
to Korean by @harheem in #26044CodeLlamaTokenizerFast
] Fix fix set_infilling_processor
to properly reset by @ArthurZucker in #26041CITests
] skip failing tests until #26054 is merged by @ArthurZucker in #26063core
] Import tensorflow inside relevant methods in trainer_utils
by @younesbelkada in #26106generation_config
is untouched by @gante in #25962llama2.md
to Korean by @mjk0618 in #26047contributing.md
to Korean by @mjk0618 in #25877MarianTokenizer
to remove metaspace character in decode
by @tanaymeh in #26091core
] fix 4bit num_parameters
by @younesbelkada in #26132RWKV
] Final fix RWMV 4bit by @younesbelkada in #26134generation_config.max_length
is set to None
by @gante in #26147test_finetune_bert2bert
by @ydshieh in #25984beam_scores
shape when token scores shape changes after logits_processor
by @BakerBunker in #25980accelerate
> 0.20.3 by @sam-scale in #26060PEFT
] Fix PEFT + gradient checkpointing by @younesbelkada in #25846convert_bros_to_pytorch.py
by @ydshieh in #26212utils/documentation_tests.txt
by @ydshieh in #26213ctrl
to Salesforce/ctrl
by @julien-c in #26183whisper.md
to Korean by @nuatmochoi in #26002Error
not captured in PR doctesting by @ydshieh in #26215Trainer
] Refactor trainer + bnb logic by @younesbelkada in #26248ALL_LAYERNORM_LAYERS
by @shijie-wu in #26227model._keep_in_fp32_modules
is set even when accelerate
is not installed by @fxmarty in #26225store_test_results
by @ydshieh in #26223audio_classification.mdx
to Korean by @gabrielwithappy in #26200RMSProp
optimizer by @natolambert in #26425transformers
is installed without tokenizers
by @urialon in #26236FA
/ tests
] Add use_cache tests for FA models by @younesbelkada in #26415PEFT
] Fix PEFT multi adapters support by @younesbelkada in #26407runs-on
in workflow files by @ydshieh in #26435debugging.md
to Korean by @wonhyeongseo in #26246perf_train_gpu_many.md
to Korean by @wonhyeongseo in #26244cos_sin
device issue in Falcon model by @ydshieh in #26448PEFT
] introducing adapter_kwargs
for loading adapters from different Hub location (subfolder
, revision
) than the base model by @younesbelkada in #26270PEFT
] Pass token when calling find_adapter_config
by @younesbelkada in #26488core
/ auto
] Fix bnb test with code revision + bug with code revision by @younesbelkada in #26431PEFT
] Protect adapter_kwargs
check by @younesbelkada in #26537tokenizer_summary.md
to Korean by @wonhyeongseo in #26243configuration_encoder_decoder.py
by @SrijanSahaySrivastava in #26519The following contributors have made significant changes to the library over the last release:
debugging.md
to Korean (#26246)perf_train_gpu_many.md
to Korean (#26244)tokenizer_summary.md
to Korean (#26243)