🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
Kudos to @pcuenca for the prompt fix in:
To support EosTokenCriteria
on MPS while pytorch
adds this functionality.
Llama 3 is supported in this release through the Llama 2 architecture and some fixes in the tokenizers
library.
The Idefics2 model was created by the Hugging Face M4 team and authored by Léo Tronchon, Hugo Laurencon, Victor Sanh. The accompanying blog post can be found here.
Idefics2 is an open multimodal model that accepts arbitrary sequences of image and text inputs and produces text outputs. The model can answer questions about images, describe visual content, create stories grounded on multiple images, or simply behave as a pure language model without visual inputs. It improves upon IDEFICS-1, notably on document understanding, OCR, or visual reasoning. Idefics2 is lightweight (8 billion parameters) and treats images in their native aspect ratio and resolution, which allows for varying inference efficiency.
Recurrent Gemma architecture. Taken from the original paper.
The Recurrent Gemma model was proposed in RecurrentGemma: Moving Past Transformers for Efficient Open Language Models by the Griffin, RLHF and Gemma Teams of Google.
The abstract from the paper is the following:
We introduce RecurrentGemma, an open language model which uses Google’s novel Griffin architecture. Griffin combines linear recurrences with local attention to achieve excellent performance on language. It has a fixed-sized state, which reduces memory use and enables efficient inference on long sequences. We provide a pre-trained model with 2B non-embedding parameters, and an instruction tuned variant. Both models achieve comparable performance to Gemma-2B despite being trained on fewer tokens.
Jamba is a pretrained, mixture-of-experts (MoE) generative text model, with 12B active parameters and an overall of 52B parameters across all experts. It supports a 256K context length, and can fit up to 140K tokens on a single 80GB GPU.
As depicted in the diagram below, Jamba’s architecture features a blocks-and-layers approach that allows Jamba to successfully integrate Transformer and Mamba architectures altogether. Each Jamba block contains either an attention or a Mamba layer, followed by a multi-layer perceptron (MLP), producing an overall ratio of one Transformer layer out of every eight total layers.
Jamba introduces the first HybridCache
object that allows it to natively support assisted generation, contrastive search, speculative decoding, beam search and all of the awesome features from the generate
API!
DBRX is a transformer-based decoder-only large language model (LLM) that was trained using next-token prediction. It uses a fine-grained mixture-of-experts (MoE) architecture with 132B total parameters of which 36B parameters are active on any input.
It was pre-trained on 12T tokens of text and code data. Compared to other open MoE models like Mixtral-8x7B and Grok-1, DBRX is fine-grained, meaning it uses a larger number of smaller experts. DBRX has 16 experts and chooses 4, while Mixtral-8x7B and Grok-1 have 8 experts and choose 2.
This provides 65x more possible combinations of experts and the authors found that this improves model quality. DBRX uses rotary position encodings (RoPE), gated linear units (GLU), and grouped query attention (GQA).
The OLMo model was proposed in OLMo: Accelerating the Science of Language Models by Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Raghavi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar Khot, William Merrill, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Valentina Pyatkin, Abhilasha Ravichander, Dustin Schwenk, Saurabh Shah, Will Smith, Emma Strubell, Nishant Subramani, Mitchell Wortsman, Pradeep Dasigi, Nathan Lambert, Kyle Richardson, Luke Zettlemoyer, Jesse Dodge, Kyle Lo, Luca Soldaini, Noah A. Smith, Hannaneh Hajishirzi.
OLMo is a series of Open Language Models designed to enable the science of language models. The OLMo models are trained on the Dolma dataset. We release all code, checkpoints, logs (coming soon), and details involved in training these models.
Qwen2MoE is the new model series of large language models from the Qwen team. Previously, we released the Qwen series, including Qwen-72B, Qwen-1.8B, Qwen-VL, Qwen-Audio, etc.
Model Details Qwen2MoE is a language model series including decoder language models of different model sizes. For each size, we release the base language model and the aligned chat model. Qwen2MoE has the following architectural choices:
Qwen2MoE is based on the Transformer architecture with SwiGLU activation, attention QKV bias, group query attention, mixture of sliding window attention and full attention, etc. Additionally, we have an improved tokenizer adaptive to multiple natural languages and codes. Qwen2MoE employs Mixture of Experts (MoE) architecture, where the models are upcycled from dense language models. For instance, Qwen1.5-MoE-A2.7B is upcycled from Qwen-1.8B. It has 14.3B parameters in total and 2.7B activated parameters during runtime, while it achieves comparable performance with Qwen1.5-7B, with only 25% of the training resources.
Taken from the original paper.
The Grounding DINO model was proposed in Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection by Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, Lei Zhang. Grounding DINO extends a closed-set object detection model with a text encoder, enabling open-set object detection. The model achieves remarkable results, such as 52.5 AP on COCO zero-shot.
Static pretrained maps have been removed from the library's internals and are currently deprecated. These used to reflect all the available checkpoints for a given architecture on the Hugging Face Hub, but their presence does not make sense in light of the huge growth of checkpoint shared by the community.
With the objective of lowering the bar of model contributions and reviewing, we first start by removing legacy objects such as this one which do not serve a purpose.
Processors are ungoing changes in order to uniformize them and make them clearer to use.
Pipelines can now be pushed to Hub using a convenient push_to_hub
method.
push_to_hub
to pipeline by @not-lain in #29172Thanks to the community contribution, Flash Attention 2 has been integrated for more architectures
-
and the
from custom_tools.md by @windsonsea in #29767-OO
mode for docstring_decorator
by @matthid in #29689Latest PyTorch + TensorFlow [dev]
by @ydshieh in #29764LlavaNext
] Fix llava next unsafe imports by @ArthurZucker in #29773set_seed
by @muellerzr in #29778torch_dtype
in the run_mlm example by @jla524 in #29776bos token
to Blip generations by @zucchini-nlp in #29642quality
] update quality check to make sure we check imports 😈 by @ArthurZucker in #29771vocab_size
by @fxmarty in #29389AssistedCandidateGenerator
by @gante in #29787cleanup
] vestiges of causal mask by @ArthurZucker in #29806SuperPoint
] Fix doc example by @amyeroberts in #29816bos_token_id is None
during the generation with inputs_embeds
by @LZHgrla in #29772cosine_with_min_lr
scheduler in Trainer by @liuyanyi in #29341num_attention_heads
!= num_key_value_heads
in Flax Llama Implementation by @bminixhofer in #29557slow_forward
gradient fix by @vasqu in #29563eos_token_id
to stopping criteria by @zucchini-nlp in #29459make fix-copies
] update and help by @ArthurZucker in #29924GptNeox
] don't gather on pkv when using the trainer by @ArthurZucker in #29892pipeline
]. Zero shot add doc warning by @ArthurZucker in #29845xpu
to the testing documentation by @faaany in #29894torch.testing.assert_allclose
by torch.testing.assert_close
by @gante in #29915Mamba
] from pretrained issue with self.embeddings
by @ArthurZucker in #29851TokenizationLlama
] fix the way we convert tokens to strings to keep leading spaces 🚨 breaking fix by @ArthurZucker in #29453BC
] Fix BC for other libraries by @ArthurZucker in #29934LlamaSlowConverter
] Slow to Fast better support by @ArthurZucker in #29797StableLm
] Add QK normalization and Parallel Residual Support by @jon-tow in #29745test_eager_matches_sdpa_generate
flaky for some models by @ydshieh in #29479run_qa.py
by @jla524 in #29867BC
] Fix BC for AWQ quant by @TechxGenus in #29965ImageToTextPipelineTests.test_conditional_generation_llava
by @faaany in #29975generate
] fix breaking change for patch by @ArthurZucker in #29976_replace_with_bnb_linear
by @SunMarc in #29958skip_special_tokens
for Wav2Vec2CTCTokenizer._decode
by @msublee in #29311remove_columns
in text-classification
example by @mariosasko in #29351tests/utils/tiny_model_summary.json
by @ydshieh in #29941kwargs
handling in generate_with_fallback
by @cifkao in #29225WhisperNoSpeechDetection
when recomputing scores by @cifkao in #29248Main CIs
] Fix the red cis by @ArthurZucker in #30022ProcessingIdefics
] Attention mask bug with padding by @byi8220 in #29449whisper
to IMPORTANT_MODELS
by @ydshieh in #30046test_encode_decode_fast_slow_all_tokens
for now by @ydshieh in #30044llm_int8_enable_fp32_cpu_offload=True
...." instead of "load_in_8bit_fp32_cpu_offload=True
". by @miRx923 in #30013torch.fx
symbolic tracing for LLama by @michaelbenayoun in #30047require_bitsandbytes
marker by @faaany in #30116mps
as device for Pipeline
class by @fnhirwa in #30080itemize
by @younesbelkada in #30162ruff
configuration to avoid deprecated configuration warning by @Sai-Suraj-27 in #30179CI
] Add new workflow to run slow tests of important models on push main if they are modified by @younesbelkada in #29235logger.warn
with logger.warning
by @Sai-Suraj-27 in #30197RecurrentGemmaIntegrationTest.test_2b_sample
by @ydshieh in #30222assertEquals
with assertEqual
by @Sai-Suraj-27 in #30241typing.Text
with str
by @Sai-Suraj-27 in #30230type annotation
for compatability with python 3.8 by @Sai-Suraj-27 in #30243docs/source/en
) by @ydshieh in #30247require_torch_multi_gpu
flag by @faaany in #30250ko/_toctree.yml
by @jungnerd in #30062raise
statement by @Sai-Suraj-27 in #30275Idefics2
's doc example by @ydshieh in #30274ExamplesTests::test_run_translation
by @ydshieh in #30281Fatal Python error: Bus error
in ZeroShotAudioClassificationPipelineTests
by @ydshieh in #30283The following contributors have made significant changes to the library over the last release:
The AWQ
issue persisted, and there was a regression reported with beam search and input embeddings.
Series of fixes for backwards compatibility (AutoAWQ and other quantization libraries, imports from trainer_pt_utils
) and functionality (LLaMA tokenizer conversion)
BC
] Fix BC for other libraries #29934LlamaSlowConverter
] Slow to Fast better support #29797Patch release to fix some breaking changes to LLaVA model, fixes/cleanup for Cohere & Gemma and broken doctest
vocab_size
#29389cleanup
] vestiges of causal mask #29806SuperPoint
] Fix doc example (https://github.com/huggingface/transformers/pull/29816)The Llama
, Cohere
and the Gemma
model both no longer cache the triangular causal mask unless static
cache is used. This was reverted by #29753, which fixes the BC issues w.r.t speed , and memory consumption, while still supporting compile and static cache. Small note, fx
is not supported for both models, a patch will be brought very soon!
Command-R is a generative model optimized for long context tasks such as retrieval augmented generation (RAG) and using external APIs and tools. It is designed to work in concert with Cohere's industry-leading Embed and Rerank models to provide best-in-class integration for RAG applications and excel at enterprise use cases. As a model built for companies to implement at scale, Command-R boasts:
Llava next is the next version of Llava, which includes better support for non padded images, improved reasoning, OCR, and world knowledge. LLaVA-NeXT even exceeds Gemini Pro on several benchmarks.
Compared with LLaVA-1.5, LLaVA-NeXT has several improvements:
LLaVa-NeXT incorporates a higher input resolution by encoding various patches of the input image. Taken from the original paper.
The MusicGen Melody model was proposed in Simple and Controllable Music Generation by Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi and Alexandre Défossez.
MusicGen Melody is a single stage auto-regressive Transformer model capable of generating high-quality music samples conditioned on text descriptions or audio prompts. The text descriptions are passed through a frozen text encoder model to obtain a sequence of hidden-state representations. MusicGen is then trained to predict discrete audio tokens, or audio codes, conditioned on these hidden-states. These audio tokens are then decoded using an audio compression model, such as EnCodec, to recover the audio waveform.
Through an efficient token interleaving pattern, MusicGen does not require a self-supervised semantic representation of the text/audio prompts, thus eliminating the need to cascade multiple models to predict a set of codebooks (e.g. hierarchically or upsampling). Instead, it is able to generate all the codebooks in a single forward pass.
The PVTv2 model was proposed in PVT v2: Improved Baselines with Pyramid Vision Transformer by Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. As an improved variant of PVT, it eschews position embeddings, relying instead on positional information encoded through zero-padding and overlapping patch embeddings. This lack of reliance on position embeddings simplifies the architecture, and enables running inference at any resolution without needing to interpolate them.
The UDOP model was proposed in Unifying Vision, Text, and Layout for Universal Document Processing by Zineng Tang, Ziyi Yang, Guoxin Wang, Yuwei Fang, Yang Liu, Chenguang Zhu, Michael Zeng, Cha Zhang, Mohit Bansal. UDOP adopts an encoder-decoder Transformer architecture based on T5 for document AI tasks like document image classification, document parsing and document visual question answering.
UDOP architecture. Taken from the original paper.
This model is a new paradigm architecture based on state-space-models, rather than attention like transformer models. The checkpoints are compatible with the original ones
Add Mamba
] Adds support for the Mamba
models by @ArthurZucker in #28094StarCoder2 is a family of open LLMs for code and comes in 3 different sizes with 3B, 7B and 15B parameters. The flagship StarCoder2-15B model is trained on over 4 trillion tokens and 600+ programming languages from The Stack v2. All models use Grouped Query Attention, a context window of 16,384 tokens with a sliding window attention of 4,096 tokens, and were trained using the Fill-in-the-Middle objective.
The SegGPT model was proposed in SegGPT: Segmenting Everything In Context by Xinlong Wang, Xiaosong Zhang, Yue Cao, Wen Wang, Chunhua Shen, Tiejun Huang. SegGPT employs a decoder-only Transformer that can generate a segmentation mask given an input image, a prompt image and its corresponding prompt mask. The model achieves remarkable one-shot results with 56.1 mIoU on COCO-20 and 85.6 mIoU on FSS-1000.
With Galore, you can pre-train large models on consumer-type hardwares, making LLM pre-training much more accessible to anyone from the community.
Our approach reduces memory usage by up to 65.5% in optimizer states while maintaining both efficiency and performance for pre-training on LLaMA 1B and 7B architectures with C4 dataset with up to 19.7B tokens, and on fine-tuning RoBERTa on GLUE tasks. Our 8-bit GaLore further reduces optimizer memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline. Notably, we demonstrate, for the first time, the feasibility of pre-training a 7B model on consumer GPUs with 24GB memory (e.g., NVIDIA RTX 4090) without model parallel, checkpointing, or offloading strategies.
Galore is based on low rank approximation of the gradients and can be used out of the box for any model.
Below is a simple snippet that demonstrates how to pre-train mistralai/Mistral-7B-v0.1
on imdb:
import torch
import datasets
from transformers import TrainingArguments, AutoConfig, AutoTokenizer, AutoModelForCausalLM
import trl
train_dataset = datasets.load_dataset('imdb', split='train')
args = TrainingArguments(
output_dir="./test-galore",
max_steps=100,
per_device_train_batch_size=2,
optim="galore_adamw",
optim_target_modules=["attn", "mlp"]
)
model_id = "mistralai/Mistral-7B-v0.1"
config = AutoConfig.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_config(config).to(0)
trainer = trl.SFTTrainer(
model=model,
args=args,
train_dataset=train_dataset,
dataset_text_field='text',
max_seq_length=512,
)
trainer.train()
Quanto has been integrated with transformers ! You can apply simple quantization algorithms with few lines of code with tiny changes. Quanto is also compatible with torch.compile
Check out the announcement blogpost for more details
Exllama and AWQ combined together for faster AWQ inference - check out the relevant documentation section for more details on how to use Exllama + AWQ.
Allow models saved or fine-tuned with Apple’s MLX framework to be loaded in transformers (as long as the model parameters use the same names), and improve tensor interoperability. This leverages MLX's adoption of safetensors as their checkpoint format.
Notable memory reduction in Gemma/LLaMa by changing the causal mask buffer type from int64 to boolean.
torch.bool
instead of torch.int64
for non-persistant causal mask buffer by @fxmarty in #29241The PRs below introduced slightly breaking changes that we believed was necessary for the repository; if these seem to impact your usage of transformers, we recommend checking out the PR description to get more insights in how to leverage the new behavior.
Gemma
] Fix bad rebase with transformers main by @younesbelkada in #29170torch.compile
with fullgraph=True
when attention_mask
input is used by @fxmarty in #29211Doc
] update model doc qwen2 by @ArthurZucker in #29238is_vision_available
result by @bmuskalla in #29280DS_DISABLE_NINJA=1
by @ydshieh in #29290non_device_test
pytest mark to filter out non-device tests by @fxmarty in #29213dtype
and device
extraction for CUDA graph generation for quantizers compatibility by @BlackSamorez in #29079attn_implementation
documentation by @fxmarty in #29295GenerationMixin
's docstring by @sadra-barikbin in #29277Gemma
/ CI
] Make sure our runners have access to the model by @younesbelkada in #29242require_read_token
] fix typo by @ArthurZucker in #29345T5 and Llama Tokenizer
] remove warning by @ArthurZucker in #29346Llama ROPE
] Fix torch export but also slow downs in forward by @ArthurZucker in #29198output_router_logits
during inference by @LeonardoEmili in #29249CI
/ starcoder2
] Change starcoder2 path to correct one for slow tests by @younesbelkada in #29359CI
]: Fix failing tests for peft integration by @younesbelkada in #29330CI
] require_read_token
in the llama FA2 test by @younesbelkada in #29361get_values(MODEL_MAPPING)
by @ydshieh in #29362offload_buffers
parameter of accelerate
to PreTrainedModel.from_pretrained
method by @notsyncing in #28755quantization
/ ESM
] Fix ESM 8bit / 4bit with bitsandbytes by @younesbelkada in #29329Llama + AWQ
] fix prepare_inputs_for_generation
🫠 by @ArthurZucker in #29381YOLOS
] Fix - return padded annotations by @amyeroberts in #29300AutoProcessor
by @JingyaHuang in #29169post_process_instance_segmentation
for panoptic tasks by @nickthegroot in #29304Generation
] Fix some issues when running the MaxLength criteria on CPU by @younesbelkada in #29317UdopTokenizer
] Fix post merge imports by @ArthurZucker in #29451Udop imports
] Processor tests were not run. by @ArthurZucker in #29456import_path
location by @loadams in #29154offload_weight()
takes from 3 to 4 positional arguments but 5 were given by @faaany in #29457Docs
/ Awq
] Add docs on exllamav2 + AWQ by @younesbelkada in #29474docs
] Add starcoder2 docs by @younesbelkada in #29454pad_to_multiple_of
by @gante in #29462TextGenerationPipeline.__call__
docstring by @alvarobartt in #29491inputs
as kwarg in TextClassificationPipeline
by @alvarobartt in #29495VisionEncoderDecoder
Positional Arg by @nickthegroot in #29497require_sacremoses
decorator by @faaany in #29504torch_device
instead of auto
for model testing by @faaany in #29531TrainingArguments
by @yundai424 in #29189n_gpu
in TrainerIntegrationTest::test_train_and_eval_dataloaders
for XPU by @faaany in #29307warning_advice
for tensorflow warning by @winstxnhdw in #29540Mamba doc
] Post merge updates by @ArthurZucker in #29472Docs
] fixed minor typo by @j-gc in #29555main
branch by @ydshieh in #28816max_position_embeddings
in the translation example by @gante in #29600Gemma
] Supports converting directly in half-precision by @younesbelkada in #29529MaskFormer
, Mask2Former
] Use einsum where possible by @amyeroberts in #29544Mask2Former
] Move normalization for numerical stability by @amyeroberts in #29542test_trainer_log_level_replica
to run on accelerators with more than 2 devices by @faaany in #29609multi_gpu_data_parallel_forward
for MusicgenTest
by @ydshieh in #29632PEFT
] Fix save_pretrained
to make sure adapters weights are also saved on TPU by @shub-kris in #29388dataset_revision
argument to RagConfig
by @ydshieh in #29610cache_position
update in generate
by @gante in #29467generation_config
by @gante in #29675filter_models
by @ydshieh in #29673glue
to nyu-mll/glue
by @lhoestq in #29679filter_models
" by @ydshieh in #29682filter_models
by @ydshieh in #29710bnb
] Make unexpected_keys
optional by @younesbelkada in #29420gradio.Interface.from_pipeline
by @abidlabs in #29684The following contributors have made significant changes to the library over the last release:
We mostly made sure that performances are not affected by the new change of paradigm with ROPE. Fixed the ROPE computation (should always be in float32) and the causal_mask
dtype was set to bool to take less RAM.
YOLOS had a regression, and Llama / T5Tokenizer had a warning popping for random reasons
TLDR:
- attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
+ attn_output = attn_output.view(bsz, q_len, -1)
Gemma is a new opensource Language Model series from Google AI that comes with a 2B and 7B variant. The release comes with the pre-trained and instruction fine-tuned versions and you can use them via AutoModelForCausalLM
, GemmaForCausalLM
or pipeline
interface!
Read more about it in the Gemma release blogpost: https://hf.co/blog/gemma
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")
model = AutoModelForCausalLM.from_pretrained("google/gemma-2b", device_map="auto", torch_dtype=torch.float16)
input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids)
You can use the model with Flash Attention, SDPA, Static cache and quantization API for further optimizations !
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")
model = AutoModelForCausalLM.from_pretrained(
"google/gemma-2b", device_map="auto", torch_dtype=torch.float16, attn_implementation="flash_attention_2"
)
input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids)
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")
model = AutoModelForCausalLM.from_pretrained(
"google/gemma-2b", device_map="auto", load_in_4bit=True
)
input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids)
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")
model = AutoModelForCausalLM.from_pretrained(
"google/gemma-2b", device_map="auto"
)
model.generation_config.cache_implementation = "static"
input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids)
The Depth Anything model was proposed in Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data by Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, Hengshuang Zhao. Depth Anything is based on the DPT architecture, trained on ~62 million images, obtaining state-of-the-art results for both relative and absolute depth estimation.
StableLM 3B 4E1T was proposed in StableLM 3B 4E1T: Technical Report by Stability AI and is the first model in a series of multi-epoch pre-trained language models.
StableLM 3B 4E1T is a decoder-only base language model pre-trained on 1 trillion tokens of diverse English and code datasets for four epochs. The model architecture is transformer-based with partial Rotary Position Embeddings, SwiGLU activation, LayerNorm, etc.
The team also provides StableLM Zephyr 3B, an instruction fine-tuned version of the model that can be used for chat-based applications.
StableLM
by @jon-tow in #28810Static past key value cache allows LlamaForCausalLM
' s forward pass to be compiled using torch.compile
!
This means that (cuda) graphs
can be used for inference, which speeds up the decoding step by 4x!
A forward pass of Llama2 7B takes around 10.5
ms to run with this on an A100! Equivalent to TGI performances! ⚡️
Core generation
] Adds support for static KV cache by @ArthurZucker in #27931CLeanup
] Revert SDPA attention changes that got in the static kv cache PR by @ArthurZucker in #29027⚠️ Support for generate
is not included yet. This feature is experimental and subject to changes in subsequent releases.
from transformers import AutoTokenizer, AutoModelForCausalLM, StaticCache
import torch
import os
# compilation triggers multiprocessing
os.environ["TOKENIZERS_PARALLELISM"] = "true"
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
device_map="auto",
torch_dtype=torch.float16
)
# set up the static cache in advance of using the model
model._setup_cache(StaticCache, max_batch_size=1, max_cache_len=128)
# trigger compilation!
compiled_model = torch.compile(model, mode="reduce-overhead", fullgraph=True)
# run the model as usual
input_text = "A few facts about the universe: "
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda").input_ids
model_outputs = compiled_model(input_ids)
HfQuantizer
makes it easy for quantization method researchers and developers to add inference and / or quantization support in 🤗 transformers. If you are interested in adding the support for new methods, please refer to this documentation page: https://huggingface.co/docs/transformers/main/en/hf_quantizer
HfQuantizer
class for quantization-related stuff in modeling_utils.py
by @poedator in #26610HfQuantizer
] Move it to "Developper guides" by @younesbelkada in #28768HFQuantizer
] Remove check_packages_compatibility
logic by @younesbelkada in #28789AQLM is a new quantization method that enables no-performance degradation in 2-bit precision. Check out this demo about how to run Mixtral in 2-bit on a free-tier Google Colab instance: https://huggingface.co/posts/ybelkada/434200761252287
The canonical repositories on the hugging face hub (models that did not have an organization, like bert-base-cased
), have been moved under organizations.
You can find the entire list of models moved here: https://huggingface.co/collections/julien-c/canonical-models-65ae66e29d5b422218567567
Redirection has been set up so that your code continues working even if you continue calling the previous paths. We, however, still encourage you to update your code to use the new links so that it is entirely future proof.
The Mistral model was added to the library in Flax.
With Keras 3 becoming the standard version of Keras in TensorFlow 2.16, we've made some internal changes to maintain compatibility. We now have full compatibility with TF 2.16 as long as the tf-keras
compatibility package is installed. We've also taken the opportunity to do some cleanup - in particular, the objects like BatchEncoding
that are returned by our tokenizers and processors can now be directly passed to Keras methods like model.fit()
, which should simplify a lot of code and eliminate a long-standing source of annoyances.
Enable loading in pretrained backbones in a new model, where all other weights are randomly initialized. Note: validation checks are still in place when creating a config. Passing in use_pretrained_backbone
will raise an error. You can override by setting
config.use_pretrained_backbone = True
after creating a config. However, it is not yet guaranteed to be fully backwards compatible.
from transformers import MaskFormerConfig, MaskFormerModel
config = MaskFormerConfig(
use_pretrained_backbone=False,
backbone="microsoft/resnet-18"
)
config.use_pretrained_backbone = True
# Both models have resnet-18 backbone weights and all other weights randomly
# initialized
model_1 = MaskFormerModel(config)
model_2 = MaskFormerModel(config)
Introduce a helper function load_backbone
to load a backbone from a backbone's model config e.g. ResNetConfig
, or from a model config which contains backbone information. This enables cleaner modeling files and crossloading between timm and transformers backbones.
from transformers import ResNetConfig, MaskFormerConfig
from transformers.utils.backbone_utils import load_backbone
# Resnet defines the backbone model to load
config = ResNetConfig()
backbone = load_backbone(config)
# Maskformer config defines a model which uses a resnet backbone
config = MaskFormerConfig(use_timm_backbone=True, backbone="resnet18")
backbone = load_backbone(config)
config = MaskFormerConfig(backbone_config=ResNetConfig())
backbone = load_backbone(config)
Backbone
] Use load_backbone
instead of AutoBackbone.from_config
by @amyeroberts in #28661Add in API references, list supported backbones, updated examples, clarification and moving information to better reflect usage and docs
Llava
] Update convert_llava_weights_to_hf.py script by @isaac-vidas in #28617GPTNeoX
] Fix GPTNeoX + Flash Attention 2 issue by @younesbelkada in #28645SigLIP
] Only import tokenizer if sentencepiece available by @amyeroberts in #28636PartialState().default_device
as it has been officially released by @statelesshz in #27256tensor_size
- fix copy/paste error msg typo by @scruel in #28660CodeGenTokenizer
by @cmathw in #28628GenerationConfig
, now the generation_config.json
can be loaded successfully by @ParadoxZW in #28604chore
] Add missing space in warning by @tomaarsen in #28695Vilt
] align input and model dtype in the ViltPatchEmbeddings forward pass by @faaany in #28633docs
] Improve visualization for vertical parallelism by @petergtz in #28583LocalEntryNotFoundError
during processor_config.json
loading by @ydshieh in #28709docs
] Update preprocessing.md by @velaia in #28719weights_only
by @ydshieh in #28725GatedRepoError
to use cache file (fix #28558). by @scruel in #28566Siglip
] protect from imports if sentencepiece not installed by @amyeroberts in #28737DepthEstimationPipeline
's docstring by @ydshieh in #28733Block
. by @xkszltl in #28727load_in_8bit
and load_in_4bit
at the same time by @osanseviero in #28266bnb
] Fix bnb slow tests by @younesbelkada in #28788torch.arange
dtype on float
usage to avoid incorrect initialization by @gante in #28760is_torch_bf16_available_on_device
more strict by @ydshieh in #28796-v
for pytest
on CircleCI by @ydshieh in #28840test_encoder_decoder_model_generate
for vision_encoder_deocder
as flaky by @amyeroberts in #28842Doc
] update contribution guidelines by @ArthurZucker in #28858save_only_model
with load_best_model_at_end
for DeepSpeed/FSDP by @pacman100 in #28866FastSpeech2ConformerModelTest
and skip it on CPU by @ydshieh in #28888torchaudio
get the correct version in torch_and_flax_job
by @ydshieh in #28899logging_first_step
by removing "evaluate" by @Sai-Suraj-27 in #28884Exception
when trying to generate 0 tokens ⚠️ by @danielkorat in #28621torch_dtype
as str
to actual torch data type (i.e. "float16" …to torch.float16
) by @KossaiSbai in #28208pipelines
] updated docstring with vqa alias by @cmahmut in #28951test_save_load_fast_init_from_base
as flaky by @gante in #28930NllbTokenizer
] refactor with added tokens decoder by @ArthurZucker in #27717DETR
] Update the processing to adapt masks & bboxes to reflect padding by @amyeroberts in #28363quantization_config
is in config but not passed as an arg by @younesbelkada in #28988AutoQuantizer
]: enhance trainer + not supported quant methods by @younesbelkada in #28991Doc
] Fix docbuilder - make BackboneMixin
and BackboneConfigMixin
importable from utils
. by @amyeroberts in #29002test_trainer
to float32 by @statelesshz in #28920Trainer
/ tags]: Fix trainer + tags when users do not pass "tags"
to trainer.push_to_hub()
by @younesbelkada in #29009logger.warning
+ inline with recent refactor by @younesbelkada in #29039test_save_load_low_cpu_mem_usage
tests by @amyeroberts in #29043generation/utils.py::GenerateEncoderDecoderOutput
's docstring by @sadra-barikbin in #29044auto_find_batch_size
isn't yet supported with DeepSpeed/FSDP. Raise error accrodingly. by @pacman100 in #29058Awq
] Add peft support for AWQ by @younesbelkada in #28987bnb
/ tests
]: Fix currently failing bnb tests by @younesbelkada in #29092bert-base-cased
tokenizer configuration test by @LysandreJik in #29105examples/pytorch/text-classification/run_classification.py
by @Ja1Zhou in #29072pipelines/base.py::Pipeline::_sanitize_parameters()
's docstring by @sadra-barikbin in #29102gradient_checkpointing
] default to use it for torch 2.3 by @ArthurZucker in #28538Trainer
/ bnb
]: Add RMSProp from bitsandbytes
to HF Trainer
by @younesbelkada in #29082bnb
/ tests
] Propagate the changes from #29092 to 4-bit tests by @younesbelkada in #29122cuda kernels
] only compile them when initializing by @ArthurZucker in #29133PEFT
/ Trainer
] Handle better peft + quantized compiled models by @younesbelkada in #29055Core tokenization
] add_dummy_prefix_space
option to help with latest issues by @ArthurZucker in #28010pipeline
] Add pool option to image feature extraction pipeline by @amyeroberts in #28985The following contributors have made significant changes to the library over the last release:
HfQuantizer
class for quantization-related stuff in modeling_utils.py
(#26610)StableLM
(#28810)Selection of fixes
torch.load
Commits