Text Generation Inference Versions Save

Large Language Model Text Generation Inference

v2.0.2

2 weeks ago

Tl;dr

New models (idefics2, phi3)
Cleaner VLM support in the openai layer
Upgraded to pytorch 2.3.0

What's Changed

Make --cuda-graphs 0 work as expected (bis) by @fxmarty in https://github.com/huggingface/text-generation-inference/pull/1768
fix typos in docs and add small clarifications by @MoritzLaurer in https://github.com/huggingface/text-generation-inference/pull/1790
Add attribute descriptions for GenerateParameters by @Wauplin in https://github.com/huggingface/text-generation-inference/pull/1798
feat: allow null eos and bos tokens in config by @drbh in https://github.com/huggingface/text-generation-inference/pull/1791
Phi3 support by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1797
Idefics2. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1756
fix: avoid frequency and repetition penalty on padding tokens by @drbh in https://github.com/huggingface/text-generation-inference/pull/1765
Adding support for HF_HUB_OFFLINE support in the router. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1789
feat: improve temperature logic in chat by @drbh in https://github.com/huggingface/text-generation-inference/pull/1749
Updating the benchmarks so everyone uses openai compat layer. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1800
Update guidance docs to reflect grammar support in API by @dr3s in https://github.com/huggingface/text-generation-inference/pull/1775
Use the generation config. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1808
2nd round of benchmark modifications (tiny adjustements to avoid overloading the host). by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1816
Adding new env variables for TPU backends. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1755
add intel xpu support for TGI by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/1475
Blunder by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1815
Fixing qwen2. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1818
Dummy CI run. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1817
Changing the waiting_served_ratio default (stack more aggressively by default). by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1820
Better graceful shutdown. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1827
Add the missing tool_prompt parameter to Python client by @maziyarpanahi in https://github.com/huggingface/text-generation-inference/pull/1825
Small CI cleanup. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1801
Add reference to TPU support by @brandonroyal in https://github.com/huggingface/text-generation-inference/pull/1760
fix: use get_speculate to the number of layers by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/1737
feat: add how it works section by @drbh in https://github.com/huggingface/text-generation-inference/pull/1773
Fixing frequency penalty by @martinigoyanes in https://github.com/huggingface/text-generation-inference/pull/1811
feat: add vlm docs and simple examples by @drbh in https://github.com/huggingface/text-generation-inference/pull/1812
Handle images in chat api by @drbh in https://github.com/huggingface/text-generation-inference/pull/1828
chore: update torch by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/1730
(chore): torch 2.3.0 by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1833

New Contributors

@MoritzLaurer made their first contribution in https://github.com/huggingface/text-generation-inference/pull/1790
@dr3s made their first contribution in https://github.com/huggingface/text-generation-inference/pull/1775
@maziyarpanahi made their first contribution in https://github.com/huggingface/text-generation-inference/pull/1825
@brandonroyal made their first contribution in https://github.com/huggingface/text-generation-inference/pull/1760
@martinigoyanes made their first contribution in https://github.com/huggingface/text-generation-inference/pull/1811

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v2.0.1...v2.0.2

v2.0.1

3 weeks ago

What's Changed

feat: improve tools to include name and add tests by @drbh in https://github.com/huggingface/text-generation-inference/pull/1693
Update response type for /v1/chat/completions and /v1/completions by @Wauplin in https://github.com/huggingface/text-generation-inference/pull/1747
accept list as prompt for OpenAI API by @drbh in https://github.com/huggingface/text-generation-inference/pull/1702
fix ROCm docker image

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v2.0.0...v2.0.1

v2.0.0

1 month ago

TGI is back to Apache 2.0!

Highlights

License was reverted to Apache 2.0
Cuda graphs are now used by default. They improve latency substancially on high end nodes.
Llava-next was added. It is the second multimodal model available on TGI after Idefics.
Cohere Command R+ support. TGI is the fastest open source backend for Command R+
FP8 support.
We now share the vocabulary for all medusa heads, greatly improving latency and memory use.

Try out Command R+ with Medusa heads on 4xA100s with:

model=text-generation-inference/commandrplus-medusa
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model --speculate 3 --num-shard 4

What's Changed

Add cuda graphs sizes and make it default. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1703
Pickle conversion now requires --trust-remote-code. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1704
Push users to streaming in the readme. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1698
Fixing cohere tokenizer. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1697
Force weights_only (before fully breaking pickle files anyway). by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1710
Regenerate ld.so.cache by @oOraph in https://github.com/huggingface/text-generation-inference/pull/1708
Revert license to Apache 2.0 by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/1714
Automatic quantization config. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1719
Adding Llava-Next (Llava 1.6) with full support. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1709
fix: fix CohereForAI/c4ai-command-r-plus by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/1707
Update libraries by @abhishekkrthakur in https://github.com/huggingface/text-generation-inference/pull/1713
Dev/mask ldconfig output v2 by @oOraph in https://github.com/huggingface/text-generation-inference/pull/1716
Fp8 Support by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1726
Upgrade EETQ (Fixes the cuda graphs). by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1729
fix(router): fix a possible deadlock in next_batch by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/1731
chore(cargo-toml): apply lto fat and codegen-units of one by @somehowchris in https://github.com/huggingface/text-generation-inference/pull/1651
Improve the defaults for the launcher by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1727
feat: medusa shared by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/1734
Fix typo in guidance.md by @eltociear in https://github.com/huggingface/text-generation-inference/pull/1735

New Contributors

@somehowchris made their first contribution in https://github.com/huggingface/text-generation-inference/pull/1651

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.4.5...v2.0.0

v1.4.5

1 month ago

Highlights

DBRX support https://github.com/huggingface/text-generation-inference/pull/1685. See #1679 on how to prompt the model.

What's Changed

fix: adjust logprob response logic by @drbh in https://github.com/huggingface/text-generation-inference/pull/1682
fix: handle batches with and without grammars by @drbh in https://github.com/huggingface/text-generation-inference/pull/1676
feat: Add dbrx support by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/1685

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.4.4...v1.4.5

v1.4.4

1 month ago

Highlights

CohereForAI/c4ai-command-r-v01 model support

What's Changed

Handle concurrent grammar requests by @drbh in https://github.com/huggingface/text-generation-inference/pull/1610
Fix idefics default. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1614
Fix async client timeout by @hugoabonizio in https://github.com/huggingface/text-generation-inference/pull/1617
accept legacy request format and response by @drbh in https://github.com/huggingface/text-generation-inference/pull/1527
add missing stop parameter for chat request by @drbh in https://github.com/huggingface/text-generation-inference/pull/1619
correctly index into mask when applying grammar by @drbh in https://github.com/huggingface/text-generation-inference/pull/1618
Use a better model for the quick tour by @lewtun in https://github.com/huggingface/text-generation-inference/pull/1639
Upgrade nix version from 0.27.1 to 0.28.0 by @yuanwu2017 in https://github.com/huggingface/text-generation-inference/pull/1638
Update peft + transformers + accelerate + bnb + safetensors by @abhishekkrthakur in https://github.com/huggingface/text-generation-inference/pull/1646
Fix index in ChatCompletionChunk by @Wauplin in https://github.com/huggingface/text-generation-inference/pull/1648
Fixing minor typo in documentation: supported hardware section by @SachinVarghese in https://github.com/huggingface/text-generation-inference/pull/1632
bump minijina and add test for core templates by @drbh in https://github.com/huggingface/text-generation-inference/pull/1626
support force downcast after FastRMSNorm multiply for Gemma by @drbh in https://github.com/huggingface/text-generation-inference/pull/1658
prefer spaces url over temp url by @drbh in https://github.com/huggingface/text-generation-inference/pull/1662
improve tool type, bump pydantic and outlines by @drbh in https://github.com/huggingface/text-generation-inference/pull/1650
Remove unecessary cuda graph. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1664
Repair idefics integration tests. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1663
fix: LlamaTokenizerFast to AutoTokenizer at flash_mistral.py by @SeongBeomLEE in https://github.com/huggingface/text-generation-inference/pull/1637
Inline images for multimodal models. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1666

New Contributors

@hugoabonizio made their first contribution in https://github.com/huggingface/text-generation-inference/pull/1617
@yuanwu2017 made their first contribution in https://github.com/huggingface/text-generation-inference/pull/1638
@abhishekkrthakur made their first contribution in https://github.com/huggingface/text-generation-inference/pull/1646
@Wauplin made their first contribution in https://github.com/huggingface/text-generation-inference/pull/1648
@SachinVarghese made their first contribution in https://github.com/huggingface/text-generation-inference/pull/1632
@SeongBeomLEE made their first contribution in https://github.com/huggingface/text-generation-inference/pull/1637

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.4.3...v1.4.4

v1.4.3

2 months ago

Highlights

Add support for Starcoder 2
Add support for Qwen2

What's Changed

fix openapi schema by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/1586
avoid default message by @drbh in https://github.com/huggingface/text-generation-inference/pull/1579
Revamp medusa implementation so that every model can benefit. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1588
Support tools by @drbh in https://github.com/huggingface/text-generation-inference/pull/1587
Fixing x-compute-time. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1606
Fixing guidance docs. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1607
starcoder2 by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/1605
Qwen2 by @Jason-CKY in https://github.com/huggingface/text-generation-inference/pull/1608

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.4.2...v1.4.3

v1.4.2

2 months ago

Highlights

Add support for Google Gemma models

What's Changed

Fix mistral with length > window_size for long prefills (rotary doesn't create long enough cos, sin). by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1571
improve endpoint support by @drbh in https://github.com/huggingface/text-generation-inference/pull/1577
refactor syntax to correctly include structs by @drbh in https://github.com/huggingface/text-generation-inference/pull/1580
fix openapi and add jsonschema validation by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/1578
add support for Gemma by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/1583

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.4.1...v1.4.2

v1.4.1

2 months ago

Highlights

Mamba support by @drbh in https://github.com/huggingface/text-generation-inference/pull/1480 and by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1552
Experimental support for cuda graphs by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/1428
Outlines guided generation by @drbh in https://github.com/huggingface/text-generation-inference/pull/1539
Added name field to OpenAI compatible API Messages by @amihalik in https://github.com/huggingface/text-generation-inference/pull/1563

What's Changed

Fixing top_n_tokens. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1497
Sending compute type from the environment instead of hardcoded string by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1504
Create the compute type at launch time (if not provided in the env). by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1505
Modify default for max_new_tokens in python client by @freitng in https://github.com/huggingface/text-generation-inference/pull/1336
feat: eetq gemv optimization when batch_size <= 4 by @dtlzhuangz in https://github.com/huggingface/text-generation-inference/pull/1502
fix: improve messages api docs content and formatting by @drbh in https://github.com/huggingface/text-generation-inference/pull/1506
GPTNeoX: Use static rotary embedding by @dwyatte in https://github.com/huggingface/text-generation-inference/pull/1498
Hotfix the / health - route. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1515
fix: tokenizer config should use local model path when possible by @drbh in https://github.com/huggingface/text-generation-inference/pull/1518
Updating tokenizers. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1517
[docs] Fix link to Install CLI by @pcuenca in https://github.com/huggingface/text-generation-inference/pull/1526
feat: add ie update to message docs by @drbh in https://github.com/huggingface/text-generation-inference/pull/1523
feat: use existing add_generation_prompt variable from config in temp… by @drbh in https://github.com/huggingface/text-generation-inference/pull/1533
Update to peft 0.8.2 by @Stillerman in https://github.com/huggingface/text-generation-inference/pull/1537
feat(server): add frequency penalty by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/1541
chore: bump ci rust version by @drbh in https://github.com/huggingface/text-generation-inference/pull/1543
ROCm AWQ support by @IlyasMoutawwakil in https://github.com/huggingface/text-generation-inference/pull/1514
feat(router): add max_batch_size by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/1542
feat: add deserialize_with that handles strings or objects with content by @drbh in https://github.com/huggingface/text-generation-inference/pull/1550
Fixing glibc version in the runtime. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1556
Upgrade intermediary layer for nvidia too. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1557
Improving mamba runtime by using updates by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1552
Small cleanup. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1560
Bugfix: eos and bos tokens positions are inconsistent by @amihalik in https://github.com/huggingface/text-generation-inference/pull/1567
chore: add pre-commit by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/1569
feat: add chat template struct to avoid tuple ordering errors by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/1570
v1.4.1 by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/1568

New Contributors

@freitng made their first contribution in https://github.com/huggingface/text-generation-inference/pull/1336
@dtlzhuangz made their first contribution in https://github.com/huggingface/text-generation-inference/pull/1502
@dwyatte made their first contribution in https://github.com/huggingface/text-generation-inference/pull/1498
@pcuenca made their first contribution in https://github.com/huggingface/text-generation-inference/pull/1526
@Stillerman made their first contribution in https://github.com/huggingface/text-generation-inference/pull/1537
@IlyasMoutawwakil made their first contribution in https://github.com/huggingface/text-generation-inference/pull/1514
@amihalik made their first contribution in https://github.com/huggingface/text-generation-inference/pull/1563

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.4.0...v1.4.1

v1.4.0

3 months ago

Highlights

OpenAI compatible API #1427
exllama v2 Tensor Parallel #1490
GPTQ support for AMD GPUs #1489
Phi support #1442

What's Changed

fix: fix local loading for .bin models by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/1419
Fix missing make target platform for local install: 'install-flash-attention-v2' by @deepily in https://github.com/huggingface/text-generation-inference/pull/1414
fix: follow base model for tokenizer in router by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/1424
Fix local load for Medusa by @PYNing in https://github.com/huggingface/text-generation-inference/pull/1420
Return prompt vs generated tokens. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1436
feat: supports openai chat completions API by @drbh in https://github.com/huggingface/text-generation-inference/pull/1427
feat: support raise_exception, bos and eos tokens by @drbh in https://github.com/huggingface/text-generation-inference/pull/1450
chore: bump rust version and annotate/fix all clippy warnings by @drbh in https://github.com/huggingface/text-generation-inference/pull/1455
feat: conditionally toggle chat on invocations route by @drbh in https://github.com/huggingface/text-generation-inference/pull/1454
Disable decoder_input_details on OpenAI-compatible chat streaming, pass temp and top-k from API by @EndlessReform in https://github.com/huggingface/text-generation-inference/pull/1470
Fixing non divisible embeddings. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1476
Add messages api compatibility docs by @drbh in https://github.com/huggingface/text-generation-inference/pull/1478
Add a new /tokenize route to get the tokenized input by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1471
feat: adds phi model by @drbh in https://github.com/huggingface/text-generation-inference/pull/1442
fix: read stderr in download by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/1486
fix: show warning with tokenizer config parsing error by @drbh in https://github.com/huggingface/text-generation-inference/pull/1488
fix: launcher doc typos by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1473
Reinstate exl2 with tp by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1490
Add sealion mpt support by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1477
Trying to fix that flaky test. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1491
fix: launcher doc typos by @thelinuxkid in https://github.com/huggingface/text-generation-inference/pull/1462
Update the docs to include newer models. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1492
GPTQ support on ROCm by @fxmarty in https://github.com/huggingface/text-generation-inference/pull/1489
feat: add tokenizer-config-path to launcher args by @drbh in https://github.com/huggingface/text-generation-inference/pull/1495

New Contributors

@deepily made their first contribution in https://github.com/huggingface/text-generation-inference/pull/1414
@PYNing made their first contribution in https://github.com/huggingface/text-generation-inference/pull/1420
@drbh made their first contribution in https://github.com/huggingface/text-generation-inference/pull/1427
@EndlessReform made their first contribution in https://github.com/huggingface/text-generation-inference/pull/1470
@thelinuxkid made their first contribution in https://github.com/huggingface/text-generation-inference/pull/1462

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.3.4...v1.4.0

v1.3.4

4 months ago

What's Changed

feat: relax mistral requirements by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/1351
fix: fix logic if sliding window key is not present in config by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/1352
fix: fix offline (#1341) by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/1347
fix: fix gpt-q with groupsize = -1 by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/1358
Peft safetensors. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1364
Change URL for Habana Gaudi support in doc by @regisss in https://github.com/huggingface/text-generation-inference/pull/1343
feat: update exllamav2 kernels by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/1370
Fix local load for peft by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1373

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.3.3...v1.3.4