Large Language Model Text Generation Inference
--cuda-graphs 0
work as expected (bis) by @fxmarty in https://github.com/huggingface/text-generation-inference/pull/1768
GenerateParameters
by @Wauplin in https://github.com/huggingface/text-generation-inference/pull/1798
HF_HUB_OFFLINE
support in the router. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1789
tool_prompt
parameter to Python client by @maziyarpanahi in https://github.com/huggingface/text-generation-inference/pull/1825
Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v2.0.1...v2.0.2
/v1/chat/completions
and /v1/completions
by @Wauplin in https://github.com/huggingface/text-generation-inference/pull/1747
Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v2.0.0...v2.0.1
Try out Command R+ with Medusa heads on 4xA100s with:
model=text-generation-inference/commandrplus-medusa
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model --speculate 3 --num-shard 4
--trust-remote-code
. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1704
Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.4.5...v2.0.0
Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.4.4...v1.4.5
Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.4.3...v1.4.4
Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.4.2...v1.4.3
Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.4.1...v1.4.2
name
field to OpenAI compatible API Messages by @amihalik in https://github.com/huggingface/text-generation-inference/pull/1563
Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.4.0...v1.4.1
decoder_input_details
on OpenAI-compatible chat streaming, pass temp and top-k from API by @EndlessReform in https://github.com/huggingface/text-generation-inference/pull/1470
/tokenize
route to get the tokenized input by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1471
Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.3.4...v1.4.0
Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.3.3...v1.3.4