A high-throughput and memory-efficient inference and serving engine for LLMs
torch==2.3.0
(#4454)tensorizer==2.9.0
(#4467)engine
to executor
package by @njhill in https://github.com/vllm-project/vllm/pull/4347
shutdown()
method to ExecutorBase
by @njhill in https://github.com/vllm-project/vllm/pull/4349
get_tokenizer
by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/4107
DistributedGPUExecutor
abstract class by @njhill in https://github.com/vllm-project/vllm/pull/4348
min_tokens
when eos_token_id
is None by @njhill in https://github.com/vllm-project/vllm/pull/4389
torch==2.3.0
by @mgoin in https://github.com/vllm-project/vllm/pull/4454
num_readers
, update version by @alpayariyak in https://github.com/vllm-project/vllm/pull/4467
/metrics
by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/4523
multiproc_worker_utils
for multiprocessing-based workers by @njhill in https://github.com/vllm-project/vllm/pull/4357
tests
directory from being packaged by @itechbear in https://github.com/vllm-project/vllm/pull/4552
_force_log
from being garbage collected by @Atry in https://github.com/vllm-project/vllm/pull/4567
Full Changelog: https://github.com/vllm-project/vllm/compare/v0.4.1...v0.4.2
Features
tensorizer
(#3476)Enhancements
mypy
(#3816, #4006, #4161, #4043)Hardwares
__init__.py
files for vllm/core/block/
and vllm/spec_decode/
by @mgoin in https://github.com/vllm-project/vllm/pull/3798
is_cpu()
by @njhill in https://github.com/vllm-project/vllm/pull/3804
attention_bias
Usage in Llama Model Configuration by @Ki6an in https://github.com/vllm-project/vllm/pull/3767
guided_json
parameter in OpenAi compatible Server by @dmarasco in https://github.com/vllm-project/vllm/pull/3945
linear_weights
directly on the layer by @Yard1 in https://github.com/vllm-project/vllm/pull/3977
merge_async_iterators
to utils by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/4026
tensorizer
by @sangstar in https://github.com/vllm-project/vllm/pull/3476
tokenizer_revision
when getting tokenizer in openai serving by @chiragjn in https://github.com/vllm-project/vllm/pull/4214
EngineArgs
by @hmellor in https://github.com/vllm-project/vllm/pull/4219
EngineArgs
by @hmellor in https://github.com/vllm-project/vllm/pull/4223
autodoc
directives by @hmellor in https://github.com/vllm-project/vllm/pull/4272
Full Changelog: https://github.com/vllm-project/vllm/compare/v0.4.0...v0.4.1
v0.4.0 lacks support for sm70/75 support. We did a hotfix for it.
__init__.py
files for vllm/core/block/
and vllm/spec_decode/
by @mgoin in https://github.com/vllm-project/vllm/pull/3798
Full Changelog: https://github.com/vllm-project/vllm/compare/v0.4.0...v0.4.0.post1
--enable-prefix-caching
to turn it on.json_object
in OpenAI server for arbitrary JSON, --use-delay
flag to improve time to first token across many requests, and min_tokens
to EOS suppression.eos_token_id
in Sequence
for easy access by @njhill in https://github.com/vllm-project/vllm/pull/3166
dynamic_ncols=True
by @chujiezheng in https://github.com/vllm-project/vllm/pull/3242
flash_attn
optional by @WoosukKwon in https://github.com/vllm-project/vllm/pull/3269
/tmp/
to ~/.cache/vllm/locks/
dir by @mgoin in https://github.com/vllm-project/vllm/pull/3241
flash_attn
in Docker image by @tdoublep in https://github.com/vllm-project/vllm/pull/3396
dist.broadcast
stall without group argument by @GindaChen in https://github.com/vllm-project/vllm/pull/3408
lstrip()
with removeprefix()
to fix Ruff linter warning by @ronensc in https://github.com/vllm-project/vllm/pull/2958
LRUCache
by @njhill in https://github.com/vllm-project/vllm/pull/3511
logits
computation and gather to model_runner
by @esmeetu in https://github.com/vllm-project/vllm/pull/3233
_prune_hidden_states
by @rkooo567 in https://github.com/vllm-project/vllm/pull/3539
rotary_embedding.py
file, get_device() -> device by @jikunshang in https://github.com/vllm-project/vllm/pull/3604
_get_ranks
in Sampler by @Yard1 in https://github.com/vllm-project/vllm/pull/3623
Full Changelog: https://github.com/vllm-project/vllm/compare/v0.3.3...v0.4.0
benchmark_serving.py
by @ronensc in https://github.com/vllm-project/vllm/pull/2934
counter_generation_tokens
by @ronensc in https://github.com/vllm-project/vllm/pull/2802
aioprometheus
to prometheus_client
by @hmellor in https://github.com/vllm-project/vllm/pull/2730
get_ip
error in pure ipv6 environment by @Jingru in https://github.com/vllm-project/vllm/pull/2931
AttributeError
in OpenAI-compatible server by @jaywonchung in https://github.com/vllm-project/vllm/pull/3018
Full Changelog: https://github.com/vllm-project/vllm/compare/v0.3.2...v0.3.3
This version adds support for the OLMo and Gemma Model, as well as seed
parameter.
sampling_params
by @njhill in https://github.com/vllm-project/vllm/pull/2881
vllm:prompt_tokens_total
metric calculation by @ronensc in https://github.com/vllm-project/vllm/pull/2869
Full Changelog: https://github.com/vllm-project/vllm/compare/v0.3.1...v0.3.2
This version fixes the following major bugs:
Also with many smaller bug fixes listed below.
prefix_len
. by @sighingnow in https://github.com/vllm-project/vllm/pull/2688
device="cuda"
to support more device by @jikunshang in https://github.com/vllm-project/vllm/pull/2503
LlamaForCausalLM
instead by @pcmoritz in https://github.com/vllm-project/vllm/pull/2854
LLM
class by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2882
Full Changelog: https://github.com/vllm-project/vllm/compare/v0.3.0...v0.3.1
top_p
and top_k
Sampling by @chenxu2048 in https://github.com/vllm-project/vllm/pull/1885
benchmark_serving.py
by @hmellor in https://github.com/vllm-project/vllm/pull/2172
group
as an argument in broadcast ops by @GindaChen in https://github.com/vllm-project/vllm/pull/2522
scheduler.running
as deque by @njhill in https://github.com/vllm-project/vllm/pull/2523
benchmark_serving.py
by @hmellor in https://github.com/vllm-project/vllm/pull/2552
include_stop_str_in_output
and length_penalty
parameters to OpenAI API by @galatolofederico in https://github.com/vllm-project/vllm/pull/2562
--engine-use-ray
by @HermitSun in https://github.com/vllm-project/vllm/pull/2664
Full Changelog: https://github.com/vllm-project/vllm/compare/v0.2.7...v0.3.0
concurrency_count
by @ronensc in https://github.com/vllm-project/vllm/pull/2315
Full Changelog: https://github.com/vllm-project/vllm/compare/v0.2.6...v0.2.7
quantization
argument by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2145
Full Changelog: https://github.com/vllm-project/vllm/compare/v0.2.5...v0.2.6