Vllm Versions Save

A high-throughput and memory-efficient inference and serving engine for LLMs

v0.4.2

1 week ago

Highlights

Features

Chunked prefill is ready for testing! It improves inter-token latency in high load scenario by chunking the prompt processing and priortizes decode (#4580)
Speculative decoding functionalities: logprobs (#4378), ngram (#4237)
Support FlashInfer as attention backend (#4353)

Models and Enhancements

Add support for Phi-3-mini (#4298, #4372, #4380)
Add more histogram metrics (#2764, #4523)
Full tensor parallelism for LoRA layers (#3524)
Expanding Marlin kernel to support all GPTQ models (#3922, #4466, #4533)

Dependency Upgrade

Upgrade to torch==2.3.0 (#4454)
Upgrade to tensorizer==2.9.0 (#4467)
Expansion of AMD test suite (#4267)

Progress and Dev Experience

Centralize and document all environment variables (#4548, #4574)
Progress towards fully typed codebase (#4337, #4427, #4555, #4450)
Progress towards pipeline parallelism (#4512, #4444, #4566)
Progress towards multiprocessing based executors (#4348, #4402, #4419)
Progress towards FP8 support (#4343, #4332, 4527)

What's Changed

[Core][Distributed] use existing torch.cuda.device context manager by @youkaichao in https://github.com/vllm-project/vllm/pull/4318
[Misc] Update ShareGPT Dataset Sampling in Serving Benchmark by @ywang96 in https://github.com/vllm-project/vllm/pull/4279
[Bugfix] Fix marlin kernel crash on H100 by @alexm-nm in https://github.com/vllm-project/vllm/pull/4218
[Doc] Add note for docker user by @youkaichao in https://github.com/vllm-project/vllm/pull/4340
[Misc] Use public API in benchmark_throughput by @zifeitong in https://github.com/vllm-project/vllm/pull/4300
[Model] Adds Phi-3 support by @caiom in https://github.com/vllm-project/vllm/pull/4298
[Core] Move ray_utils.py from engine to executor package by @njhill in https://github.com/vllm-project/vllm/pull/4347
[Bugfix][Model] Refactor OLMo model to support new HF format in transformers 4.40.0 by @Isotr0py in https://github.com/vllm-project/vllm/pull/4324
[CI/Build] Adding functionality to reset the node's GPUs before processing. by @Alexei-V-Ivanov-AMD in https://github.com/vllm-project/vllm/pull/4213
[Doc] README Phi-3 name fix. by @caiom in https://github.com/vllm-project/vllm/pull/4372
[Core]refactor aqlm quant ops by @jikunshang in https://github.com/vllm-project/vllm/pull/4351
[Mypy] Typing lora folder by @rkooo567 in https://github.com/vllm-project/vllm/pull/4337
[Misc] Optimize flash attention backend log by @esmeetu in https://github.com/vllm-project/vllm/pull/4368
[Core] Add shutdown() method to ExecutorBase by @njhill in https://github.com/vllm-project/vllm/pull/4349
[Core] Move function tracing setup to util function by @njhill in https://github.com/vllm-project/vllm/pull/4352
[ROCm][Hardware][AMD][Doc] Documentation update for ROCm by @hongxiayang in https://github.com/vllm-project/vllm/pull/4376
[Bugfix] Fix parameter name in get_tokenizer by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/4107
[Frontend] Add --log-level option to api server by @normster in https://github.com/vllm-project/vllm/pull/4377
[CI] Disable non-lazy string operation on logging by @rkooo567 in https://github.com/vllm-project/vllm/pull/4326
[Core] Refactoring sampler and support prompt logprob for chunked prefill by @rkooo567 in https://github.com/vllm-project/vllm/pull/4309
[Misc][Refactor] Generalize linear_method to be quant_method by @comaniac in https://github.com/vllm-project/vllm/pull/4373
[Misc] add RFC issue template by @youkaichao in https://github.com/vllm-project/vllm/pull/4401
[Core] Introduce DistributedGPUExecutor abstract class by @njhill in https://github.com/vllm-project/vllm/pull/4348
[Kernel] Optimize FP8 support for MoE kernel / Mixtral via static scales by @pcmoritz in https://github.com/vllm-project/vllm/pull/4343
[Frontend][Bugfix] Disallow extra fields in OpenAI API by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/4355
[Misc] Fix logger format typo by @esmeetu in https://github.com/vllm-project/vllm/pull/4396
[ROCm][Hardware][AMD] Enable group query attention for triton FA by @hongxiayang in https://github.com/vllm-project/vllm/pull/4406
[Kernel] Full Tensor Parallelism for LoRA Layers by @FurtherAI in https://github.com/vllm-project/vllm/pull/3524
[Model] Phi-3 4k sliding window temp. fix by @caiom in https://github.com/vllm-project/vllm/pull/4380
[Bugfix][Core] Fix get decoding config from ray by @esmeetu in https://github.com/vllm-project/vllm/pull/4335
[Bugfix] Abort requests when the connection to /v1/completions is interrupted by @chestnut-Q in https://github.com/vllm-project/vllm/pull/4363
[BugFix] Fix min_tokens when eos_token_id is None by @njhill in https://github.com/vllm-project/vllm/pull/4389
✨ support local cache for models by @prashantgupta24 in https://github.com/vllm-project/vllm/pull/4374
[BugFix] Fix return type of executor execute_model methods by @njhill in https://github.com/vllm-project/vllm/pull/4402
[BugFix] Resolved Issues For LinearMethod --> QuantConfig by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/4418
[Misc] fix typo in llm_engine init logging by @DefTruth in https://github.com/vllm-project/vllm/pull/4428
Add more Prometheus metrics by @ronensc in https://github.com/vllm-project/vllm/pull/2764
[CI] clean docker cache for neuron by @simon-mo in https://github.com/vllm-project/vllm/pull/4441
[mypy][5/N] Support all typing on model executor by @rkooo567 in https://github.com/vllm-project/vllm/pull/4427
[Kernel] Marlin Expansion: Support AutoGPTQ Models with Marlin by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/3922
[CI] hotfix: soft fail neuron test by @simon-mo in https://github.com/vllm-project/vllm/pull/4458
[Core][Distributed] use cpu group to broadcast metadata in cpu by @youkaichao in https://github.com/vllm-project/vllm/pull/4444
[Misc] Upgrade to torch==2.3.0 by @mgoin in https://github.com/vllm-project/vllm/pull/4454
[Bugfix][Kernel] Fix compute_type for MoE kernel by @WoosukKwon in https://github.com/vllm-project/vllm/pull/4463
[Core]Refactor gptq_marlin ops by @jikunshang in https://github.com/vllm-project/vllm/pull/4466
[BugFix] fix num_lookahead_slots missing in async executor by @leiwen83 in https://github.com/vllm-project/vllm/pull/4165
[Doc] add visualization for multi-stage dockerfile by @prashantgupta24 in https://github.com/vllm-project/vllm/pull/4456
[Kernel] Support Fp8 Checkpoints (Dynamic + Static) by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/4332
[Frontend] Support complex message content for chat completions endpoint by @fgreinacher in https://github.com/vllm-project/vllm/pull/3467
[Frontend] [Core] Tensorizer: support dynamic num_readers, update version by @alpayariyak in https://github.com/vllm-project/vllm/pull/4467
[Bugfix][Minor] Make ignore_eos effective by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/4468
fix_tokenizer_snapshot_download_bug by @kingljl in https://github.com/vllm-project/vllm/pull/4493
Unable to find Punica extension issue during source code installation by @kingljl in https://github.com/vllm-project/vllm/pull/4494
[Core] Centralize GPU Worker construction by @njhill in https://github.com/vllm-project/vllm/pull/4419
[Misc][Typo] type annotation fix by @HarryWu99 in https://github.com/vllm-project/vllm/pull/4495
[Misc] fix typo in block manager by @Juelianqvq in https://github.com/vllm-project/vllm/pull/4453
Allow user to define whitespace pattern for outlines by @robcaulk in https://github.com/vllm-project/vllm/pull/4305
[Misc]Add customized information for models by @jeejeelee in https://github.com/vllm-project/vllm/pull/4132
[Test] Add ignore_eos test by @rkooo567 in https://github.com/vllm-project/vllm/pull/4519
[Bugfix] Fix the fp8 kv_cache check error that occurs when failing to obtain the CUDA version. by @AnyISalIn in https://github.com/vllm-project/vllm/pull/4173
[Bugfix] Fix 307 Redirect for /metrics by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/4523
[Doc] update(example model): for OpenAI compatible serving by @fpaupier in https://github.com/vllm-project/vllm/pull/4503
[Bugfix] Use random seed if seed is -1 by @sasha0552 in https://github.com/vllm-project/vllm/pull/4531
[CI/Build][Bugfix] VLLM_USE_PRECOMPILED should skip compilation by @tjohnson31415 in https://github.com/vllm-project/vllm/pull/4534
[Speculative decoding] Add ngram prompt lookup decoding by @leiwen83 in https://github.com/vllm-project/vllm/pull/4237
[Core] Enable prefix caching with block manager v2 enabled by @leiwen83 in https://github.com/vllm-project/vllm/pull/4142
[Core] Add multiproc_worker_utils for multiprocessing-based workers by @njhill in https://github.com/vllm-project/vllm/pull/4357
[Kernel] Update fused_moe tuning script for FP8 by @pcmoritz in https://github.com/vllm-project/vllm/pull/4457
[Bugfix] Add validation for seed by @sasha0552 in https://github.com/vllm-project/vllm/pull/4529
[Bugfix][Core] Fix and refactor logging stats by @esmeetu in https://github.com/vllm-project/vllm/pull/4336
[Core][Distributed] fix pynccl del error by @youkaichao in https://github.com/vllm-project/vllm/pull/4508
[Misc] Remove Mixtral device="cuda" declarations by @pcmoritz in https://github.com/vllm-project/vllm/pull/4543
[Misc] Fix expert_ids shape in MoE by @WoosukKwon in https://github.com/vllm-project/vllm/pull/4517
[MISC] Rework logger to enable pythonic custom logging configuration to be provided by @tdg5 in https://github.com/vllm-project/vllm/pull/4273
[Bug fix][Core] assert num_new_tokens == 1 fails when SamplingParams.n is not 1 and max_tokens is large & Add tests for preemption by @rkooo567 in https://github.com/vllm-project/vllm/pull/4451
[CI]Add regression tests to ensure the async engine generates metrics by @ronensc in https://github.com/vllm-project/vllm/pull/4524
[mypy][6/N] Fix all the core subdirectory typing by @rkooo567 in https://github.com/vllm-project/vllm/pull/4450
[Core][Distributed] enable multiple tp group by @youkaichao in https://github.com/vllm-project/vllm/pull/4512
[Kernel] Support running GPTQ 8-bit models in Marlin by @alexm-nm in https://github.com/vllm-project/vllm/pull/4533
[mypy][7/N] Cover all directories by @rkooo567 in https://github.com/vllm-project/vllm/pull/4555
[Misc] Exclude the tests directory from being packaged by @itechbear in https://github.com/vllm-project/vllm/pull/4552
[BugFix] Include target-device specific requirements.txt in sdist by @markmc in https://github.com/vllm-project/vllm/pull/4559
[Misc] centralize all usage of environment variables by @youkaichao in https://github.com/vllm-project/vllm/pull/4548
[kernel] fix sliding window in prefix prefill Triton kernel by @mmoskal in https://github.com/vllm-project/vllm/pull/4405
[CI/Build] AMD CI pipeline with extended set of tests. by @Alexei-V-Ivanov-AMD in https://github.com/vllm-project/vllm/pull/4267
[Core] Ignore infeasible swap requests. by @rkooo567 in https://github.com/vllm-project/vllm/pull/4557
[Core][Distributed] enable allreduce for multiple tp groups by @youkaichao in https://github.com/vllm-project/vllm/pull/4566
[BugFix] Prevent the task of _force_log from being garbage collected by @Atry in https://github.com/vllm-project/vllm/pull/4567
[Misc] remove chunk detected debug logs by @DefTruth in https://github.com/vllm-project/vllm/pull/4571
[Doc] add env vars to the doc by @youkaichao in https://github.com/vllm-project/vllm/pull/4572
[Core][Model runner refactoring 1/N] Refactor attn metadata term by @rkooo567 in https://github.com/vllm-project/vllm/pull/4518
[Bugfix] Allow "None" or "" to be passed to CLI for string args that default to None by @mgoin in https://github.com/vllm-project/vllm/pull/4586
Fix/async chat serving by @schoennenbeck in https://github.com/vllm-project/vllm/pull/2727
[Kernel] Use flashinfer for decoding by @LiuXiaoxuanPKU in https://github.com/vllm-project/vllm/pull/4353
[Speculative decoding] Support target-model logprobs by @cadedaniel in https://github.com/vllm-project/vllm/pull/4378
[Misc] add installation time env vars by @youkaichao in https://github.com/vllm-project/vllm/pull/4574
[Misc][Refactor] Introduce ExecuteModelData by @comaniac in https://github.com/vllm-project/vllm/pull/4540
[Doc] Chunked Prefill Documentation by @rkooo567 in https://github.com/vllm-project/vllm/pull/4580
[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with Dynamic/Static Activations) by @mgoin in https://github.com/vllm-project/vllm/pull/4527
[CI] check size of the wheels by @simon-mo in https://github.com/vllm-project/vllm/pull/4319
[Bugfix] Fix inappropriate content of model_name tag in Prometheus metrics by @DearPlanet in https://github.com/vllm-project/vllm/pull/3937
bump version to v0.4.2 by @simon-mo in https://github.com/vllm-project/vllm/pull/4600
[CI] Reduce wheel size by not shipping debug symbols by @simon-mo in https://github.com/vllm-project/vllm/pull/4602

New Contributors

@zifeitong made their first contribution in https://github.com/vllm-project/vllm/pull/4300
@caiom made their first contribution in https://github.com/vllm-project/vllm/pull/4298
@Alexei-V-Ivanov-AMD made their first contribution in https://github.com/vllm-project/vllm/pull/4213
@normster made their first contribution in https://github.com/vllm-project/vllm/pull/4377
@FurtherAI made their first contribution in https://github.com/vllm-project/vllm/pull/3524
@chestnut-Q made their first contribution in https://github.com/vllm-project/vllm/pull/4363
@prashantgupta24 made their first contribution in https://github.com/vllm-project/vllm/pull/4374
@fgreinacher made their first contribution in https://github.com/vllm-project/vllm/pull/3467
@alpayariyak made their first contribution in https://github.com/vllm-project/vllm/pull/4467
@HarryWu99 made their first contribution in https://github.com/vllm-project/vllm/pull/4495
@Juelianqvq made their first contribution in https://github.com/vllm-project/vllm/pull/4453
@robcaulk made their first contribution in https://github.com/vllm-project/vllm/pull/4305
@AnyISalIn made their first contribution in https://github.com/vllm-project/vllm/pull/4173
@sasha0552 made their first contribution in https://github.com/vllm-project/vllm/pull/4531
@tdg5 made their first contribution in https://github.com/vllm-project/vllm/pull/4273
@itechbear made their first contribution in https://github.com/vllm-project/vllm/pull/4552
@markmc made their first contribution in https://github.com/vllm-project/vllm/pull/4559
@Atry made their first contribution in https://github.com/vllm-project/vllm/pull/4567
@schoennenbeck made their first contribution in https://github.com/vllm-project/vllm/pull/2727
@DearPlanet made their first contribution in https://github.com/vllm-project/vllm/pull/3937

Full Changelog: https://github.com/vllm-project/vllm/compare/v0.4.1...v0.4.2

v0.4.1

3 weeks ago

Highlights

Features

Support and enhance CommandR+ (#3829), minicpm (#3893), Meta Llama 3 (#4175, #4182), Mixtral 8x22b (#4073, #4002)
Support private model registration, and updating our support policy (#3871, 3948)
Support PyTorch 2.2.1 and Triton 2.2.0 (#4061, #4079, #3805, #3904, #4271)
Add option for using LM Format Enforcer for guided decoding (#3868)
Add option for optionally initialize tokenizer and detokenizer (#3748)
Add option for load model using tensorizer (#3476)

Enhancements

vLLM is now mostly type checked by mypy (#3816, #4006, #4161, #4043)
Progress towards chunked prefill scheduler (#3550, #3853, #4280, #3884)
Progress towards speculative decoding (#3250, #3706, #3894)
Initial support with dynamic per-tensor scaling via FP8 (#4118)

Hardwares

Intel CPU inference backend is added (#3993, #3634)
AMD backend is enhanced with Triton kernel and e4m3fn KV cache (#3643, #3290)

What's Changed

[Kernel] Layernorm performance optimization by @mawong-amd in https://github.com/vllm-project/vllm/pull/3662
[Doc] Update installation doc for build from source and explain the dependency on torch/cuda version by @youkaichao in https://github.com/vllm-project/vllm/pull/3746
[CI/Build] Make Marlin Tests Green by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/3753
[Misc] Minor fixes in requirements.txt by @WoosukKwon in https://github.com/vllm-project/vllm/pull/3769
[Misc] Some minor simplifications to detokenization logic by @njhill in https://github.com/vllm-project/vllm/pull/3670
[Misc] Fix Benchmark TTFT Calculation for Chat Completions by @ywang96 in https://github.com/vllm-project/vllm/pull/3768
[Speculative decoding 4/9] Lookahead scheduling for speculative decoding by @cadedaniel in https://github.com/vllm-project/vllm/pull/3250
[Misc] Add support for new autogptq checkpoint_format by @Qubitium in https://github.com/vllm-project/vllm/pull/3689
[Misc] [CI/Build] Speed up block manager CPU-only unit tests ~10x by opting-out of GPU cleanup by @cadedaniel in https://github.com/vllm-project/vllm/pull/3783
[Hardware][Intel] Add CPU inference backend by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/3634
[HotFix] [CI/Build] Minor fix for CPU backend CI by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/3787
[Frontend][Bugfix] allow using the default middleware with a root path by @A-Mahla in https://github.com/vllm-project/vllm/pull/3788
[Doc] Fix vLLMEngine Doc Page by @ywang96 in https://github.com/vllm-project/vllm/pull/3791
[CI/Build] fix TORCH_CUDA_ARCH_LIST in wheel build by @youkaichao in https://github.com/vllm-project/vllm/pull/3801
Fix crash when try torch.cuda.set_device in worker by @leiwen83 in https://github.com/vllm-project/vllm/pull/3770
[Bugfix] Add __init__.py files for vllm/core/block/ and vllm/spec_decode/ by @mgoin in https://github.com/vllm-project/vllm/pull/3798
[CI/Build] 0.4.0.post1, fix sm 7.0/7.5 binary by @youkaichao in https://github.com/vllm-project/vllm/pull/3803
[Speculative decoding] Adding configuration object for speculative decoding by @cadedaniel in https://github.com/vllm-project/vllm/pull/3706
[BugFix] Use different mechanism to get vllm version in is_cpu() by @njhill in https://github.com/vllm-project/vllm/pull/3804
[Doc] Update README.md by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/3806
[Doc] Update contribution guidelines for better onboarding by @michaelfeil in https://github.com/vllm-project/vllm/pull/3819
[3/N] Refactor scheduler for chunked prefill scheduling by @rkooo567 in https://github.com/vllm-project/vllm/pull/3550
Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) by @AdrianAbeyta in https://github.com/vllm-project/vllm/pull/3290
[Misc] Publish 3rd meetup slides by @WoosukKwon in https://github.com/vllm-project/vllm/pull/3835
Fixes the argument for local_tokenizer_group by @sighingnow in https://github.com/vllm-project/vllm/pull/3754
[Core] Enable hf_transfer by default if available by @michaelfeil in https://github.com/vllm-project/vllm/pull/3817
[Bugfix] Add kv_scale input parameter to CPU backend by @WoosukKwon in https://github.com/vllm-project/vllm/pull/3840
[Core] [Frontend] Make detokenization optional by @mgerstgrasser in https://github.com/vllm-project/vllm/pull/3749
[Bugfix] Fix args in benchmark_serving by @CatherineSue in https://github.com/vllm-project/vllm/pull/3836
[Benchmark] Refactor sample_requests in benchmark_throughput by @gty111 in https://github.com/vllm-project/vllm/pull/3613
[Core] manage nccl via a pypi package & upgrade to pt 2.2.1 by @youkaichao in https://github.com/vllm-project/vllm/pull/3805
[Hardware][CPU] Update cpu torch to match default of 2.2.1 by @mgoin in https://github.com/vllm-project/vllm/pull/3854
[Model] Cohere CommandR+ by @saurabhdash2512 in https://github.com/vllm-project/vllm/pull/3829
[Core] improve robustness of pynccl by @youkaichao in https://github.com/vllm-project/vllm/pull/3860
[Doc]Add asynchronous engine arguments to documentation. by @SeanGallen in https://github.com/vllm-project/vllm/pull/3810
[CI/Build] fix pip cache with vllm_nccl & refactor dockerfile to build wheels by @youkaichao in https://github.com/vllm-project/vllm/pull/3859
[Misc] Add pytest marker to opt-out of global test cleanup by @cadedaniel in https://github.com/vllm-project/vllm/pull/3863
[Misc] Fix linter issues in examples/fp8/quantizer/quantize.py by @cadedaniel in https://github.com/vllm-project/vllm/pull/3864
[Bugfix] Fixing requirements.txt by @noamgat in https://github.com/vllm-project/vllm/pull/3865
[Misc] Define common requirements by @WoosukKwon in https://github.com/vllm-project/vllm/pull/3841
Add option to completion API to truncate prompt tokens by @tdoublep in https://github.com/vllm-project/vllm/pull/3144
[Chunked Prefill][4/n] Chunked prefill scheduler. by @rkooo567 in https://github.com/vllm-project/vllm/pull/3853
[Bugfix] Fix incorrect output on OLMo models in Tensor Parallelism by @Isotr0py in https://github.com/vllm-project/vllm/pull/3869
[CI/Benchmark] add more iteration and use multiple percentiles for robust latency benchmark by @youkaichao in https://github.com/vllm-project/vllm/pull/3889
[Core] enable out-of-tree model register by @youkaichao in https://github.com/vllm-project/vllm/pull/3871
[WIP][Core] latency optimization by @youkaichao in https://github.com/vllm-project/vllm/pull/3890
[Bugfix] Fix Llava inference with Tensor Parallelism. by @Isotr0py in https://github.com/vllm-project/vllm/pull/3883
[Model] add minicpm by @SUDA-HLT-ywfang in https://github.com/vllm-project/vllm/pull/3893
[Bugfix] Added Command-R GPTQ support by @egortolmachev in https://github.com/vllm-project/vllm/pull/3849
[Bugfix] Enable Proper attention_bias Usage in Llama Model Configuration by @Ki6an in https://github.com/vllm-project/vllm/pull/3767
[Hotfix][CI/Build][Kernel] CUDA 11.8 does not support layernorm optimizations by @mawong-amd in https://github.com/vllm-project/vllm/pull/3782
[BugFix][Model] Fix commandr RoPE max_position_embeddings by @esmeetu in https://github.com/vllm-project/vllm/pull/3919
[Core] separate distributed_init from worker by @youkaichao in https://github.com/vllm-project/vllm/pull/3904
[Misc] [Core] Implement RFC "Augment BaseExecutor interfaces to enable hardware-agnostic speculative decoding" by @cadedaniel in https://github.com/vllm-project/vllm/pull/3837
[Bugfix] Fix KeyError on loading GPT-NeoX by @jsato8094 in https://github.com/vllm-project/vllm/pull/3925
[ROCm][Hardware][AMD] Use Triton Kernel for default FA on ROCm by @jpvillam-amd in https://github.com/vllm-project/vllm/pull/3643
[Misc] Avoid loading incorrect LoRA config by @jeejeelee in https://github.com/vllm-project/vllm/pull/3777
[Benchmark] Add cpu options to bench scripts by @PZD-CHINA in https://github.com/vllm-project/vllm/pull/3915
[Bugfix] fix utils.py/merge_dict func TypeError: 'type' object is not subscriptable by @zhaotyer in https://github.com/vllm-project/vllm/pull/3955
[Bugfix] Fix logits processor when prompt_logprobs is not None by @huyiwen in https://github.com/vllm-project/vllm/pull/3899
[Bugfix] handle prompt_logprobs in _apply_min_tokens_penalty by @tjohnson31415 in https://github.com/vllm-project/vllm/pull/3876
[Bugfix][ROCm] Add numba to Dockerfile.rocm by @WoosukKwon in https://github.com/vllm-project/vllm/pull/3962
[Model][AMD] ROCm support for 256 head dims for Gemma by @jamestwhedbee in https://github.com/vllm-project/vllm/pull/3972
[Doc] Add doc to state our model support policy by @youkaichao in https://github.com/vllm-project/vllm/pull/3948
[Bugfix] Remove key sorting for guided_json parameter in OpenAi compatible Server by @dmarasco in https://github.com/vllm-project/vllm/pull/3945
[Doc] Fix getting stared to use publicly available model by @fpaupier in https://github.com/vllm-project/vllm/pull/3963
[Bugfix] handle hf_config with architectures == None by @tjohnson31415 in https://github.com/vllm-project/vllm/pull/3982
[WIP][Core][Refactor] move vllm/model_executor/parallel_utils into vllm/distributed and vllm/device_communicators by @youkaichao in https://github.com/vllm-project/vllm/pull/3950
[Core][5/N] Fully working chunked prefill e2e by @rkooo567 in https://github.com/vllm-project/vllm/pull/3884
[Core][Model] Use torch.compile to accelerate layernorm in commandr by @youkaichao in https://github.com/vllm-project/vllm/pull/3985
[Test] Add xformer and flash attn tests by @rkooo567 in https://github.com/vllm-project/vllm/pull/3961
[Misc] refactor ops and cache_ops layer by @jikunshang in https://github.com/vllm-project/vllm/pull/3913
[Doc][Installation] delete python setup.py develop by @youkaichao in https://github.com/vllm-project/vllm/pull/3989
[Kernel] Fused MoE Config for Mixtral 8x22 by @ywang96 in https://github.com/vllm-project/vllm/pull/4002
fix-bgmv-kernel-640 by @kingljl in https://github.com/vllm-project/vllm/pull/4007
[Hardware][Intel] Isolate CPUModelRunner and ModelRunner for better maintenance by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/3824
[Core] Set linear_weights directly on the layer by @Yard1 in https://github.com/vllm-project/vllm/pull/3977
[Core][Distributed] make init_distributed_environment compatible with init_process_group by @youkaichao in https://github.com/vllm-project/vllm/pull/4014
Fix echo/logprob OpenAI completion bug by @dylanwhawk in https://github.com/vllm-project/vllm/pull/3441
[Kernel] Add extra punica sizes to support bigger vocabs by @Yard1 in https://github.com/vllm-project/vllm/pull/4015
[BugFix] Fix handling of stop strings and stop token ids by @njhill in https://github.com/vllm-project/vllm/pull/3672
[Doc] Add typing hints / mypy types cleanup by @michaelfeil in https://github.com/vllm-project/vllm/pull/3816
[Core] Support LoRA on quantized models by @jeejeelee in https://github.com/vllm-project/vllm/pull/4012
[Frontend][Core] Move merge_async_iterators to utils by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/4026
[Test] Test multiple attn backend for chunked prefill. by @rkooo567 in https://github.com/vllm-project/vllm/pull/4023
[Bugfix] fix type hint for py 3.8 by @youkaichao in https://github.com/vllm-project/vllm/pull/4036
[Misc] Fix typo in scheduler.py by @zhuohan123 in https://github.com/vllm-project/vllm/pull/4022
[mypy] Add mypy type annotation part 1 by @rkooo567 in https://github.com/vllm-project/vllm/pull/4006
[Core] fix custom allreduce default value by @youkaichao in https://github.com/vllm-project/vllm/pull/4040
Fix triton compilation issue by @Bellk17 in https://github.com/vllm-project/vllm/pull/3984
[Bugfix] Fix LoRA bug by @jeejeelee in https://github.com/vllm-project/vllm/pull/4032
[CI/Test] expand ruff and yapf for all supported python version by @youkaichao in https://github.com/vllm-project/vllm/pull/4037
[Bugfix] More type hint fixes for py 3.8 by @dylanwhawk in https://github.com/vllm-project/vllm/pull/4039
[Core][Distributed] improve logging for init dist by @youkaichao in https://github.com/vllm-project/vllm/pull/4042
[Bugfix] fix_log_time_in_metrics by @zspo in https://github.com/vllm-project/vllm/pull/4050
[Bugfix] fix_small_bug_in_neuron_executor by @zspo in https://github.com/vllm-project/vllm/pull/4051
[Kernel] Add punica dimension for Baichuan-13B by @jeejeelee in https://github.com/vllm-project/vllm/pull/4053
[Frontend] [Core] feat: Add model loading using tensorizer by @sangstar in https://github.com/vllm-project/vllm/pull/3476
[Core] avoid too many cuda context by caching p2p test by @youkaichao in https://github.com/vllm-project/vllm/pull/4021
[BugFix] Fix tensorizer extra in setup.py by @njhill in https://github.com/vllm-project/vllm/pull/4072
[Docs] document that mixtral 8x22b is supported by @simon-mo in https://github.com/vllm-project/vllm/pull/4073
[Misc] Upgrade triton to 2.2.0 by @esmeetu in https://github.com/vllm-project/vllm/pull/4061
[Bugfix] Fix filelock version requirement by @zhuohan123 in https://github.com/vllm-project/vllm/pull/4075
[Misc][Minor] Fix CPU block num log in CPUExecutor. by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/4088
[Core] Simplifications to executor classes by @njhill in https://github.com/vllm-project/vllm/pull/4071
[Doc] Add better clarity for tensorizer usage by @sangstar in https://github.com/vllm-project/vllm/pull/4090
[Bugfix] Fix ray workers profiling with nsight by @rickyyx in https://github.com/vllm-project/vllm/pull/4095
[Typing] Fix Sequence type GenericAlias only available after Python 3.9. by @rkooo567 in https://github.com/vllm-project/vllm/pull/4092
[Core] Fix engine-use-ray broken by @rkooo567 in https://github.com/vllm-project/vllm/pull/4105
LM Format Enforcer Guided Decoding Support by @noamgat in https://github.com/vllm-project/vllm/pull/3868
[Core] Refactor model loading code by @Yard1 in https://github.com/vllm-project/vllm/pull/4097
[Speculative decoding 6/9] Integrate speculative decoding with LLMEngine by @cadedaniel in https://github.com/vllm-project/vllm/pull/3894
[Misc] [CI] Fix CI failure caught after merge by @cadedaniel in https://github.com/vllm-project/vllm/pull/4126
[CI] Move CPU/AMD tests to after wait by @cadedaniel in https://github.com/vllm-project/vllm/pull/4123
[Core] replace narrow-usage RayWorkerVllm to general WorkerWrapper to reduce code duplication by @youkaichao in https://github.com/vllm-project/vllm/pull/4024
[Bugfix] fix output parsing error for trtllm backend by @elinx in https://github.com/vllm-project/vllm/pull/4137
[Kernel] Add punica dimension for Swallow-MS-7B LoRA by @ucciicci in https://github.com/vllm-project/vllm/pull/4134
[Typing] Mypy typing part 2 by @rkooo567 in https://github.com/vllm-project/vllm/pull/4043
[Core] Add integrity check during initialization; add test for it by @youkaichao in https://github.com/vllm-project/vllm/pull/4155
Allow model to be served under multiple names by @hmellor in https://github.com/vllm-project/vllm/pull/2894
[Bugfix] Get available quantization methods from quantization registry by @mgoin in https://github.com/vllm-project/vllm/pull/4098
[Bugfix][Kernel] allow non-power-of-two head sizes in prefix prefill by @mmoskal in https://github.com/vllm-project/vllm/pull/4128
[Docs] document that Meta Llama 3 is supported by @simon-mo in https://github.com/vllm-project/vllm/pull/4175
[Bugfix] Support logprobs when using guided_json and other constrained decoding fields by @jamestwhedbee in https://github.com/vllm-project/vllm/pull/4149
[Misc] Bump transformers to latest version by @njhill in https://github.com/vllm-project/vllm/pull/4176
[CI/CD] add neuron docker and ci test scripts by @liangfu in https://github.com/vllm-project/vllm/pull/3571
[Bugfix] Fix CustomAllreduce pcie nvlink topology detection (#3974) by @agt in https://github.com/vllm-project/vllm/pull/4159
[Core] add an option to log every function call to for debugging hang/crash in distributed inference by @youkaichao in https://github.com/vllm-project/vllm/pull/4079
Support eos_token_id from generation_config.json by @simon-mo in https://github.com/vllm-project/vllm/pull/4182
[Bugfix] Fix LoRA loading check by @jeejeelee in https://github.com/vllm-project/vllm/pull/4138
Bump version of 0.4.1 by @simon-mo in https://github.com/vllm-project/vllm/pull/4177
[Misc] fix docstrings by @UranusSeven in https://github.com/vllm-project/vllm/pull/4191
[Bugfix][Core] Restore logging of stats in the async engine by @ronensc in https://github.com/vllm-project/vllm/pull/4150
[Misc] add nccl in collect env by @youkaichao in https://github.com/vllm-project/vllm/pull/4211
Pass tokenizer_revision when getting tokenizer in openai serving by @chiragjn in https://github.com/vllm-project/vllm/pull/4214
[Bugfix] Add fix for JSON whitespace by @ayusher in https://github.com/vllm-project/vllm/pull/4189
Fix missing docs and out of sync EngineArgs by @hmellor in https://github.com/vllm-project/vllm/pull/4219
[Kernel][FP8] Initial support with dynamic per-tensor scaling by @comaniac in https://github.com/vllm-project/vllm/pull/4118
[Frontend] multiple sampling params support by @nunjunj in https://github.com/vllm-project/vllm/pull/3570
Updating lm-format-enforcer version and adding links to decoding libraries in docs by @noamgat in https://github.com/vllm-project/vllm/pull/4222
Don't show default value for flags in EngineArgs by @hmellor in https://github.com/vllm-project/vllm/pull/4223
[Doc]: Update the page of adding new models by @YeFD in https://github.com/vllm-project/vllm/pull/4236
Make initialization of tokenizer and detokenizer optional by @GeauxEric in https://github.com/vllm-project/vllm/pull/3748
[AMD][Hardware][Misc][Bugfix] xformer cleanup and light navi logic and CI fixes and refactoring by @hongxiayang in https://github.com/vllm-project/vllm/pull/4129
[Core][Distributed] fix _is_full_nvlink detection by @youkaichao in https://github.com/vllm-project/vllm/pull/4233
[Misc] Add vision language model support to CPU backend by @Isotr0py in https://github.com/vllm-project/vllm/pull/3968
[Bugfix] Fix type annotations in CPU model runner by @WoosukKwon in https://github.com/vllm-project/vllm/pull/4256
[Frontend] Enable support for CPU backend in AsyncLLMEngine. by @sighingnow in https://github.com/vllm-project/vllm/pull/3993
[Bugfix] Ensure download_weights_from_hf(..) inside loader is using the revision parameter by @alexm-nm in https://github.com/vllm-project/vllm/pull/4217
Add example scripts to documentation by @hmellor in https://github.com/vllm-project/vllm/pull/4225
[Core] Scheduler perf fix by @rkooo567 in https://github.com/vllm-project/vllm/pull/4270
[Doc] Update the SkyPilot doc with serving and Llama-3 by @Michaelvll in https://github.com/vllm-project/vllm/pull/4276
[Core][Distributed] use absolute path for library file by @youkaichao in https://github.com/vllm-project/vllm/pull/4271
Fix autodoc directives by @hmellor in https://github.com/vllm-project/vllm/pull/4272
[Mypy] Part 3 fix typing for nested directories for most of directory by @rkooo567 in https://github.com/vllm-project/vllm/pull/4161
[Core] Some simplification of WorkerWrapper changes by @njhill in https://github.com/vllm-project/vllm/pull/4183
[Core] Scheduling optimization 2 by @rkooo567 in https://github.com/vllm-project/vllm/pull/4280
[Speculative decoding 7/9] Speculative decoding end-to-end correctness tests. by @cadedaniel in https://github.com/vllm-project/vllm/pull/3951
[Bugfix] Fixing max token error message for openai compatible server by @jgordley in https://github.com/vllm-project/vllm/pull/4016
[Bugfix] Add init_cached_hf_modules to RayWorkerWrapper by @DefTruth in https://github.com/vllm-project/vllm/pull/4286
[Core][Logging] Add last frame information for better debugging by @youkaichao in https://github.com/vllm-project/vllm/pull/4278
[CI] Add ccache for wheel builds job by @simon-mo in https://github.com/vllm-project/vllm/pull/4281
AQLM CUDA support by @jaemzfleming in https://github.com/vllm-project/vllm/pull/3287
[Bugfix][Frontend] Raise exception when file-like chat template fails to be opened by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/4292
[Kernel] FP8 support for MoE kernel / Mixtral by @pcmoritz in https://github.com/vllm-project/vllm/pull/4244
[Bugfix] fixed fp8 conflict with aqlm by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/4307
[Core][Distributed] use cpu/gloo to initialize pynccl by @youkaichao in https://github.com/vllm-project/vllm/pull/4248
[CI][Build] change pynvml to nvidia-ml-py by @youkaichao in https://github.com/vllm-project/vllm/pull/4302
[Misc] Reduce supported Punica dtypes by @WoosukKwon in https://github.com/vllm-project/vllm/pull/4304

New Contributors

@mawong-amd made their first contribution in https://github.com/vllm-project/vllm/pull/3662
@Qubitium made their first contribution in https://github.com/vllm-project/vllm/pull/3689
@bigPYJ1151 made their first contribution in https://github.com/vllm-project/vllm/pull/3634
@A-Mahla made their first contribution in https://github.com/vllm-project/vllm/pull/3788
@AdrianAbeyta made their first contribution in https://github.com/vllm-project/vllm/pull/3290
@mgerstgrasser made their first contribution in https://github.com/vllm-project/vllm/pull/3749
@CatherineSue made their first contribution in https://github.com/vllm-project/vllm/pull/3836
@saurabhdash2512 made their first contribution in https://github.com/vllm-project/vllm/pull/3829
@SeanGallen made their first contribution in https://github.com/vllm-project/vllm/pull/3810
@SUDA-HLT-ywfang made their first contribution in https://github.com/vllm-project/vllm/pull/3893
@egortolmachev made their first contribution in https://github.com/vllm-project/vllm/pull/3849
@Ki6an made their first contribution in https://github.com/vllm-project/vllm/pull/3767
@jsato8094 made their first contribution in https://github.com/vllm-project/vllm/pull/3925
@jpvillam-amd made their first contribution in https://github.com/vllm-project/vllm/pull/3643
@PZD-CHINA made their first contribution in https://github.com/vllm-project/vllm/pull/3915
@zhaotyer made their first contribution in https://github.com/vllm-project/vllm/pull/3955
@huyiwen made their first contribution in https://github.com/vllm-project/vllm/pull/3899
@dmarasco made their first contribution in https://github.com/vllm-project/vllm/pull/3945
@fpaupier made their first contribution in https://github.com/vllm-project/vllm/pull/3963
@kingljl made their first contribution in https://github.com/vllm-project/vllm/pull/4007
@DarkLight1337 made their first contribution in https://github.com/vllm-project/vllm/pull/4026
@Bellk17 made their first contribution in https://github.com/vllm-project/vllm/pull/3984
@sangstar made their first contribution in https://github.com/vllm-project/vllm/pull/3476
@rickyyx made their first contribution in https://github.com/vllm-project/vllm/pull/4095
@elinx made their first contribution in https://github.com/vllm-project/vllm/pull/4137
@ucciicci made their first contribution in https://github.com/vllm-project/vllm/pull/4134
@mmoskal made their first contribution in https://github.com/vllm-project/vllm/pull/4128
@agt made their first contribution in https://github.com/vllm-project/vllm/pull/4159
@ayusher made their first contribution in https://github.com/vllm-project/vllm/pull/4189
@nunjunj made their first contribution in https://github.com/vllm-project/vllm/pull/3570
@YeFD made their first contribution in https://github.com/vllm-project/vllm/pull/4236
@GeauxEric made their first contribution in https://github.com/vllm-project/vllm/pull/3748
@alexm-nm made their first contribution in https://github.com/vllm-project/vllm/pull/4217
@jgordley made their first contribution in https://github.com/vllm-project/vllm/pull/4016
@DefTruth made their first contribution in https://github.com/vllm-project/vllm/pull/4286
@jaemzfleming made their first contribution in https://github.com/vllm-project/vllm/pull/3287

Full Changelog: https://github.com/vllm-project/vllm/compare/v0.4.0...v0.4.1

v0.4.0.post1

1 month ago

Highlight

v0.4.0 lacks support for sm70/75 support. We did a hotfix for it.

What's Changed

[Kernel] Layernorm performance optimization by @mawong-amd in https://github.com/vllm-project/vllm/pull/3662
[Doc] Update installation doc for build from source and explain the dependency on torch/cuda version by @youkaichao in https://github.com/vllm-project/vllm/pull/3746
[CI/Build] Make Marlin Tests Green by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/3753
[Misc] Minor fixes in requirements.txt by @WoosukKwon in https://github.com/vllm-project/vllm/pull/3769
[Misc] Some minor simplifications to detokenization logic by @njhill in https://github.com/vllm-project/vllm/pull/3670
[Misc] Fix Benchmark TTFT Calculation for Chat Completions by @ywang96 in https://github.com/vllm-project/vllm/pull/3768
[Speculative decoding 4/9] Lookahead scheduling for speculative decoding by @cadedaniel in https://github.com/vllm-project/vllm/pull/3250
[Misc] Add support for new autogptq checkpoint_format by @Qubitium in https://github.com/vllm-project/vllm/pull/3689
[Misc] [CI/Build] Speed up block manager CPU-only unit tests ~10x by opting-out of GPU cleanup by @cadedaniel in https://github.com/vllm-project/vllm/pull/3783
[Hardware][Intel] Add CPU inference backend by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/3634
[HotFix] [CI/Build] Minor fix for CPU backend CI by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/3787
[Frontend][Bugfix] allow using the default middleware with a root path by @A-Mahla in https://github.com/vllm-project/vllm/pull/3788
[Doc] Fix vLLMEngine Doc Page by @ywang96 in https://github.com/vllm-project/vllm/pull/3791
[CI/Build] fix TORCH_CUDA_ARCH_LIST in wheel build by @youkaichao in https://github.com/vllm-project/vllm/pull/3801
Fix crash when try torch.cuda.set_device in worker by @leiwen83 in https://github.com/vllm-project/vllm/pull/3770
[Bugfix] Add __init__.py files for vllm/core/block/ and vllm/spec_decode/ by @mgoin in https://github.com/vllm-project/vllm/pull/3798
[CI/Build] 0.4.0.post1, fix sm 7.0/7.5 binary by @youkaichao in https://github.com/vllm-project/vllm/pull/3803

New Contributors

@mawong-amd made their first contribution in https://github.com/vllm-project/vllm/pull/3662
@Qubitium made their first contribution in https://github.com/vllm-project/vllm/pull/3689
@bigPYJ1151 made their first contribution in https://github.com/vllm-project/vllm/pull/3634
@A-Mahla made their first contribution in https://github.com/vllm-project/vllm/pull/3788

Full Changelog: https://github.com/vllm-project/vllm/compare/v0.4.0...v0.4.0.post1

v0.4.0

1 month ago

Major changes

Models

New models: Command+R(#3433), Qwen2 MoE(#3346), DBRX(#3660), XVerse (#3610), Jais (#3183).
New vision language model: LLaVA (#3042)

Production features

Automatic prefix caching (#2762, #3703) supporting long system prompt to be automatically cached across requests. Use the flag --enable-prefix-caching to turn it on.
Support json_object in OpenAI server for arbitrary JSON, --use-delay flag to improve time to first token across many requests, and min_tokens to EOS suppression.
Progress in chunked prefill scheduler (#3236, #3538), and speculative decoding (#3103).
Custom all reduce kernel has been re-enabled after more robustness fixes.
Replaced cupy dependency due to its bugs.

Hardware

Improved Neuron support for AWS Inferentia.
CMake based build system for extensibility.

Ecosystem

Extensive serving benchmark refactoring (#3277)
Usage statistics collection (#2852)

What's Changed

allow user chose log level by --log-level instead of fixed 'info'. by @AllenDou in https://github.com/vllm-project/vllm/pull/3109
Reorder kv dtype check to avoid nvcc not found error on AMD platform by @cloudhan in https://github.com/vllm-project/vllm/pull/3104
Add Automatic Prefix Caching by @SageMoore in https://github.com/vllm-project/vllm/pull/2762
Add vLLM version info to logs and openai API server by @jasonacox in https://github.com/vllm-project/vllm/pull/3161
[FIX] Fix styles in automatic prefix caching & add a automatic prefix caching benchmark by @zhuohan123 in https://github.com/vllm-project/vllm/pull/3158
Make it easy to profile workers with nsight by @pcmoritz in https://github.com/vllm-project/vllm/pull/3162
[DOC] add setup document to support neuron backend by @liangfu in https://github.com/vllm-project/vllm/pull/2777
[Minor Fix] Remove unused code in benchmark_prefix_caching.py by @gty111 in https://github.com/vllm-project/vllm/pull/3171
Add document for vllm paged attention kernel. by @pian13131 in https://github.com/vllm-project/vllm/pull/2978
enable --gpu-memory-utilization in benchmark_throughput.py by @AllenDou in https://github.com/vllm-project/vllm/pull/3175
[Minor fix] The domain dns.google may cause a socket.gaierror exception by @ttbachyinsda in https://github.com/vllm-project/vllm/pull/3176
Push logprob generation to LLMEngine by @Yard1 in https://github.com/vllm-project/vllm/pull/3065
Add health check, make async Engine more robust by @Yard1 in https://github.com/vllm-project/vllm/pull/3015
Fix the openai benchmarking requests to work with latest OpenAI apis by @wangchen615 in https://github.com/vllm-project/vllm/pull/2992
[ROCm] enable cupy in order to enable cudagraph mode for AMD GPUs by @hongxiayang in https://github.com/vllm-project/vllm/pull/3123
Store eos_token_id in Sequence for easy access by @njhill in https://github.com/vllm-project/vllm/pull/3166
[Fix] Avoid pickling entire LLMEngine for Ray workers by @njhill in https://github.com/vllm-project/vllm/pull/3207
[Tests] Add block manager and scheduler tests by @rkooo567 in https://github.com/vllm-project/vllm/pull/3108
[Testing] Fix core tests by @cadedaniel in https://github.com/vllm-project/vllm/pull/3224
A simple addition of dynamic_ncols=True by @chujiezheng in https://github.com/vllm-project/vllm/pull/3242
Add GPTQ support for Gemma by @TechxGenus in https://github.com/vllm-project/vllm/pull/3200
Update requirements-dev.txt to include package for benchmarking scripts. by @wangchen615 in https://github.com/vllm-project/vllm/pull/3181
Separate attention backends by @WoosukKwon in https://github.com/vllm-project/vllm/pull/3005
Measure model memory usage by @mgoin in https://github.com/vllm-project/vllm/pull/3120
Possible fix for conflict between Automated Prefix Caching (#2762) and multi-LoRA support (#1804) by @jacobthebanana in https://github.com/vllm-project/vllm/pull/3263
Fix auto prefix bug by @ElizaWszola in https://github.com/vllm-project/vllm/pull/3239
Connect engine healthcheck to openai server by @njhill in https://github.com/vllm-project/vllm/pull/3260
Feature add lora support for Qwen2 by @whyiug in https://github.com/vllm-project/vllm/pull/3177
[Minor Fix] Fix comments in benchmark_serving by @gty111 in https://github.com/vllm-project/vllm/pull/3252
[Docs] Fix Unmocked Imports by @ywang96 in https://github.com/vllm-project/vllm/pull/3275
[FIX] Make flash_attn optional by @WoosukKwon in https://github.com/vllm-project/vllm/pull/3269
Move model filelocks from /tmp/ to ~/.cache/vllm/locks/ dir by @mgoin in https://github.com/vllm-project/vllm/pull/3241
[FIX] Fix prefix test error on main by @zhuohan123 in https://github.com/vllm-project/vllm/pull/3286
[Speculative decoding 3/9] Worker which speculates, scores, and applies rejection sampling by @cadedaniel in https://github.com/vllm-project/vllm/pull/3103
Enhance lora tests with more layer and rank variations by @tterrysun in https://github.com/vllm-project/vllm/pull/3243
[ROCM] Fix blockReduceSum to use correct warp counts for ROCm and CUDA by @dllehr-amd in https://github.com/vllm-project/vllm/pull/3262
[BugFix] Fix get tokenizer when using ray by @esmeetu in https://github.com/vllm-project/vllm/pull/3301
[Fix] Fix best_of behavior when n=1 by @njhill in https://github.com/vllm-project/vllm/pull/3298
Re-enable the 80 char line width limit by @zhuohan123 in https://github.com/vllm-project/vllm/pull/3305
[docs] Add LoRA support information for models by @pcmoritz in https://github.com/vllm-project/vllm/pull/3299
Add distributed model executor abstraction by @zhuohan123 in https://github.com/vllm-project/vllm/pull/3191
[ROCm] Fix warp and lane calculation in blockReduceSum by @kliuae in https://github.com/vllm-project/vllm/pull/3321
Support Mistral Model Inference with transformers-neuronx by @DAIZHENWEI in https://github.com/vllm-project/vllm/pull/3153
docs: Add BentoML deployment doc by @Sherlock113 in https://github.com/vllm-project/vllm/pull/3336
Fixes #1556 double free by @br3no in https://github.com/vllm-project/vllm/pull/3347
Add kernel for GeGLU with approximate GELU by @WoosukKwon in https://github.com/vllm-project/vllm/pull/3337
[Fix] fix quantization arg when using marlin by @DreamTeamWangbowen in https://github.com/vllm-project/vllm/pull/3319
add hf_transfer to requirements.txt by @RonanKMcGovern in https://github.com/vllm-project/vllm/pull/3031
fix bias in if, ambiguous by @hliuca in https://github.com/vllm-project/vllm/pull/3259
[Minor Fix] Use cupy-cuda11x in CUDA 11.8 build by @chenxu2048 in https://github.com/vllm-project/vllm/pull/3256
Add missing kernel for CodeLlama-34B on A/H100 (no tensor parallelism) when using Multi-LoRA. by @orsharir in https://github.com/vllm-project/vllm/pull/3350
Add batched RoPE kernel by @tterrysun in https://github.com/vllm-project/vllm/pull/3095
Fix lint by @Yard1 in https://github.com/vllm-project/vllm/pull/3388
[FIX] Simpler fix for async engine running on ray by @zhuohan123 in https://github.com/vllm-project/vllm/pull/3371
[Hotfix] [Debug] test_openai_server.py::test_guided_regex_completion by @simon-mo in https://github.com/vllm-project/vllm/pull/3383
allow user to chose which vllm's merics to display in grafana by @AllenDou in https://github.com/vllm-project/vllm/pull/3393
[Kernel] change benchmark script so that result can be directly used; tune moe kernel in A100/H100 with tp=2,4,8 by @youkaichao in https://github.com/vllm-project/vllm/pull/3389
Install flash_attn in Docker image by @tdoublep in https://github.com/vllm-project/vllm/pull/3396
Add args for mTLS support by @declark1 in https://github.com/vllm-project/vllm/pull/3410
[issue templates] add some issue templates by @youkaichao in https://github.com/vllm-project/vllm/pull/3412
Fix assertion failure in Qwen 1.5 with prefix caching enabled by @chenxu2048 in https://github.com/vllm-project/vllm/pull/3373
fix marlin config repr by @qeternity in https://github.com/vllm-project/vllm/pull/3414
Feature: dynamic shared mem moe_align_block_size_kernel by @akhoroshev in https://github.com/vllm-project/vllm/pull/3376
[Misc] add HOST_IP env var by @youkaichao in https://github.com/vllm-project/vllm/pull/3419
Add chat templates for Falcon by @Dinghow in https://github.com/vllm-project/vllm/pull/3420
Add chat templates for ChatGLM by @Dinghow in https://github.com/vllm-project/vllm/pull/3418
Fix dist.broadcast stall without group argument by @GindaChen in https://github.com/vllm-project/vllm/pull/3408
Fix tie_word_embeddings for Qwen2. by @fyabc in https://github.com/vllm-project/vllm/pull/3344
[Fix] Add args for mTLS support by @declark1 in https://github.com/vllm-project/vllm/pull/3430
Fixes the misuse/mixuse of time.time()/time.monotonic() by @sighingnow in https://github.com/vllm-project/vllm/pull/3220
[Misc] add error message in non linux platform by @youkaichao in https://github.com/vllm-project/vllm/pull/3438
Fix issue templates by @hmellor in https://github.com/vllm-project/vllm/pull/3436
fix document error for value and v_vec illustration by @laneeeee in https://github.com/vllm-project/vllm/pull/3421
Asynchronous tokenization by @Yard1 in https://github.com/vllm-project/vllm/pull/2879
Removed Extraneous Print Message From OAI Server by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/3440
[Misc] PR templates by @youkaichao in https://github.com/vllm-project/vllm/pull/3413
Fixes the incorrect argument in the prefix-prefill test cases by @sighingnow in https://github.com/vllm-project/vllm/pull/3246
Replace lstrip() with removeprefix() to fix Ruff linter warning by @ronensc in https://github.com/vllm-project/vllm/pull/2958
Fix Baichuan chat template by @Dinghow in https://github.com/vllm-project/vllm/pull/3340
[Misc] fix line length for entire codebase by @simon-mo in https://github.com/vllm-project/vllm/pull/3444
Support arbitrary json_object in OpenAI and Context Free Grammar by @simon-mo in https://github.com/vllm-project/vllm/pull/3211
Fix setup.py neuron-ls issue by @simon-mo in https://github.com/vllm-project/vllm/pull/2671
[Misc] Define from_dict and to_dict in InputMetadata by @WoosukKwon in https://github.com/vllm-project/vllm/pull/3452
[CI] Shard tests for LoRA and Kernels to speed up by @simon-mo in https://github.com/vllm-project/vllm/pull/3445
[Bugfix] Make moe_align_block_size AMD-compatible by @WoosukKwon in https://github.com/vllm-project/vllm/pull/3470
CI: Add ROCm Docker Build by @simon-mo in https://github.com/vllm-project/vllm/pull/2886
[Testing] Add test_config.py to CI by @cadedaniel in https://github.com/vllm-project/vllm/pull/3437
[CI/Build] Fix Bad Import In Test by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/3473
[Misc] Fix PR Template by @zhuohan123 in https://github.com/vllm-project/vllm/pull/3478
Cmake based build system by @bnellnm in https://github.com/vllm-project/vllm/pull/2830
[Core] Zero-copy asdict for InputMetadata by @Yard1 in https://github.com/vllm-project/vllm/pull/3475
[Misc] Update README for the Third vLLM Meetup by @zhuohan123 in https://github.com/vllm-project/vllm/pull/3479
[Core] Cache some utils by @Yard1 in https://github.com/vllm-project/vllm/pull/3474
[Core] print error before deadlock by @youkaichao in https://github.com/vllm-project/vllm/pull/3459
[Doc] Add docs about OpenAI compatible server by @simon-mo in https://github.com/vllm-project/vllm/pull/3288
[BugFix] Avoid initializing CUDA too early by @njhill in https://github.com/vllm-project/vllm/pull/3487
Update dockerfile with ModelScope support by @ifsheldon in https://github.com/vllm-project/vllm/pull/3429
[Doc] minor fix to neuron-installation.rst by @jimburtoft in https://github.com/vllm-project/vllm/pull/3505
Revert "[Core] Cache some utils" by @simon-mo in https://github.com/vllm-project/vllm/pull/3507
[Doc] minor fix of spelling in amd-installation.rst by @jimburtoft in https://github.com/vllm-project/vllm/pull/3506
Use lru_cache for some environment detection utils by @simon-mo in https://github.com/vllm-project/vllm/pull/3508
[PREFIX CACHING FOLLOW UP] A bunch of fixes to block allocator performance when automatic prefix caching is disabled by @ElizaWszola in https://github.com/vllm-project/vllm/pull/3357
[Core] Add generic typing to LRUCache by @njhill in https://github.com/vllm-project/vllm/pull/3511
[Misc] Remove cache stream and cache events by @WoosukKwon in https://github.com/vllm-project/vllm/pull/3461
Abort when nvcc command is not found in the PATH by @AllenDou in https://github.com/vllm-project/vllm/pull/3527
Check for _is_cuda() in compute_num_jobs by @bnellnm in https://github.com/vllm-project/vllm/pull/3481
[Bugfix] Fix ROCm support in CMakeLists.txt by @jamestwhedbee in https://github.com/vllm-project/vllm/pull/3534
[1/n] Triton sampling kernel by @Yard1 in https://github.com/vllm-project/vllm/pull/3186
[1/n][Chunked Prefill] Refactor input query shapes by @rkooo567 in https://github.com/vllm-project/vllm/pull/3236
Migrate logits computation and gather to model_runner by @esmeetu in https://github.com/vllm-project/vllm/pull/3233
[BugFix] Hot fix in setup.py for neuron build by @zhuohan123 in https://github.com/vllm-project/vllm/pull/3537
[PREFIX CACHING FOLLOW UP] OrderedDict-based evictor by @ElizaWszola in https://github.com/vllm-project/vllm/pull/3431
Fix 1D query issue from _prune_hidden_states by @rkooo567 in https://github.com/vllm-project/vllm/pull/3539
[🚀 Ready to be merged] Added support for Jais models by @grandiose-pizza in https://github.com/vllm-project/vllm/pull/3183
[Misc][Log] Add log for tokenizer length not equal to vocabulary size by @esmeetu in https://github.com/vllm-project/vllm/pull/3500
[Misc] Bump up transformers to v4.39.0 & Remove StarCoder2Config by @WoosukKwon in https://github.com/vllm-project/vllm/pull/3551
[BugFix] gemma loading after quantization or LoRA. by @taeminlee in https://github.com/vllm-project/vllm/pull/3553
[Bugfix][Model] Fix Qwen2 by @esmeetu in https://github.com/vllm-project/vllm/pull/3554
[Hardware][Neuron] Refactor neuron support by @zhuohan123 in https://github.com/vllm-project/vllm/pull/3471
Some fixes for custom allreduce kernels by @hanzhi713 in https://github.com/vllm-project/vllm/pull/2760
Dynamic scheduler delay to improve ITL performance by @tdoublep in https://github.com/vllm-project/vllm/pull/3279
[Core] Improve detokenization performance for prefill by @Yard1 in https://github.com/vllm-project/vllm/pull/3469
[Bugfix] use SoftLockFile instead of LockFile by @kota-iizuka in https://github.com/vllm-project/vllm/pull/3578
[Misc] Fix BLOOM copyright notice by @WoosukKwon in https://github.com/vllm-project/vllm/pull/3591
[Misc] Bump transformers version by @ywang96 in https://github.com/vllm-project/vllm/pull/3592
[BugFix] Fix Falcon tied embeddings by @WoosukKwon in https://github.com/vllm-project/vllm/pull/3590
[BugFix] 1D query fix for MoE models by @njhill in https://github.com/vllm-project/vllm/pull/3597
[CI] typo fix: is_hip --> is_hip() by @youkaichao in https://github.com/vllm-project/vllm/pull/3595
[CI/Build] respect the common environment variable MAX_JOBS by @youkaichao in https://github.com/vllm-project/vllm/pull/3600
[CI/Build] fix flaky test by @youkaichao in https://github.com/vllm-project/vllm/pull/3602
[BugFix] minor fix: method typo in rotary_embedding.py file, get_device() -> device by @jikunshang in https://github.com/vllm-project/vllm/pull/3604
[Bugfix] Revert "[Bugfix] use SoftLockFile instead of LockFile (#3578)" by @WoosukKwon in https://github.com/vllm-project/vllm/pull/3599
[Model] Add starcoder2 awq support by @shaonianyr in https://github.com/vllm-project/vllm/pull/3569
[Core] Refactor Attention Take 2 by @WoosukKwon in https://github.com/vllm-project/vllm/pull/3462
[Bugfix] fix automatic prefix args and add log info by @gty111 in https://github.com/vllm-project/vllm/pull/3608
[CI] Try introducing isort. by @rkooo567 in https://github.com/vllm-project/vllm/pull/3495
[Core] Adding token ranks along with logprobs by @SwapnilDreams100 in https://github.com/vllm-project/vllm/pull/3516
feat: implement the min_tokens sampling parameter by @tjohnson31415 in https://github.com/vllm-project/vllm/pull/3124
[Bugfix] API stream returning two stops by @dylanwhawk in https://github.com/vllm-project/vllm/pull/3450
hotfix isort on logprobs ranks pr by @simon-mo in https://github.com/vllm-project/vllm/pull/3622
[Feature] Add vision language model support. by @xwjiang2010 in https://github.com/vllm-project/vllm/pull/3042
Optimize _get_ranks in Sampler by @Yard1 in https://github.com/vllm-project/vllm/pull/3623
[Misc] Include matched stop string/token in responses by @njhill in https://github.com/vllm-project/vllm/pull/2976
Enable more models to inference based on LoRA by @jeejeelee in https://github.com/vllm-project/vllm/pull/3382
[Bugfix] Fix ipv6 address parsing bug by @liiliiliil in https://github.com/vllm-project/vllm/pull/3641
[BugFix] Fix ipv4 address parsing regression by @njhill in https://github.com/vllm-project/vllm/pull/3645
[Kernel] support non-zero cuda devices in punica kernels by @jeejeelee in https://github.com/vllm-project/vllm/pull/3636
[Doc]add lora support by @jeejeelee in https://github.com/vllm-project/vllm/pull/3649
[Misc] Minor fix in KVCache type by @WoosukKwon in https://github.com/vllm-project/vllm/pull/3652
[Core] remove cupy dependency by @youkaichao in https://github.com/vllm-project/vllm/pull/3625
[Bugfix] More faithful implementation of Gemma by @WoosukKwon in https://github.com/vllm-project/vllm/pull/3653
[Bugfix] [Hotfix] fix nccl library name by @youkaichao in https://github.com/vllm-project/vllm/pull/3661
[Model] Add support for DBRX by @megha95 in https://github.com/vllm-project/vllm/pull/3660
[Misc] add the "download-dir" option to the latency/throughput benchmarks by @AmadeusChan in https://github.com/vllm-project/vllm/pull/3621
feat(benchmarks): Add Prefix Caching Benchmark to Serving Benchmark by @ywang96 in https://github.com/vllm-project/vllm/pull/3277
Add support for Cohere's Command-R model by @zeppombal in https://github.com/vllm-project/vllm/pull/3433
[Docs] Add Command-R to supported models by @WoosukKwon in https://github.com/vllm-project/vllm/pull/3669
[Model] Fix and clean commandr by @esmeetu in https://github.com/vllm-project/vllm/pull/3671
[Model] Add support for xverse by @hxer7963 in https://github.com/vllm-project/vllm/pull/3610
[CI/Build] update default number of jobs and nvcc threads to avoid overloading the system by @youkaichao in https://github.com/vllm-project/vllm/pull/3675
[Kernel] Add Triton MoE kernel configs for DBRX + A100 by @WoosukKwon in https://github.com/vllm-project/vllm/pull/3679
[Core] [Bugfix] Refactor block manager subsystem for better testability by @cadedaniel in https://github.com/vllm-project/vllm/pull/3492
[Model] Add support for Qwen2MoeModel by @wenyujin333 in https://github.com/vllm-project/vllm/pull/3346
[Kernel] DBRX Triton MoE kernel H100 by @ywang96 in https://github.com/vllm-project/vllm/pull/3692
[2/N] Chunked prefill data update by @rkooo567 in https://github.com/vllm-project/vllm/pull/3538
[Bugfix] Update neuron_executor.py to add optional vision_language_config. by @adamrb in https://github.com/vllm-project/vllm/pull/3695
fix benchmark format reporting in buildkite by @simon-mo in https://github.com/vllm-project/vllm/pull/3693
[CI] Add test case to run examples scripts by @simon-mo in https://github.com/vllm-project/vllm/pull/3638
[Core] Support multi-node inference(eager and cuda graph) by @esmeetu in https://github.com/vllm-project/vllm/pull/3686
[Kernel] Add MoE Triton kernel configs for A100 40GB by @WoosukKwon in https://github.com/vllm-project/vllm/pull/3700
[Bugfix] Set enable_prefix_caching=True in prefix caching example by @WoosukKwon in https://github.com/vllm-project/vllm/pull/3703
fix logging msg for block manager by @simon-mo in https://github.com/vllm-project/vllm/pull/3701
[Core] fix del of communicator by @youkaichao in https://github.com/vllm-project/vllm/pull/3702
[Benchmark] Change mii to use persistent deployment and support tensor parallel by @IKACE in https://github.com/vllm-project/vllm/pull/3628
bump version to v0.4.0 by @simon-mo in https://github.com/vllm-project/vllm/pull/3705
Revert "bump version to v0.4.0" by @youkaichao in https://github.com/vllm-project/vllm/pull/3708
[Test] Make model tests run again and remove --forked from pytest by @rkooo567 in https://github.com/vllm-project/vllm/pull/3631
[Misc] Minor type annotation fix by @WoosukKwon in https://github.com/vllm-project/vllm/pull/3716
[Core][Test] move local_rank to the last arg with default value to keep api compatible by @youkaichao in https://github.com/vllm-project/vllm/pull/3711
add ccache to docker build image by @simon-mo in https://github.com/vllm-project/vllm/pull/3704
Usage Stats Collection by @yhu422 in https://github.com/vllm-project/vllm/pull/2852
[BugFix] Fix tokenizer out of vocab size by @esmeetu in https://github.com/vllm-project/vllm/pull/3685
[BugFix][Frontend] Fix completion logprobs=0 error by @esmeetu in https://github.com/vllm-project/vllm/pull/3731
[Bugfix] Command-R Max Model Length by @ywang96 in https://github.com/vllm-project/vllm/pull/3727
bump version to v0.4.0 by @simon-mo in https://github.com/vllm-project/vllm/pull/3712
[ROCm][Bugfix] Fixed several bugs related to rccl path and attention selector logic by @hongxiayang in https://github.com/vllm-project/vllm/pull/3699
usage lib get version another way by @simon-mo in https://github.com/vllm-project/vllm/pull/3735
[BugFix] Use consistent logger everywhere by @njhill in https://github.com/vllm-project/vllm/pull/3738
[Core][Bugfix] cache len of tokenizer by @youkaichao in https://github.com/vllm-project/vllm/pull/3741
Fix build when nvtools is missing by @bnellnm in https://github.com/vllm-project/vllm/pull/3698
CMake build elf without PTX by @simon-mo in https://github.com/vllm-project/vllm/pull/3739

New Contributors

@cloudhan made their first contribution in https://github.com/vllm-project/vllm/pull/3104
@SageMoore made their first contribution in https://github.com/vllm-project/vllm/pull/2762
@jasonacox made their first contribution in https://github.com/vllm-project/vllm/pull/3161
@gty111 made their first contribution in https://github.com/vllm-project/vllm/pull/3171
@pian13131 made their first contribution in https://github.com/vllm-project/vllm/pull/2978
@ttbachyinsda made their first contribution in https://github.com/vllm-project/vllm/pull/3176
@wangchen615 made their first contribution in https://github.com/vllm-project/vllm/pull/2992
@chujiezheng made their first contribution in https://github.com/vllm-project/vllm/pull/3242
@TechxGenus made their first contribution in https://github.com/vllm-project/vllm/pull/3200
@mgoin made their first contribution in https://github.com/vllm-project/vllm/pull/3120
@jacobthebanana made their first contribution in https://github.com/vllm-project/vllm/pull/3263
@ElizaWszola made their first contribution in https://github.com/vllm-project/vllm/pull/3239
@DAIZHENWEI made their first contribution in https://github.com/vllm-project/vllm/pull/3153
@Sherlock113 made their first contribution in https://github.com/vllm-project/vllm/pull/3336
@br3no made their first contribution in https://github.com/vllm-project/vllm/pull/3347
@DreamTeamWangbowen made their first contribution in https://github.com/vllm-project/vllm/pull/3319
@RonanKMcGovern made their first contribution in https://github.com/vllm-project/vllm/pull/3031
@hliuca made their first contribution in https://github.com/vllm-project/vllm/pull/3259
@orsharir made their first contribution in https://github.com/vllm-project/vllm/pull/3350
@youkaichao made their first contribution in https://github.com/vllm-project/vllm/pull/3389
@tdoublep made their first contribution in https://github.com/vllm-project/vllm/pull/3396
@declark1 made their first contribution in https://github.com/vllm-project/vllm/pull/3410
@qeternity made their first contribution in https://github.com/vllm-project/vllm/pull/3414
@akhoroshev made their first contribution in https://github.com/vllm-project/vllm/pull/3376
@Dinghow made their first contribution in https://github.com/vllm-project/vllm/pull/3420
@fyabc made their first contribution in https://github.com/vllm-project/vllm/pull/3344
@laneeeee made their first contribution in https://github.com/vllm-project/vllm/pull/3421
@bnellnm made their first contribution in https://github.com/vllm-project/vllm/pull/2830
@ifsheldon made their first contribution in https://github.com/vllm-project/vllm/pull/3429
@jimburtoft made their first contribution in https://github.com/vllm-project/vllm/pull/3505
@grandiose-pizza made their first contribution in https://github.com/vllm-project/vllm/pull/3183
@taeminlee made their first contribution in https://github.com/vllm-project/vllm/pull/3553
@kota-iizuka made their first contribution in https://github.com/vllm-project/vllm/pull/3578
@shaonianyr made their first contribution in https://github.com/vllm-project/vllm/pull/3569
@SwapnilDreams100 made their first contribution in https://github.com/vllm-project/vllm/pull/3516
@tjohnson31415 made their first contribution in https://github.com/vllm-project/vllm/pull/3124
@xwjiang2010 made their first contribution in https://github.com/vllm-project/vllm/pull/3042
@liiliiliil made their first contribution in https://github.com/vllm-project/vllm/pull/3641
@AmadeusChan made their first contribution in https://github.com/vllm-project/vllm/pull/3621
@zeppombal made their first contribution in https://github.com/vllm-project/vllm/pull/3433
@hxer7963 made their first contribution in https://github.com/vllm-project/vllm/pull/3610
@wenyujin333 made their first contribution in https://github.com/vllm-project/vllm/pull/3346
@adamrb made their first contribution in https://github.com/vllm-project/vllm/pull/3695
@IKACE made their first contribution in https://github.com/vllm-project/vllm/pull/3628
@yhu422 made their first contribution in https://github.com/vllm-project/vllm/pull/2852

Full Changelog: https://github.com/vllm-project/vllm/compare/v0.3.3...v0.4.0

v0.3.3

2 months ago

Major changes

StarCoder2 support
Performance optimization and LoRA support for Gemma
2/3/8-bit GPTQ support
Integrate Marlin Kernels for Int4 GPTQ inference
Performance optimization for MoE kernel
[Experimental] AWS Inferentia2 support
[Experimental] Structured output (JSON, Regex) in OpenAI Server

What's Changed

Update a comment in benchmark_serving.py by @ronensc in https://github.com/vllm-project/vllm/pull/2934
Added early stopping to completion APIs by @Maxusmusti in https://github.com/vllm-project/vllm/pull/2939
Migrate MistralForCausalLM to LlamaForCausalLM by @esmeetu in https://github.com/vllm-project/vllm/pull/2868
Use Llama RMSNorm for Gemma by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2974
chore(vllm): codespell for spell checking by @mspronesti in https://github.com/vllm-project/vllm/pull/2820
Optimize GeGLU layer in Gemma by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2975
[FIX] Fix issue #2904 by @44670 in https://github.com/vllm-project/vllm/pull/2983
Remove Flash Attention in test env by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2982
Include tokens from prompt phase in counter_generation_tokens by @ronensc in https://github.com/vllm-project/vllm/pull/2802
Fix nvcc not found in vllm-openai image by @zhaoyang-star in https://github.com/vllm-project/vllm/pull/2781
[Fix] Fix assertion on Mistral YaRN model len by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2984
Port metrics from aioprometheus to prometheus_client by @hmellor in https://github.com/vllm-project/vllm/pull/2730
Add LogProbs for Chat Completions in OpenAI by @jlcmoore in https://github.com/vllm-project/vllm/pull/2918
Optimized fused MoE Kernel, take 2 by @pcmoritz in https://github.com/vllm-project/vllm/pull/2979
[Minor] Remove gather_cached_kv kernel by @WoosukKwon in https://github.com/vllm-project/vllm/pull/3043
[Minor] Remove unused config file by @esmeetu in https://github.com/vllm-project/vllm/pull/3039
Fix using CuPy for eager mode by @esmeetu in https://github.com/vllm-project/vllm/pull/3037
Fix stablelm by @esmeetu in https://github.com/vllm-project/vllm/pull/3038
Support Orion model by @dachengai in https://github.com/vllm-project/vllm/pull/2539
fix get_ip error in pure ipv6 environment by @Jingru in https://github.com/vllm-project/vllm/pull/2931
[Minor] Fix type annotation in fused moe by @WoosukKwon in https://github.com/vllm-project/vllm/pull/3045
Support logit bias for OpenAI API by @dylanwhawk in https://github.com/vllm-project/vllm/pull/3027
[Minor] Fix StableLMEpochForCausalLM -> StableLmForCausalLM by @WoosukKwon in https://github.com/vllm-project/vllm/pull/3046
Enables GQA support in the prefix prefill kernels by @sighingnow in https://github.com/vllm-project/vllm/pull/3007
multi-lora documentation fix by @ElefHead in https://github.com/vllm-project/vllm/pull/3064
Restrict prometheus_client >= 0.18.0 to prevent errors when importing pkgs by @AllenDou in https://github.com/vllm-project/vllm/pull/3070
Support inference with transformers-neuronx by @liangfu in https://github.com/vllm-project/vllm/pull/2569
Add LoRA support for Gemma by @WoosukKwon in https://github.com/vllm-project/vllm/pull/3050
Add Support for 2/3/8-bit GPTQ Quantization Models by @chu-tianxiang in https://github.com/vllm-project/vllm/pull/2330
Fix: AttributeError in OpenAI-compatible server by @jaywonchung in https://github.com/vllm-project/vllm/pull/3018
add cache_config's info to prometheus metrics. by @AllenDou in https://github.com/vllm-project/vllm/pull/3100
Support starcoder2 architecture by @sh0416 in https://github.com/vllm-project/vllm/pull/3089
Fix building from source on WSL by @aliencaocao in https://github.com/vllm-project/vllm/pull/3112
[Fix] Don't deep-copy LogitsProcessors when copying SamplingParams by @njhill in https://github.com/vllm-project/vllm/pull/3099
Add guided decoding for OpenAI API server by @felixzhu555 in https://github.com/vllm-project/vllm/pull/2819
Fix: Output text is always truncated in some models by @HyperdriveHustle in https://github.com/vllm-project/vllm/pull/3016
Remove exclude_unset in streaming response by @sh0416 in https://github.com/vllm-project/vllm/pull/3143
docs: Add tutorial on deploying vLLM model with KServe by @terrytangyuan in https://github.com/vllm-project/vllm/pull/2586
fix relative import path of protocol.py by @Huarong in https://github.com/vllm-project/vllm/pull/3134
Integrate Marlin Kernels for Int4 GPTQ inference by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/2497
Bump up to v0.3.3 by @WoosukKwon in https://github.com/vllm-project/vllm/pull/3129

New Contributors

@Maxusmusti made their first contribution in https://github.com/vllm-project/vllm/pull/2939
@44670 made their first contribution in https://github.com/vllm-project/vllm/pull/2983
@jlcmoore made their first contribution in https://github.com/vllm-project/vllm/pull/2918
@dachengai made their first contribution in https://github.com/vllm-project/vllm/pull/2539
@dylanwhawk made their first contribution in https://github.com/vllm-project/vllm/pull/3027
@ElefHead made their first contribution in https://github.com/vllm-project/vllm/pull/3064
@AllenDou made their first contribution in https://github.com/vllm-project/vllm/pull/3070
@jaywonchung made their first contribution in https://github.com/vllm-project/vllm/pull/3018
@sh0416 made their first contribution in https://github.com/vllm-project/vllm/pull/3089
@aliencaocao made their first contribution in https://github.com/vllm-project/vllm/pull/3112
@felixzhu555 made their first contribution in https://github.com/vllm-project/vllm/pull/2819
@HyperdriveHustle made their first contribution in https://github.com/vllm-project/vllm/pull/3016
@terrytangyuan made their first contribution in https://github.com/vllm-project/vllm/pull/2586
@Huarong made their first contribution in https://github.com/vllm-project/vllm/pull/3134

Full Changelog: https://github.com/vllm-project/vllm/compare/v0.3.2...v0.3.3

v0.3.2

2 months ago

Major Changes

This version adds support for the OLMo and Gemma Model, as well as seed parameter.

What's Changed

Defensively copy sampling_params by @njhill in https://github.com/vllm-project/vllm/pull/2881
multi-LoRA as extra models in OpenAI server by @jvmncs in https://github.com/vllm-project/vllm/pull/2775
Add code-revision config argument for Hugging Face Hub by @mbm-ai in https://github.com/vllm-project/vllm/pull/2892
[Minor] Small fix to make distributed init logic in worker looks cleaner by @zhuohan123 in https://github.com/vllm-project/vllm/pull/2905
[Test] Add basic correctness test by @zhuohan123 in https://github.com/vllm-project/vllm/pull/2908
Support OLMo models. by @Isotr0py in https://github.com/vllm-project/vllm/pull/2832
Add warning to prevent changes to benchmark api server by @simon-mo in https://github.com/vllm-project/vllm/pull/2858
Fix vllm:prompt_tokens_total metric calculation by @ronensc in https://github.com/vllm-project/vllm/pull/2869
[ROCm] include gfx908 as supported by @jamestwhedbee in https://github.com/vllm-project/vllm/pull/2792
[FIX] Fix beam search test by @zhuohan123 in https://github.com/vllm-project/vllm/pull/2930
Make vLLM logging formatting optional by @Yard1 in https://github.com/vllm-project/vllm/pull/2877
Add metrics to RequestOutput by @Yard1 in https://github.com/vllm-project/vllm/pull/2876
Add Gemma model by @xiangxu-google in https://github.com/vllm-project/vllm/pull/2964
Upgrade transformers to v4.38.0 by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2965
[FIX] Add Gemma model to the doc by @zhuohan123 in https://github.com/vllm-project/vllm/pull/2966
[ROCm] Upgrade transformers to v4.38.0 by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2967
Support per-request seed by @njhill in https://github.com/vllm-project/vllm/pull/2514
Bump up version to v0.3.2 by @zhuohan123 in https://github.com/vllm-project/vllm/pull/2968

New Contributors

@jvmncs made their first contribution in https://github.com/vllm-project/vllm/pull/2775
@mbm-ai made their first contribution in https://github.com/vllm-project/vllm/pull/2892
@Isotr0py made their first contribution in https://github.com/vllm-project/vllm/pull/2832
@jamestwhedbee made their first contribution in https://github.com/vllm-project/vllm/pull/2792

Full Changelog: https://github.com/vllm-project/vllm/compare/v0.3.1...v0.3.2

v0.3.1

3 months ago

Major Changes

This version fixes the following major bugs:

Memory leak with distributed execution. (Solved by using CuPY for collective communication).
Support for Python 3.8.

Also with many smaller bug fixes listed below.

What's Changed

Fixes assertion failure in prefix caching: the lora index mapping should respect prefix_len. by @sighingnow in https://github.com/vllm-project/vllm/pull/2688
fix some bugs about parameter description by @zspo in https://github.com/vllm-project/vllm/pull/2689
[Minor] Fix test_cache.py CI test failure by @pcmoritz in https://github.com/vllm-project/vllm/pull/2684
Add unit test for Mixtral MoE layer by @pcmoritz in https://github.com/vllm-project/vllm/pull/2677
Refactor Prometheus and Add Request Level Metrics by @rib-2 in https://github.com/vllm-project/vllm/pull/2316
Add Internlm2 by @Leymore in https://github.com/vllm-project/vllm/pull/2666
Fix compile error when using rocm by @zhaoyang-star in https://github.com/vllm-project/vllm/pull/2648
fix python 3.8 syntax by @simon-mo in https://github.com/vllm-project/vllm/pull/2716
Update README for meetup slides by @simon-mo in https://github.com/vllm-project/vllm/pull/2718
Use revision when downloading the quantization config file by @Pernekhan in https://github.com/vllm-project/vllm/pull/2697
remove hardcoded device="cuda" to support more device by @jikunshang in https://github.com/vllm-project/vllm/pull/2503
fix length_penalty default value to 1.0 by @zspo in https://github.com/vllm-project/vllm/pull/2667
Add one example to run batch inference distributed on Ray by @c21 in https://github.com/vllm-project/vllm/pull/2696
docs: update langchain serving instructions by @mspronesti in https://github.com/vllm-project/vllm/pull/2736
Set&Get llm internal tokenizer instead of the TokenizerGroup by @dancingpipi in https://github.com/vllm-project/vllm/pull/2741
Remove eos tokens from output by default by @zcnrex in https://github.com/vllm-project/vllm/pull/2611
add requirement: triton >= 2.1.0 by @whyiug in https://github.com/vllm-project/vllm/pull/2746
[Minor] Fix benchmark_latency by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2765
[ROCm] Fix some kernels failed unit tests by @hongxiayang in https://github.com/vllm-project/vllm/pull/2498
Set local logging level via env variable by @gardberg in https://github.com/vllm-project/vllm/pull/2774
[ROCm] Fixup arch checks for ROCM by @dllehr-amd in https://github.com/vllm-project/vllm/pull/2627
Add fused top-K softmax kernel for MoE by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2769
fix issue when model parameter is not a model id but path of the model. by @liuyhwangyh in https://github.com/vllm-project/vllm/pull/2489
[Minor] More fix of test_cache.py CI test failure by @LiuXiaoxuanPKU in https://github.com/vllm-project/vllm/pull/2750
[ROCm] Fix build problem resulted from previous commit related to FP8 kv-cache support by @hongxiayang in https://github.com/vllm-project/vllm/pull/2790
Add documentation on how to do incremental builds by @pcmoritz in https://github.com/vllm-project/vllm/pull/2796
[Ray] Integration compiled DAG off by default by @rkooo567 in https://github.com/vllm-project/vllm/pull/2471
Disable custom all reduce by default by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2808
[ROCm] support Radeon™ 7900 series (gfx1100) without using flash-attention by @hongxiayang in https://github.com/vllm-project/vllm/pull/2768
Add documentation section about LoRA by @pcmoritz in https://github.com/vllm-project/vllm/pull/2834
Refactor 2 awq gemm kernels into m16nXk32 by @zcnrex in https://github.com/vllm-project/vllm/pull/2723
Serving Benchmark Refactoring by @ywang96 in https://github.com/vllm-project/vllm/pull/2433
[CI] Ensure documentation build is checked in CI by @simon-mo in https://github.com/vllm-project/vllm/pull/2842
Refactor llama family models by @esmeetu in https://github.com/vllm-project/vllm/pull/2637
Revert "Refactor llama family models" by @pcmoritz in https://github.com/vllm-project/vllm/pull/2851
Use CuPy for CUDA graphs by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2811
Remove Yi model definition, please use LlamaForCausalLM instead by @pcmoritz in https://github.com/vllm-project/vllm/pull/2854
Add LoRA support for Mixtral by @tterrysun in https://github.com/vllm-project/vllm/pull/2831
Migrate InternLMForCausalLM to LlamaForCausalLM by @pcmoritz in https://github.com/vllm-project/vllm/pull/2860
Fix internlm after https://github.com/vllm-project/vllm/pull/2860 by @pcmoritz in https://github.com/vllm-project/vllm/pull/2861
[Fix] Fix memory profiling when GPU is used by multiple processes by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2863
Fix docker python version by @NikolaBorisov in https://github.com/vllm-project/vllm/pull/2845
Migrate AquilaForCausalLM to LlamaForCausalLM by @esmeetu in https://github.com/vllm-project/vllm/pull/2867
Don't use cupy NCCL for AMD backends by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2855
Align LoRA code between Mistral and Mixtral (fixes #2875) by @pcmoritz in https://github.com/vllm-project/vllm/pull/2880
[BugFix] Fix GC bug for LLM class by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2882
Fix decilm.py by @pcmoritz in https://github.com/vllm-project/vllm/pull/2883
[ROCm] Dockerfile fix for flash-attention build by @hongxiayang in https://github.com/vllm-project/vllm/pull/2885
Prefix Caching- fix t4 triton error by @caoshiyi in https://github.com/vllm-project/vllm/pull/2517
Bump up to v0.3.1 by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2887

New Contributors

@sighingnow made their first contribution in https://github.com/vllm-project/vllm/pull/2688
@rib-2 made their first contribution in https://github.com/vllm-project/vllm/pull/2316
@Leymore made their first contribution in https://github.com/vllm-project/vllm/pull/2666
@Pernekhan made their first contribution in https://github.com/vllm-project/vllm/pull/2697
@jikunshang made their first contribution in https://github.com/vllm-project/vllm/pull/2503
@c21 made their first contribution in https://github.com/vllm-project/vllm/pull/2696
@zcnrex made their first contribution in https://github.com/vllm-project/vllm/pull/2611
@whyiug made their first contribution in https://github.com/vllm-project/vllm/pull/2746
@gardberg made their first contribution in https://github.com/vllm-project/vllm/pull/2774
@dllehr-amd made their first contribution in https://github.com/vllm-project/vllm/pull/2627
@rkooo567 made their first contribution in https://github.com/vllm-project/vllm/pull/2471
@ywang96 made their first contribution in https://github.com/vllm-project/vllm/pull/2433
@tterrysun made their first contribution in https://github.com/vllm-project/vllm/pull/2831

Full Changelog: https://github.com/vllm-project/vllm/compare/v0.3.0...v0.3.1

v0.3.0

3 months ago

Major Changes

Experimental multi-lora support
Experimental prefix caching support
FP8 KV Cache support
Optimized MoE performance and Deepseek MoE support
CI tested PRs
Support batch completion in server

What's Changed

Miner fix of type hint by @beginlner in https://github.com/vllm-project/vllm/pull/2340
Build docker image with shared objects from "build" step by @payoto in https://github.com/vllm-project/vllm/pull/2237
Ensure metrics are logged regardless of requests by @ichernev in https://github.com/vllm-project/vllm/pull/2347
Changed scheduler to use deques instead of lists by @NadavShmayo in https://github.com/vllm-project/vllm/pull/2290
Fix eager mode performance by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2377
[Minor] Remove unused code in attention by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2384
Add baichuan chat template jinjia file by @EvilPsyCHo in https://github.com/vllm-project/vllm/pull/2390
[Speculative decoding 1/9] Optimized rejection sampler by @cadedaniel in https://github.com/vllm-project/vllm/pull/2336
Fix ipv4 ipv6 dualstack by @yunfeng-scale in https://github.com/vllm-project/vllm/pull/2408
[Minor] Rename phi_1_5 to phi by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2385
[DOC] Add additional comments for LLMEngine and AsyncLLMEngine by @litone01 in https://github.com/vllm-project/vllm/pull/1011
[Minor] Fix the format in quick start guide related to Model Scope by @zhuohan123 in https://github.com/vllm-project/vllm/pull/2425
Add gradio chatbot for openai webserver by @arkohut in https://github.com/vllm-project/vllm/pull/2307
[BUG] RuntimeError: deque mutated during iteration in abort_seq_group by @chenxu2048 in https://github.com/vllm-project/vllm/pull/2371
Allow setting fastapi root_path argument by @chiragjn in https://github.com/vllm-project/vllm/pull/2341
Address Phi modeling update 2 by @huiwy in https://github.com/vllm-project/vllm/pull/2428
Update a more user-friendly error message, offering more considerate advice for beginners, when using V100 GPU #1901 by @chuanzhubin in https://github.com/vllm-project/vllm/pull/2374
Update quickstart.rst with small clarifying change (fix typo) by @nautsimon in https://github.com/vllm-project/vllm/pull/2369
Aligning top_p and top_k Sampling by @chenxu2048 in https://github.com/vllm-project/vllm/pull/1885
[Minor] Fix err msg by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2431
[Minor] Optimize cuda graph memory usage by @esmeetu in https://github.com/vllm-project/vllm/pull/2437
[CI] Add Buildkite by @simon-mo in https://github.com/vllm-project/vllm/pull/2355
Announce the second vLLM meetup by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2444
Allow buildkite to retry build on agent lost by @simon-mo in https://github.com/vllm-project/vllm/pull/2446
Fix weigit loading for GQA with TP by @zhangch9 in https://github.com/vllm-project/vllm/pull/2379
CI: make sure benchmark script exit on error by @simon-mo in https://github.com/vllm-project/vllm/pull/2449
ci: retry on build failure as well by @simon-mo in https://github.com/vllm-project/vllm/pull/2457
Add StableLM3B model by @ita9naiwa in https://github.com/vllm-project/vllm/pull/2372
OpenAI refactoring by @FlorianJoncour in https://github.com/vllm-project/vllm/pull/2360
[Experimental] Prefix Caching Support by @caoshiyi in https://github.com/vllm-project/vllm/pull/1669
fix stablelm.py tensor-parallel-size bug by @YingchaoX in https://github.com/vllm-project/vllm/pull/2482
Minor fix in prefill cache example by @JasonZhu1313 in https://github.com/vllm-project/vllm/pull/2494
fix: fix some args desc by @zspo in https://github.com/vllm-project/vllm/pull/2487
[Neuron] Add an option to build with neuron by @liangfu in https://github.com/vllm-project/vllm/pull/2065
Don't download both safetensor and bin files. by @NikolaBorisov in https://github.com/vllm-project/vllm/pull/2480
[BugFix] Fix abort_seq_group by @beginlner in https://github.com/vllm-project/vllm/pull/2463
refactor completion api for readability by @simon-mo in https://github.com/vllm-project/vllm/pull/2499
Support OpenAI API server in benchmark_serving.py by @hmellor in https://github.com/vllm-project/vllm/pull/2172
Simplify broadcast logic for control messages by @zhuohan123 in https://github.com/vllm-project/vllm/pull/2501
[Bugfix] fix load local safetensors model by @esmeetu in https://github.com/vllm-project/vllm/pull/2512
Add benchmark serving to CI by @simon-mo in https://github.com/vllm-project/vllm/pull/2505
Add group as an argument in broadcast ops by @GindaChen in https://github.com/vllm-project/vllm/pull/2522
[Fix] Keep scheduler.running as deque by @njhill in https://github.com/vllm-project/vllm/pull/2523
migrate pydantic from v1 to v2 by @joennlae in https://github.com/vllm-project/vllm/pull/2531
[Speculative decoding 2/9] Multi-step worker for draft model by @cadedaniel in https://github.com/vllm-project/vllm/pull/2424
Fix "Port could not be cast to integer value as " by @pcmoritz in https://github.com/vllm-project/vllm/pull/2545
Add qwen2 by @JustinLin610 in https://github.com/vllm-project/vllm/pull/2495
Fix progress bar and allow HTTPS in benchmark_serving.py by @hmellor in https://github.com/vllm-project/vllm/pull/2552
Add a 1-line docstring to explain why calling context_attention_fwd twice in test_prefix_prefill.py by @JasonZhu1313 in https://github.com/vllm-project/vllm/pull/2553
[Feature] Simple API token authentication by @taisazero in https://github.com/vllm-project/vllm/pull/1106
Add multi-LoRA support by @Yard1 in https://github.com/vllm-project/vllm/pull/1804
lint: format all python file instead of just source code by @simon-mo in https://github.com/vllm-project/vllm/pull/2567
[Bugfix] fix crash if max_tokens=None by @NikolaBorisov in https://github.com/vllm-project/vllm/pull/2570
Added include_stop_str_in_output and length_penalty parameters to OpenAI API by @galatolofederico in https://github.com/vllm-project/vllm/pull/2562
[Doc] Fix the syntax error in the doc of supported_models. by @keli-wen in https://github.com/vllm-project/vllm/pull/2584
Support Batch Completion in Server by @simon-mo in https://github.com/vllm-project/vllm/pull/2529
fix names and license by @JustinLin610 in https://github.com/vllm-project/vllm/pull/2589
[Fix] Use a correct device when creating OptionalCUDAGuard by @sh1ng in https://github.com/vllm-project/vllm/pull/2583
[ROCm] add support to ROCm 6.0 and MI300 by @hongxiayang in https://github.com/vllm-project/vllm/pull/2274
Support for Stable LM 2 by @dakotamahan-stability in https://github.com/vllm-project/vllm/pull/2598
Don't build punica kernels by default by @pcmoritz in https://github.com/vllm-project/vllm/pull/2605
AWQ: Up to 2.66x higher throughput by @casper-hansen in https://github.com/vllm-project/vllm/pull/2566
Use head_dim in config if exists by @xiangxu-google in https://github.com/vllm-project/vllm/pull/2622
Custom all reduce kernels by @hanzhi713 in https://github.com/vllm-project/vllm/pull/2192
[Minor] Fix warning on Ray dependencies by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2630
Speed up Punica compilation by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2632
Small async_llm_engine refactor by @andoorve in https://github.com/vllm-project/vllm/pull/2618
Update Ray version requirements by @simon-mo in https://github.com/vllm-project/vllm/pull/2636
Support FP8-E5M2 KV Cache by @zhaoyang-star in https://github.com/vllm-project/vllm/pull/2279
Fix error when tp > 1 by @zhaoyang-star in https://github.com/vllm-project/vllm/pull/2644
No repeated IPC open by @hanzhi713 in https://github.com/vllm-project/vllm/pull/2642
ROCm: Allow setting compilation target by @rlrs in https://github.com/vllm-project/vllm/pull/2581
DeepseekMoE support with Fused MoE kernel by @zwd003 in https://github.com/vllm-project/vllm/pull/2453
Fused MOE for Mixtral by @pcmoritz in https://github.com/vllm-project/vllm/pull/2542
Fix 'Actor methods cannot be called directly' when using --engine-use-ray by @HermitSun in https://github.com/vllm-project/vllm/pull/2664
Add swap_blocks unit tests by @sh1ng in https://github.com/vllm-project/vllm/pull/2616
Fix a small typo (tenosr -> tensor) by @pcmoritz in https://github.com/vllm-project/vllm/pull/2672
[Minor] Fix false warning when TP=1 by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2674
Add quantized mixtral support by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2673
Bump up version to v0.3.0 by @zhuohan123 in https://github.com/vllm-project/vllm/pull/2656

New Contributors

@payoto made their first contribution in https://github.com/vllm-project/vllm/pull/2237
@NadavShmayo made their first contribution in https://github.com/vllm-project/vllm/pull/2290
@EvilPsyCHo made their first contribution in https://github.com/vllm-project/vllm/pull/2390
@litone01 made their first contribution in https://github.com/vllm-project/vllm/pull/1011
@arkohut made their first contribution in https://github.com/vllm-project/vllm/pull/2307
@chiragjn made their first contribution in https://github.com/vllm-project/vllm/pull/2341
@huiwy made their first contribution in https://github.com/vllm-project/vllm/pull/2428
@chuanzhubin made their first contribution in https://github.com/vllm-project/vllm/pull/2374
@nautsimon made their first contribution in https://github.com/vllm-project/vllm/pull/2369
@zhangch9 made their first contribution in https://github.com/vllm-project/vllm/pull/2379
@ita9naiwa made their first contribution in https://github.com/vllm-project/vllm/pull/2372
@caoshiyi made their first contribution in https://github.com/vllm-project/vllm/pull/1669
@YingchaoX made their first contribution in https://github.com/vllm-project/vllm/pull/2482
@JasonZhu1313 made their first contribution in https://github.com/vllm-project/vllm/pull/2494
@zspo made their first contribution in https://github.com/vllm-project/vllm/pull/2487
@liangfu made their first contribution in https://github.com/vllm-project/vllm/pull/2065
@NikolaBorisov made their first contribution in https://github.com/vllm-project/vllm/pull/2480
@GindaChen made their first contribution in https://github.com/vllm-project/vllm/pull/2522
@njhill made their first contribution in https://github.com/vllm-project/vllm/pull/2523
@joennlae made their first contribution in https://github.com/vllm-project/vllm/pull/2531
@pcmoritz made their first contribution in https://github.com/vllm-project/vllm/pull/2545
@JustinLin610 made their first contribution in https://github.com/vllm-project/vllm/pull/2495
@taisazero made their first contribution in https://github.com/vllm-project/vllm/pull/1106
@galatolofederico made their first contribution in https://github.com/vllm-project/vllm/pull/2562
@keli-wen made their first contribution in https://github.com/vllm-project/vllm/pull/2584
@sh1ng made their first contribution in https://github.com/vllm-project/vllm/pull/2583
@hongxiayang made their first contribution in https://github.com/vllm-project/vllm/pull/2274
@dakotamahan-stability made their first contribution in https://github.com/vllm-project/vllm/pull/2598
@xiangxu-google made their first contribution in https://github.com/vllm-project/vllm/pull/2622
@andoorve made their first contribution in https://github.com/vllm-project/vllm/pull/2618
@rlrs made their first contribution in https://github.com/vllm-project/vllm/pull/2581
@zwd003 made their first contribution in https://github.com/vllm-project/vllm/pull/2453

Full Changelog: https://github.com/vllm-project/vllm/compare/v0.2.7...v0.3.0

v0.2.7

4 months ago

Major Changes

Up to 70% throughput improvement for distributed inference by removing serialization/deserialization overheads
Fix tensor parallelism support for Mixtral + GPTQ/AWQ

What's Changed

Minor fix for gpu-memory-utilization description by @SuhongMoon in https://github.com/vllm-project/vllm/pull/2162
[BugFix] Raise error when max_model_len is larger than KV cache size by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2163
[BugFix] Fix RoPE kernel on long sequences by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2164
Add SSL arguments to API servers by @HMellor in https://github.com/vllm-project/vllm/pull/2109
typo fix by @oushu1zhangxiangxuan1 in https://github.com/vllm-project/vllm/pull/2166
[ROCm] Fixes for GPTQ on ROCm by @kliuae in https://github.com/vllm-project/vllm/pull/2180
Update Help Text for --gpu-memory-utilization Argument by @SuhongMoon in https://github.com/vllm-project/vllm/pull/2183
[Minor] Add warning on CUDA graph memory usage by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2182
Added DeciLM-7b and DeciLM-7b-instruct by @avideci in https://github.com/vllm-project/vllm/pull/2062
[BugFix] Fix weight loading for Mixtral with TP by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2208
Make _prepare_sample non blocking and pin memory of CPU input buffers by @hanzhi713 in https://github.com/vllm-project/vllm/pull/2207
Remove Sampler copy stream by @Yard1 in https://github.com/vllm-project/vllm/pull/2209
Fix a broken link by @ronensc in https://github.com/vllm-project/vllm/pull/2222
Disable Ray usage stats collection by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2206
[BugFix] Fix recovery logic for sequence group by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2186
Update installation instructions to include CUDA 11.8 xFormers by @skt7 in https://github.com/vllm-project/vllm/pull/2246
Add "About" Heading to README.md by @blueceiling in https://github.com/vllm-project/vllm/pull/2260
[BUGFIX] Do not return ignored sentences twice in async llm engine by @zhuohan123 in https://github.com/vllm-project/vllm/pull/2258
[BUGFIX] Fix API server test by @zhuohan123 in https://github.com/vllm-project/vllm/pull/2270
[BUGFIX] Fix the path of test prompts by @zhuohan123 in https://github.com/vllm-project/vllm/pull/2273
[BUGFIX] Fix communication test by @zhuohan123 in https://github.com/vllm-project/vllm/pull/2285
Add support GPT-NeoX Models without attention biases by @dalgarak in https://github.com/vllm-project/vllm/pull/2301
[FIX] Fix kernel bug by @jeejeelee in https://github.com/vllm-project/vllm/pull/1959
fix typo and remove unused code by @esmeetu in https://github.com/vllm-project/vllm/pull/2305
Enable CUDA graph for GPTQ & SqueezeLLM by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2318
Fix Gradio example: remove deprecated parameter concurrency_count by @ronensc in https://github.com/vllm-project/vllm/pull/2315
Use NCCL instead of ray for control-plane communication to remove serialization overhead by @zhuohan123 in https://github.com/vllm-project/vllm/pull/2221
Remove unused const TIMEOUT_TO_PREVENT_DEADLOCK by @ronensc in https://github.com/vllm-project/vllm/pull/2321
[Minor] Revert the changes in test_cache by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2335
Bump up to v0.2.7 by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2337

New Contributors

@SuhongMoon made their first contribution in https://github.com/vllm-project/vllm/pull/2162
@HMellor made their first contribution in https://github.com/vllm-project/vllm/pull/2109
@oushu1zhangxiangxuan1 made their first contribution in https://github.com/vllm-project/vllm/pull/2166
@kliuae made their first contribution in https://github.com/vllm-project/vllm/pull/2180
@avideci made their first contribution in https://github.com/vllm-project/vllm/pull/2062
@hanzhi713 made their first contribution in https://github.com/vllm-project/vllm/pull/2207
@ronensc made their first contribution in https://github.com/vllm-project/vllm/pull/2222
@skt7 made their first contribution in https://github.com/vllm-project/vllm/pull/2246
@blueceiling made their first contribution in https://github.com/vllm-project/vllm/pull/2260
@dalgarak made their first contribution in https://github.com/vllm-project/vllm/pull/2301

Full Changelog: https://github.com/vllm-project/vllm/compare/v0.2.6...v0.2.7

v0.2.6

4 months ago

Major changes

Fast model execution with CUDA/HIP graph
W4A16 GPTQ support (thanks to @chu-tianxiang)
Fix memory profiling with tensor parallelism
Fix *.bin weight loading for Mixtral models

What's Changed

Fix typing in generate function for AsyncLLMEngine & add toml to requirements-dev by @mezuzza in https://github.com/vllm-project/vllm/pull/2100
Fix Dockerfile.rocm by @tjtanaa in https://github.com/vllm-project/vllm/pull/2101
avoid multiple redefinition by @MitchellX in https://github.com/vllm-project/vllm/pull/1817
Add a flag to include stop string in output text by @yunfeng-scale in https://github.com/vllm-project/vllm/pull/1976
Add GPTQ support by @chu-tianxiang in https://github.com/vllm-project/vllm/pull/916
[Docs] Add quantization support to docs by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2135
[ROCm] Temporarily remove GPTQ ROCm support by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2138
simplify loading weights logic by @esmeetu in https://github.com/vllm-project/vllm/pull/2133
Optimize model execution with CUDA graph by @WoosukKwon in https://github.com/vllm-project/vllm/pull/1926
[Minor] Delete Llama tokenizer warnings by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2146
Fix all-reduce memory usage by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2151
Pin PyTorch & xformers versions by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2155
Remove dependency on CuPy by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2152
[Docs] Add CUDA graph support to docs by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2148
Temporarily enforce eager mode for GPTQ models by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2154
[Minor] Add more detailed explanation on quantization argument by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2145
[Minor] Fix xformers version by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2158
[Minor] Add Phi 2 to supported models by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2159
Make sampler less blocking by @Yard1 in https://github.com/vllm-project/vllm/pull/1889
[Minor] Fix a typo in .pt weight support by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2160
Disable CUDA graph for SqueezeLLM by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2161
Bump up to v0.2.6 by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2157

New Contributors

@mezuzza made their first contribution in https://github.com/vllm-project/vllm/pull/2100
@MitchellX made their first contribution in https://github.com/vllm-project/vllm/pull/1817

Full Changelog: https://github.com/vllm-project/vllm/compare/v0.2.5...v0.2.6