Vllm Versions Save

A high-throughput and memory-efficient inference and serving engine for LLMs

v0.4.2

1 week ago

Highlights

Features

  • Chunked prefill is ready for testing! It improves inter-token latency in high load scenario by chunking the prompt processing and priortizes decode (#4580)
  • Speculative decoding functionalities: logprobs (#4378), ngram (#4237)
  • Support FlashInfer as attention backend (#4353)

Models and Enhancements

  • Add support for Phi-3-mini (#4298, #4372, #4380)
  • Add more histogram metrics (#2764, #4523)
  • Full tensor parallelism for LoRA layers (#3524)
  • Expanding Marlin kernel to support all GPTQ models (#3922, #4466, #4533)

Dependency Upgrade

  • Upgrade to torch==2.3.0 (#4454)
  • Upgrade to tensorizer==2.9.0 (#4467)
  • Expansion of AMD test suite (#4267)

Progress and Dev Experience

  • Centralize and document all environment variables (#4548, #4574)
  • Progress towards fully typed codebase (#4337, #4427, #4555, #4450)
  • Progress towards pipeline parallelism (#4512, #4444, #4566)
  • Progress towards multiprocessing based executors (#4348, #4402, #4419)
  • Progress towards FP8 support (#4343, #4332, 4527)

What's Changed

New Contributors

Full Changelog: https://github.com/vllm-project/vllm/compare/v0.4.1...v0.4.2

v0.4.1

3 weeks ago

Highlights

Features

  • Support and enhance CommandR+ (#3829), minicpm (#3893), Meta Llama 3 (#4175, #4182), Mixtral 8x22b (#4073, #4002)
  • Support private model registration, and updating our support policy (#3871, 3948)
  • Support PyTorch 2.2.1 and Triton 2.2.0 (#4061, #4079, #3805, #3904, #4271)
  • Add option for using LM Format Enforcer for guided decoding (#3868)
  • Add option for optionally initialize tokenizer and detokenizer (#3748)
  • Add option for load model using tensorizer (#3476)

Enhancements

  • vLLM is now mostly type checked by mypy (#3816, #4006, #4161, #4043)
  • Progress towards chunked prefill scheduler (#3550, #3853, #4280, #3884)
  • Progress towards speculative decoding (#3250, #3706, #3894)
  • Initial support with dynamic per-tensor scaling via FP8 (#4118)

Hardwares

  • Intel CPU inference backend is added (#3993, #3634)
  • AMD backend is enhanced with Triton kernel and e4m3fn KV cache (#3643, #3290)

What's Changed

New Contributors

Full Changelog: https://github.com/vllm-project/vllm/compare/v0.4.0...v0.4.1

v0.4.0.post1

1 month ago

Highlight

v0.4.0 lacks support for sm70/75 support. We did a hotfix for it.

What's Changed

New Contributors

Full Changelog: https://github.com/vllm-project/vllm/compare/v0.4.0...v0.4.0.post1

v0.4.0

1 month ago

Major changes

Models

  • New models: Command+R(#3433), Qwen2 MoE(#3346), DBRX(#3660), XVerse (#3610), Jais (#3183).
  • New vision language model: LLaVA (#3042)

Production features

  • Automatic prefix caching (#2762, #3703) supporting long system prompt to be automatically cached across requests. Use the flag --enable-prefix-caching to turn it on.
  • Support json_object in OpenAI server for arbitrary JSON, --use-delay flag to improve time to first token across many requests, and min_tokens to EOS suppression.
  • Progress in chunked prefill scheduler (#3236, #3538), and speculative decoding (#3103).
  • Custom all reduce kernel has been re-enabled after more robustness fixes.
  • Replaced cupy dependency due to its bugs.

Hardware

  • Improved Neuron support for AWS Inferentia.
  • CMake based build system for extensibility.

Ecosystem

  • Extensive serving benchmark refactoring (#3277)
  • Usage statistics collection (#2852)

What's Changed

New Contributors

Full Changelog: https://github.com/vllm-project/vllm/compare/v0.3.3...v0.4.0

v0.3.3

2 months ago

Major changes

  • StarCoder2 support
  • Performance optimization and LoRA support for Gemma
  • 2/3/8-bit GPTQ support
  • Integrate Marlin Kernels for Int4 GPTQ inference
  • Performance optimization for MoE kernel
  • [Experimental] AWS Inferentia2 support
  • [Experimental] Structured output (JSON, Regex) in OpenAI Server

What's Changed

New Contributors

Full Changelog: https://github.com/vllm-project/vllm/compare/v0.3.2...v0.3.3

v0.3.2

2 months ago

Major Changes

This version adds support for the OLMo and Gemma Model, as well as seed parameter.

What's Changed

New Contributors

Full Changelog: https://github.com/vllm-project/vllm/compare/v0.3.1...v0.3.2

v0.3.1

3 months ago

Major Changes

This version fixes the following major bugs:

  • Memory leak with distributed execution. (Solved by using CuPY for collective communication).
  • Support for Python 3.8.

Also with many smaller bug fixes listed below.

What's Changed

New Contributors

Full Changelog: https://github.com/vllm-project/vllm/compare/v0.3.0...v0.3.1

v0.3.0

3 months ago

Major Changes

  • Experimental multi-lora support
  • Experimental prefix caching support
  • FP8 KV Cache support
  • Optimized MoE performance and Deepseek MoE support
  • CI tested PRs
  • Support batch completion in server

What's Changed

New Contributors

Full Changelog: https://github.com/vllm-project/vllm/compare/v0.2.7...v0.3.0

v0.2.7

4 months ago

Major Changes

  • Up to 70% throughput improvement for distributed inference by removing serialization/deserialization overheads
  • Fix tensor parallelism support for Mixtral + GPTQ/AWQ

What's Changed

New Contributors

Full Changelog: https://github.com/vllm-project/vllm/compare/v0.2.6...v0.2.7

v0.2.6

4 months ago

Major changes

  • Fast model execution with CUDA/HIP graph
  • W4A16 GPTQ support (thanks to @chu-tianxiang)
  • Fix memory profiling with tensor parallelism
  • Fix *.bin weight loading for Mixtral models

What's Changed

New Contributors

Full Changelog: https://github.com/vllm-project/vllm/compare/v0.2.5...v0.2.6