Lmdeploy Versions Save

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.

v0.4.1

6 days ago

What's Changed

๐Ÿš€ Features

๐Ÿ’ฅ Improvements

๐Ÿž Bug fixes

๐Ÿ“š Documentations

๐ŸŒ Other

New Contributors

Full Changelog: https://github.com/InternLM/lmdeploy/compare/v0.4.0...v0.4.1

v0.4.0

2 weeks ago

Highlights

Support for Llama3 and additional Vision-Language Models (VLMs):

  • We now support Llama3 and an extended range of Vision-Language Models (VLMs), including InternVL versions 1.1 and 1.2, MiniGemini, and InternLMXComposer2.

Introduce online int4/int8 KV quantization and inference

  • data-free online quantization
  • Supports all nvidia GPU models with Volta architecture (sm70) and above
  • KV int8 quantization has almost lossless accuracy, and KV int4 quantization accuracy is within an acceptable range
  • Efficient inference, with int8/int4 KV quantization applied to llama2-7b, RPS is improved by approximately 30% and 40% respectively compared to fp16

The following table shows the evaluation results of three LLM models with different KV numerical precision:

- - - llama2-7b-chat - - internlm2-chat-7b - - qwen1.5-7b-chat - -
dataset version metric kv fp16 kv int8 kv int4 kv fp16 kv int8 kv int4 fp16 kv int8 kv int4
ceval - naive_average 28.42 27.96 27.58 60.45 60.88 60.28 70.56 70.49 68.62
mmlu - naive_average 35.64 35.58 34.79 63.91 64 62.36 61.48 61.56 60.65
triviaqa 2121ce score 56.09 56.13 53.71 58.73 58.7 58.18 44.62 44.77 44.04
gsm8k 1d7fe4 accuracy 28.2 28.05 27.37 70.13 69.75 66.87 54.97 56.41 54.74
race-middle 9a54b6 accuracy 41.57 41.78 41.23 88.93 88.93 88.93 87.33 87.26 86.28
race-high 9a54b6 accuracy 39.65 39.77 40.77 85.33 85.31 84.62 82.53 82.59 82.02

The below table presents LMDeploy's inference performance with quantized KV.

model kv type test settings RPS v.s. kv fp16
llama2-chat-7b fp16 tp1 / ratio 0.8 / bs 256 / prompts 10000 14.98 1.0
- int8 tp1 / ratio 0.8 / bs 256 / prompts 10000 19.01 1.27
- int4 tp1 / ratio 0.8 / bs 256 / prompts 10000 20.81 1.39
llama2-chat-13b fp16 tp1 / ratio 0.9 / bs 128 / prompts 10000 8.55 1.0
- int8 tp1 / ratio 0.9 / bs 256 / prompts 10000 10.96 1.28
- int4 tp1 / ratio 0.9 / bs 256 / prompts 10000 11.91 1.39
internlm2-chat-7b fp16 tp1 / ratio 0.8 / bs 256 / prompts 10000 24.13 1.0
- int8 tp1 / ratio 0.8 / bs 256 / prompts 10000 25.28 1.05
- int4 tp1 / ratio 0.8 / bs 256 / prompts 10000 25.80 1.07

What's Changed

๐Ÿš€ Features

๐Ÿ’ฅ Improvements

๐Ÿž Bug fixes

๐Ÿ“š Documentations

๐ŸŒ Other

New Contributors

Full Changelog: https://github.com/InternLM/lmdeploy/compare/v0.3.0...v0.4.0

v0.3.0

1 month ago

Highlight

  • Refactor attention and optimize GQA(#1258 #1307 #1116), achieving 22+ and 16+ RPS for internlm2-7b and internlm2-20b, about 1.8x faster than vLLM
  • Support new models, including Qwen1.5-MOE(#1372), DBRX(#1367), DeepSeek-VL(#1335)

What's Changed

๐Ÿš€ Features

๐Ÿ’ฅ Improvements

๐Ÿž Bug fixes

๐Ÿ“š Documentations

๐ŸŒ Other

Full Changelog: https://github.com/InternLM/lmdeploy/compare/v0.2.6...v0.3.0

v0.2.6

1 month ago

Highlight

Support vision-languange models (VLM) inference pipeline and serving. Currently, it supports the following models, Qwen-VL-Chat, LLaVA series v1.5, v1.6 and Yi-VL

  • VLM Inference Pipeline
from lmdeploy import pipeline
from lmdeploy.vl import load_image

pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b')

image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
response = pipe(('describe this image', image))
print(response)

Please refer to the detailed guide from here

  • VLM serving by openai compatible server
lmdeploy server api_server liuhaotian/llava-v1.6-vicuna-7b --server-port 8000
  • VLM Serving by gradio
lmdeploy serve gradio liuhaotian/llava-v1.6-vicuna-7b --server-port 6006

What's Changed

๐Ÿš€ Features

๐Ÿ’ฅ Improvements

๐Ÿž Bug fixes

๐Ÿ“š Documentations

๐ŸŒ Other

New Contributors

Full Changelog: https://github.com/InternLM/lmdeploy/compare/v0.2.5...v0.2.6

v0.2.5

2 months ago

What's Changed

๐Ÿš€ Features

๐Ÿ’ฅ Improvements

๐Ÿž Bug fixes

๐Ÿ“š Documentations

๐ŸŒ Other

New Contributors

Full Changelog: https://github.com/InternLM/lmdeploy/compare/v0.2.4...v0.2.5

v0.2.4

2 months ago

What's Changed

๐Ÿ’ฅ Improvements

๐Ÿž Bug fixes

๐Ÿ“š Documentations

๐ŸŒ Other

Full Changelog: https://github.com/InternLM/lmdeploy/compare/v0.2.3...v0.2.4

v0.2.3

3 months ago

What's Changed

๐Ÿš€ Features

๐Ÿ’ฅ Improvements

๐Ÿž Bug fixes

๐Ÿ“š Documentations

๐ŸŒ Other

New Contributors

Full Changelog: https://github.com/InternLM/lmdeploy/compare/v0.2.2...v0.2.3

v0.2.2

3 months ago

Highlight

English version

  • The allocation strategy for k/v cache is changed. The parameter cache_max_entry_count defaults to 0.8. It means the proportion of GPU FREE memory rather than TOTAL memory. The default value is updated to 0.8. It can help prevent OOM issues.
  • The pipeline API supports streaming inference. You may give it a try!
from lmdeploy import pipeline
pipe = pipeline('internlm/internlm2-chat-7b')
for item in pipe.stream_infer('hi, please intro yourself'):
    print(item)
  • Add api key and ssl to api_server

Chinese version

  • TurboMind engine ไฟฎๆ”นไบ†GPU memoryๅˆ†้…็ญ–็•ฅใ€‚k/v cache ๅ†…ๅญ˜ๆฏ”ไพ‹ๅ‚ๆ•ฐ cache_max_entry_count ็ผบ็œๅ€ผๅ˜ๆ›ดไธบ 0.8ใ€‚ๅฎƒ่กจ็คบ GPU็ฉบ้—ฒๅ†…ๅญ˜็š„ๆฏ”ไพ‹๏ผŒไธๅ†ๆ˜ฏ GPU ๆ€ปๅ†…ๅญ˜็š„ๆฏ”ไพ‹ใ€‚
  • Pipeline ๆ”ฏๆŒๆตๅผ่พ“ๅ‡บๆŽฅๅฃใ€‚ๅฏไปฅๅฐ่ฏ•ไธ‹ๅฆ‚ไธ‹ไปฃ็ ๏ผš
from lmdeploy import pipeline
pipe = pipeline('internlm/internlm2-chat-7b')
for item in pipe.stream_infer('hi, please intro yourself'):
    print(item)
  • api_server ๅœจๆŽฅๅฃไธญๅขžๅŠ ไบ† api_key

What's Changed

๐Ÿš€ Features

๐Ÿ’ฅ Improvements

๐Ÿž Bug fixes

๐Ÿ“š Documentations

๐ŸŒ Other

New Contributors

Full Changelog: https://github.com/InternLM/lmdeploy/compare/v0.2.1...v0.2.2

v0.2.1

3 months ago

What's Changed

๐Ÿ’ฅ Improvements

๐Ÿž Bug fixes

๐Ÿ“š Documentations

๐ŸŒ Other

Full Changelog: https://github.com/InternLM/lmdeploy/compare/v0.2.0...v0.2.1

v0.2.0

3 months ago

What's Changed

๐Ÿš€ Features

๐Ÿ’ฅ Improvements

๐Ÿž Bug fixes

๐Ÿ“š Documentations

๐ŸŒ Other

New Contributors

Full Changelog: https://github.com/InternLM/lmdeploy/compare/v0.1.0...v0.2.0