LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
CTA_H
& fix qkv bias by @lzhangzz in https://github.com/InternLM/lmdeploy/pull/1491
Full Changelog: https://github.com/InternLM/lmdeploy/compare/v0.4.0...v0.4.1
Support for Llama3 and additional Vision-Language Models (VLMs):
Introduce online int4/int8 KV quantization and inference
The following table shows the evaluation results of three LLM models with different KV numerical precision:
- | - | - | llama2-7b-chat | - | - | internlm2-chat-7b | - | - | qwen1.5-7b-chat | - | - |
---|---|---|---|---|---|---|---|---|---|---|---|
dataset | version | metric | kv fp16 | kv int8 | kv int4 | kv fp16 | kv int8 | kv int4 | fp16 | kv int8 | kv int4 |
ceval | - | naive_average | 28.42 | 27.96 | 27.58 | 60.45 | 60.88 | 60.28 | 70.56 | 70.49 | 68.62 |
mmlu | - | naive_average | 35.64 | 35.58 | 34.79 | 63.91 | 64 | 62.36 | 61.48 | 61.56 | 60.65 |
triviaqa | 2121ce | score | 56.09 | 56.13 | 53.71 | 58.73 | 58.7 | 58.18 | 44.62 | 44.77 | 44.04 |
gsm8k | 1d7fe4 | accuracy | 28.2 | 28.05 | 27.37 | 70.13 | 69.75 | 66.87 | 54.97 | 56.41 | 54.74 |
race-middle | 9a54b6 | accuracy | 41.57 | 41.78 | 41.23 | 88.93 | 88.93 | 88.93 | 87.33 | 87.26 | 86.28 |
race-high | 9a54b6 | accuracy | 39.65 | 39.77 | 40.77 | 85.33 | 85.31 | 84.62 | 82.53 | 82.59 | 82.02 |
The below table presents LMDeploy's inference performance with quantized KV.
model | kv type | test settings | RPS | v.s. kv fp16 |
---|---|---|---|---|
llama2-chat-7b | fp16 | tp1 / ratio 0.8 / bs 256 / prompts 10000 | 14.98 | 1.0 |
- | int8 | tp1 / ratio 0.8 / bs 256 / prompts 10000 | 19.01 | 1.27 |
- | int4 | tp1 / ratio 0.8 / bs 256 / prompts 10000 | 20.81 | 1.39 |
llama2-chat-13b | fp16 | tp1 / ratio 0.9 / bs 128 / prompts 10000 | 8.55 | 1.0 |
- | int8 | tp1 / ratio 0.9 / bs 256 / prompts 10000 | 10.96 | 1.28 |
- | int4 | tp1 / ratio 0.9 / bs 256 / prompts 10000 | 11.91 | 1.39 |
internlm2-chat-7b | fp16 | tp1 / ratio 0.8 / bs 256 / prompts 10000 | 24.13 | 1.0 |
- | int8 | tp1 / ratio 0.8 / bs 256 / prompts 10000 | 25.28 | 1.05 |
- | int4 | tp1 / ratio 0.8 / bs 256 / prompts 10000 | 25.80 | 1.07 |
sm_89
and sm_90
targets by @lzhangzz in https://github.com/InternLM/lmdeploy/pull/1383
ArgumentError
error happened in python 3.11 by @RunningLeon in https://github.com/InternLM/lmdeploy/pull/1401
Full Changelog: https://github.com/InternLM/lmdeploy/compare/v0.3.0...v0.4.0
[4,5,6,8]
by @lzhangzz in https://github.com/InternLM/lmdeploy/pull/1258
max_prefill_token_num
for low gpu memory by @grimoire in https://github.com/InternLM/lmdeploy/pull/1373
Full Changelog: https://github.com/InternLM/lmdeploy/compare/v0.2.6...v0.3.0
Support vision-languange models (VLM) inference pipeline and serving. Currently, it supports the following models, Qwen-VL-Chat, LLaVA series v1.5, v1.6 and Yi-VL
from lmdeploy import pipeline
from lmdeploy.vl import load_image
pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b')
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
response = pipe(('describe this image', image))
print(response)
Please refer to the detailed guide from here
lmdeploy server api_server liuhaotian/llava-v1.6-vicuna-7b --server-port 8000
lmdeploy serve gradio liuhaotian/llava-v1.6-vicuna-7b --server-port 6006
Full Changelog: https://github.com/InternLM/lmdeploy/compare/v0.2.5...v0.2.6
min_new_tokens
generation config in pytorch engine by @grimoire in https://github.com/InternLM/lmdeploy/pull/1096
model_name
by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1188
max_prefill_token_num
by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/1203
None
session_len by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/1230
Full Changelog: https://github.com/InternLM/lmdeploy/compare/v0.2.4...v0.2.5
docker run
directly by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1162
top_k
in ChatCompletionRequest by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/1174
Full Changelog: https://github.com/InternLM/lmdeploy/compare/v0.2.3...v0.2.4
get_logger
to remove the dependency of MMLogger from mmengine by @yinfan98 in https://github.com/InternLM/lmdeploy/pull/1064
ignore_eos
logic by @grimoire in https://github.com/InternLM/lmdeploy/pull/1099
Full Changelog: https://github.com/InternLM/lmdeploy/compare/v0.2.2...v0.2.3
cache_max_entry_count
defaults to 0.8. It means the proportion of GPU FREE memory rather than TOTAL memory. The default value is updated to 0.8. It can help prevent OOM issues.from lmdeploy import pipeline
pipe = pipeline('internlm/internlm2-chat-7b')
for item in pipe.stream_infer('hi, please intro yourself'):
print(item)
api_server
from lmdeploy import pipeline
pipe = pipeline('internlm/internlm2-chat-7b')
for item in pipe.stream_infer('hi, please intro yourself'):
print(item)
tp>1
by @grimoire in https://github.com/InternLM/lmdeploy/pull/942
Full Changelog: https://github.com/InternLM/lmdeploy/compare/v0.2.1...v0.2.2
Full Changelog: https://github.com/InternLM/lmdeploy/compare/v0.2.0...v0.2.1
lmdeploy lite calibrate
and lmdeploy lite auto_awq
by @pppppM in https://github.com/InternLM/lmdeploy/pull/849
api_server
dependencies from serve.txt to runtime.txt by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/879
flash-attn
dependency of lmdeploy lite module by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/917
tp
from pipline argument list by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/947
calibrate
bug when transformers>4.36
by @pppppM in https://github.com/InternLM/lmdeploy/pull/967
Full Changelog: https://github.com/InternLM/lmdeploy/compare/v0.1.0...v0.2.0