Lmdeploy Versions Save

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.

v0.4.1

6 days ago

What's Changed

🚀 Features

Add colab demo by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1428
support starcoder2 by @grimoire in https://github.com/InternLM/lmdeploy/pull/1468
support OpenGVLab/InternVL-Chat-V1-5 by @irexyc in https://github.com/InternLM/lmdeploy/pull/1490

💥 Improvements

variable CTA_H & fix qkv bias by @lzhangzz in https://github.com/InternLM/lmdeploy/pull/1491
refactor vision model loading by @irexyc in https://github.com/InternLM/lmdeploy/pull/1482
fix installation requirements for windows by @irexyc in https://github.com/InternLM/lmdeploy/pull/1531
Remove split batch inside pipline inference function by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1507
Remove first empty chunck for api_server by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1527
add benchmark script to profile pipeline APIs by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/1528
Add input validation by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1525

🐞 Bug fixes

fix local variable 'response' referenced before assignment in async_engine.generate by @irexyc in https://github.com/InternLM/lmdeploy/pull/1513
Fix turbomind import in windows by @irexyc in https://github.com/InternLM/lmdeploy/pull/1533
Fix convert qwen2 to turbomind by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1546
Adding api_key and model_name parameters to the restful benchmark by @NiuBlibing in https://github.com/InternLM/lmdeploy/pull/1478

📚 Documentations

update supported models for Baichuan by @zhyncs in https://github.com/InternLM/lmdeploy/pull/1485
Fix typo in w8a8.md by @Infinity4B in https://github.com/InternLM/lmdeploy/pull/1523
complete build.md by @YanxingLiu in https://github.com/InternLM/lmdeploy/pull/1508
update readme wechat qrcode by @vansin in https://github.com/InternLM/lmdeploy/pull/1529
Update docker docs for VL api by @vody-am in https://github.com/InternLM/lmdeploy/pull/1534
Format supported model table using html syntax by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/1493
doc: add example of deploying api server to Kubernetes by @uzuku in https://github.com/InternLM/lmdeploy/pull/1488

🌐 Other

add modelscope and lora testcase by @zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1506
bump version to v0.4.1 by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/1544

New Contributors

@NiuBlibing made their first contribution in https://github.com/InternLM/lmdeploy/pull/1478
@Infinity4B made their first contribution in https://github.com/InternLM/lmdeploy/pull/1523
@YanxingLiu made their first contribution in https://github.com/InternLM/lmdeploy/pull/1508
@vody-am made their first contribution in https://github.com/InternLM/lmdeploy/pull/1534
@uzuku made their first contribution in https://github.com/InternLM/lmdeploy/pull/1488

Full Changelog: https://github.com/InternLM/lmdeploy/compare/v0.4.0...v0.4.1

v0.4.0

2 weeks ago

Highlights

Support for Llama3 and additional Vision-Language Models (VLMs):

We now support Llama3 and an extended range of Vision-Language Models (VLMs), including InternVL versions 1.1 and 1.2, MiniGemini, and InternLMXComposer2.

Introduce online int4/int8 KV quantization and inference

data-free online quantization
Supports all nvidia GPU models with Volta architecture (sm70) and above
KV int8 quantization has almost lossless accuracy, and KV int4 quantization accuracy is within an acceptable range
Efficient inference, with int8/int4 KV quantization applied to llama2-7b, RPS is improved by approximately 30% and 40% respectively compared to fp16

The following table shows the evaluation results of three LLM models with different KV numerical precision:

-	-	-	llama2-7b-chat	-	-	internlm2-chat-7b	-	-	qwen1.5-7b-chat	-	-
dataset	version	metric	kv fp16	kv int8	kv int4	kv fp16	kv int8	kv int4	fp16	kv int8	kv int4
ceval	-	naive_average	28.42	27.96	27.58	60.45	60.88	60.28	70.56	70.49	68.62
mmlu	-	naive_average	35.64	35.58	34.79	63.91	64	62.36	61.48	61.56	60.65
triviaqa	2121ce	score	56.09	56.13	53.71	58.73	58.7	58.18	44.62	44.77	44.04
gsm8k	1d7fe4	accuracy	28.2	28.05	27.37	70.13	69.75	66.87	54.97	56.41	54.74
race-middle	9a54b6	accuracy	41.57	41.78	41.23	88.93	88.93	88.93	87.33	87.26	86.28
race-high	9a54b6	accuracy	39.65	39.77	40.77	85.33	85.31	84.62	82.53	82.59	82.02

The below table presents LMDeploy's inference performance with quantized KV.

model	kv type	test settings	RPS	v.s. kv fp16
llama2-chat-7b	fp16	tp1 / ratio 0.8 / bs 256 / prompts 10000	14.98	1.0
-	int8	tp1 / ratio 0.8 / bs 256 / prompts 10000	19.01	1.27
-	int4	tp1 / ratio 0.8 / bs 256 / prompts 10000	20.81	1.39
llama2-chat-13b	fp16	tp1 / ratio 0.9 / bs 128 / prompts 10000	8.55	1.0
-	int8	tp1 / ratio 0.9 / bs 256 / prompts 10000	10.96	1.28
-	int4	tp1 / ratio 0.9 / bs 256 / prompts 10000	11.91	1.39
internlm2-chat-7b	fp16	tp1 / ratio 0.8 / bs 256 / prompts 10000	24.13	1.0
-	int8	tp1 / ratio 0.8 / bs 256 / prompts 10000	25.28	1.05
-	int4	tp1 / ratio 0.8 / bs 256 / prompts 10000	25.80	1.07

What's Changed

🚀 Features

Support qwen1.5 in turbomind engine by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/1406
Online 8/4-bit KV-cache quantization by @lzhangzz in https://github.com/InternLM/lmdeploy/pull/1377
Support qwen1.5-*-AWQ model inference in turbomind by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/1430
support Internvl chat v1.1, v1.2 and v1.2-plus by @irexyc in https://github.com/InternLM/lmdeploy/pull/1425
support Internvl chat llava by @irexyc in https://github.com/InternLM/lmdeploy/pull/1426
Add llama3 chat template by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1461
Support mini gemini llama by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1438
add interactive api in service for VL models by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1444
support output logprobs with turbomind backend. by @irexyc in https://github.com/InternLM/lmdeploy/pull/1391
support internlm-xcomposer2-7b & internlm-xcomposer2-4khd-7b by @irexyc in https://github.com/InternLM/lmdeploy/pull/1458
Add qwen1.5 awq quantization by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1470

💥 Improvements

Reduce binary size, add sm_89 and sm_90 targets by @lzhangzz in https://github.com/InternLM/lmdeploy/pull/1383
Use new event loop instead of the current loop for pipeline by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1352
Optimize inference of pytorch engine with tensor parallelism by @grimoire in https://github.com/InternLM/lmdeploy/pull/1397
add llava-v1.6-34b template by @irexyc in https://github.com/InternLM/lmdeploy/pull/1408
Initialize vl encoder first to avoid OOM by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1434
Support model_name customization for api_server by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1403
Expose dynamic split&fuse parameters by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/1433
warning transformers version by @grimoire in https://github.com/InternLM/lmdeploy/pull/1453
Optimize apply_rotary kernel and remove useless inference_mode by @grimoire in https://github.com/InternLM/lmdeploy/pull/1457
set infinity timeout to nccl by @grimoire in https://github.com/InternLM/lmdeploy/pull/1465
Feat: format internlm2 chat template by @liujiangning30 in https://github.com/InternLM/lmdeploy/pull/1456

🐞 Bug fixes

handle SIGTERM by @grimoire in https://github.com/InternLM/lmdeploy/pull/1389
fix chat cli ArgumentError error happened in python 3.11 by @RunningLeon in https://github.com/InternLM/lmdeploy/pull/1401
Fix llama_triton_example by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1414
miss --trust-remote-code in converter, which is side effect brought by pr #1406 by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/1420
fix sampling kernel by @grimoire in https://github.com/InternLM/lmdeploy/pull/1417
Fix loading single safetensor file error by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1427
remove space in deepseek template by @grimoire in https://github.com/InternLM/lmdeploy/pull/1441
fix free repetition_penalty_workspace_ buffer by @irexyc in https://github.com/InternLM/lmdeploy/pull/1467
fix adapter failure when tp>1 by @grimoire in https://github.com/InternLM/lmdeploy/pull/1476
get model in advance to fix downloading from modelscope error by @irexyc in https://github.com/InternLM/lmdeploy/pull/1473
Fix the side effect in engine_intance brought by #1391 by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/1480

📚 Documentations

Add model name corresponding to the test data in the doc by @wykvictor in https://github.com/InternLM/lmdeploy/pull/1400
fix typo in get_started guide by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/1411
Add async openai demo for api_server by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1409
add the recommendation version for Python Backend by @zhyncs in https://github.com/InternLM/lmdeploy/pull/1436
Update kv quantization and inference guide by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/1412
update doc for llama3 by @zhyncs in https://github.com/InternLM/lmdeploy/pull/1462

🌐 Other

hack cmakelist.txt in pr_test workflow by @zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1405
Add benchmark report generated in summary by @zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1419
add restful completions v1 test case by @ZhoujhZoe in https://github.com/InternLM/lmdeploy/pull/1416
Add kvint4/8 ete testcase by @zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1448
impove rotary embedding of qwen in torch engine by @grimoire in https://github.com/InternLM/lmdeploy/pull/1451
change cutlass url in ut by @RunningLeon in https://github.com/InternLM/lmdeploy/pull/1464
bump version to v0.4.0 by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/1469

New Contributors

@wykvictor made their first contribution in https://github.com/InternLM/lmdeploy/pull/1400
@ZhoujhZoe made their first contribution in https://github.com/InternLM/lmdeploy/pull/1416
@liujiangning30 made their first contribution in https://github.com/InternLM/lmdeploy/pull/1456

Full Changelog: https://github.com/InternLM/lmdeploy/compare/v0.3.0...v0.4.0

v0.3.0

1 month ago

Highlight

Refactor attention and optimize GQA(#1258 #1307 #1116), achieving 22+ and 16+ RPS for internlm2-7b and internlm2-20b, about 1.8x faster than vLLM
Support new models, including Qwen1.5-MOE(#1372), DBRX(#1367), DeepSeek-VL(#1335)

What's Changed

🚀 Features

Add tensor core GQA dispatch for [4,5,6,8] by @lzhangzz in https://github.com/InternLM/lmdeploy/pull/1258
upgrade turbomind to v2.1 by by @lzhangzz in https://github.com/InternLM/lmdeploy/pull/1307, https://github.com/InternLM/lmdeploy/pull/1116
Support slora to pipeline by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1286
Support qwen for pytorch engine by @RunningLeon in https://github.com/InternLM/lmdeploy/pull/1265
Support Triton inference server python backend by @ispobock in https://github.com/InternLM/lmdeploy/pull/1329
torch engine support dbrx by @grimoire in https://github.com/InternLM/lmdeploy/pull/1367
Support qwen2 moe for pytorch engine by @RunningLeon in https://github.com/InternLM/lmdeploy/pull/1372
Add deepseek vl by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1335

💥 Improvements

rm unused var by @zhyncs in https://github.com/InternLM/lmdeploy/pull/1256
Expose cache_block_seq_len to API by @ispobock in https://github.com/InternLM/lmdeploy/pull/1218
add chat template for deepseek coder model by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/1310
Add more log info for api_server by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1323
remove cuda cache after loading vison model by @irexyc in https://github.com/InternLM/lmdeploy/pull/1325
Add new chat cli with auto backend feature by @RunningLeon in https://github.com/InternLM/lmdeploy/pull/1276
Update rewritings for qwen by @RunningLeon in https://github.com/InternLM/lmdeploy/pull/1351
lazy import accelerate.init_empty_weights for vl async engine by @irexyc in https://github.com/InternLM/lmdeploy/pull/1359
update lmdeploy pypi packages deps to cuda12 by @irexyc in https://github.com/InternLM/lmdeploy/pull/1368
update max_prefill_token_num for low gpu memory by @grimoire in https://github.com/InternLM/lmdeploy/pull/1373
Optimize pipeline of pytorch engine by @grimoire in https://github.com/InternLM/lmdeploy/pull/1328

🐞 Bug fixes

fix different stop/bad words length in batch by @irexyc in https://github.com/InternLM/lmdeploy/pull/1246
Fix performance issue of chatbot by @ispobock in https://github.com/InternLM/lmdeploy/pull/1295
add missed argument by @irexyc in https://github.com/InternLM/lmdeploy/pull/1317
Fix dlpack memory leak by @ispobock in https://github.com/InternLM/lmdeploy/pull/1344
Fix invalid context for Internstudio platform by @lzhangzz in https://github.com/InternLM/lmdeploy/pull/1354
fix benchmark generation by @grimoire in https://github.com/InternLM/lmdeploy/pull/1349
fix window attention by @grimoire in https://github.com/InternLM/lmdeploy/pull/1341
fix batchApplyRepetitionPenalty by @irexyc in https://github.com/InternLM/lmdeploy/pull/1358
Fix memory leak of DLManagedTensor by @ispobock in https://github.com/InternLM/lmdeploy/pull/1361
fix vlm inference hung with tp by @irexyc in https://github.com/InternLM/lmdeploy/pull/1336
[Fix] fix the unit test of model name deduce by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1382

📚 Documentations

add citation in readme by @RunningLeon in https://github.com/InternLM/lmdeploy/pull/1308
Add slora example for pipeline by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1343

🌐 Other

Add restful interface regrssion daily test workflow. by @zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1302
Add offline mode for testcase workflow by @zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1318
workflow bugfix and add llava-v1.5-13b testcase by @zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1339
Add benchmark test workflow by @zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1364
bump version to v0.3.0 by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/1387

Full Changelog: https://github.com/InternLM/lmdeploy/compare/v0.2.6...v0.3.0

v0.2.6

1 month ago

Highlight

Support vision-languange models (VLM) inference pipeline and serving. Currently, it supports the following models, Qwen-VL-Chat, LLaVA series v1.5, v1.6 and Yi-VL

VLM Inference Pipeline

from lmdeploy import pipeline
from lmdeploy.vl import load_image

pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b')

image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
response = pipe(('describe this image', image))
print(response)

Please refer to the detailed guide from here

VLM serving by openai compatible server

lmdeploy server api_server liuhaotian/llava-v1.6-vicuna-7b --server-port 8000

VLM Serving by gradio

lmdeploy serve gradio liuhaotian/llava-v1.6-vicuna-7b --server-port 6006

What's Changed

🚀 Features

Add inference pipeline for VL models by @irexyc in https://github.com/InternLM/lmdeploy/pull/1214
Support serving VLMs by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1285
Serve VLM by gradio by @irexyc in https://github.com/InternLM/lmdeploy/pull/1293
Add pipeline.chat api for easy use by @irexyc in https://github.com/InternLM/lmdeploy/pull/1292

💥 Improvements

Hide qos functions from swagger UI if not applied by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1238
Color log formatter by @grimoire in https://github.com/InternLM/lmdeploy/pull/1247
optimize filling kv cache kernel in pytorch engine by @grimoire in https://github.com/InternLM/lmdeploy/pull/1251
Refactor chat template and support accurate name matching. by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1216
Support passing json file to chat template by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1200
upgrade peft and check adapters by @grimoire in https://github.com/InternLM/lmdeploy/pull/1284
better cache allocation in pytorch engine by @grimoire in https://github.com/InternLM/lmdeploy/pull/1272
Fall back to base template if there is no chat_template in tokenizer_config.json by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1294

🐞 Bug fixes

lazy load convert_pv jit function by @grimoire in https://github.com/InternLM/lmdeploy/pull/1253
[BUG] fix the case when num_used_blocks < 0 by @jjjjohnson in https://github.com/InternLM/lmdeploy/pull/1277
Check bf16 model in torch engine by @grimoire in https://github.com/InternLM/lmdeploy/pull/1270
fix bf16 check by @grimoire in https://github.com/InternLM/lmdeploy/pull/1281
[Fix] fix triton server chatbot init error by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1278
Fix concatenate issue in profile serving by @ispobock in https://github.com/InternLM/lmdeploy/pull/1282
fix torch tp lora adapter by @grimoire in https://github.com/InternLM/lmdeploy/pull/1300
Fix crash when api_server loads a turbomind model by @irexyc in https://github.com/InternLM/lmdeploy/pull/1304

📚 Documentations

fix config for readthedocs by @RunningLeon in https://github.com/InternLM/lmdeploy/pull/1245
update badges in README by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/1243
Update serving guide including api_server and gradio by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/1248
rename restful_api.md to api_server.md by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/1287
Update readthedocs index by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/1288

🌐 Other

Parallelize testcase and refactor test workflow by @zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1254
Accelerate sample request in benchmark script by @ispobock in https://github.com/InternLM/lmdeploy/pull/1264
Update eval ci cfg by @RunningLeon in https://github.com/InternLM/lmdeploy/pull/1259
Test case bugfix and add restful interface testcases. by @zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1271
bump version to v0.2.6 by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/1299

New Contributors

@jjjjohnson made their first contribution in https://github.com/InternLM/lmdeploy/pull/1277

Full Changelog: https://github.com/InternLM/lmdeploy/compare/v0.2.5...v0.2.6

v0.2.5

2 months ago

What's Changed

🚀 Features

Support mistral and sliding window attention by @grimoire in https://github.com/InternLM/lmdeploy/pull/1075
torch engine support chatglm3 by @grimoire in https://github.com/InternLM/lmdeploy/pull/1159
Support qwen1.5 in pytorch engine by @grimoire in https://github.com/InternLM/lmdeploy/pull/1160
Support mixtral for pytorch engine by @RunningLeon in https://github.com/InternLM/lmdeploy/pull/1133
Support torch deepseek moe by @grimoire in https://github.com/InternLM/lmdeploy/pull/1163
Support gemma model in pytorch engine by @grimoire in https://github.com/InternLM/lmdeploy/pull/1184
Auto backend for pipeline and serve when backend is not set to pytorch explicitly by @RunningLeon in https://github.com/InternLM/lmdeploy/pull/1211

💥 Improvements

Fix argument error by @ispobock in https://github.com/InternLM/lmdeploy/pull/1193
Use LifoQueue for turbomind async_stream_infer by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1179
Update interactive output len strategy and response by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1164
Support min_new_tokens generation config in pytorch engine by @grimoire in https://github.com/InternLM/lmdeploy/pull/1096
Batched sampling by @grimoire in https://github.com/InternLM/lmdeploy/pull/1197
refactor the logic of getting model_name by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1188
Add parameter max_prefill_token_num by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/1203
optmize baichuan in pytorch engine by @grimoire in https://github.com/InternLM/lmdeploy/pull/1223
check model required transformers version by @grimoire in https://github.com/InternLM/lmdeploy/pull/1220
torch optmize chatglm3 by @grimoire in https://github.com/InternLM/lmdeploy/pull/1215
Async torch engine by @grimoire in https://github.com/InternLM/lmdeploy/pull/1206
remove unused kernel in pytorch engine by @grimoire in https://github.com/InternLM/lmdeploy/pull/1237

🐞 Bug fixes

Fix session length for profile generation by @ispobock in https://github.com/InternLM/lmdeploy/pull/1181
fix torch engine infer by @RunningLeon in https://github.com/InternLM/lmdeploy/pull/1185
fix module map by @grimoire in https://github.com/InternLM/lmdeploy/pull/1205
[Fix] Correct session length warning by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1207
Fix all devices occupation when applying tp to torch engine by updating device map by @grimoire in https://github.com/InternLM/lmdeploy/pull/1172
Fix falcon chatglm2 template by @grimoire in https://github.com/InternLM/lmdeploy/pull/1168
[Fix] Avoid AsyncEngine running the same session id by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1219
Fix None session_len by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/1230
fix multinomial sampling by @grimoire in https://github.com/InternLM/lmdeploy/pull/1228
fix returning logits in prefill phase of pytorch engine by @grimoire in https://github.com/InternLM/lmdeploy/pull/1209
optimize pytorch engine inference with falcon model by @grimoire in https://github.com/InternLM/lmdeploy/pull/1234
fix bf16 multinomial sampling by @grimoire in https://github.com/InternLM/lmdeploy/pull/1239
reduce torchengine prefill mem usage by @grimoire in https://github.com/InternLM/lmdeploy/pull/1240

📚 Documentations

auto generate pipeline api for readthedocs by @RunningLeon in https://github.com/InternLM/lmdeploy/pull/1186
Added tutorial document for deploying lmdeploy on Jetson series boards. by @BestAnHongjun in https://github.com/InternLM/lmdeploy/pull/1192
update doc index by @zhyncs in https://github.com/InternLM/lmdeploy/pull/1241

🌐 Other

Add PR test workflow and check-in more testcases by @zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1208
fix pytest version by @zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1236
bump version to v0.2.5 by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/1235

New Contributors

@ispobock made their first contribution in https://github.com/InternLM/lmdeploy/pull/1181
@BestAnHongjun made their first contribution in https://github.com/InternLM/lmdeploy/pull/1192

Full Changelog: https://github.com/InternLM/lmdeploy/compare/v0.2.4...v0.2.5

v0.2.4

2 months ago

What's Changed

💥 Improvements

use stricter rules to get weight file by @irexyc in https://github.com/InternLM/lmdeploy/pull/1070
check pytorch engine environment by @grimoire in https://github.com/InternLM/lmdeploy/pull/1107
Update Dockerfile order to launch the http service by docker run directly by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1162
Support torch cache_max_entry_count by @grimoire in https://github.com/InternLM/lmdeploy/pull/1166
Remove the manual model conversion during benchmark by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/953
update llama triton example by @zhyncs in https://github.com/InternLM/lmdeploy/pull/1153

🐞 Bug fixes

fix embedding copy size by @irexyc in https://github.com/InternLM/lmdeploy/pull/1036
fix pytorch engine with peft==0.8.2 by @grimoire in https://github.com/InternLM/lmdeploy/pull/1122
support triton2.2 by @grimoire in https://github.com/InternLM/lmdeploy/pull/1137
Add top_k in ChatCompletionRequest by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/1174
minor fix benchmark generation guide and script by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/1175

📚 Documentations

docs add debug turbomind guide by @zhyncs in https://github.com/InternLM/lmdeploy/pull/1121

🌐 Other

Add eval ci by @RunningLeon in https://github.com/InternLM/lmdeploy/pull/1060
Ete testcase add more models by @zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1077
Fix win ci by @irexyc in https://github.com/InternLM/lmdeploy/pull/1132
bump version to v0.2.4 by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/1171

Full Changelog: https://github.com/InternLM/lmdeploy/compare/v0.2.3...v0.2.4

v0.2.3

3 months ago

What's Changed

🚀 Features

Support loading model from modelscope by @irexyc in https://github.com/InternLM/lmdeploy/pull/1069

💥 Improvements

Remove caching tokenizer.json by @grimoire in https://github.com/InternLM/lmdeploy/pull/1074
Refactor get_logger to remove the dependency of MMLogger from mmengine by @yinfan98 in https://github.com/InternLM/lmdeploy/pull/1064
Use TM_LOG_LEVEL environment variable first by @zhyncs in https://github.com/InternLM/lmdeploy/pull/1071
Speed up the initialization of w8a8 model for torch engine by @yinfan98 in https://github.com/InternLM/lmdeploy/pull/1088
Make logging.logger's behavior consistent with MMLogger by @irexyc in https://github.com/InternLM/lmdeploy/pull/1092
Remove owned_session for torch engine by @grimoire in https://github.com/InternLM/lmdeploy/pull/1097
Unify engine initialization in pipeline by @irexyc in https://github.com/InternLM/lmdeploy/pull/1085
Add skip_special_tokens in GenerationConfig by @grimoire in https://github.com/InternLM/lmdeploy/pull/1091
Use default stop words for turbomind backend in pipeline by @irexyc in https://github.com/InternLM/lmdeploy/pull/1119
Add input_token_len to Response and update Response document by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1115

🐞 Bug fixes

Fix fast tokenizer swallows prefix space when there are too many white spaces by @AllentDan in https://github.com/InternLM/lmdeploy/pull/992
Fix turbomind CUDA runtime error invalid argument by @zhyncs in https://github.com/InternLM/lmdeploy/pull/1100
Add safety check for incremental decode by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1094
Fix device type of get_ppl for turbomind by @RunningLeon in https://github.com/InternLM/lmdeploy/pull/1093
Fix pipeline init turbomind from workspace by @irexyc in https://github.com/InternLM/lmdeploy/pull/1126
Add dependency version check and fix ignore_eos logic by @grimoire in https://github.com/InternLM/lmdeploy/pull/1099
Change configuration_internlm.py to configuration_internlm2.py by @HIT-cwh in https://github.com/InternLM/lmdeploy/pull/1129

📚 Documentations

Update contribution guide by @zhyncs in https://github.com/InternLM/lmdeploy/pull/1120

🌐 Other

Bump version to v0.2.3 by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/1123

New Contributors

@yinfan98 made their first contribution in https://github.com/InternLM/lmdeploy/pull/1064

Full Changelog: https://github.com/InternLM/lmdeploy/compare/v0.2.2...v0.2.3

v0.2.2

3 months ago

Highlight

English version

The allocation strategy for k/v cache is changed. The parameter cache_max_entry_count defaults to 0.8. It means the proportion of GPU FREE memory rather than TOTAL memory. The default value is updated to 0.8. It can help prevent OOM issues.
The pipeline API supports streaming inference. You may give it a try!

from lmdeploy import pipeline
pipe = pipeline('internlm/internlm2-chat-7b')
for item in pipe.stream_infer('hi, please intro yourself'):
    print(item)

Add api key and ssl to api_server

Chinese version

TurboMind engine 修改了GPU memory分配策略。k/v cache 内存比例参数 cache_max_entry_count 缺省值变更为 0.8。它表示 GPU空闲内存的比例，不再是 GPU 总内存的比例。
Pipeline 支持流式输出接口。可以尝试下如下代码：

from lmdeploy import pipeline
pipe = pipeline('internlm/internlm2-chat-7b')
for item in pipe.stream_infer('hi, please intro yourself'):
    print(item)

api_server 在接口中增加了 api_key

What's Changed

🚀 Features

add alignment tools by @grimoire in https://github.com/InternLM/lmdeploy/pull/1004
support min_length for turbomind backend by @irexyc in https://github.com/InternLM/lmdeploy/pull/961
Add stream mode function to pipeline by @AllentDan in https://github.com/InternLM/lmdeploy/pull/974
[Feature] Add api key and ssl to http server by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1048

💥 Improvements

hide stop-words in output text by @grimoire in https://github.com/InternLM/lmdeploy/pull/991
optimize sleep by @grimoire in https://github.com/InternLM/lmdeploy/pull/1034
set example values to /v1/chat/completions in swagger UI by @AllentDan in https://github.com/InternLM/lmdeploy/pull/984
Update adapters cli argument by @RunningLeon in https://github.com/InternLM/lmdeploy/pull/1039
Fix turbomind end session bug. Add huggingface demo document by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1017
Support linking the custom built mpi by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/1025
sync mem size for tp by @lzhangzz in https://github.com/InternLM/lmdeploy/pull/1053
Remove model name when loading hf model by @irexyc in https://github.com/InternLM/lmdeploy/pull/1022
support internlm2-1_8b by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/1073
Update chat template for internlm2 base model by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/1079

🐞 Bug fixes

fix TorchEngine stuck when benchmarking with tp>1 by @grimoire in https://github.com/InternLM/lmdeploy/pull/942
fix module mapping error of baichuan model by @grimoire in https://github.com/InternLM/lmdeploy/pull/977
fix import error for triton server by @RunningLeon in https://github.com/InternLM/lmdeploy/pull/985
fix qwen-vl example by @irexyc in https://github.com/InternLM/lmdeploy/pull/996
fix missing init file in modules by @RunningLeon in https://github.com/InternLM/lmdeploy/pull/1013
fix tp mem usage by @grimoire in https://github.com/InternLM/lmdeploy/pull/987
update indexes_containing_token function by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1050
fix flash kernel on sm 70 by @grimoire in https://github.com/InternLM/lmdeploy/pull/1027
Fix baichuan2 lora by @grimoire in https://github.com/InternLM/lmdeploy/pull/1042
Fix modelconfig in pytorch engine, support YI. by @grimoire in https://github.com/InternLM/lmdeploy/pull/1052
Fix repetition penalty for long context by @irexyc in https://github.com/InternLM/lmdeploy/pull/1037
[Fix] Support QLinear in rowwise_parallelize_linear_fn and colwise_parallelize_linear_fn by @HIT-cwh in https://github.com/InternLM/lmdeploy/pull/1072

📚 Documentations

add docs for evaluation with opencompass by @RunningLeon in https://github.com/InternLM/lmdeploy/pull/995
update docs for kvint8 by @RunningLeon in https://github.com/InternLM/lmdeploy/pull/1026
[doc] Introduce project OpenAOE by @JiaYingLii in https://github.com/InternLM/lmdeploy/pull/1049
update pipeline guide and FAQ about OOM by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/1051
docs update cache_max_entry_count for turbomind config by @zhyncs in https://github.com/InternLM/lmdeploy/pull/1067

🌐 Other

update ut ci to new server node by @RunningLeon in https://github.com/InternLM/lmdeploy/pull/1024
Ete testcase update by @zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1023
fix OOM in BlockManager by @zhyncs in https://github.com/InternLM/lmdeploy/pull/973
fix use engine_config.tp when tp is None by @zhyncs in https://github.com/InternLM/lmdeploy/pull/1057
Fix serve api by moving logger inside process for turbomind by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1061
bump version to v0.2.2 by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/1076

New Contributors

@zhyncs made their first contribution in https://github.com/InternLM/lmdeploy/pull/973
@JiaYingLii made their first contribution in https://github.com/InternLM/lmdeploy/pull/1049

Full Changelog: https://github.com/InternLM/lmdeploy/compare/v0.2.1...v0.2.2

v0.2.1

3 months ago

What's Changed

💥 Improvements

[Fix] interlm2 chat format by @Harold-lkk in https://github.com/InternLM/lmdeploy/pull/1002

🐞 Bug fixes

fix baichuan2 conversion by @AllentDan in https://github.com/InternLM/lmdeploy/pull/972
[Fix] interlm messages2prompt by @Harold-lkk in https://github.com/InternLM/lmdeploy/pull/1003

📚 Documentations

add guide about installation on cuda 12+ platform by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/988

🌐 Other

bump version to v0.2.1 by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/1005

Full Changelog: https://github.com/InternLM/lmdeploy/compare/v0.2.0...v0.2.1

v0.2.0

3 months ago

What's Changed

🚀 Features

Support internlm2 by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/963
[Feature] Add params config for api server web_ui by @amulil in https://github.com/InternLM/lmdeploy/pull/735
[Feature]Merge lmdeploy lite calibrate and lmdeploy lite auto_awq by @pppppM in https://github.com/InternLM/lmdeploy/pull/849
Compute cross entropy loss given a list of input tokens by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/830
Support QoS in api_server by @sallyjunjun in https://github.com/InternLM/lmdeploy/pull/877
Refactor torch inference engine by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/871
add image chat demo by @irexyc in https://github.com/InternLM/lmdeploy/pull/874
check-in generation config by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/902
check-in ModelConfig by @AllentDan in https://github.com/InternLM/lmdeploy/pull/907
pytorch engine config by @grimoire in https://github.com/InternLM/lmdeploy/pull/908
Check-in turbomind engine config by @irexyc in https://github.com/InternLM/lmdeploy/pull/909
S-LoRA support by @grimoire in https://github.com/InternLM/lmdeploy/pull/894
add init in adapters by @grimoire in https://github.com/InternLM/lmdeploy/pull/923
Refactor LLM inference pipeline API by @AllentDan in https://github.com/InternLM/lmdeploy/pull/916
Refactor gradio and api_server by @AllentDan in https://github.com/InternLM/lmdeploy/pull/918
Add request distributor server by @AllentDan in https://github.com/InternLM/lmdeploy/pull/903
Upgrade lmdeploy cli by @RunningLeon in https://github.com/InternLM/lmdeploy/pull/922

💥 Improvements

add top_k value for /v1/completions and update the documents by @AllentDan in https://github.com/InternLM/lmdeploy/pull/870
export "num_tokens_per_iter", "max_prefill_iters" and etc when converting a model by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/845
Move api_server dependencies from serve.txt to runtime.txt by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/879
Refactor benchmark bash script by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/884
Add test case for function regression by @zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/844
Update test triton CI by @RunningLeon in https://github.com/InternLM/lmdeploy/pull/893
Update dockerfile by @RunningLeon in https://github.com/InternLM/lmdeploy/pull/891
Perform fuzzy matching on chat template according to model path by @AllentDan in https://github.com/InternLM/lmdeploy/pull/839
support accessing lmdeploy version by lmdeploy.version_info by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/910
Remove flash-attn dependency of lmdeploy lite module by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/917
Improve setup by removing pycuda dependency and adding cuda runtime and cublas to RPATH by @irexyc in https://github.com/InternLM/lmdeploy/pull/912
remove unused settings in turbomind engine config by @irexyc in https://github.com/InternLM/lmdeploy/pull/921
Cleanup fixed attributes in turbomind engine config by @irexyc in https://github.com/InternLM/lmdeploy/pull/928
fix get_gpu_mem by @grimoire in https://github.com/InternLM/lmdeploy/pull/934
remove instance_num argument by @AllentDan in https://github.com/InternLM/lmdeploy/pull/931
Fix matching results of several chat templates like llama2, solar, yi and so on by @AllentDan in https://github.com/InternLM/lmdeploy/pull/925
add pytorch random sampling by @grimoire in https://github.com/InternLM/lmdeploy/pull/930
suppress turbomind chat warning by @irexyc in https://github.com/InternLM/lmdeploy/pull/937
modify type hint of api to avoid import _turbomind by @AllentDan in https://github.com/InternLM/lmdeploy/pull/936
accelerate pytorch benchmark by @grimoire in https://github.com/InternLM/lmdeploy/pull/946
Remove tp from pipline argument list by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/947
set gradio default value the same as chat.py by @AllentDan in https://github.com/InternLM/lmdeploy/pull/949
print help for cli in case of failure by @RunningLeon in https://github.com/InternLM/lmdeploy/pull/955
return dataclass for pipeline by @AllentDan in https://github.com/InternLM/lmdeploy/pull/952
set random seed when it is None by @AllentDan in https://github.com/InternLM/lmdeploy/pull/958
avoid run get_logger when import lmdeploy by @RunningLeon in https://github.com/InternLM/lmdeploy/pull/956
support mlp s-lora by @grimoire in https://github.com/InternLM/lmdeploy/pull/957
skip resume logic for pytorch backend by @AllentDan in https://github.com/InternLM/lmdeploy/pull/968
Add ci for ut by @RunningLeon in https://github.com/InternLM/lmdeploy/pull/966

🐞 Bug fixes

add tritonclient req by @RunningLeon in https://github.com/InternLM/lmdeploy/pull/872
Fix uninitialized parameter by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/875
Fix overflow by @irexyc in https://github.com/InternLM/lmdeploy/pull/897
Fix data offset by @AllentDan in https://github.com/InternLM/lmdeploy/pull/900
Fix context decoding stuck issue when tp > 1 by @irexyc in https://github.com/InternLM/lmdeploy/pull/904
[Fix] set scaling_factor 1 forcefully when sequence length is less than max_pos_emb by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/911
fix pytorch llama2 with new transformers by @grimoire in https://github.com/InternLM/lmdeploy/pull/914
fix local variable 'output_ids' referenced before assignment by @irexyc in https://github.com/InternLM/lmdeploy/pull/919
fix pipeline stop_words type error by @AllentDan in https://github.com/InternLM/lmdeploy/pull/929
pass stop words to openai api by @AllentDan in https://github.com/InternLM/lmdeploy/pull/887
fix profile generation multiprocessing error by @AllentDan in https://github.com/InternLM/lmdeploy/pull/933
Miss init.py in modeling folder by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/951
fix cli with special arg names by @RunningLeon in https://github.com/InternLM/lmdeploy/pull/959
fix logger in tokenizer by @RunningLeon in https://github.com/InternLM/lmdeploy/pull/960

📚 Documentations

Improve user guide by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/899
Add user guide about pytorch engine by @grimoire in https://github.com/InternLM/lmdeploy/pull/915
Update supported models and add quick start section in README by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/926
Fix scripts in benchmark doc by @panli889 in https://github.com/InternLM/lmdeploy/pull/941
Update get_started and w4a16 tutorials by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/945
Add more docstring to api_server and proxy_server by @AllentDan in https://github.com/InternLM/lmdeploy/pull/965
stable api_server benchmark result by a non-zero await by @AllentDan in https://github.com/InternLM/lmdeploy/pull/885
fix pytorch backend can not properly stop by @AllentDan in https://github.com/InternLM/lmdeploy/pull/962
[Fix] Fix calibrate bug when transformers>4.36 by @pppppM in https://github.com/InternLM/lmdeploy/pull/967

🌐 Other

bump version to v0.2.0 by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/969

New Contributors

@amulil made their first contribution in https://github.com/InternLM/lmdeploy/pull/735
@zhulinJulia24 made their first contribution in https://github.com/InternLM/lmdeploy/pull/844
@sallyjunjun made their first contribution in https://github.com/InternLM/lmdeploy/pull/877
@panli889 made their first contribution in https://github.com/InternLM/lmdeploy/pull/941

Full Changelog: https://github.com/InternLM/lmdeploy/compare/v0.1.0...v0.2.0