Support p-tuning v2 finetuned models for ChatGLM family
Fix convert.py for lora models & chatglm3-6b-128k
Fix RoPE theta config for 32k/128k sequence length
Better cuda cmake script respecting nvcc version
v0.3.1
4 months ago
Support function calling in OpenAI api server
Faster repetition penalty sampling
Support max_new_tokens generation option
v0.3.0
6 months ago
Full functionality of ChatGLM3 including system prompt, function call and code interpreter
Brand new OpenAI-style chat API
Add token usage information in OpenAI api server to be compatible with LangChain frontend
Fix conversion error for chatglm3-6b-32k
v0.2.10
6 months ago
Support ChatGLM3 in conversation mode.
Coming soon: new prompt format for system message and function call.
v0.2.9
7 months ago
Support InternLM 7B & 20B model architectures
v0.2.8
7 months ago
Metal backend support for all models (ChatGLM & ChatGLM2 & Baichuan-7B & Baichuan-13B)
Fix GLM generation on CUDA for long context
v0.2.7
8 months ago
Support Baichuan-7B model architecture (works for both Baichuan v1 & v2).
Minor bug fix and enhancement.
v0.2.6
8 months ago
Support Baichuan-13B on CPU & CUDA backends
Bug fix for Windows and Metal
v0.2.5
9 months ago
Optimize context computing (GEMM) for metal backend
Support repetition penalty option for generation
Update Dockerfile for CPU & CUDA backends with full functionality, hosted on GHCR
v0.2.4
9 months ago
Python binding enhancement: support load-and-convert directly from original Hugging Face models. Intermediate GGML model files are no longer necessary.