CTranslate2 Versions Save

Fast inference engine for Transformer models

v4.2.1

2 weeks ago

Note: Because of the increasing of package's size (> 100 MB), the release v4.2.0 was pushed unsuccessfully.

New features

Support load/unload for generator/Whisper Attention (#1670)

Fixes and improvements

Fix Llama 3 (#1671)

v4.2.0

1 month ago

New features

Support Flash Attention (#1651)
Implementation of gemm for FLOAT32 compute type with RUY backend (#1598)
Conv1D quantization for only CPU (DNNL and CUDA backend is not supported) (#1601)

Fixes and improvements

Fix bug tensor parallel (#1643)
Use BestSampler when temperature is 0 (#1659)
Fix bug gemma (#1660)
Optimize loading/unloading time for Translator with cache (#1645)

v4.1.1

1 month ago

Fixes and improvements

Fix classifiers in setup.py to push pypi package

v4.1.0

1 month ago

New features

Support Gemma Model (#1631)
Support Tensor Parallelism (#1599)

Fixes and improvements

Avoid initializing unused GPU (#1633)
Read very large tensor by chunk if the size > max value of int (#1636)
Update Readme

v4.0.0

2 months ago

This major version introduces the breaking change while updating to cuda 12.

Breaking changes

Python

Support cuda 12

New features

Add feature to_device() in class StorageView in Python to move data between host <-> device

Fixes and improvements

Implement Conv1D with im2col and GEMM to improvement in performance
Get tokens in the range of the vocab size for LlaMa models
Fix loss of performance
Update cibuildwheel to 2.16.5

v3.24.0

4 months ago

New features

Support of new option offset to ignore token score of special tokens

v3.23.0

5 months ago

New features

Support Phi model

Fixes and improvements

Fix the conversion for whisper without the "alignment_heads" in the "generation_config.json"
Fix forward batch

v3.22.0

5 months ago

New features

Support "sliding window" and "chunking input" for Mistral

Fixes and improvements

Take into account the "generation_config.json" and fix "lang_ids" getter for Whisper converter
Accept callback even on "generate_tokens" method
Fix iomp5 linking with latest Intel OpenAPI on Ubuntu
Fixed "decoder_start_token_id" for T5

v3.21.0

6 months ago

New features

Minimal Support for Mistral (Loader and Rotary extension for long sequence). No sliding yet
Support Distil-Whisper
Support Whisper-large-v3

v3.20.0

7 months ago

New features

Update the Transformers converter to support more model architectures:
- MixFormerSequential (used by microsoft/phi-1_5)
Accept batch inputs in methods generate_tokens
Add method Generator.async_generate_tokens to return an asynchronous generator compatible with asyncio

Fixes and improvements

Remove the epsilon value in the softmax CPU kernel for consistency with other implementations
Optimize implementation of the Dynamic Time Wrapping (DTW) function (used for Whisper alignment)
Avoid an unnecessary copy of the input arguments in method Whisper::align