CTranslate2 Versions Save

Fast inference engine for Transformer models

v4.2.1

2 weeks ago

Note: Because of the increasing of package's size (> 100 MB), the release v4.2.0 was pushed unsuccessfully.

New features

  • Support load/unload for generator/Whisper Attention (#1670)

Fixes and improvements

  • Fix Llama 3 (#1671)

v4.2.0

1 month ago

New features

  • Support Flash Attention (#1651)
  • Implementation of gemm for FLOAT32 compute type with RUY backend (#1598)
  • Conv1D quantization for only CPU (DNNL and CUDA backend is not supported) (#1601)

Fixes and improvements

  • Fix bug tensor parallel (#1643)
  • Use BestSampler when temperature is 0 (#1659)
  • Fix bug gemma (#1660)
  • Optimize loading/unloading time for Translator with cache (#1645)

v4.1.1

1 month ago

Fixes and improvements

  • Fix classifiers in setup.py to push pypi package

v4.1.0

1 month ago

New features

  • Support Gemma Model (#1631)
  • Support Tensor Parallelism (#1599)

Fixes and improvements

  • Avoid initializing unused GPU (#1633)
  • Read very large tensor by chunk if the size > max value of int (#1636)
  • Update Readme

v4.0.0

2 months ago

This major version introduces the breaking change while updating to cuda 12.

Breaking changes

Python

  • Support cuda 12

New features

  • Add feature to_device() in class StorageView in Python to move data between host <-> device

Fixes and improvements

  • Implement Conv1D with im2col and GEMM to improvement in performance
  • Get tokens in the range of the vocab size for LlaMa models
  • Fix loss of performance
  • Update cibuildwheel to 2.16.5

v3.24.0

4 months ago

New features

  • Support of new option offset to ignore token score of special tokens

v3.23.0

5 months ago

New features

  • Support Phi model

Fixes and improvements

  • Fix the conversion for whisper without the "alignment_heads" in the "generation_config.json"
  • Fix forward batch

v3.22.0

5 months ago

New features

  • Support "sliding window" and "chunking input" for Mistral

Fixes and improvements

  • Take into account the "generation_config.json" and fix "lang_ids" getter for Whisper converter
  • Accept callback even on "generate_tokens" method
  • Fix iomp5 linking with latest Intel OpenAPI on Ubuntu
  • Fixed "decoder_start_token_id" for T5

v3.21.0

6 months ago

New features

  • Minimal Support for Mistral (Loader and Rotary extension for long sequence). No sliding yet
  • Support Distil-Whisper
  • Support Whisper-large-v3

v3.20.0

7 months ago

New features

  • Update the Transformers converter to support more model architectures:
    • MixFormerSequential (used by microsoft/phi-1_5)
  • Accept batch inputs in methods generate_tokens
  • Add method Generator.async_generate_tokens to return an asynchronous generator compatible with asyncio

Fixes and improvements

  • Remove the epsilon value in the softmax CPU kernel for consistency with other implementations
  • Optimize implementation of the Dynamic Time Wrapping (DTW) function (used for Whisper alignment)
  • Avoid an unnecessary copy of the input arguments in method Whisper::align