Faster Whisper Versions Save

Faster Whisper transcription with CTranslate2

v0.5.1

1 year ago

Fix download_root to correctly set the cache directory where the models are downloaded.

v0.5.0

1 year ago

Improved logging

Some information are now logged under INFO and DEBUG levels. The logging level can be configured like this:

import logging

logging.basicConfig()
logging.getLogger("faster_whisper").setLevel(logging.DEBUG)

More control over model downloads

New arguments were added to the WhisperModel constructor to better control how the models are downloaded:

  • download_root to specify where the model should be downloaded.
  • local_files_only to avoid downloading the model and directly return the path to the cached model, it it exists.

Other changes

  • Improve the default VAD behavior to prevent some words from being assigned to the incorrect speech chunk in the original audio
  • Fix incorrect application of option condition_on_previous_text=False (note that the bug still exists in openai/whisper v20230314)
  • Fix segment timestamps that are sometimes inconsistent with the words timestamps after VAD
  • Extend the Segment structure with additional properties to match openai/whisper
  • Rename AudioInfo to TranscriptionInfo and add a new property options to summarize the transcription options that were used

v0.4.1

1 year ago

Fix some IndexError exceptions:

  • when VAD is enabled and a predicted timestamp is after the last speech chunk
  • when word timestamps are enabled and the model predicts a tokens sequence that is decoded to invalid Unicode characters

v0.4.0

1 year ago

Integration of Silero VAD

The Silero VAD model is integrated to ignore parts of the audio without speech:

model.transcribe(..., vad_filter=True)

The default behavior is conservative and only removes silence longer than 2 seconds. See the README to find how to customize the VAD parameters.

Note: the Silero model is executed with onnxruntime which is currently not released for Python 3.11. The dependency is excluded for this Python version and so the VAD features cannot be used.

Speaker diarization using stereo channels

The function decode_audio has a new argument split_stereo to split stereo audio into seperate left and right channels:

left, right = decode_audio(audio_file, split_stereo=True)

# model.transcribe(left)
# model.transcribe(right)

Other changes

  • Add Segment attributes avg_log_prob and no_speech_prob (same definition as openai/whisper)
  • Ignore audio frames raising an av.error.InvalidDataError exception during decoding
  • Fix option prefix to be passed only to the first 30-second window
  • Extend suppress_tokens with some special tokens that should always be suppressed (unless suppress_tokens is None)
  • Raise a more helpful error message when the selected model size is invalid
  • Disable the progress bar when the model to download is already in the cache

v0.3.0

1 year ago
  • Converted models are now available on the Hugging Face Hub and are automatically downloaded when creating a WhisperModel instance. The conversion step is no longer required for the original Whisper models.
# Automatically download https://huggingface.co/guillaumekln/faster-whisper-large-v2
model = WhisperModel("large-v2")
  • Run the encoder only once for each 30-second window. Before this change the same window could be encoded multiple times, for example in the temperature fallback or when word-level timestamps is enabled.

v0.2.0

1 year ago

Initial publication of the library on PyPI: https://pypi.org/project/faster-whisper/