Whisper.cpp Versions Save

Port of OpenAI's Whisper model in C/C++

v1.5.5

1 week ago

Overview

Many small incremental updates + Token level timestamps with DTW by @denersc in https://github.com/ggerganov/whisper.cpp/pull/1485 Feedback is welcome!

Full Changelog: https://github.com/ggerganov/whisper.cpp/compare/v1.5.4...v1.5.5

What's Changed

New Contributors

Full Changelog: https://github.com/ggerganov/whisper.cpp/compare/v1.5.4...v1.5.5

v1.5.4

3 months ago

Overview

  • Faster Core ML ANE models (#1716)
  • CUDA bugfix causing random erros in the transcription
  • Fix SwiftUI example build

Full Changelog: https://github.com/ggerganov/whisper.cpp/compare/v1.5.3...v1.5.4

v1.5.3

3 months ago

Overview

Minor maintenance release:

  • Fix CUDA issues where the transcription produces garbage
  • FIX quantized models to work with CUDA backend
  • Allow to use whisper.cpp and llama.cpp together in SwiftUI projects

What's Changed

New Contributors

Full Changelog: https://github.com/ggerganov/whisper.cpp/compare/v1.5.2...v1.5.3

v1.5.2

4 months ago

Overview

Minor maintenance release:

  • Re-enable CPU BLAS processing after fixing a regression (#1583)

Add new example: wchess

https://github.com/ggerganov/whisper.cpp/assets/1991296/c2b2f03c-9684-49f3-8106-357d2d4e67fa

Shoutout to @fraxy-v (implementation) and @ejones (grammar) for making it work!

What's Changed

New Contributors

Full Changelog: https://github.com/ggerganov/whisper.cpp/compare/v1.5.1...v1.5.2

v1.5.1

5 months ago

Overview

Minor update:

  • With Metal, auto-fallback to CPU if device does not support Apple7 family
  • Add server example

What's Changed

New Contributors

Full Changelog: https://github.com/ggerganov/whisper.cpp/compare/v1.5.0...v1.5.1

v1.5.0

5 months ago

Overview

This major release includes the following changes:

  • Full GPU processing of the Encoder and the Decoder with CUDA and Metal is now supported
  • Efficient beam-search implementation via batched decoding and unified KV cache
  • Full quantization support of all available ggml quantization types
  • Support for grammar constrained sampling
  • Support for Distil Whisper models
  • Support for Whisper Large-v3

and more

Full GPU support

On Apple Silicon, GPU support has been available to a large extend since 15 Sep. However, part of the Encoder was still being executed on the CPU due to lack of MSL kernels for the convolution operations. These kernels are now available resulting in additional speed-up of the Encoder in this release:

image

Encoder performance on Apple M1 Max - before and after (plot by @dreness)

For NVIDIA hardware, the entire computation can now be offloaded to the GPU which results in significant performance boost. For detailed performance breakdown, checkout the Benchmarks section below.

The GPU processing on Apple Silicon is enabled by default, while for NVIDIA you need to build with WHISPER_CUBLAS=1:

# Apple Silicon
make

# NVIDIA
WHISPER_CUBLAS=1 make

Implementation: https://github.com/ggerganov/whisper.cpp/pull/1472

Special credits to: @FSSRepo, @slaren

At last, whisper.cpp now supports efficient Beam Search decoding. The missing piece was the implementation of batched decoding, which now follows closely the unified KV cache idea from llama.cpp. On modern NVIDIA hardware, the performance with 5 beams is the same as 1 beam thanks to the large amount of computing power available. With Metal, the speed with 5 beams is a bit slower compared to 1 beam, but it is significantly faster compared to 5x times the time for single batch which was observed with the old naive implementation.

Beam Search is now enabled by default in whisper.cpp to match the OG implementation of OpenAI Whisper. For more performance details, checkout the Benchmarks section below.

Implementation: https://github.com/ggerganov/whisper.cpp/pull/1486

Quantization support

All ggml quantization types are now supported. Quantization mixtures for Whisper model can be implemented. It's still unclear how the quality is affected from the quantization - this is an interesting area which can be explored in the future.

Grammar sampling

The decoder output can now be constrained with a GBNF grammar. This can be a useful technique for further improving the transcription quality in situations where the set of possible phrases are limited.

https://github.com/ggerganov/whisper.cpp/assets/377495/d24716e2-5e9c-441b-8c6b-395922dccbf4

Implementation: https://github.com/ggerganov/whisper.cpp/pull/1229

Special credits to @ejones

Distil Whisper

Recently, Distil Whisper models have been released: https://huggingface.co/distil-whisper

whisper.cpp offers support for these models, although it still lacks full implementation of the proposed chunking strategy. Performance details for distilled models are included in the Benchmarks section below.

Implementation: https://github.com/ggerganov/whisper.cpp/pull/1424

Whisper Large-v3

Recently, OpenAI released a new version 3 of the Large model: https://github.com/openai/whisper/pull/1761

Implementation: https://github.com/ggerganov/whisper.cpp/pull/1444

Benchmarks

Below is a breakdown of the performance of whisper.cpp on Apple Silicon, NVIDIA and CPU. The tables show the Encoder and Decoder speed in ms/tok. The Dec. column corresponds to batch size 1. The Bch5 column corresponds to batch size 5. The PP column corresponds to batch size 128.

For optimal Beam Search performance, the Bch5 number should be 5 times smaller than Dec.

Hw Config Model Th Enc. Dec. Bch5 PP Commit
M2 Ultra METAL tiny 1 11.14 1.40 0.49 0.01 ccc85b4
M2 Ultra METAL tiny-q5_0 1 11.51 1.41 0.52 0.01 ccc85b4
M2 Ultra METAL tiny-q5_1 1 12.21 1.41 0.52 0.01 ccc85b4
M2 Ultra METAL base 1 20.21 2.05 0.77 0.02 ccc85b4
M2 Ultra METAL base-q5_0 1 19.89 1.96 0.81 0.02 ccc85b4
M2 Ultra METAL base-q5_1 1 20.14 2.02 0.81 0.02 ccc85b4
M2 Ultra METAL small 1 51.01 3.97 1.74 0.05 ccc85b4
M2 Ultra METAL small-q5_0 1 56.86 4.09 1.85 0.06 ccc85b4
M2 Ultra METAL small-q5_1 1 56.81 4.14 1.85 0.06 ccc85b4
M2 Ultra METAL medium 1 141.21 8.47 3.98 0.13 ccc85b4
M2 Ultra METAL medium-q5_0 1 160.56 8.27 4.18 0.14 ccc85b4
M2 Ultra METAL medium-q5_1 1 160.52 8.40 4.15 0.14 ccc85b4
M2 Ultra METAL medium-dis 1 128.14 1.13 0.43 0.02 ccc85b4
M2 Ultra METAL large-v2 1 248.73 11.96 6.08 0.22 ccc85b4
M2 Ultra METAL large-v2-q5_0 1 286.31 11.99 6.60 0.26 ccc85b4
M2 Ultra METAL large-v2-q5_1 1 284.56 12.42 6.47 0.26 ccc85b4
M2 Ultra METAL large-v2-dis 1 224.31 1.26 0.49 0.02 ccc85b4
Hw Config Model Th Enc. Dec. Bch5 PP Commit
M2 Ultra COREML METAL tiny 1 7.60 1.41 0.50 0.01 ccc85b4
M2 Ultra COREML METAL base 1 11.90 2.07 0.78 0.02 ccc85b4
M2 Ultra COREML METAL small 1 32.19 4.10 1.78 0.05 ccc85b4
M2 Ultra COREML METAL medium 1 94.43 8.40 3.89 0.12 ccc85b4
M2 Ultra COREML METAL large-v2 1 179.78 12.12 6.07 0.22 ccc85b4
Hw Config Model Th Enc. Dec. Bch5 PP Commit
NVIDIA V100 BLAS CUDA tiny 1 8.84 1.62 0.33 0.02 ccc85b4
NVIDIA V100 BLAS CUDA tiny-q5_0 1 8.43 1.19 0.31 0.02 ccc85b4
NVIDIA V100 BLAS CUDA tiny-q5_1 1 8.41 1.19 0.29 0.02 ccc85b4
NVIDIA V100 BLAS CUDA base 1 14.79 2.31 0.46 0.03 ccc85b4
NVIDIA V100 BLAS CUDA base-q5_0 1 15.05 1.66 0.44 0.03 ccc85b4
NVIDIA V100 BLAS CUDA base-q5_1 1 15.01 1.68 0.46 0.03 ccc85b4
NVIDIA V100 BLAS CUDA small 1 40.30 4.37 0.88 0.05 ccc85b4
NVIDIA V100 BLAS CUDA small-q5_0 1 41.17 3.11 0.94 0.05 ccc85b4
NVIDIA V100 BLAS CUDA small-q5_1 1 41.12 3.11 0.82 0.05 ccc85b4
NVIDIA V100 BLAS CUDA medium 1 104.93 10.06 1.77 0.11 ccc85b4
NVIDIA V100 BLAS CUDA medium-q5_0 1 107.11 6.13 2.07 0.12 ccc85b4
NVIDIA V100 BLAS CUDA medium-q5_1 1 107.91 6.21 1.77 0.12 ccc85b4
NVIDIA V100 BLAS CUDA medium-dis 1 103.45 1.11 0.24 0.02 ccc85b4
NVIDIA V100 BLAS CUDA large-v2 1 171.55 15.76 2.62 0.17 ccc85b4
NVIDIA V100 BLAS CUDA large-v2-q5_0 1 176.27 8.61 3.17 0.19 ccc85b4
NVIDIA V100 BLAS CUDA large-v2-q5_1 1 176.23 8.67 2.59 0.19 ccc85b4
Hw Config Model Th Enc. Dec. Bch5 PP Commit
AMD Ryzen 9 5950X AVX2 tiny 8 197.47 1.22 0.44 0.25 ccc85b4
AMD Ryzen 9 5950X AVX2 tiny-q5_0 8 222.92 0.87 0.45 0.30 ccc85b4
AMD Ryzen 9 5950X AVX2 tiny-q5_1 8 221.25 0.89 0.45 0.30 ccc85b4
AMD Ryzen 9 5950X AVX2 base 8 427.14 3.11 0.88 0.43 ccc85b4
AMD Ryzen 9 5950X AVX2 base-q5_0 8 474.96 1.41 0.72 0.51 ccc85b4
AMD Ryzen 9 5950X AVX2 base-q5_1 8 485.05 1.48 0.73 0.52 ccc85b4
AMD Ryzen 9 5950X AVX2 small 8 1470.51 11.70 2.89 1.21 ccc85b4
AMD Ryzen 9 5950X AVX2 small-q5_0 8 1700.43 5.48 1.98 1.41 ccc85b4
AMD Ryzen 9 5950X AVX2 small-q5_1 8 1719.03 5.79 2.02 1.42 ccc85b4
AMD Ryzen 9 5950X AVX2 medium 8 4417.70 35.13 8.14 3.24 ccc85b4
AMD Ryzen 9 5950X AVX2 medium-q5_0 8 5335.77 17.44 5.35 3.92 ccc85b4
AMD Ryzen 9 5950X AVX2 medium-q5_1 8 5372.26 18.36 5.42 3.88 ccc85b4
AMD Ryzen 9 5950X AVX2 medium-dis 8 4070.25 4.86 1.16 0.53 ccc85b4
AMD Ryzen 9 5950X AVX2 large-v2 8 8179.09 66.89 15.45 5.88 ccc85b4
AMD Ryzen 9 5950X AVX2 large-v2-dis 8 7490.45 7.06 1.63 0.70 ccc85b4

API Changes

  • Add struct whisper_context_params

  • Add whisper_log_set

  • Deprecate:

    • whisper_init_from_file
    • whisper_init_from_buffer
    • whisper_init
    • whisper_init_from_file_no_state
    • whisper_init_from_buffer_no_state
    • whisper_init_no_state
  • Add:

    • whisper_init_from_file_with_params
    • whisper_init_from_buffer_with_params
    • whisper_init_with_params
    • whisper_init_from_file_with_params_no_state
    • whisper_init_from_buffer_with_params_no_state
    • whisper_init_with_params_no_state
  • Diff of struct whisper_full_params

     struct whisper_full_params {
         enum whisper_sampling_strategy strategy;
@@ -338,6 +435,7 @@ extern "C" {
 
         bool translate;
         bool no_context;        // do not use past transcription (if any) as initial prompt for the decoder
+        bool no_timestamps;     // do not generate timestamps
         bool single_segment;    // force single segment output (useful for streaming)
         bool print_special;     // print special tokens (e.g. <SOT>, <EOT>, <BEG>, etc.)
         bool print_progress;    // print progress information
@@ -355,8 +453,12 @@ extern "C" {
         // [EXPERIMENTAL] speed-up techniques
         // note: these can significantly reduce the quality of the output
         bool speed_up;          // speed-up the audio by 2x using Phase Vocoder
+        bool debug_mode;        // enable debug_mode provides extra info (eg. Dump log_mel)
         int  audio_ctx;         // overwrite the audio context size (0 = use default)
 
+        // [EXPERIMENTAL] [TDRZ] tinydiarize
+        bool tdrz_enable;       // enable tinydiarize speaker turn detection
+
         // tokens to provide to the whisper decoder as initial prompt
         // these are prepended to any existing text context from a previous call
         const char * initial_prompt;
@@ -365,6 +467,7 @@ extern "C" {
 
         // for auto-detection, set to nullptr, "" or "auto"
         const char * language;
+        bool detect_language;
 
         // common decoding parameters:
         bool suppress_blank;    // ref: https://github.com/openai/whisper/blob/f82bc59f5ea234d4b97fb2860842ed38519f7e65/whisper/decoding.py#L89
@@ -403,11 +506,24 @@ extern "C" {
         whisper_encoder_begin_callback encoder_begin_callback;
         void * encoder_begin_callback_user_data;
 
+        // called each time before ggml computation starts
+        whisper_abort_callback abort_callback;
+        void * abort_callback_user_data;
+
         // called by each decoder to filter obtained logits
         whisper_logits_filter_callback logits_filter_callback;
         void * logits_filter_callback_user_data;
+
+        const whisper_grammar_element ** grammar_rules;
+        size_t                           n_grammar_rules;
+        size_t                           i_start_rule;
+        float                            grammar_penalty;
     };
 

There might be some instability around the API, especially with the existing language bindings. I wasn't able to test everything, so expect some issues and feel free to submit PRs with any kind of fixes that you find.

Highlights and what's next

A lot of the updates in these release are possible thanks to the many contributions in llama.cpp - huge shoutout to all the contributors and collaborators there!

Regarding future updates to whisper.cpp, I'm looking forward to the following things:

  • Add server example similar to the one in llama.cpp
  • Try to improve Metal's batched decoding performance
  • Look for some interesting applications of the grammar sampling functionality

What's Changed

New Contributors

Full Changelog: https://github.com/ggerganov/whisper.cpp/compare/v1.4.0...v1.5.0

v1.4.3

5 months ago

This is a minor release, the main reason for which is that there hasn't been an official release for a few months now and some small things have accumulated on the master branch that would be nice to be upstreamed. I am planning a major v1.5.0 release with some new and long-waited functionality soon:

  • Full CUDA offloading
  • Efficient Beam-Search implementation
  • Grammar support

The current version v1.4.3 should be considered in beta as I haven't worked intensively on whisper.cpp recently and there might be some issues that made their way in the code. I'll try to polish things in the next days and prepare a stable v1.5.0 release. In the meantime, any feedback will be highly appreciated.

Detailed API changes, features and new contributor recognitions will be included in the v1.5.0 release.

v1.4.0

11 months ago

Overview

This is a new major release adding integer quantization and partial GPU (NVIDIA) support

Integer quantization

This allows the ggml Whisper models to be converted from the default 16-bit floating point weights to 4, 5 or 8 bit integer weights. The resulting quantized models are smaller in disk size and memory usage and can be processed faster on some architectures. The transcription quality is degraded to some extend - not quantified at the moment.

Here is a quantitative evaluation of the different quantization modes applied to the LLaMA and RWKV large language models. These results can give an impression about the expected quality, size and speed for quantized Whisper models:

LLaMA quantization (measured on M1 Pro)

Model Measure F16 Q4_0 Q4_1 Q4_2 Q5_0 Q5_1 Q8_0
7B perplexity 5.9565 6.2103 6.1286 6.1698 6.0139 5.9934 5.9571
7B file size 13.0G 4.0G 4.8G 4.0G 4.4G 4.8G 7.1G
7B ms/tok @ 4th 128 56 61 84 91 95 75
7B ms/tok @ 8th 128 47 55 48 53 59 75
7B bits/weight 16.0 5.0 6.0 5.0 5.5 6.0 9.0
13B perplexity 5.2455 5.3748 5.3471 5.3433 5.2768 5.2582 5.2458
13B file size 25.0G 7.6G 9.1G 7.6G 8.4G 9.1G 14G
13B ms/tok @ 4th 239 104 113 160 176 185 141
13B ms/tok @ 8th 240 85 99 97 108 117 147
13B bits/weight 16.0 5.0 6.0 5.0 5.5 6.0 9.0

ref: https://github.com/ggerganov/llama.cpp#quantization

RWKV quantization

Format Perplexity (169M) Latency, ms (1.5B) File size, GB (1.5B)
Q4_0 17.507 76 1.53
Q4_1 17.187 72 1.68
Q4_2 17.060 85 1.53
Q5_0 16.194 78 1.60
Q5_1 15.851 81 1.68
Q8_0 15.652 89 2.13
FP16 15.623 117 2.82
FP32 15.623 198 5.64

ref: https://github.com/ggerganov/ggml/issues/89#issuecomment-1528781992

This feature is possible thanks to the many contributions in the llama.cpp project: https://github.com/users/ggerganov/projects/2

GPU support via cuBLAS

Using cuBLAS results mainly in improved Encoder inference speed. I haven't done proper timings, but one can expect at least 2-3 times faster Encoder evaluation with modern NVIDIA GPU cards compared to CPU-only processing. Feel free to post your Encoder benchmarks in issue #89.

This is another feature made possible by the llama.cpp project. Special recognition to @slaren for putting almost all of this work together


This release remains in "beta" stage as I haven't verified that everything works as expected.

What's Changed

New Contributors

Full Changelog: https://github.com/ggerganov/whisper.cpp/compare/v1.3.0...v1.4.0

v1.3.0

1 year ago

Overview

This release should be considered in Beta stage, since I haven't done a lot of testing and I am not sure if I didn't break something. But overall, I believe both the performance and the quality are improved.

  • Added Core ML support #566
  • Restored decoding fallbacks with default size of 2 instead of 5 (f19e23fbd108ec3ac458c7a19b31c930719e7a94)
  • Pad the audio with zeros instead of the spectrogram (5108b30e6daf361c856abb6b86e5038500bdbeb1)
  • Added talk-llama example
  • Added whisper_state which allows parallel transcriptions with a single model in memory (#523)

The C-style API has been extended significantly to support the new whisper_state, but in general should be backwards compatible. The only breaking change is in the callbacks signatures.

Please provide feedback in the discussion if you observe any issues.

The next release v1.4.0 will follow up relatively soon and will provide 4-bit integer quantization support.

What's Changed

New Contributors

Full Changelog: https://github.com/ggerganov/whisper.cpp/compare/v1.2.1...v1.3.0

v1.2.1

1 year ago

Overview

This is a minor release. The main reason for it is a critical bug fix that causes the software to crash randomly when the language auto-detect option is used (i.e. whisper_lang_auto_detect()).

Other than that, the release includes refactoring of the examples, ruby bindings and some minor changes to the C API.

You can provide feedback in the existing v1.2.0 discussion.

What's Changed

Core ggml / whisper

Bindings

Examples

C-style API

  • Add whisper_pcm_to_mel_phase_vocoder()
  • Add *(whisper_logits_filter_callback)()
  • Change struct whisper_full_params
  • Add whisper_full_lang_id()

New Contributors

Full Changelog: https://github.com/ggerganov/whisper.cpp/compare/v1.2.0...v1.2.1

Highlights

Recently, I have been making progress on adding integer quantisation support in the ggml tensor library. This will eventually allow to use quantised models which require less memory and will hopefully run faster. I think the next major release v1.3.0 will officially add quantisation support. For now, you can keep track of the progress in #540