Port of OpenAI's Whisper model in C/C++
Many small incremental updates + Token level timestamps with DTW by @denersc in https://github.com/ggerganov/whisper.cpp/pull/1485 Feedback is welcome!
Full Changelog: https://github.com/ggerganov/whisper.cpp/compare/v1.5.4...v1.5.5
verbose_json
response and show examples on the home page by @JacobLinCool in https://github.com/ggerganov/whisper.cpp/pull/1802
Full Changelog: https://github.com/ggerganov/whisper.cpp/compare/v1.5.4...v1.5.5
Full Changelog: https://github.com/ggerganov/whisper.cpp/compare/v1.5.3...v1.5.4
Minor maintenance release:
whisper.cpp
and llama.cpp
together in SwiftUI projectsWHISPER_PRINT_DEBUG
with WHISPER_LOG_DEBUG
by @bobqianic in https://github.com/ggerganov/whisper.cpp/pull/1681
tensor->n_dims
with ggml_n_dims(tensor)
by @bobqianic in https://github.com/ggerganov/whisper.cpp/pull/1694
Full Changelog: https://github.com/ggerganov/whisper.cpp/compare/v1.5.2...v1.5.3
Minor maintenance release:
Add new example: wchess
https://github.com/ggerganov/whisper.cpp/assets/1991296/c2b2f03c-9684-49f3-8106-357d2d4e67fa
Shoutout to @fraxy-v (implementation) and @ejones (grammar) for making it work!
ggml_metal_log
on Intel macs by @Finnvoor in https://github.com/ggerganov/whisper.cpp/pull/1606
Full Changelog: https://github.com/ggerganov/whisper.cpp/compare/v1.5.1...v1.5.2
Minor update:
Full Changelog: https://github.com/ggerganov/whisper.cpp/compare/v1.5.0...v1.5.1
This major release includes the following changes:
ggml
quantization typesand more
On Apple Silicon, GPU support has been available to a large extend since 15 Sep. However, part of the Encoder was still being executed on the CPU due to lack of MSL kernels for the convolution operations. These kernels are now available resulting in additional speed-up of the Encoder in this release:
Encoder performance on Apple M1 Max - before and after (plot by @dreness)
For NVIDIA hardware, the entire computation can now be offloaded to the GPU which results in significant performance boost. For detailed performance breakdown, checkout the Benchmarks section below.
The GPU processing on Apple Silicon is enabled by default, while for NVIDIA you need to build with WHISPER_CUBLAS=1
:
# Apple Silicon
make
# NVIDIA
WHISPER_CUBLAS=1 make
Implementation: https://github.com/ggerganov/whisper.cpp/pull/1472
Special credits to: @FSSRepo, @slaren
At last, whisper.cpp
now supports efficient Beam Search decoding. The missing piece was the implementation of batched decoding, which now follows closely the unified KV cache idea from llama.cpp. On modern NVIDIA hardware, the performance with 5 beams is the same as 1 beam thanks to the large amount of computing power available. With Metal, the speed with 5 beams is a bit slower compared to 1 beam, but it is significantly faster compared to 5x times the time for single batch which was observed with the old naive implementation.
Beam Search is now enabled by default in whisper.cpp
to match the OG implementation of OpenAI Whisper. For more performance details, checkout the Benchmarks section below.
Implementation: https://github.com/ggerganov/whisper.cpp/pull/1486
All ggml
quantization types are now supported. Quantization mixtures for Whisper model can be implemented. It's still unclear how the quality is affected from the quantization - this is an interesting area which can be explored in the future.
The decoder output can now be constrained with a GBNF grammar. This can be a useful technique for further improving the transcription quality in situations where the set of possible phrases are limited.
https://github.com/ggerganov/whisper.cpp/assets/377495/d24716e2-5e9c-441b-8c6b-395922dccbf4
Implementation: https://github.com/ggerganov/whisper.cpp/pull/1229
Special credits to @ejones
Recently, Distil Whisper models have been released: https://huggingface.co/distil-whisper
whisper.cpp
offers support for these models, although it still lacks full implementation of the proposed chunking strategy. Performance details for distilled models are included in the Benchmarks section below.
Implementation: https://github.com/ggerganov/whisper.cpp/pull/1424
Recently, OpenAI released a new version 3 of the Large model: https://github.com/openai/whisper/pull/1761
Implementation: https://github.com/ggerganov/whisper.cpp/pull/1444
Below is a breakdown of the performance of whisper.cpp
on Apple Silicon, NVIDIA and CPU. The tables show the Encoder and Decoder speed in ms/tok
. The Dec.
column corresponds to batch size 1. The Bch5
column corresponds to batch size 5. The PP
column corresponds to batch size 128.
For optimal Beam Search performance, the Bch5
number should be 5 times smaller than Dec.
Hw | Config | Model | Th | Enc. | Dec. | Bch5 | PP | Commit |
---|---|---|---|---|---|---|---|---|
M2 Ultra | METAL | tiny | 1 | 11.14 | 1.40 | 0.49 | 0.01 | ccc85b4 |
M2 Ultra | METAL | tiny-q5_0 | 1 | 11.51 | 1.41 | 0.52 | 0.01 | ccc85b4 |
M2 Ultra | METAL | tiny-q5_1 | 1 | 12.21 | 1.41 | 0.52 | 0.01 | ccc85b4 |
M2 Ultra | METAL | base | 1 | 20.21 | 2.05 | 0.77 | 0.02 | ccc85b4 |
M2 Ultra | METAL | base-q5_0 | 1 | 19.89 | 1.96 | 0.81 | 0.02 | ccc85b4 |
M2 Ultra | METAL | base-q5_1 | 1 | 20.14 | 2.02 | 0.81 | 0.02 | ccc85b4 |
M2 Ultra | METAL | small | 1 | 51.01 | 3.97 | 1.74 | 0.05 | ccc85b4 |
M2 Ultra | METAL | small-q5_0 | 1 | 56.86 | 4.09 | 1.85 | 0.06 | ccc85b4 |
M2 Ultra | METAL | small-q5_1 | 1 | 56.81 | 4.14 | 1.85 | 0.06 | ccc85b4 |
M2 Ultra | METAL | medium | 1 | 141.21 | 8.47 | 3.98 | 0.13 | ccc85b4 |
M2 Ultra | METAL | medium-q5_0 | 1 | 160.56 | 8.27 | 4.18 | 0.14 | ccc85b4 |
M2 Ultra | METAL | medium-q5_1 | 1 | 160.52 | 8.40 | 4.15 | 0.14 | ccc85b4 |
M2 Ultra | METAL | medium-dis | 1 | 128.14 | 1.13 | 0.43 | 0.02 | ccc85b4 |
M2 Ultra | METAL | large-v2 | 1 | 248.73 | 11.96 | 6.08 | 0.22 | ccc85b4 |
M2 Ultra | METAL | large-v2-q5_0 | 1 | 286.31 | 11.99 | 6.60 | 0.26 | ccc85b4 |
M2 Ultra | METAL | large-v2-q5_1 | 1 | 284.56 | 12.42 | 6.47 | 0.26 | ccc85b4 |
M2 Ultra | METAL | large-v2-dis | 1 | 224.31 | 1.26 | 0.49 | 0.02 | ccc85b4 |
Hw | Config | Model | Th | Enc. | Dec. | Bch5 | PP | Commit |
---|---|---|---|---|---|---|---|---|
M2 Ultra | COREML METAL | tiny | 1 | 7.60 | 1.41 | 0.50 | 0.01 | ccc85b4 |
M2 Ultra | COREML METAL | base | 1 | 11.90 | 2.07 | 0.78 | 0.02 | ccc85b4 |
M2 Ultra | COREML METAL | small | 1 | 32.19 | 4.10 | 1.78 | 0.05 | ccc85b4 |
M2 Ultra | COREML METAL | medium | 1 | 94.43 | 8.40 | 3.89 | 0.12 | ccc85b4 |
M2 Ultra | COREML METAL | large-v2 | 1 | 179.78 | 12.12 | 6.07 | 0.22 | ccc85b4 |
Hw | Config | Model | Th | Enc. | Dec. | Bch5 | PP | Commit |
---|---|---|---|---|---|---|---|---|
NVIDIA V100 | BLAS CUDA | tiny | 1 | 8.84 | 1.62 | 0.33 | 0.02 | ccc85b4 |
NVIDIA V100 | BLAS CUDA | tiny-q5_0 | 1 | 8.43 | 1.19 | 0.31 | 0.02 | ccc85b4 |
NVIDIA V100 | BLAS CUDA | tiny-q5_1 | 1 | 8.41 | 1.19 | 0.29 | 0.02 | ccc85b4 |
NVIDIA V100 | BLAS CUDA | base | 1 | 14.79 | 2.31 | 0.46 | 0.03 | ccc85b4 |
NVIDIA V100 | BLAS CUDA | base-q5_0 | 1 | 15.05 | 1.66 | 0.44 | 0.03 | ccc85b4 |
NVIDIA V100 | BLAS CUDA | base-q5_1 | 1 | 15.01 | 1.68 | 0.46 | 0.03 | ccc85b4 |
NVIDIA V100 | BLAS CUDA | small | 1 | 40.30 | 4.37 | 0.88 | 0.05 | ccc85b4 |
NVIDIA V100 | BLAS CUDA | small-q5_0 | 1 | 41.17 | 3.11 | 0.94 | 0.05 | ccc85b4 |
NVIDIA V100 | BLAS CUDA | small-q5_1 | 1 | 41.12 | 3.11 | 0.82 | 0.05 | ccc85b4 |
NVIDIA V100 | BLAS CUDA | medium | 1 | 104.93 | 10.06 | 1.77 | 0.11 | ccc85b4 |
NVIDIA V100 | BLAS CUDA | medium-q5_0 | 1 | 107.11 | 6.13 | 2.07 | 0.12 | ccc85b4 |
NVIDIA V100 | BLAS CUDA | medium-q5_1 | 1 | 107.91 | 6.21 | 1.77 | 0.12 | ccc85b4 |
NVIDIA V100 | BLAS CUDA | medium-dis | 1 | 103.45 | 1.11 | 0.24 | 0.02 | ccc85b4 |
NVIDIA V100 | BLAS CUDA | large-v2 | 1 | 171.55 | 15.76 | 2.62 | 0.17 | ccc85b4 |
NVIDIA V100 | BLAS CUDA | large-v2-q5_0 | 1 | 176.27 | 8.61 | 3.17 | 0.19 | ccc85b4 |
NVIDIA V100 | BLAS CUDA | large-v2-q5_1 | 1 | 176.23 | 8.67 | 2.59 | 0.19 | ccc85b4 |
Hw | Config | Model | Th | Enc. | Dec. | Bch5 | PP | Commit |
---|---|---|---|---|---|---|---|---|
AMD Ryzen 9 5950X | AVX2 | tiny | 8 | 197.47 | 1.22 | 0.44 | 0.25 | ccc85b4 |
AMD Ryzen 9 5950X | AVX2 | tiny-q5_0 | 8 | 222.92 | 0.87 | 0.45 | 0.30 | ccc85b4 |
AMD Ryzen 9 5950X | AVX2 | tiny-q5_1 | 8 | 221.25 | 0.89 | 0.45 | 0.30 | ccc85b4 |
AMD Ryzen 9 5950X | AVX2 | base | 8 | 427.14 | 3.11 | 0.88 | 0.43 | ccc85b4 |
AMD Ryzen 9 5950X | AVX2 | base-q5_0 | 8 | 474.96 | 1.41 | 0.72 | 0.51 | ccc85b4 |
AMD Ryzen 9 5950X | AVX2 | base-q5_1 | 8 | 485.05 | 1.48 | 0.73 | 0.52 | ccc85b4 |
AMD Ryzen 9 5950X | AVX2 | small | 8 | 1470.51 | 11.70 | 2.89 | 1.21 | ccc85b4 |
AMD Ryzen 9 5950X | AVX2 | small-q5_0 | 8 | 1700.43 | 5.48 | 1.98 | 1.41 | ccc85b4 |
AMD Ryzen 9 5950X | AVX2 | small-q5_1 | 8 | 1719.03 | 5.79 | 2.02 | 1.42 | ccc85b4 |
AMD Ryzen 9 5950X | AVX2 | medium | 8 | 4417.70 | 35.13 | 8.14 | 3.24 | ccc85b4 |
AMD Ryzen 9 5950X | AVX2 | medium-q5_0 | 8 | 5335.77 | 17.44 | 5.35 | 3.92 | ccc85b4 |
AMD Ryzen 9 5950X | AVX2 | medium-q5_1 | 8 | 5372.26 | 18.36 | 5.42 | 3.88 | ccc85b4 |
AMD Ryzen 9 5950X | AVX2 | medium-dis | 8 | 4070.25 | 4.86 | 1.16 | 0.53 | ccc85b4 |
AMD Ryzen 9 5950X | AVX2 | large-v2 | 8 | 8179.09 | 66.89 | 15.45 | 5.88 | ccc85b4 |
AMD Ryzen 9 5950X | AVX2 | large-v2-dis | 8 | 7490.45 | 7.06 | 1.63 | 0.70 | ccc85b4 |
Add struct whisper_context_params
Add whisper_log_set
Deprecate:
whisper_init_from_file
whisper_init_from_buffer
whisper_init
whisper_init_from_file_no_state
whisper_init_from_buffer_no_state
whisper_init_no_state
Add:
whisper_init_from_file_with_params
whisper_init_from_buffer_with_params
whisper_init_with_params
whisper_init_from_file_with_params_no_state
whisper_init_from_buffer_with_params_no_state
whisper_init_with_params_no_state
Diff of struct whisper_full_params
struct whisper_full_params {
enum whisper_sampling_strategy strategy;
@@ -338,6 +435,7 @@ extern "C" {
bool translate;
bool no_context; // do not use past transcription (if any) as initial prompt for the decoder
+ bool no_timestamps; // do not generate timestamps
bool single_segment; // force single segment output (useful for streaming)
bool print_special; // print special tokens (e.g. <SOT>, <EOT>, <BEG>, etc.)
bool print_progress; // print progress information
@@ -355,8 +453,12 @@ extern "C" {
// [EXPERIMENTAL] speed-up techniques
// note: these can significantly reduce the quality of the output
bool speed_up; // speed-up the audio by 2x using Phase Vocoder
+ bool debug_mode; // enable debug_mode provides extra info (eg. Dump log_mel)
int audio_ctx; // overwrite the audio context size (0 = use default)
+ // [EXPERIMENTAL] [TDRZ] tinydiarize
+ bool tdrz_enable; // enable tinydiarize speaker turn detection
+
// tokens to provide to the whisper decoder as initial prompt
// these are prepended to any existing text context from a previous call
const char * initial_prompt;
@@ -365,6 +467,7 @@ extern "C" {
// for auto-detection, set to nullptr, "" or "auto"
const char * language;
+ bool detect_language;
// common decoding parameters:
bool suppress_blank; // ref: https://github.com/openai/whisper/blob/f82bc59f5ea234d4b97fb2860842ed38519f7e65/whisper/decoding.py#L89
@@ -403,11 +506,24 @@ extern "C" {
whisper_encoder_begin_callback encoder_begin_callback;
void * encoder_begin_callback_user_data;
+ // called each time before ggml computation starts
+ whisper_abort_callback abort_callback;
+ void * abort_callback_user_data;
+
// called by each decoder to filter obtained logits
whisper_logits_filter_callback logits_filter_callback;
void * logits_filter_callback_user_data;
+
+ const whisper_grammar_element ** grammar_rules;
+ size_t n_grammar_rules;
+ size_t i_start_rule;
+ float grammar_penalty;
};
There might be some instability around the API, especially with the existing language bindings. I wasn't able to test everything, so expect some issues and feel free to submit PRs with any kind of fixes that you find.
A lot of the updates in these release are possible thanks to the many contributions in llama.cpp - huge shoutout to all the contributors and collaborators there!
Regarding future updates to whisper.cpp
, I'm looking forward to the following things:
llama.cpp
Latest performance of the talk-llama example
https://github.com/ggerganov/whisper.cpp/assets/1991296/d97a3788-bf2a-4756-9a43-60c6b391649e
examples/main/main.cpp
whisper_params_parse
by @faker2048 in https://github.com/ggerganov/whisper.cpp/pull/1002
split_on_word
no longer trims by @ggerganov in https://github.com/ggerganov/whisper.cpp/pull/1046
beam_search
sampling by @bobqianic in https://github.com/ggerganov/whisper.cpp/pull/1243
--speed-up
. by @Sogl in https://github.com/ggerganov/whisper.cpp/pull/1306
n_mel
mismatch in convert-whisper-to-openvino.py by @bobqianic in https://github.com/ggerganov/whisper.cpp/pull/1459
Full Changelog: https://github.com/ggerganov/whisper.cpp/compare/v1.4.0...v1.5.0
This is a minor release, the main reason for which is that there hasn't been an official release for a few months now and some small things have accumulated on the master
branch that would be nice to be upstreamed. I am planning a major v1.5.0
release with some new and long-waited functionality soon:
The current version v1.4.3
should be considered in beta as I haven't worked intensively on whisper.cpp
recently and there might be some issues that made their way in the code. I'll try to polish things in the next days and prepare a stable v1.5.0
release. In the meantime, any feedback will be highly appreciated.
Detailed API changes, features and new contributor recognitions will be included in the v1.5.0
release.
This is a new major release adding integer quantization and partial GPU (NVIDIA) support
This allows the ggml
Whisper models to be converted from the default 16-bit floating point weights to 4, 5 or 8 bit integer weights.
The resulting quantized models are smaller in disk size and memory usage and can be processed faster on some architectures. The transcription quality is degraded to some extend - not quantified at the moment.
Q4_0
, Q4_1
, Q4_2
, Q5_0
, Q5_1
, Q8_0
Q5
quantized models: https://whisper.ggerganov.com
Here is a quantitative evaluation of the different quantization modes applied to the LLaMA and RWKV large language models. These results can give an impression about the expected quality, size and speed for quantized Whisper models:
Model | Measure | F16 | Q4_0 | Q4_1 | Q4_2 | Q5_0 | Q5_1 | Q8_0 |
---|---|---|---|---|---|---|---|---|
7B | perplexity | 5.9565 | 6.2103 | 6.1286 | 6.1698 | 6.0139 | 5.9934 | 5.9571 |
7B | file size | 13.0G | 4.0G | 4.8G | 4.0G | 4.4G | 4.8G | 7.1G |
7B | ms/tok @ 4th | 128 | 56 | 61 | 84 | 91 | 95 | 75 |
7B | ms/tok @ 8th | 128 | 47 | 55 | 48 | 53 | 59 | 75 |
7B | bits/weight | 16.0 | 5.0 | 6.0 | 5.0 | 5.5 | 6.0 | 9.0 |
13B | perplexity | 5.2455 | 5.3748 | 5.3471 | 5.3433 | 5.2768 | 5.2582 | 5.2458 |
13B | file size | 25.0G | 7.6G | 9.1G | 7.6G | 8.4G | 9.1G | 14G |
13B | ms/tok @ 4th | 239 | 104 | 113 | 160 | 176 | 185 | 141 |
13B | ms/tok @ 8th | 240 | 85 | 99 | 97 | 108 | 117 | 147 |
13B | bits/weight | 16.0 | 5.0 | 6.0 | 5.0 | 5.5 | 6.0 | 9.0 |
ref: https://github.com/ggerganov/llama.cpp#quantization
Format | Perplexity (169M) | Latency, ms (1.5B) | File size, GB (1.5B) |
---|---|---|---|
Q4_0 |
17.507 | 76 | 1.53 |
Q4_1 |
17.187 | 72 | 1.68 |
Q4_2 |
17.060 | 85 | 1.53 |
Q5_0 |
16.194 | 78 | 1.60 |
Q5_1 |
15.851 | 81 | 1.68 |
Q8_0 |
15.652 | 89 | 2.13 |
FP16 |
15.623 | 117 | 2.82 |
FP32 |
15.623 | 198 | 5.64 |
ref: https://github.com/ggerganov/ggml/issues/89#issuecomment-1528781992
This feature is possible thanks to the many contributions in the llama.cpp project: https://github.com/users/ggerganov/projects/2
Using cuBLAS results mainly in improved Encoder inference speed. I haven't done proper timings, but one can expect at least 2-3 times faster Encoder evaluation with modern NVIDIA GPU cards compared to CPU-only processing. Feel free to post your Encoder benchmarks in issue #89.
This is another feature made possible by the llama.cpp project. Special recognition to @slaren for putting almost all of this work together
This release remains in "beta" stage as I haven't verified that everything works as expected.
Full Changelog: https://github.com/ggerganov/whisper.cpp/compare/v1.3.0...v1.4.0
This release should be considered in Beta stage, since I haven't done a lot of testing and I am not sure if I didn't break something. But overall, I believe both the performance and the quality are improved.
whisper_state
which allows parallel transcriptions with a single model in memory (#523)The C-style API has been extended significantly to support the new whisper_state
, but in general should be backwards compatible.
The only breaking change is in the callbacks signatures.
Please provide feedback in the discussion if you observe any issues.
The next release v1.4.0
will follow up relatively soon and will provide 4-bit integer quantization support.
-O3 -DNDEBUG
in release mode by @jhen0409 in https://github.com/ggerganov/whisper.cpp/pull/640
log_mel_spectrogram
when singlethreaded by @maxilevi in https://github.com/ggerganov/whisper.cpp/pull/763
Full Changelog: https://github.com/ggerganov/whisper.cpp/compare/v1.2.1...v1.3.0
This is a minor release. The main reason for it is a critical bug fix that causes the software to crash randomly when the language auto-detect option is used (i.e. whisper_lang_auto_detect()
).
Other than that, the release includes refactoring of the examples, ruby bindings and some minor changes to the C API.
You can provide feedback in the existing v1.2.0 discussion.
ggml
/ whisper
whisper
: whisper : add "split_on_word" flag when using using "max_len" option by @mightymatth in #455 and @boolemancer in https://github.com/ggerganov/whisper.cpp/pull/476
whisper
: add whisper_full_lang_id() for getting the context lang by @kamranjon in https://github.com/ggerganov/whisper.cpp/pull/461
whisper
: fixed Beam Search Strategy and exposed whisper_pcm_to_mel_phase_vocoder by @sandrohanea in https://github.com/ggerganov/whisper.cpp/pull/474
whisper
: suppress non-speech-related token outputs by @shibukazu in https://github.com/ggerganov/whisper.cpp/pull/473
cmake
: install whisper.h header by @aviks in https://github.com/ggerganov/whisper.cpp/pull/485
whisper
: fix signedness compiler warning by @shikokuchuo in https://github.com/ggerganov/whisper.cpp/pull/506
whisper
: by default disable non-speech tokens suppression #473whisper
: add API for applying custom logits filters during decoding 0d229163bbea769c7a3e0e500e45850c9a6e2e42whisper
: fix uninitialized exp_n_audio_ctx
by @Finnvoor in https://github.com/ggerganov/whisper.cpp/pull/520
bindings
: add Ruby by @taf2 in https://github.com/ggerganov/whisper.cpp/pull/500
readme
: add .NET repos (#303)readme
: add cython bindings (#9)readme
: add pybind11 bindings by @aarnphm in https://github.com/ggerganov/whisper.cpp/pull/538
ci
: add node addon test and optimize compilation configuration by @chenqianhe in https://github.com/ggerganov/whisper.cpp/pull/468
yt-wsp.sh
: add unique filename generation by @genevera in https://github.com/ggerganov/whisper.cpp/pull/495
examples
: refactor in order to reuse code and reduce duplication by @ggerganov in https://github.com/ggerganov/whisper.cpp/pull/482
main
: fix stdin pipe stream by @conradg in https://github.com/ggerganov/whisper.cpp/pull/503
make
: add "-mcpu=native" when building for aarch64 (#532)whisper_pcm_to_mel_phase_vocoder()
*(whisper_logits_filter_callback)()
struct whisper_full_params
whisper_full_lang_id()
Full Changelog: https://github.com/ggerganov/whisper.cpp/compare/v1.2.0...v1.2.1
Recently, I have been making progress on adding integer quantisation support in the ggml
tensor library. This will eventually allow to use quantised models which require less memory and will hopefully run faster. I think the next major release v1.3.0
will officially add quantisation support. For now, you can keep track of the progress in #540
🎙️ MacWhisper by @jordibruin powered by whisper.cpp