An Optimized Speech-to-Text Pipeline for the Whisper Model Supporting Multiple Inference Engine
vtt,srt,json,tsv,txt
. (Doc: https://github.com/shashikg/WhisperS2T/blob/main/docs.md#write-transcripts-to-a-file)lang_code
or tasks
instead of list -- when all the audio files belongs to same language/task. https://github.com/shashikg/WhisperS2T/issues/27
transcribe
function by @shashikg in https://github.com/shashikg/WhisperS2T/pull/15 (Doc: https://github.com/shashikg/WhisperS2T/blob/main/docs.md#run-without-vad-model)Full Changelog: https://github.com/shashikg/WhisperS2T/compare/v1.3.0...v1.3.1
WhisperS2T now offers compatibility with NVIDIA's TensorRT-LLM: https://github.com/NVIDIA/TensorRT-LLM backend, delivering a further twofold improvement in inference time compared to the CTranslate2 backend. The current optimal configuration on an A30 GPU achieves transcription of 1-hour files in approximately 18 seconds. Updated benchmarks are detailed below:
ffmpeg
resampling issue by adding an option to use the swr
resampler in case soxr
is not available.Model Name | Acc. Overlapped | Acc. Within Collar (0.1s) | Acc. Within Collar (0.2s) | Acc. Within Collar (0.5s) | Acc. Within Collar (1.0s) | Total Word Hits | Inference Time |
---|---|---|---|---|---|---|---|
WhisperS2T (ASR: whsiper-large-v2 - Aligner: whsiper-tiny) | 66.21 | 38.67 | 60.8 | 76.06 | 85.82 | 64350 | 2.6x |
WhisperS2T (ASR: whsiper-large-v2 - Aligner: whsiper-large-v2) | 66.72 | 48.95 | 58.54 | 73.44 | 84.0 | 64350 | 1.6x |
WhisperX (ASR: whsiper-large-v2 - Aligner: wav2vec) | 55.65 | 50.66 | 55.84 | 66.18 | 75.57 | 64307 | 1x |
We used the Whisper model for alignment. What we observed is that both Whisper as well as phoneme-level alignment (as in WhisperX) yield similar performance. However, using Whisper provides several advantages, including out-of-the-box support for all languages. For phoneme-level alignment, we need an individual model for every new language, which we believe somewhat diminishes the advantages of using the Whisper model at all. Moreover, when using the whisper-tiny
model for word alignment, it incurs very little latency overhead without affecting the alignment accuracies. We utilized the AMI-MIX-Headset-Test dataset for benchmarking.
There's no properly defined metric for estimating word alignment accuracy. Hence, we introduce a new metric to accurately estimate the performance of word alignment. Check this function: Word Alignment Metric Function.
The proposed metric performs the following steps:
overlapped_words
and words_within_collar
(refer to the figure below). Finally, we divide both values by the total number of detected words.Added support for Whisper Large v3 and Distil-Whisper Large v2. Below are the benchmarks
WhisperS2T is an optimized lightning-fast speech-to-text pipeline tailored for the whisper model! It's designed to be exceptionally fast, boasting a 1.5X speed improvement over WhisperX and a 2X speed boost compared to HuggingFace Pipeline with FlashAttention 2 (Insanely Fast Whisper). Moreover, it includes several heuristics to enhance transcription accuracy.
Whisper is a general-purpose speech recognition model developed by OpenAI. It is trained on a large dataset of diverse audio and is also a multitasking model that can perform multilingual speech recognition, speech translation, and language identification.
Stay tuned for a technical report comparing WhisperS2T against other whisper pipelines. Meanwhile, check some quick benchmarks on A30 GPU.
Install audio packages required for resampling and loading audio files.
apt-get install -y libsndfile1 ffmpeg
To install or update to the latest released version of WhisperS2T use the following command:
pip install -U whisper-s2t
Or to install from latest commit in this repo:
pip install -U git+https://github.com/shashikg/WhisperS2T.git
import whisper_s2t
model = whisper_s2t.load_model(model_identifier="large-v2", backend='CTranslate2')
files = ['data/KINCAID46/audio/1.wav']
lang_codes = ['en']
tasks = ['transcribe']
initial_prompts = [None]
out = model.transcribe_with_vad(files,
lang_codes=lang_codes,
tasks=tasks,
initial_prompts=initial_prompts,
batch_size=32)
print(out[0][0])
"""
[Console Output]
{'text': "Let's bring in Phil Mackie who is there at the palace. We're looking at Teresa and Philip May. Philip, can you see how he's being transferred from the helicopters? It looks like, as you said, the beast. It's got its headlights on because the sun is beginning to set now, certainly sinking behind some clouds. It's about a quarter of a mile away down the Grand Drive",
'avg_logprob': -0.25426941679184695,
'no_speech_prob': 8.147954940795898e-05,
'start_time': 0.0,
'end_time': 24.8}
"""
Check this Documentation for more details.
This project is licensed under MIT License - see the LICENSE file for details.