Streaming transcriber with whisper
Streaming transcriber with whisper. Enough machine power is needed to transcribe in real time.
This repository has been archived. There are some alternatives.
pip install -U git+https://github.com/shirayu/[email protected]
# If you use GPU, install proper torch and torchaudio
# Check https://pytorch.org/get-started/locally/
# Example : torch for CUDA 11.6
pip install -U torch torchaudio --extra-index-url https://download.pytorch.org/whl/cu116
If you get OSError: PortAudio library not found
in Linux, install "PortAudio".
sudo apt -y install portaudio19-dev
# Run in English
# By the default, it needs to wait at least 30 seconds
whispering --language en --model tiny
--help
shows full options--model
sets the model name to use. Larger models will be more accurate, but may not be able to transcribe in real time.--language
sets the language to transcribe. The list of languages are shown with whispering -h
--no-progress
disables the progress message-t
sets temperatures to decode. You can set several like -t 0.0 -t 0.1 -t 0.5
, but too many temperatures exhaust decoding time--debug
outputs logs for debug--vad
sets VAD (Voice Activity Detection) threshold. The default is 0.5
. 0
disables VAD and forces whisper to analyze non-voice activity sound period. Try --vad 0
if VAD prevents transcription.--output
sets output file (Default: Standard output)--frame
: the number of minimum frames of mel spectrogram input for Whisper (default: 3000
. i.e. 30 seconds)By default, whispering performs VAD for every 3.75 second.
This interval is determined by the value of -n
and its default is 20
.
When an interval is predicted as "silence", it will not be passed to whisper.
If you want to disable VAD, please make VAD threshold 0 by adding --vad 0
.
By default, whispering does not perform analysis until the total length of the segments determined by VAD to have speech exceeds 30 seconds.
This is because the original Whisper assumes that the inputs are 30 seconds segments.
However, if silence segments appear 16 times (the default value of --max_nospeech_skip
) after speech is detected, the analysis is performed.
You can make the length of segments smaller with --frame
option (default: 3000), but it sacrifices accuracy because this is not expected input for Whisper.
⚠ No security mechanism. Please make secure with your responsibility.
Run with --host
and --port
.
whispering --language en --model tiny --host 0.0.0.0 --port 8000
whispering --host ADDRESS_OF_HOST --port 8000 --mode client
You can set -n
and other options.
Install poetry to use poetry
command
Clone and install libraries
# Clone
git clone https://github.com/shirayu/whispering.git
# With poetry
poetry config virtualenvs.in-project true
poetry install --all-extras
poetry run pip install -U torch torchaudio --extra-index-url https://download.pytorch.org/whl/cu116
# With npm
npm install
Run test and check that no errors occur
poetry run make -j4
Make fancy updates
Make style
poetry run make style
Run test again and check that no errors occur
poetry run make -j4
Check typos by using typos. Just run typos
command in the root directory.
typos
Send Pull requests!