VOSK Speech Recognition Toolkit
This is Vosk, the lifelong speech recognition system.
As of 2019, the neural network based speech recognizers are pretty limited in terms of amount of the speech data they can use in training and require enormous computing power and time to train and optimize the parameters. Neural networks have problems with human-like one shot learning, their decisions are not very robust to unseen conditions and hard to understand and correct.
That is why we decided to build a system based on large signal database concept. We apply audio fingerprinting scheme. The audio is segmented on chunks, the chunks are stored in the database based on LSH hash value. During decoding we simply lookup the chunks in the database to get the idea what are the possible phones. That helps us to make a proper decision on decoding results.
The advantages of this approach are:
The disandvantages are:
The nice to have things in the future would be:
To install the requirements run
pip3 install -r requirements.txt
To prepare the training/verification data create the following two files:
wav.scp
list to map uterances to wav files in filesystemphones.txt
the CTM file with phonemes and timings. It could be CTM file from the alignment or
it could be a CTM file from the decodingYou can create them with Kaldi ASR toolkit
To add the data to the database run
python3 index.py wavs-train.txt phones-train.txt data.idx
That will add the data to the database data.idx or create a new one
To verify decoding results run
python3 verify.py wavs-test.txt phones-test.txt data.idx
The tool will search for segments in the index and report suspicious segments which you can additionally check and later add to the database to improve the accuracy of recognition.