voice to code on linux with kaldi
Silviux is a fork of the silvius voice to code project with many added features and tools for working with kaldi.
If you don't have any experience with a voice to code system I strongly recommend you try out other projects first. While this project is definitely usable it is still mostly an experiment in using multiple small vocabulary speech decoders instead of a more verbose grammar. The idea is that with a smaller vocabulary you get a boost in recognition accuracy and the parsers can use less verbose commands. This comes at the cost of having to manually manage which mode is active. In addition, as modifications are made to the grammar for a parser, the decoder will have to be rebuilt in order for kaldi to recognize the new words. There are scripts to ease this process but it may be helpful to learn a bit about language models if for no other reason than to understand the basic terminology used in the script comments.
Note that you can disable the sopare and vim code by not including them in the message pipeline,
#silviux/main.py
handler = Handler()
handler.use(Notify(context))
handler.use(Sleep(context))
handler.use(Mode(context))
handler.use(Hold(context))
#handler.use(Sopare(context))
#handler.use(Vim(context))
handler.use(History(context))
handler.use(Kaldi(context))
handler.use(Parse(context))
handler.use(Optimistic(context))
handler.use(Execute(context))
handler.init()
The notify middleware is required but you may simply comment out the part of the notify code that invokes os.command if you're trying to run the program without gnome.
The client code is now python3 and depends on the PyAudio and ws4py packages.
See the README in server/silviux-server for the server installation using docker. The client will connect to either the silvius server or the kaldi gstreamer server but only the silviux-server supports changing the decoder.
The server's silviux.yaml file contains the paths to the kaldi decoders. You will either need to create these yourself or download the ones I have prepared at https://silviux.weisman.dev. The model creation scripts place the created models into the server/models directory. The servers yaml file expects the models to be at server/silviux-server/models, so this directory is a symlink to the models directory above it.
The program uses sopare for switching between modes. Install this program and train it on the sounds found in the middleware/sopare.py file. Place the sopare_plugin directory into the plugins directory of sopare.
See silviux/config/silviux.yaml for configuring which port the server is listening on, where the sopare program is installed, which microphone to use etc. The main program will still accept the same command line args as silvius (see silviux/stream.py) to override the options in the yaml file.
$ python3 test.py
$ python3 main.py
The contents of the lm_utils/ directory are not imported or used at all during the normal (main.py) execution of the client code. It contains scripts for creating the needed language model and lexicon files for creating a kaldi decoder. For the 'programming' language model, the goal is to create a corpus of text that is the spoken equivalent of some original code.
#somefile.js
function() {
return "hello";
}
We wish to map the above text into the spoken words we would say to produce that text.
#key config
{"space": "spay", "(": "brace", ")": "mace, "enter": "spike", ";": "sem"}
#somefile.js.modified
function sue shi spay brace spike
return spay quote hello quote sem spike
mace spike
We can then train a language model on these files to help improve recognition accuracy. The workflow is something like this:
Note that when you first use a new decoder, it takes about a minute for kaldi to set the adaption state. This means your first impression will often be that it isn't working well and you may try changing the volume or tone of your voice to "help" it with the recognition. Try not to do this and just speak naturally for the first minute or so before making a determination.
You should read the comments in the lm_utils/scripts/makelm.sh file. This file is a collection of commands used in the creation of language models and the formatting of lexicon files so they will work with kaldi. As an example, here I will walk through the commands for creating a new 'command' decoder which will be a mix of the english model downloaded from the kaldi website and the terminal words from our 'command' grammar.
First you will need the lm and lexicon from the aspire chain model. If you ran the server's silviux.sh install script you already have these files within the docker container's /opt/kaldi/egs/aspire/s5/ directory. The LM can be found in data/local/lm/4gram-mincount/lm_unpruned.gz. Extract the 'lm_unpruned' file into the lm_utils/exp/ directory. The lexicon file can be found at data/local/dict/lexicon2_raw.txt, also copy this to the lm_utils/exp directory.
Now we need to create a small language model from our command grammar to mix with the english lm and lexicon. We're going to run a script for outputting the terminals from our grammar and use it to create a lexicon and a bigram corpus to create a language model.
# From the silviux project root
$ python3 parser_utils.py command > lm_utils/exp/terminals
$ python3 lm_utils/ngrams.py lm_utils/exp/terminals > lm_utils/exp/bigrams
$ ngram-count -wbdiscount -text lm_utils/exp/bigrams -lm lm_utils/exp/command.lm
We now have a language model for our terminal words at lm_utils/exp/command.lm. We're also going to need a lexicon for these terminal words. Upload the lm_utils/exp/terminals file to http://www.speech.cs.cmu.edu/tools/lextool.html (use the 'word file' option then hit 'compile). Save the .dict file to lm_utils/exp/command.dict. We're now going to format this file and combine it with the english lexicon and finish off by removing any duplicate lines (kaldi won't compile our decoder with duplicates).
$ cd lm_utils
$ python3 lexicon-format.py exp/command.dict exp/terminals exp/commandlexicon.txt
$ cat exp/commandlexicon.txt exp/lexicon2_raw.txt > exp/merged.txt
$ sort exp/merged.txt | uniq > exp/mylexicon.txt
We now have the lexicon file, exp/mylexicon.txt, which we will use with the decoder creation script. The final step is to mix our command bigram language model with the english language model.
ngram -lm exp/lm_unpruned -mix-lm exp/command.lm -lambda 0.2 -write-lm exp/mymodel.lm
The exp/mylexicon.txt and exp/mymodel.lm files can now be placed into server/ (or whatever directory the docker container is mounted to) for running the decoder creation scripts.