THEANO-KALDI-RNNs is a project implementing various Recurrent Neural Networks (RNNs) for RNN-HMM speech recognition. The Theano Code is coupled with the Kaldi decoder.
THEANO-KALDI-RNNs is a software which offers the possibility to use various Recurrent Neural Networks (RNNs) in the context of a Kaldi-based hybrid HMM/RNN speech recognizer.
Note: A new project called "pytorch-kaldi" https://github.com/mravanelli/pytorch-kaldi is now available. If you are interested, please take a look into it.
The current version supports the following standard architectures:
The code also considers some architectural variations:
The latter architectures have been explored in [1] (see reference). Please cite this paper if you use this toolkit or a part of it.
All the RNNs are based on a state-of-the-art technology which includes:
If not already done, install KALDI (http://kaldi-asr.org/) and make sure that your KALDI installation is working.
Run the original TIMIT kaldi recipe in egs/timit/s5/run.sh and check whether everything is properly working. This step is necessary to compute features and labels that will be inherited in the theano/python part of this code.
Install THEANO (http://deeplearning.net/software/theano/install.html) and make sure your installation is working. Try for instance to type import theano in the python environment and check whether everything works fine.
The code has been tested with:
This step is necessary to derive the labels later used to train the RNN. In particular:
steps/align_fmllr.sh --nj "$train_nj" --cmd "$train_cmd" \
data/dev data/lang exp/tri3 exp/tri3_ali_dev
steps/align_fmllr.sh --nj 24 --cmd "$train_cmd" \
data/test data/lang exp/tri3 exp/tri3_ali_test
am-info exp/tri3/final.mdl
/exp/dnn4_pretrain-dbn_dnn/ali_train_pdf.counts
After training, forward and decoding phases are finished, you can go into the kaldi_decoding_scripts foder and run ./RESULTS to check the system performance.
Note that the performance obtained can be slightly different from that reported in the paper due, for instance, to the randomness introduced by different initializations. To mitigate this source of randomness and perform a fair comparison across the various architectures, in [1] we ran more experiments with different seeds (i.e., setting a different seed in the cfg_file) and we averaged the obtained error rates.
Please, note that this is an ongoing project. It would be helpful to report us any issue!
[1] M. Ravanelli, P. Brakel, M. Omologo, Y. Bengio, "Improving speech recognition by revising Gated Recurrent Units", in Proceedings of Interspeech 2017