This code implements a basic MLP for speech recognition. The MLP is trained with pytorch, while feature extraction, alignments, and decoding are performed with Kaldi. The current implementation supports dropout and batch normalization. An example for phoneme recognition using the standard TIMIT dataset is provided.
This code implements a basic MLP for HMM-DNN speech recognition. The MLP is trained with pytorch, while feature extraction, alignments, and decoding are performed with Kaldi. The current implementation supports dropout and batch normalization. An example for phoneme recognition using the standard TIMIT dataset is provided.
Make sure that python is installed (the code is tested with python 2.7). Even though not mandatory, we suggest to use Anaconda (https://anaconda.org/anaconda/python).
If not already done, install pytorch (http://pytorch.org/) and make sure that the installation works. As a first test, type “python” and, once entered into the console, type “import torch”. Make sure everything is fine.
If not already done, install Kaldi (http://kaldi-asr.org/). As suggested during the installation, do not forget to add the path of the Kaldi binaries into $HOME/.bashrc. As a first test to check the installation, open a bash shell, type “copy-feats” and make sure no errors appear.
Install kaldi-io package from the kaldi-io-for-python project (https://github.com/vesis84/kaldi-io-for-python). It provides a simple interface between kaldi and python. To install it:
The code has been tested with:
This step is necessary to compute features and labels later used to train the pytorch MLP. In particular:
If you wanna use tri3 alignments, type:
steps/align_fmllr.sh --nj 4 data/dev data/lang exp/tri3 exp/tri3_ali_dev
steps/align_fmllr.sh --nj 4 data/test data/lang exp/tri3 exp/tri3_ali_test
If you wanna use dnn alignments (as suggested), type:
steps/nnet/align.sh --nj 4 data-fmllr-tri3/dev data/lang exp/dnn4_pretrain-dbn_dnn exp/dnn4_pretrain-dbn_dnn_ali_dev
steps/nnet/align.sh --nj 4 data-fmllr-tri3/test data/lang exp/dnn4_pretrain-dbn_dnn exp/dnn4_pretrain-dbn_dnn_ali_test
Go to the pytorch_MLP_for_ASR folder. The create_chunks.sh script first shuffles or sorts (based on the sentence length) a kaldi feature list and then split it into a certain number of chunks. Shuffling a list could be good for feed-forward DNNs, while a sorted list can be useful for RNNs (not used here). The code also computes per-speaker and per-sentence CMVN.
For mfcc features run:
./create_chunks.sh $KALDI_ROOT/egs/timit/s5/data/train mfcc_lists 5 train 0
./create_chunks.sh $KALDI_ROOT/egs/timit/s5/data/dev mfcc_lists 1 dev 0
./create_chunks.sh $KALDI_ROOT/egs/timit/s5/data/test mfcc_lists 1 test 0
For fMLLR features run:
./create_chunks.sh $KALDI_ROOT/egs/timit/s5/data-fmllr-tri3/train fmllr_lists 5 train 0
./create_chunks.sh $KALDI_ROOT/egs/timit/s5/data-fmllr-tri3/dev fmllr_lists 1 dev 0
./create_chunks.sh $KALDI_ROOT/egs/timit/s5/data-fmllr-tri3/test fmllr_lists 1 test 0
python MLP_ASR.py --cfg TIMIT_MLP_mfcc.cfg 2> log.log
or
python MLP_ASR.py --cfg TIMIT_MLP_fmllr.cfg 2> log.log
If everything is working fine, your output (for fMLLR features) should look like this:
epoch 1 training_cost=3.185270, training_error=0.690495, dev_error=0.549124, test_error=0.549172, learning_rate=0.080000, execution_time(s)=85.436095
epoch 2 training_cost=1.950891, training_error=0.533513, dev_error=0.498461, test_error=0.499940, learning_rate=0.080000, execution_time(s)=78.238582
epoch 3 training_cost=1.737371, training_error=0.489724, dev_error=0.474726, test_error=0.479321, learning_rate=0.080000, execution_time(s)=78.390865
epoch 4 training_cost=1.610313, training_error=0.461962, dev_error=0.464242, test_error=0.465437, learning_rate=0.080000, execution_time(s)=78.282750
epoch 5 training_cost=1.521487, training_error=0.442533, dev_error=0.455979, test_error=0.457632, learning_rate=0.080000, execution_time(s)=77.166886
epoch 6 training_cost=1.452035, training_error=0.426761, dev_error=0.451179, test_error=0.453436, learning_rate=0.080000, execution_time(s)=77.064029
epoch 7 training_cost=1.394820, training_error=0.413627, dev_error=0.443357, test_error=0.445354, learning_rate=0.080000, execution_time(s)=78.169549
epoch 8 training_cost=1.347145, training_error=0.402646, dev_error=0.441773, test_error=0.444214, learning_rate=0.080000, execution_time(s)=77.795720
epoch 9 training_cost=1.305390, training_error=0.392546, dev_error=0.437266, test_error=0.443972, learning_rate=0.080000, execution_time(s)=77.853706
epoch 10 training_cost=1.269022, training_error=0.383710, dev_error=0.434375, test_error=0.442246, learning_rate=0.080000, execution_time(s)=78.647257
epoch 11 training_cost=1.235830, training_error=0.375467, dev_error=0.431452, test_error=0.433421, learning_rate=0.080000, execution_time(s)=77.407901
epoch 12 training_cost=1.205622, training_error=0.368032, dev_error=0.432220, test_error=0.434976, learning_rate=0.080000, execution_time(s)=78.074666
epoch 13 training_cost=1.132139, training_error=0.348587, dev_error=0.419605, test_error=0.423768, learning_rate=0.040000, execution_time(s)=77.901673
epoch 14 training_cost=1.098085, training_error=0.339950, dev_error=0.418413, test_error=0.423302, learning_rate=0.040000, execution_time(s)=78.078263
epoch 15 training_cost=1.079947, training_error=0.335365, dev_error=0.418103, test_error=0.424390, learning_rate=0.040000, execution_time(s)=77.930880
epoch 16 training_cost=1.042311, training_error=0.325323, dev_error=0.412053, test_error=0.417673, learning_rate=0.020000, execution_time(s)=78.131381
epoch 17 training_cost=1.025281, training_error=0.320353, dev_error=0.413433, test_error=0.418553, learning_rate=0.020000, execution_time(s)=78.066443
epoch 18 training_cost=1.004788, training_error=0.314823, dev_error=0.408852, test_error=0.415479, learning_rate=0.010000, execution_time(s)=79.046657
epoch 19 training_cost=0.995931, training_error=0.312059, dev_error=0.409269, test_error=0.414081, learning_rate=0.010000, execution_time(s)=77.593635
epoch 20 training_cost=0.985299, training_error=0.309571, dev_error=0.407089, test_error=0.412613, learning_rate=0.005000, execution_time(s)=78.187198
epoch 21 training_cost=0.980231, training_error=0.308210, dev_error=0.406648, test_error=0.412423, learning_rate=0.005000, execution_time(s)=78.114028
epoch 22 training_cost=0.977153, training_error=0.307378, dev_error=0.406378, test_error=0.413494, learning_rate=0.005000, execution_time(s)=78.196592
epoch 23 training_cost=0.970285, training_error=0.305204, dev_error=0.405064, test_error=0.412578, learning_rate=0.002500, execution_time(s)=80.574895
epoch 24 training_cost=0.968486, training_error=0.304418, dev_error=0.406182, test_error=0.412734, learning_rate=0.002500, execution_time(s)=78.166218
During the last epoch, the training script creates a file pout_test.ark containing a set of likelihoods (i.e., normalized posterior probabilities) computed on the test sentences. These likelihoods can be used to feed the Kaldi decoder in this way:
cd kaldi_decoding_scripts
For mfcc features:
./decode_dnn_TIMIT.sh $KALDI_ROOT/egs/timit/s5/exp/tri3/graph \
$KALDI_ROOT/egs/timit/s5/data/test/ \
$KALDI_ROOT/egs/timit/s5/exp/dnn4_pretrain-dbn_dnn_ali \
../TIMIT_MLP_mfcc/decoding_test \
"cat ../TIMIT_MLP_mfcc/pout_test.ark"
For fMLLR features
./decode_dnn_TIMIT.sh $KALDI_ROOT/egs/timit/s5/exp/tri3/graph \
$KALDI_ROOT/egs/timit/s5/data/test/ \
$KALDI_ROOT/egs/timit/s5/exp/dnn4_pretrain-dbn_dnn_ali \
../TIMIT_MLP_fmllr/decoding_test \
"cat ../TIMIT_MLP_fmllr/pout_test.ark"
If everything if fine, you should obtain Phone Error Rates (PER%) similar to the following ones:
For reference purposes, you can take a look to our results here: TIMIT_MLP_fmllr_reference or TIMIT_MLP_mfcc_reference.
Note that, despite its simplicity, the performance obtained with this implementation is slightly better than that achieved with the kaldi baselines (even without pre-training or sMBR). For comparison purposes, see for instance the file $KALDI_ROOT/egs/timit/s5/RESULTS.
Note also that small variations of PER with respect to these reference values (e.g, +/- 0.5 %) are normal (since due to a different DNN initialization).
Please, cite my PhD thesis if you use this code:
[1] M. Ravanelli, "Deep Learning for Distant Speech Recognition", PhD Thesis, Unitn 2017