Speech recognition on the TIMIT (or any other) dataset
With this repo you can preprocess an audio dataset (modify phoneme classes, resample audio etc), and train LSTM networks for framewise phoneme classification. You can achieve 82% accuracy on the TIMIT dataset, similar to the results from Graves et al (2013), although CTC is not used here. Instead, the network generates predictions in the middle of each phoneme interval, as specified by the labels. This is to simplify things, but using CTC shouldn't be too much trouble.
In order to create and train a network on a dataset, you need to:
install software. I recommend using Anaconda and a virtual environment.
conda env create -f environment.yml
Generate a binary file from the source dataset. It's easiest if you structure your data as in the TIMIT dataset, although that's not really needed. Just make sure that the wav and its corresponding phn file have the same path except for the extension. Otherwise they won't get matched and your labels will be off.
dataset/TRAIN/speakerName/videoName/
.
Each videoName/ directory contains a videoName.wav
and videoName.phn
.
The phn contains the audio sample (@16kHz) numbers where each phoneme starts and ends.dataRoot/dataset/fixed(nbPhonemes)/
transform.py phonemes -i dataRoot/TIMIT/original/ -o dataRoot/TIMIT/fixed
root/dataset/fixed(nbPhonemes)/
find . -depth -print0 | xargs -0 rename '$_ = lc $_'
in the root dataset directory (change 'lc' to 'uc to convert to upper case). Repeat until you get no more output.root/dataset/binary(nbPhonemes)/dataset/dataset_nbPhonemes_ch.pkl
. eg root/TIMIT/binary39/TIMIT/TIMIT_39_ch.pklroot/dataset/binary_nbPhonemes/dataset_MeanStd.pkl
. It's useful for normalization when evaluating.Use RNN.py to start training. Its functions are implemented in RNN_tools_lstm.py, but you can set the parameters from RNN.py.
set location of pkl generated by datasetToPkl.py
specify number of LSTM layers and number of units per layer
use bidirectional LSTM layers
add some dense layers (though it did not improve performance for me)
learning rate and decay (LR is updated at end of RNN_tools_lstm.py). It's decreased if the performance hasn't improved for some time.
it will automatically give the model a name based on the specified parameters. A log file, the model parameters and a pkl file containing training info (accuracy, error etc for each epoch) are stored as well.
The storage location isroot/dataset/results
to evaluate a dataset, change the test_dataset variable to whatever you want (TIMIT/TCDTIMIT/combined)
combinedSR/combinedNN.py
On TIMIT, you should get about 82% accuracy using a 2-layer, 256 units/layer bidirectional LSTM network. You should get about 67% on TCD-TIMIT.
The TIMIT dataset is non-free and available from https://catalog.ldc.upenn.edu/LDC93S19.
The TCD-TIMIT dataset is free for research and available from https://sigmedia.tcd.ie/TCDTIMIT/.
If you want to use TCD-TIMIT, I recommend to use my repo TCDTIMITprocessing to download, and extract the database. It's quite a nasty job otherwise. You can use extractTCDTIMITaudio.py
to get the phoneme and wav files.
If you want to do lipreading or audio-visual speech recognition, check out my other repository MultimodalSR