A fast cnn-based vocoder
NOTE: I'm no longer working on this project. See #9.
This work is inspired from m-cnn model described in Fast Spectrogram Inversion using Multi-head Convolutional Neural Networks. The authors show that even a simple upsampling networks is enough to synthesis waveform from spectrogram/mel-spectrogram.
In this repo, I use spectrogram feature for training model because it contains more information than mel-spectrogram feature. However, because the transformation from spectrogram to mel-spectrogram is just a linear projection, so basically, you can train a simple network predict spectrogram from mel-spectrogram. You also can change parameters to be able to train a vocoder from mel-spectrogram feature too.
Compare with m-cnn, my proposed network have some differences:
$ pip install -r requirements.txt
I use LJSpeech dataset for my experiment. If you don't have it yet, please download dataset and put it somewhere.
After that, you can run command to generate dataset for our experiment:
$ python preprocessing.py --samples_per_audio 20 \
--out_dir ljspeech \
--data_dir path/to/ljspeech/dataset \
--n_workers 4
$ python train.py --out_dir ${output_directory}
For more training options, please run:
$ python train.py --help
$ python gen_spec.py -i sample.wav -o out.npz
$ python synthesis.py --model_path path/to/checkpoint \
--spec_path out.npz \
--out_path out.wav
You can get my pre-trained model here.
This implementation uses code from NVIDIA, Ryuichi Yamamoto, Keith Ito as described in my code.