A PyTorch implementation of Tacotron2, an end-to-end text-to-speech(TTS) system described in "Natural TTS Synthesis By Conditioning Wavenet On Mel Spectrogram Predictions".
A PyTorch implementation of Tacotron2, described in Natural TTS Synthesis By Conditioning Wavenet On Mel Spectrogram Predictions, an end-to-end text-to-speech(TTS) neural network architecture, which directly converts character text sequence to speech.
pip install -r requirements.txt
egs/ljspeech/run.sh
, download LJ Speech Dataset for free.$ cd egs/ljspeech
# Modify wav_dir to your LJ Speech dir
$ bash run.sh
That's all.
You can change parameter by $ bash run.sh --parameter_name parameter_value
, egs, $ bash run.sh --stage 2
. See parameter name in egs/ljspeech/run.sh
before . utils/parse_options.sh
.
Workflow of egs/ljspeech/run.sh
:
egs/ljspeech/run.sh
provide example usage.
# Set PATH and PYTHONPATH
$ cd egs/ljspeech/; . ./path.sh
# Train:
$ train.py -h
# Synthesis audio:
$ synthesis.py -h
If you want to visualize your loss, you can use visdom to do that:
$ visdom
$ bash run.sh --visdom 1 --visdom_id "<any-string>"
or $ train.py ... --visdom 1 --vidsdom_id "<any-string>"
<your-remote-server-ip>:8097
, egs, 127.0.0.1:8097
<any-string>
in Environment
to see your loss
$ bash run.sh --continue_from <model-path>
Use comma separated gpu-id sequence, such as:
$ bash run.sh --id "0,1"
batch_size
or use more GPU. $ bash run.sh --batch_size <lower-value>
or $ bash run.sh --id "0,1"
.This is a work in progress and any contribution is welcome (dev branch is main development branch).
I implement feature prediction network + Griffin-Lim to synthesis speech now.
Attention and synthesised audio on 37k iterations: