A Pytorch Implementation of "Neural Speech Synthesis with Transformer Network"
pip install -r requirements.txt
You can check some generated samples below. All samples are step at 160k, so I think the model is not converged yet. This model seems to be lower performance in long sentences.
The first plot is the predicted mel spectrogram, and the second is the ground truth.
hyperparams.py
includes all hyper parameters that are needed.prepare_data.py
preprocess wav files to mel, linear spectrogram and save them for faster training time. Preprocessing codes for text is in text/ directory.preprocess.py
includes all preprocessing codes when you loads data.module.py
contains all methods, including attention, prenet, postnet and so on.network.py
contains networks including encoder, decoder and post-processing network.train_transformer.py
is for training autoregressive attention network. (text --> mel)train_postnet.py
is for training post network. (mel --> linear)synthesis.py
is for generating TTS sample.hyperparams.py
, especially 'data_path' which is a directory that you extract files, and the others if necessary.prepare_data.py
.train_transformer.py
.train_postnet.py
.synthesis.py
. Make sure the restore step.