An implement of GlowTTS model. Several modes are added: speaker embedding, prosody encoder(GST), and gradient reversal.
Glow TTS
GE2E speaker embedding
Prosody encoder (Global style token layer)
Gradient reversal layer
torch >= 1.5.1
tensorboardX >= 2.0
librosa >= 0.7.2
matplotlib >= 3.1.3
Optional for loss flow
Single | Multi | Dataset | Dataset address | ||
---|---|---|---|---|---|
O | O | LJSpeech | https://keithito.com/LJ-Speech-Dataset/ | ||
X | X | BC2013 | http://www.cstr.ed.ac.uk/projects/blizzard/ | ||
X | O | CMU Arctic | http://www.festvox.org/cmu_arctic/index.html | ||
X | O | VCTK | https://datashare.is.ed.ac.uk/handle/10283/2651 | ||
X | X | LibriTTS | https://openslr.org/60/ |
Before proceeding, please set the pattern, inference, and checkpoint paths in 'Hyper_Parameters.yaml' according to your environment.
Sound
Use_Cython_Alignment
true
, the cython implementation of official code will be used.false
, the python implementation will be used.Encoder
Decoder
WaveNet
null
, model does not exports wav files.null
, all parameters must be matched to pre-trained Parallel WaveGAN model.Speaker_Embedding
Type
, you can select null
, 'LUT'
, 'GE2E'
null
: No speaker embedding. Single speaker versionLUT
: Model will generate a lookup table about the speakers.GE2E
: Model will use d-vectors which is generated by a pretrained GE2E model.
Token path
Train
Inference_Batch_Size
null
, it will be same to Train/Batch_Size
Inference_Path
Checkpoint_Path
Log_Path
Use_Mixed_Precision
Nvidia apex
must be installed in the environment.Device
python Pattern_Generate.py [parameters]
At least, one or more of datasets must be used.
0.001
.1
.10
.python Train.py -s <int>
-s <int>
Mode | Dataset | Trained steps | Link |
---|---|---|---|
Vanilla | LJ | 100000 | Link(Broken) |
SE & LUT | LJ + CUMA | 100000 | Link |
SE & LUT | LJ + VCTK | 100000 | Link |
PE | LJ + CUMA | 100000 | Link |
PE | LJ + VCTK | 400000 | Link |
GR & LUT | LJ + VCTK | 400000 | Link(Failed) |