A Non-Autoregressive Transformer based Text-to-Speech, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS
A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS. Any suggestions toward the best Non-AR TTS are welcome :)
One TTS Alignment To Rule Them All (Badlani et al., 2021): We are finally freed from external aligners such as MFA! Validation alignments for LJ014-0329 up to 70K are shown below as an example.
Model | Memory Usage | Training Time (1K steps) |
---|---|---|
Fastformer (lucidrains') | 10531MiB / 24220MiB | 4m 25s |
Fastformer (wuch15's) | 10515MiB / 24220MiB | 4m 45s |
Long-Short Transformer | 10633MiB / 24220MiB | 5m 26s |
Conformer | 18903MiB / 24220MiB | 7m 4s |
Reformer | 10293MiB / 24220MiB | 10m 16s |
Transformer | 7909MiB / 24220MiB | 4m 51s |
Transformer_fs2 | 11571MiB / 24220MiB | 4m 53s |
Toggle the type of building blocks by
# In the model.yaml
block_type: "transformer_fs2" # ["transformer_fs2", "transformer", "fastformer", "lstransformer", "conformer", "reformer"]
Toggle the type of prosody modelings by
# In the model.yaml
prosody_modeling:
model_type: "none" # ["none", "du2021", "liu2021"]
Toggle the type of duration modelings by
# In the model.yaml
duration_modeling:
learn_alignment: True # True for unsupervised modeling, and False for supervised modeling
DATASET refers to the names of datasets such as LJSpeech
and VCTK
in the following documents.
You can install the Python dependencies with
pip3 install -r requirements.txt
Also, Dockerfile
is provided for Docker
users.
You have to download the pretrained models and put them in output/ckpt/DATASET/
. The models are trained under unsupervised duration modeling with "transformer_fs2" building block.
For a single-speaker TTS, run
python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step RESTORE_STEP --mode single --dataset DATASET
For a multi-speaker TTS, run
python3 synthesize.py --text "YOUR_DESIRED_TEXT" --speaker_id SPEAKER_ID --restore_step RESTORE_STEP --mode single --dataset DATASET
The dictionary of learned speakers can be found at preprocessed_data/DATASET/speakers.json
, and the generated utterances will be put in output/result/
.
Batch inference is also supported, try
python3 synthesize.py --source preprocessed_data/DATASET/val.txt --restore_step RESTORE_STEP --mode batch --dataset DATASET
to synthesize all utterances in preprocessed_data/DATASET/val.txt
.
The pitch/volume/speaking rate of the synthesized utterances can be controlled by specifying the desired pitch/energy/duration ratios. For example, one can increase the speaking rate by 20 % and decrease the volume by 20 % by
python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step RESTORE_STEP --mode single --dataset DATASET --duration_control 0.8 --energy_control 0.8
Add --speaker_id SPEAKER_ID for a multi-speaker TTS.
The supported datasets are
Any of both single-speaker TTS dataset (e.g., Blizzard Challenge 2013) and multi-speaker TTS dataset (e.g., LibriTTS) can be added following LJSpeech and VCTK, respectively. Moreover, your own language and dataset can be adapted following here.
For a multi-speaker TTS with external speaker embedder, download ResCNN Softmax+Triplet pretrained model of philipperemy's DeepSpeaker for the speaker embedding and locate it in ./deepspeaker/pretrained_models/
.
Run
python3 prepare_align.py --dataset DATASET
for some preparations.
For the forced alignment, Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences.
Pre-extracted alignments for the datasets are provided here.
You have to unzip the files in preprocessed_data/DATASET/TextGrid/
. Alternately, you can run the aligner by yourself.
After that, run the preprocessing script by
python3 preprocess.py --dataset DATASET
Train your model with
python3 train.py --dataset DATASET
Useful options:
--use_amp
argument to the above command.CUDA_VISIBLE_DEVICES=<GPU_IDs>
at the beginning of the above command.Use
tensorboard --logdir output/log
to serve TensorBoard on your localhost. The loss curves, synthesized mel-spectrograms, and audios are shown.
ID | Model | Block Type | Pitch Conditioning |
---|---|---|---|
1 | LJSpeech_transformer_fs2_cwt | transformer_fs2 |
continuous wavelet transform |
2 | LJSpeech_transformer_cwt | transformer |
continuous wavelet transform |
3 | LJSpeech_transformer_frame | transformer |
frame-level f0 |
4 | LJSpeech_transformer_ph | transformer |
phoneme-level f0 |
Observations from
'none'
and 'DeepSpeaker'
).
var_start_steps
for better model convergence, especially under unsupervised duration modeling
Please cite this repository by the "Cite this repository" of About section (top right of the main page).