PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation
PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation.
naive
branch)main
branch)You can install the Python dependencies with
pip3 install -r requirements.txt
You have to download pretrained models and put them in output/ckpt/LibriTTS_meta_learner/
.
For English multi-speaker TTS, run
python3 synthesize.py --text "YOUR_DESIRED_TEXT" --ref_audio path/to/reference_audio.wav --restore_step 200000 --mode single -p config/LibriTTS/preprocess.yaml -m config/LibriTTS/model.yaml -t config/LibriTTS/train.yaml
The generated utterances will be put in output/result/
. Your synthesized speech will have ref_audio
's style.
Batch inference is also supported, try
python3 synthesize.py --source preprocessed_data/LibriTTS/val.txt --restore_step 200000 --mode batch -p config/LibriTTS/preprocess.yaml -m config/LibriTTS/model.yaml -t config/LibriTTS/train.yaml
to synthesize all utterances in preprocessed_data/LibriTTS/val.txt
. This can be viewed as a reconstruction of validation datasets referring to themselves for the reference style.
The pitch/volume/speaking rate of the synthesized utterances can be controlled by specifying the desired pitch/energy/duration ratios. For example, one can increase the speaking rate by 20 % and decrease the volume by 20 % by
python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step 200000 --mode single -p config/LibriTTS/preprocess.yaml -m config/LibriTTS/model.yaml -t config/LibriTTS/train.yaml --duration_control 0.8 --energy_control 0.8
Note that the controllability is originated from FastSpeech2 and not a vital interest of StyleSpeech. Please refer to STYLER [demo, code] for the controllability of each style factor.
The supported datasets are
Run
python3 prepare_align.py config/LibriTTS/preprocess.yaml
for some preparations.
For the forced alignment, Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences.
Pre-extracted alignments for the datasets are provided here.
You have to unzip the files in preprocessed_data/LibriTTS/TextGrid/
. Alternately, you can run the aligner by yourself.
After that, run the preprocessing script by
python3 preprocess.py config/LibriTTS/preprocess.yaml
Train your model with
python3 train.py -p config/LibriTTS/preprocess.yaml -m config/LibriTTS/model.yaml -t config/LibriTTS/train.yaml
As described in the paper, the script will start from pre-training the naive model until meta_learning_warmup
steps and then meta-train the model for additional steps via episodic training.
Use
tensorboard --logdir output/log/LibriTTS
to serve TensorBoard on your localhost. The loss curves, synthesized mel-spectrograms, and audios are shown.
22050Hz
sampling rate instead of 16kHz
.80
to 128
.28.197M
.16
batch size on training instead of 48
or 20
mainly due to the lack of memory capacity with a single 24GiB TITAN-RTX. This can be achieved by the following script to filter out data longer than max_seq_len
:
python3 filelist_filtering.py -p config/LibriTTS/preprocess.yaml -m config/LibriTTS/model.yaml
This will generate train_filtered.txt
in the same location of train.txt
.@misc{lee2021stylespeech,
author = {Lee, Keon},
title = {StyleSpeech},
year = {2021},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/keonlee9420/StyleSpeech}}
}