PyTorch Implementation of Daft-Exprt: Robust Prosody Transfer Across Speakers for Expressive Speech Synthesis
PyTorch Implementation of Daft-Exprt: Robust Prosody Transfer Across Speakers for Expressive Speech Synthesis
The validation logs up to 70K of synthesized mel and alignment are shown below (VCTK_val_p237-088).
DATASET refers to the names of datasets such as VCTK
in the following documents.
You can install the Python dependencies with
pip3 install -r requirements.txt
Also, Dockerfile
is provided for Docker
users.
You have to download the pretrained models and put them in output/ckpt/DATASET/
.
For a multi-speaker TTS, run
python3 synthesize.py --text "YOUR_DESIRED_TEXT" --speaker_id SPEAKER_ID --restore_step RESTORE_STEP --mode single --dataset DATASET --ref_audio REF_AUDIO
to synthesize speech with the style of input audio at REF_AUDIO. The dictionary of learned speakers can be found at preprocessed_data/VCTK/speakers.json
, and the generated utterances will be put in output/result/
.
Batch inference is also supported, try
python3 synthesize.py --source preprocessed_data/DATASET/val.txt --restore_step RESTORE_STEP --mode batch --dataset DATASET
to synthesize all utterances consuming themselves as a reference audio in preprocessed_data/DATASET/val.txt
.
The pitch/volume/speaking rate of the synthesized utterances can be controlled by specifying the desired pitch/energy/duration ratios. For example, one can increase the speaking rate by 20 % and decrease the volume by 20 % by
python3 synthesize.py --text "YOUR_DESIRED_TEXT" --speaker_id SPEAKER_ID --restore_step RESTORE_STEP --mode single --dataset DATASET --ref_audio REF_AUDIO --duration_control 0.8 --energy_control 0.8
The supported datasets are
For a multi-speaker TTS with external speaker embedder, download ResCNN Softmax+Triplet pretrained model of philipperemy's DeepSpeaker for the speaker embedding and locate it in ./deepspeaker/pretrained_models/
.
Run
python3 prepare_align.py --dataset DATASET
for some preparations.
For the forced alignment, Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences.
Pre-extracted alignments for the datasets are provided here.
You have to unzip the files in preprocessed_data/DATASET/TextGrid/
. Alternately, you can run the aligner by yourself.
After that, run the preprocessing script by
python3 preprocess.py --dataset DATASET
Train your model with
python3 train.py --dataset DATASET
Use
tensorboard --logdir output/log
to serve TensorBoard on your localhost. The loss curves, synthesized mel-spectrograms, and audios are shown.
16
batch size instead of 48
due to memory issues.'none'
and 'DeepSpeaker'
).
@misc{lee2021daft_exprt,
author = {Lee, Keon},
title = {Daft-Exprt},
year = {2021},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/keonlee9420/Daft-Exprt}}
}