PyTorch Implementation of NCSOFT's FastPitchFormant: Source-filter based Decomposed Modeling for Speech Synthesis
PyTorch Implementation of FastPitchFormant: Source-filter based Decomposed Modeling for Speech Synthesis.
You can install the Python dependencies with
pip3 install -r requirements.txt
You have to download the pretrained models and put them in output/ckpt/LJSpeech/
.
For English single-speaker TTS, run
python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step 600000 --mode single -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml
The generated utterances will be put in output/result/
.
Batch inference is also supported, try
python3 synthesize.py --source preprocessed_data/LJSpeech/val.txt --restore_step 600000 --mode batch -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml
to synthesize all utterances in preprocessed_data/LJSpeech/val.txt
The pitch/speaking rate of the synthesized utterances can be controlled by specifying the desired pitch/energy/duration ratios. For example, one can increase the speaking rate by 20 % and decrease the pitch by 20 % by
python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step 600000 --mode single -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml --duration_control 0.8 --pitch_control 0.8
The supported datasets are
First, run
python3 prepare_align.py config/LJSpeech/preprocess.yaml
for some preparations.
As described in the paper, Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences.
Alignments for the LJSpeech datasets are provided here.
You have to unzip the files in preprocessed_data/LJSpeech/TextGrid/
.
After that, run the preprocessing script by
python3 preprocess.py config/LJSpeech/preprocess.yaml
Alternately, you can align the corpus by yourself. Download the official MFA package and run
./montreal-forced-aligner/bin/mfa_align raw_data/LJSpeech/ lexicon/librispeech-lexicon.txt english preprocessed_data/LJSpeech
or
./montreal-forced-aligner/bin/mfa_train_and_align raw_data/LJSpeech/ lexicon/librispeech-lexicon.txt preprocessed_data/LJSpeech
to align the corpus and then run the preprocessing script.
python3 preprocess.py config/LJSpeech/preprocess.yaml
Train your model with
python3 train.py -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml
Use
tensorboard --logdir output/log/LJSpeech
to serve TensorBoard on your localhost. The loss curves, synthesized mel-spectrograms, and audios are shown.
normalization
to False
in ./config/LJSpeech/preprocess.yaml
when you need to see more wide pitch range as the paper described.@misc{lee2021fastpitchformant,
author = {Lee, Keon},
title = {FastPitchFormant},
year = {2021},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/keonlee9420/FastPitchFormant}}
}