PyTorch implementation of GAN-based text-to-speech synthesis and voice conversion (VC)
PyTorch implementation of Generative adversarial Networks (GAN) based text-to-speech (TTS) and voice conversion (VC).
Audio samples are available in the Jupyter notebooks at the link below:
adversarial_streams
, which represents streams (mgc, lf0, vuv, bap) to be used to compute adversarial loss, is a very speech quality sensitive parameter. Computing adversarial loss on mgc features (except for first few dimensions) seems to be working good.mask_nth_mgc_for_adv_loss
> 0, first mask_nth_mgc_for_adv_loss
dimension for mgc will be ignored for computing adversarial loss. As described in saito2017asja, I confirmed that using 0-th (and 1-th) mgc for computing adversarial loss affects speech quality. From my experience, mask_nth_mgc_for_adv_loss
= 1 for mgc order 25, mask_nth_mgc_for_adv_loss
= 2 for mgc order 59 are working to me.f0_interpolation_kind
to "slinear" if you want frist-order spline interpolation, which is same as Merlin's default.use_harvest
to True if you want to use Harvest F0 estimation algorithm. If False, Dio and StoneMask are used to estimate/refine F0.cuda runtime error (2) : out of memory
, try smaller batch size. https://github.com/r9y9/gantts/issues/3
Though I haven't got improvements over Saito's approach [1] yet, but the GAN-based models described in [2] should be achieved by the following configurations:
generator_add_noise
to True. This will enable generator to use Gaussian noise as input. Linguistic features are concatenated with the noise vector.discriminator_linguistic_condition
to True. The discriminator uses linguistic features as condition.tf.contrib.training.HParams
)Please install PyTorch, TensorFlow and SRU (if needed) first. Once you have those, then
git clone --recursive https://github.com/r9y9/gantts && cd gantts
pip install -e ".[train]"
should install all other dependencies.
train.py
.Feature extraction scripts are written for CMU ARCTIC dataset, but can be easily adapted for other datasets.
vc_demo.sh
is a clb
to clt
voice conversion demo script. Before running the script, please download wav files for clb
and slt
from CMU ARCTIC and check that you have all data in a directory as follows:
> tree ~/data/cmu_arctic/ -d -L 1
/home/ryuichi/data/cmu_arctic/
├── cmu_us_awb_arctic
├── cmu_us_bdl_arctic
├── cmu_us_clb_arctic
├── cmu_us_jmk_arctic
├── cmu_us_ksp_arctic
├── cmu_us_rms_arctic
└── cmu_us_slt_arctic
Once you have downloaded datasets, then:
./vc_demo.sh ${experimental_id} ${your_cmu_arctic_data_root}
e.g.,
./vc_demo.sh vc_gan_test ~/data/cmu_arctic/
Model checkpoints will be saved at ./checkpoints/${experimental_id}
and audio samples
are saved at ./generated/${experimental_id}
.
tts_demo.sh
is a self-contained TTS demo script. The usage is:
./tts_demo.sh ${experimental_id}
This will download slt_arctic_full_data
used in Merlin's demo, perform feature extraction, train models and synthesize audio samples for eval/test set. ${experimenta_id}
can be arbitrary string, for example,
./tts_demo.sh tts_test
Model checkpoints will be saved at ./checkpoints/${experimental_id}
and audio samples
are saved at ./generated/${experimental_id}
.
See hparams.py
.
tensorboard --logdir=log
The repository doesn't try to reproduce same results reported in their papers because 1) data is not publically available and 2). hyper parameters are highly depends on data. Instead, I tried same ideas on different data with different hyper parameters.