Implementation of WaveGrad high-fidelity vocoder from Google Brain in PyTorch.
Implementation (PyTorch) of Google Brain's high-fidelity WaveGrad vocoder (paper). First implementation on GitHub with high-quality generation for 6-iterations.
generated_samples
folder).Number of parameters: 15.810.401
Model | Stable | RTX 2080 Ti | Tesla K80 | Intel Xeon 2.3GHz* |
---|---|---|---|---|
1000 iterations | + | 9.59 | - | - |
100 iterations | + | 0.94 | 5.85 | - |
50 iterations | + | 0.45 | 2.92 | - |
25 iterations | + | 0.22 | 1.45 | - |
12 iterations | + | 0.10 | 0.69 | 4.55 |
6 iterations | + | 0.04 | 0.33 | 2.09 |
*Note: Used an old version of Intel Xeon CPU.
WaveGrad is a conditional model for waveform generation through estimating gradients of the data density with WaveNet-similar sampling quality. This vocoder is neither GAN, nor Normalizing Flow, nor classical autoregressive model. The main concept of vocoder is based on Denoising Diffusion Probabilistic Models (DDPM), which utilize Langevin dynamics and score matching frameworks. Furthemore, comparing to classic DDPM, WaveGrad achieves super-fast convergence (6 iterations and probably lower) w.r.t. Langevin dynamics iterative sampling scheme.
git clone https://github.com/ivanvovk/WaveGrad.git
cd WaveGrad
pip install -r requirements.txt
filelists
folder.configs
folder.*Note: if you are going to change hop_length
for STFT, then make sure that the product of your upsampling factors
in config is equal to your new hop_length
.
runs/train.sh
script and specify visible GPU devices and path to your configuration file. If you specify more than one GPU the training will run in distributed mode.sh runs/train.sh
To track your training process run tensorboard by tensorboard --logdir=logs/YOUR_LOGDIR_FOLDER
. All logging information and checkpoints will be stored in logs/YOUR_LOGDIR_FOLDER
. logdir
is specified in config file.
Once model is trained, grid search for the best schedule* for a needed number of iterations in notebooks/inference.ipynb
. The code supports parallelism, so you can specify more than one number of jobs to accelerate the search.
*Note: grid search is necessary just for a small number of iterations (like 6 or 7). For larger number just try Fibonacci sequence benchmark.fibonacci(...)
initialization: I used it for 25 iteration and it works well. From good 25-iteration schedule, for example, you can build a higher-order schedule by copying elements.
benchmark.fibonacci(...)
.Put your mel-spectrograms in some folder. Make a filelist. Then run this command with your own arguments:
sh runs/inference.sh -c <your-config> -ch <your-checkpoint> -ns <your-noise-schedule> -m <your-mel-filelist> -v "yes"
More inference details are provided in notebooks/inference.ipynb
. There you can also find how to set a noise schedule for the model and make grid search for the best scheme.
Examples of generated audios are provided in generated_samples
folder. Quality degradation between 1000-iteration and 6-iteration inferences is not noticeable if found the best schedule for the latter.
You can find a pretrained checkpoint file* on LJSpeech (22KHz) via this Google Drive link.
*Note: uploaded checkpoint is a dict
with a single key 'model'
.