Unofficial PyTorch Implementation of UnivNet Vocoder (https://arxiv.org/abs/2106.07889)
UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation
This is an unofficial PyTorch implementation of Jang et al. (Kakao), UnivNet.
Audio samples are uploaded!
Both UnivNet-c16 and c32 results and the pre-trained weights have been uploaded.
For both models, our implementation matches the objective scores (PESQ and RMSE) of the original paper.
According to the authors of the paper, UnivNet obtained the best objective results among the recent GAN-based neural vocoders (including HiFi-GAN) as well as outperforming HiFi-GAN in a subjective evaluation. Also its inference speed is 1.5 times faster than HiFi-GAN.
This repository uses the same mel-spectrogram function as the Official HiFi-GAN, which is compatible with NVIDIA/tacotron2.
Our default mel calculation hyperparameters are as below, following the original paper.
audio:
n_mel_channels: 100
filter_length: 1024
hop_length: 256 # WARNING: this can't be changed.
win_length: 1024
sampling_rate: 24000
mel_fmin: 0.0
mel_fmax: 12000.0
You can modify the hyperparameters to be compatible with your acoustic model.
The implementation needs following dependencies.
pip install -r requirements.txt
Preparing Data
datasets/LibriTTS/train-clean-360
.Note: The mel-spectrograms calculated from audio file will be saved as **.mel
at first, and then loaded from disk afterwards.
Preparing Metadata
Following the format from NVIDIA/tacotron2, the metadata should be formatted as:
path_to_wav|transcript|speaker_id
path_to_wav|transcript|speaker_id
...
Train/validation metadata for LibriTTS train-clean-360 split and are already prepared in datasets/metadata
.
5% of the train-clean-360 utterances were randomly sampled for validation.
Since this model is a vocoder, the transcripts are NOT used during training.
Preparing Configuration Files
Run cp config/default_c32.yaml config/config.yaml
and then edit config.yaml
Write down the root path of train/validation in the data
section. The data loader parses list of files within the path recursively.
data:
train_dir: 'datasets/' # root path of train data (either relative/absoulte path is ok)
train_meta: 'metadata/libritts_train_clean_360_train.txt' # relative path of metadata file from train_dir
val_dir: 'datasets/' # root path of validation data
val_meta: 'metadata/libritts_train_clean_360_val.txt' # relative path of metadata file from val_dir
We provide the default metadata for LibriTTS train-clean-360 split.
Modify channel_size
in gen
to switch between UnivNet-c16 and c32.
gen:
noise_dim: 64
channel_size: 32 # 32 or 16
dilations: [1, 3, 9, 27]
strides: [8, 8, 4]
lReLU_slope: 0.2
Training
python trainer.py -c CONFIG_YAML_FILE -n NAME_OF_THE_RUN
Tensorboard
tensorboard --logdir logs/
If you are running tensorboard on a remote machine, you can open the tensorboard page by adding --bind_all
option.
python inference.py -p CHECKPOINT_PATH -i INPUT_MEL_PATH -o OUTPUT_WAV_PATH
You can download the pre-trained models from the Google Drive link below. The models were trained on LibriTTS train-clean-360 split.
See audio samples at https://mindslab-ai.github.io/univnet/
We evaluated our model with validation set.
Model | PESQ(↑) | RMSE(↓) | Model Size |
---|---|---|---|
HiFi-GAN v1 | 3.54 | 0.423 | 14.01M |
Official UnivNet-c16 | 3.59 | 0.337 | 4.00M |
Our UnivNet-c16 | 3.60 | 0.317 | 4.00M |
Official UnivNet-c32 | 3.70 | 0.316 | 14.86M |
Our UnivNet-c32 | 3.68 | 0.304 | 14.87M |
The loss graphs of UnivNet are listed below.
The orange and blue graphs indicate c16 and c32, respectively.
Implementation authors are:
Contributors are:
Special thanks to
This code is licensed under BSD 3-Clause License.
We referred following codes and repositories.
Papers
Datasets