Project README

VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Design

Jungil Kong, Jihoon Park, Beomjeong Kim, Jeongmin Kim, Dohee Kong, Sangjin Kim

SK Telecom, South Korea

Single-stage text-to-speech models have been actively studied recently, and their results have outperformed two-stage pipeline systems. Although the previous single-stage model has made great progress, there is room for improvement in terms of its intermittent unnaturalness, computational efficiency, and strong dependence on phoneme conversion. In this work, we introduce VITS2, a single-stage text-to-speech model that efficiently synthesizes a more natural speech by improving several aspects of the previous work. We propose improved structures and training mechanisms and present that the proposed methods are effective in improving naturalness, similarity of speech characteristics in a multi-speaker model, and efficiency of training and inference. Furthermore, we demonstrate that the strong dependence on phoneme conversion in previous works can be significantly reduced with our method, which allows a fully end-to-end single-stage approach.

Demo: https://vits-2.github.io/demo/

Paper: https://arxiv.org/abs/2307.16430

Unofficial implementation of VITS2. This is a work in progress. Please refer to TODO for more details.

Duration Predictor	Normalizing Flows	Text Encoder

Audio Samples

[In progress]

Audio sample after 52,000 steps of training on 1 GPU for LJSpeech dataset: https://github.com/daniilrobnikov/vits2/assets/91742765/d769c77a-bd92-4732-96e7-ab53bf50d783

Installation:

Clone the repo

git clone [email protected]:daniilrobnikov/vits2.git
cd vits2

Setting up the conda env

This is assuming you have navigated to the vits2 root after cloning it.

NOTE: This is tested under python3.11 with conda env. For other python versions, you might encounter version conflicts.

PyTorch 2.0 Please refer requirements.txt

# install required packages (for pytorch 2.0)
conda create -n vits2 python=3.11
conda activate vits2
pip install -r requirements.txt

conda env config vars set PYTHONPATH="/path/to/vits2"

Download datasets

There are three options you can choose from: LJ Speech, VCTK, or custom dataset.

LJ Speech: LJ Speech dataset. Used for single speaker TTS.
VCTK: VCTK dataset. Used for multi-speaker TTS.
Custom dataset: You can use your own dataset. Please refer here.

LJ Speech dataset

download and extract the LJ Speech dataset

wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
tar -xvf LJSpeech-1.1.tar.bz2
cd LJSpeech-1.1/wavs
rm -rf wavs

preprocess mel-spectrograms. See mel_transform.py

python preprocess/mel_transform.py --data_dir /path/to/LJSpeech-1.1 -c datasets/ljs_base/config.yaml

preprocess text. See prepare/filelists.ipynb
rename or create a link to the dataset folder.

ln -s /path/to/LJSpeech-1.1 DUMMY1

VCTK dataset

download and extract the VCTK dataset

wget https://datashare.is.ed.ac.uk/bitstream/handle/10283/3443/VCTK-Corpus-0.92.zip
unzip VCTK-Corpus-0.92.zip

(optional): downsample the audio files to 22050 Hz. See audio_resample.ipynb
preprocess mel-spectrograms. See mel_transform.py

python preprocess/mel_transform.py --data_dir /path/to/VCTK-Corpus-0.92 -c datasets/vctk_base/config.yaml

preprocess text. See prepare/filelists.ipynb
rename or create a link to the dataset folder.

ln -s /path/to/VCTK-Corpus-0.92 DUMMY2

Custom dataset

create a folder with wav files
duplicate the ljs_base in datasets directory and rename it to custom_base
open custom_base and change the following fields in config.yaml:

data:
  training_files: datasets/custom_base/filelists/train.txt
  validation_files: datasets/custom_base/filelists/val.txt
  text_cleaners: # See text/cleaners.py
    - phonemize_text
    - tokenize_text
    - add_bos_eos
  cleaned_text: true # True if you ran step 6.
  language: en-us # language of your dataset. See espeak-ng
  sample_rate: 22050 # sample rate, based on your dataset
  ...
  n_speakers: 0 # 0 for single speaker, > 0 for multi-speaker

preprocess mel-spectrograms. See mel_transform.py

python preprocess/mel_transform.py --data_dir /path/to/custom_dataset -c datasets/custom_base/config.yaml

preprocess text. See prepare/filelists.ipynb

NOTE: You may need to install espeak-ng if you want to use phonemize_text cleaner. Please refer espeak-ng

rename or create a link to the dataset folder.

ln -s /path/to/custom_dataset DUMMY3

Training Examples

# LJ Speech
python train.py -c datasets/ljs_base/config.yaml -m ljs_base

# VCTK
python train_ms.py -c datasets/vctk_base/config.yaml -m vctk_base

# Custom dataset (multi-speaker)
python train_ms.py -c datasets/custom_base/config.yaml  -m custom_base

Inference Examples

See inference.ipynb and inference_batch.ipynb

Pretrained Models

[In progress]

Todo

model (vits2)
- update TextEncoder to support speaker conditioning
- support for high-resolution mel-spectrograms in training. See mel_transform.py
- Monotonic Alignment Search with Gaussian noise
- Normalizing Flows using Transformer Block
- Stochastic Duration Predictor with Time Step-wise Conditional Discriminator
model (YourTTS)
- Language Conditioning
- Speaker Encoder
model (NaturalSpeech)
- KL Divergence Loss after Prior Enhancing
- GAN loss for e2e training
other
- support for batch inference
- special tokens in tokenizer
- test numba.jit and numba.cuda.jit implementations of MAS. See monotonic_align.py
- KL Divergence Loss between TextEncoder and Projection
- support for streaming inference. Please refer vits_chinese
- use optuna for hyperparameter tuning
future work
- update model to vits2. Please refer VITS2
- update model to YourTTS with zero-shot learning. See YourTTS
- update model to NaturalSpeech. Please refer NaturalSpeech

Acknowledgements

This is unofficial repo based on VITS2
g2p for multiple languages is based on phonemizer
We also thank GhatGPT for providing writing assistance.

References

VITS2

Open Source Agenda is not affiliated with "Vits2" Project. README Source: daniilrobnikov/vits2

Stars

361

Open Issues

Last Commit

8 months ago

Repository

daniilrobnikov/vits2

License

MIT

Homepage

https://vits-2.github.io/demo/

Open Source Agenda Badge

<a href="https://www.opensourceagenda.com/projects/vits2"><img src="https://www.opensourceagenda.com/projects/vits2/reviews/badge.svg" alt="Open Source Agenda"></a>

Submit Review Review Your Favorite Project

Submit Resource Articles, Courses, Videos

Submit Article Submit a post to our blog