Diffusion Singing Voice Conversion based on Grad-TTS from HuaWei
This project is named as Grad-SVC, or GVC for short. Its core technology is diffusion, but so different from other diffusion based SVC models. Codes are adapted from Grad-TTS
and whisper-vits-svc
. So the features from whisper-vits-svc
are used in this project. By the way, Diff-VC is a follow-up of Grad-TTS, Diffusion-Based Any-to-Any Voice Conversion
Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech
The framework of grad-svc-v1
The framework of grad-svc-v2 & v3, encoder:768->512, diffusion:64->96
https://github.com/PlayVoice/Grad-SVC/assets/16432329/f9b66af7-b5b5-4efb-b73d-adb0dc84a0ae
Such beautiful codes from Grad-TTS
easy to read
Multi-speaker based on speaker encoder
No speaker leaky based on Perturbation
& Instance Normlize
& GRL
No electronic sound
Integrated DPM Solver-k for less steps
Integrated Fast Maximum Likelihood Sampling Scheme, for less steps
Conditional Flow Matching (V3), first used in SVC
Rectified Flow Matching (TODO)
Install project dependencies
pip install -r requirements.txt
Download the Timbre Encoder: Speaker-Encoder by @mueller91, put best_model.pth.tar
into speaker_pretrain/
.
Download hubert_soft model,put hubert-soft-0d54a1f4.pt
into hubert_pretrain/
.
Download pretrained nsf_bigvgan_pretrain_32K.pth, and put it into bigvgan_pretrain/
.
Performance Bottleneck: Generator and Discriminator are 116Mb, but Generator is only 22Mb
系统性能瓶颈:生成器和判别器一共116M,而生成器只有22M
Download pretrain model gvc.pretrain.pth, and put it into grad_pretrain/
.
python gvc_inference.py --model ./grad_pretrain/gvc.pretrain.pth --spk ./assets/singers/singer0001.npy --wave test.wav
For this pretrain model, temperature
is set temperature=1.015
in gvc_inference.py
to get good result.
Put the dataset into the data_raw
directory following the structure below.
data_raw
├───speaker0
│ ├───000001.wav
│ ├───...
│ └───000xxx.wav
└───speaker1
├───000001.wav
├───...
└───000xxx.wav
After preprocessing you will get an output with following structure.
data_gvc/
└── waves-16k
│ └── speaker0
│ │ ├── 000001.wav
│ │ └── 000xxx.wav
│ └── speaker1
│ ├── 000001.wav
│ └── 000xxx.wav
└── waves-32k
│ └── speaker0
│ │ ├── 000001.wav
│ │ └── 000xxx.wav
│ └── speaker1
│ ├── 000001.wav
│ └── 000xxx.wav
└── mel
│ └── speaker0
│ │ ├── 000001.mel.pt
│ │ └── 000xxx.mel.pt
│ └── speaker1
│ ├── 000001.mel.pt
│ └── 000xxx.mel.pt
└── pitch
│ └── speaker0
│ │ ├── 000001.pit.npy
│ │ └── 000xxx.pit.npy
│ └── speaker1
│ ├── 000001.pit.npy
│ └── 000xxx.pit.npy
└── hubert
│ └── speaker0
│ │ ├── 000001.vec.npy
│ │ └── 000xxx.vec.npy
│ └── speaker1
│ ├── 000001.vec.npy
│ └── 000xxx.vec.npy
└── speaker
│ └── speaker0
│ │ ├── 000001.spk.npy
│ │ └── 000xxx.spk.npy
│ └── speaker1
│ ├── 000001.spk.npy
│ └── 000xxx.spk.npy
└── singer
├── speaker0.spk.npy
└── speaker1.spk.npy
./data_gvc/waves-16k
python prepare/preprocess_a.py -w ./data_raw -o ./data_gvc/waves-16k -s 16000
./data_gvc/waves-32k
python prepare/preprocess_a.py -w ./data_raw -o ./data_gvc/waves-32k -s 32000
python prepare/preprocess_f0.py -w data_gvc/waves-16k/ -p data_gvc/pitch
python prepare/preprocess_spec.py -w data_gvc/waves-32k/ -s data_gvc/mel
python prepare/preprocess_hubert.py -w data_gvc/waves-16k/ -v data_gvc/hubert
python prepare/preprocess_speaker.py data_gvc/waves-16k/ data_gvc/speaker
python prepare/preprocess_speaker_ave.py data_gvc/speaker/ data_gvc/singer
python prepare/preprocess_train.py
python prepare/preprocess_zzz.py
python gvc_trainer.py
python gvc_trainer.py -p logs/grad_svc/grad_svc_***.pth
tensorboard --logdir logs/
Export inference model
python gvc_export.py --checkpoint_path logs/grad_svc/grad_svc_***.pth
Inference
python gvc_inference.py --model gvc.pth --spk ./data_gvc/singer/your_singer.spk.npy --wave test.wav --rature 1.015 --shift 0
temperature=1.015, needs to be adjusted to get good results; Recommended range is (1.001, 1.035).
Inference step by step
python hubert/inference.py -w test.wav -v test.vec.npy
python pitch/inference.py -w test.wav -p test.csv
python gvc_inference.py --model gvc.pth --spk ./data_gvc/singer/your_singer.spk.npy --wave test.wav --vec test.vec.npy --pit test.csv --shift 0
https://github.com/huawei-noah/Speech-Backbones/blob/main/Grad-TTS
https://github.com/huawei-noah/Speech-Backbones/tree/main/DiffVC
https://github.com/facebookresearch/speech-resynthesis
https://github.com/cantabile-kwok/VoiceFlow-TTS
https://github.com/shivammehta25/Matcha-TTS
https://github.com/shivammehta25/Diff-TTSG
https://github.com/majidAdibian77/ResGrad
https://github.com/LuChengTHU/dpm-solver
https://github.com/gmltmd789/UnitSpeech
https://github.com/zhenye234/CoMoSpeech
https://github.com/seahore/PPG-GradVC
https://github.com/thuhcsi/LightGrad
https://github.com/lmnt-com/wavegrad
https://github.com/naver-ai/facetts
https://github.com/jaywalnut310/vits
https://github.com/NVIDIA/BigVGAN
https://github.com/bshall/soft-vc
https://github.com/mozilla/TTS
https://github.com/ubisoft/ubisoft-laforge-daft-exprt
https://github.com/yl4579/StyleTTS-VC