An unofficial PyTorch implementation of VALL-E
An unofficial PyTorch implementation of VALL-E, utilizing the EnCodec encoder/decoder.
Note Development on this is very sporadic. Gomen.
Note Compatibility for existing models may break at any time while I feverishly try and work out the best way to crank out a model. Gomen.
Besides a working PyTorch environment, the only hard requirement is espeak-ng
:
espeak
/espeak-ng
installed.espeak
/espeak-ng
.espeak-ng
.
PHONEMIZER_ESPEAK_LIBRARY
environment variable to specify the path to libespeak-ng.dll
.Simply run pip install git+https://git.ecker.tech/mrq/vall-e
or pip install git+https://github.com/e-c-k-e-r/vall-e
.
I've tested this repo under Python versions 3.10.9
, 3.11.3
, and 3.12.3
.
To quickly try it out, you can run python -m vall_e.models.ar_nar --yaml="./data/config.yaml"
.
A small trainer will overfit a provided utterance to ensure a model configuration works.
Note Pre-Trained weights aren't up to par until I finally nail the best training methodologies and model code. Gomen.
My pre-trained weights can be acquired from here.
A script to setup a proper environment and download the weights can be invoked with ./scripts/setup.sh
Training is very dependent on:
Note The provided dataset needs to be reprocessed to better suit a new training dataset format. Gomen.
A "libre" dataset utilizing EnCodec quantized audio can be found here under data.tar.gz
.
A script to setup a proper environment and train can be invoked with ./scripts/setup-training.sh
If you already have a dataset you want, for example your own large corpus, or for finetuning, you can use your own dataset instead.
venv
with https://github.com/m-bain/whisperX/
.faster-whisper
is an exercise left to the user at the moment.python3 -m venv venv-whisper
source ./venv-whisper/bin/activate
pip3 install torch torchvision torchaudio
pip3 install git+https://github.com/m-bain/whisperX/
Populate your source voices under ./voices/{group name}/{speaker name}/
.
Run python3 ./scripts/transcribe_dataset.py
. This will generate a transcription with timestamps for your dataset.
model_name
and batch_size
variables.Run python3 ./scripts/process_dataset.py
. This will phonemize the transcriptions and quantize the audio.
Copy ./data/config.yaml
to ./training/config.yaml
. Customize the training configuration and populate your dataset.training
list with the values stored under ./training/dataset_list.json
.
./vall_e/config.py
for additional configuration details.Two dataset formats are supported:
./training/data/{group}/{speaker}/{id}.enc
as a NumPy file../training/data/{group}/{speaker}/{id}.dac
as a NumPy file.python3 -m vall_e.data --yaml="./training/config.yaml" --action=metadata
python3 -m vall_e.data --yaml="./training/config.yaml"
(metadata for dataset pre-load is generated alongside HDF5 creation)use_hdf5
in your config YAML.For single GPUs, simply running python3 -m vall_e.train --yaml="./training/config.yaml
.
For multiple GPUs, or exotic distributed training:
deepspeed
backends, simply running deepspeed --module vall_e.train --yaml="./training/config.yaml"
should handle the gory details.local
backends, simply run torchrun --nnodes=1 --nproc-per-node={NUMOFGPUS} -m vall_e.train --yaml="./training/config.yaml"
You can enter save
to save the state at any time, or quit
to save and quit training.
The lr
will also let you adjust the learning rate on the fly. For example: lr 1.0e-3
will set the learning rate to 0.001
.
Included is a helper script to parse the training metrics. Simply invoke it with, for example: python3 -m vall_e.plot --yaml="./training/config.yaml"
You can specify what X and Y labels you want to plot against by passing --xs tokens_processed --ys loss stats.acc
As training under deepspeed
and Windows is not (easily) supported, under your config.yaml
, simply change trainer.backend
to local
to use the local training backend.
Creature comforts like float16
, amp
, and multi-GPU training should work, but extensive testing still needs to be done to ensure it all functions.
Unfortunately, efforts to train a good foundational model seems entirely predicated on a good dataset. My dataset might be too fouled with:
text
being too long.
<s>
and </s>
tokens with empty utterances.As the core of VALL-E makes use of a language model, various LLM architectures can be supported and slotted in. Currently supported LLm architectures:
llama
: using HF transformer's LLaMa implementation for its attention-based transformer, boasting RoPE and other improvements.
mixtral
: using HF transformer's Mixtral implementation for its attention-based transformer, also utilizing its MoE implementation.bitnet
: using this implementation of BitNet's transformer.
cfg.optimizers.bitnet=True
will make use of BitNet's linear implementation.transformer
: a basic attention-based transformer implementation, with attention heads + feed forwards.retnet
: using TorchScale's RetNet implementation, a retention-based approach can be used instead.
retnet-hf
: using syncdoth/RetNet with a HuggingFace-compatible RetNet model
mamba
: using state-spaces/mamba (needs to mature)
For audio backends:
encodec
: a tried-and-tested EnCodec to encode/decode audio.vocos
: a higher quality EnCodec decoder.
encodec
backend automagically, as there's no EnCodec encoder under vocos
descript-audio-codec
: boasts better compression and quality
descript-audio-codec
at 24KHz + 8kbps will NOT converge in any manner.descript-audio-codec
at 44KHz + 8kbps seems harder to model its "language", but despite the loss being rather high, it sounds fine.llama
-based models also support different attention backends:
math
: torch's SDPA's math
implementationmem_efficient
: torch's SDPA's memory efficient (xformers
adjacent) implementationflash
: torch's SDPA's flash attention implementationxformers
: facebookresearch/xformers's memory efficient attentionauto
: determine the best fit from the abovesdpa
: integrated LlamaSdpaAttention
attention modelflash_attention_2
: integrated LlamaFlashAttetion2
attention modelThe wide support for various backends is solely while I try and figure out which is the "best" for a core foundation model.
To export the models, run: python -m vall_e.export --yaml=./training/config.yaml
.
This will export the latest checkpoints, for example, under ./training/ckpt/ar+nar-retnet-8/fp32.pth
, to be loaded on any system with PyTorch, and will include additional metadata, such as the symmap used, and training stats.
To synthesize speech: python -m vall_e <text> <ref_path> <out_path> --yaml=<yaml_path>
Some additional flags you can pass are:
--language
: specifies the language for phonemizing the text, and helps guide inferencing when the model is trained against that language.--max-ar-steps
: maximum steps for inferencing through the AR model. Each second is 75 steps.--device
: device to use (default: cuda
, examples: cuda:0
, cuda:1
, cpu
)--ar-temp
: sampling temperature to use for the AR pass. During experimentation, 0.95
provides the most consistent output, but values close to it works fine.--nar-temp
: sampling temperature to use for the NAR pass. During experimentation, 0.2
provides clean output, but values upward of 0.6
seems fine too.And some experimental sampling flags you can use too (your mileage will definitely vary):
--max-ar-context
: Number of resp
tokens to keep in the context when inferencing. This is akin to "rolling context" in an effort to try and curb any context limitations, but currently does not seem fruitful.--min-ar-temp
/ --min-nar-temp
: triggers the dynamic temperature pathway, adjusting the temperature based on the confidence of the best token. Acceptable values are between [0.0, (n)ar-temp)
.
--top-p
: limits the sampling pool to top sum of values that equal P
% probability in the probability distribution.--top-k
: limits the sampling pool to the top K
values in the probability distribution.--repetition-penalty
: modifies the probability of tokens if they have appeared before. In the context of audio generation, this is a very iffy parameter to use.--repetition-penalty-decay
: modifies the above factor applied to scale based on how far away it is in the past sequence.--length-penalty
: (AR only) modifies the probability of the stop token based on the current sequence length. This is very finnicky due to the AR already being well correlated with the length.--beam-width
: (AR only) specifies the number of branches to search through for beam sampling.
B
spaces.--mirostat-tau
: (AR only) the "surprise value" when performing mirostat sampling.
--mirostat-eta
: (AR only) the "learning rate" during mirostat sampling applied to the maximum surprise.Unless otherwise credited/noted in this README or within the designated Python file, this repository is licensed under AGPLv3.
EnCodec is licensed under CC-BY-NC 4.0. If you use the code to generate audio quantization or perform decoding, it is important to adhere to the terms of their license.
This implementation was originally based on enhuiz/vall-e, but has been heavily, heavily modified over time. Without it I would not have had a good basis to muck around and learn.
@article{wang2023neural,
title={Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers},
author={Wang, Chengyi and Chen, Sanyuan and Wu, Yu and Zhang, Ziqiang and Zhou, Long and Liu, Shujie and Chen, Zhuo and Liu, Yanqing and Wang, Huaming and Li, Jinyu and others},
journal={arXiv preprint arXiv:2301.02111},
year={2023}
}
@article{defossez2022highfi,
title={High Fidelity Neural Audio Compression},
author={Défossez, Alexandre and Copet, Jade and Synnaeve, Gabriel and Adi, Yossi},
journal={arXiv preprint arXiv:2210.13438},
year={2022}
}