Pandrator aspires to be a user-friendly app with a graphical interface and a one-click installer that creates high-quality speech from text in multiple languages (audiobooks, speech synchronised with subtitles and more) using local models (XTTS, Silero or VoiceCraft), plus voice cloning, LLM pre-processing, RVC enhancement, and automatic evaluation
Pandrator is a tool designed to transform text, PDF, EPUB and SRT files into spoken audio in multiple languages based on open source software, including voice cloning, LLM-based text preprocessing and the ability to directly save generated subtitle audio to a video file by mixing the synchronized output with the original audio track of the video. It aspires to be easy to use and install - it has a one-click installer and a graphical user interface.
It leverages the XTTS, Silero and VoiceCraft model(s) for text-to-speech conversion and voice cloning, enhanced by RVC_CLI for quality improvement and better voice cloning results, and NISQA for audio quality evaluation. Additionally, it incorporates Text Generation Webui's API for local LLM-based text pre-processing, enabling a wide range of text manipulations before audio generation.
It is still in alpha stage and I'm not an experienced developer (I'm a noob, in fact), so the code is far from perfect in terms of optimisation, features and reliability. Please keep this in mind and contribute, if you want to help me make it better.
The samples were generated using the minimal settings - no LLM text processing, RVC or TTS evaluation, and no sentences were regenerated. Both XTTS and Silero generations were faster than playback speed.
https://github.com/lukaszliniewicz/Pandrator/assets/75737665/76a97cf0-275d-4ea2-868e-95eecdc6f6ce
https://github.com/lukaszliniewicz/Pandrator/assets/75737665/bbb10512-79ed-43ea-bee3-e271b605580e
https://github.com/lukaszliniewicz/Pandrator/assets/75737665/118f5b9c-641b-4edd-8ef6-178dd924a883
I was able to run all functionalities on a laptop with a Ryzen 5600h and a 3050 laptop GPU (4GB of VRAM). It's likely that you will need at least 16GB of RAM, a reasonably modern CPU, and ideally an NVIDIA GPU with 4 GB+ of VRAM for usable performance. Consult the requirments of the services listed below.
Silero runs on the CPU. It should perform well on almost all reasonably modern systems.
You can run VoiceCraft on a cpu, but generation will be very slow. To achieve meaningful acceleration with a GPU (Nvidia), you need one with at least 8GB of VRAM.
This project relies on several APIs and services (running locally) and libraries, notably:
.txt
files into sentences, customtkinter by TomSchimansky, num2words by savoirfairelinux for converting numbers to words (Silero requirs this), pysrt
, pydub
and others (see requirements.txt
).Run pandrator_start_minimal_xtts.exe
, pandrator_start_minimal_silero.exe
or pandrator_start_minimal_voicecraft.exe
with administrator priviliges. You will find them under Releases. The executables were created using pyinstaller from pandrator_start_minimal_xtts.py
, pandrator_start_minimal_silero.py
and pandrator_start_minimal_voicecraft.py
in the repository.
The file may be flagged as a threat by antivirus software, so you may have to add it as an exception.
On first use the EXE creates the Pandrator folder, installs curl
, git
, ffmpeg
(using Chocolatey, if not already installed) and Miniconda
, clones the XTTS Api Server respository, the Silero Api Server repository or the VoiceCraft API repository and the Pandrator repository, creates conda environments, installs dependencies and launches Pandrator and the server you chose. You may use the EXE to launch Pandrator later.
If you want to perform the setup again, remove the Pandrator folder it created. Please allow at least a couple of minutes for the initial setup process to download models and install dependencies (it takes about 7-10 minutes for me).
For additional functionality:
--api
to CMD_FLAGS.txt
in the main directory of the Webui before starting it).Please refer to the repositories linked under Dependencies for detailed installation instructions. Remember that the APIs must be running to make use of the functionalities they offer.
git clone https://github.com/lukaszliniewicz/Pandrator.git
).cd
to the repository directory.pip install -r requirements.txt
.python pandrator.py
.openchat-3.5-0106.Q5_K_M.gguf
with good results, as well as for example Mistral 7B Instruct 0.2
. Different models may perform different tasks well, so it's possible to choose a specific model for a specific prompt..txt
, .srt
and .pdf
files.If you don't want to use the additional functionalities, you have everything you need in the Session tab.
Outputs
to do that)..txt
, .srt
, .pdf
or epub
file. If you choose a PDF or EPUB file, a preview window will open with the extracted text. You may edit it (OCRed books often have poorly recognized text from the title page, for example). Very big PDF files can take a couple of minutes to load..wav
files (22050hz sample rate, mono) stored in the tts_voices
directory. The XTTS model uses the audio to clone the voice. It doesn't matter what language the sample is in, you will be able to generate speech in all supported languages, but the quality will be best if you provide a sample in your target language. You may use the sample one in the repository or upload your own. Please make sure that the audio is between 6 and 12s, mono, and the sample rate is 22050hz. You may use a tool like Audacity to prepare the files. The less noise, the better. You may use a tool like Resemble AI for denoising and/or enhancement of your samples on Hugging Face..wav
sample. However, it needs both a properly formatted .wav
file (mono, 16000hz) and a .txt
file with the transcription of what is said in the sample. The files must have the same name (apart from the extension, of course). You need to upload them to tts_voices/VoiceCraft
and you will be able to select them in the GUI. Currently only English is supported. If you generate with a new voice for the first time, the server will perform the alignment procedure, so the first sentence will be generated with a delay. This won't happen when you use that voice again..srt
file, you will be given the option to select a video file and one of its audio tracks to mix with the synchronized output, as well as weather you want to lower the volume of the original audio when subtitle audio is playing..opus
at 64k bitrate; you may change it in the Audio tab to .wav
or .mp3
).Contributions, suggestions for improvements, and bug reports are most welcome!
.wav
usuing Audacity, for instance..srt
subtitle files.