A desktop application that transcribes audio from files, microphone input or YouTube videos with the option to translate the content and create subtitles.
A desktop application that transcribes audio from files, microphone input or YouTube videos with the option to translate the content and create subtitles.
Audiotext transcribes the audio from an audio file, video file, microphone input, or YouTube video into one of the 74 different languages it supports, along with some of their dialects. You can transcribe using the Google Speech-to-Text API or WhisperX, which can even translate the transcription or generate subtitles!
You can also choose the theme you like best. It can be dark, light, or the one configured in the system.
│ .gitignore
│ audiotext.spec
│ LICENSE
│ README.md
│ requirements.txt
│
├───.github
│ │ CONTRIBUTING.md
│ │
│ ├───ISSUE_TEMPLATE
│ │ bug_report_template.md
│ │ feature_request_template.md
│ │
│ └───PULL_REQUEST_TEMPLATE
│ pull_request_template.md
│
├───res
│ ├───img
│ │ icon.ico
│ │
│ └───locales
│ │ main_controller.pot
│ │ main_window.pot
│ │
│ ├───en
│ │ └───LC_MESSAGES
│ │ app.mo
│ │ app.po
│ │ main_controller.po
│ │ main_window.po
│ │
│ └───es
│ └───LC_MESSAGES
│ app.mo
│ app.po
│ main_controller.po
│ main_window.po
│
└───src
│ app.py
│
├───controller
│ __init__.py
│ main_controller.py
│
├───model
│ │ __init__.py
│ │ transcription.py
│ │
│ └───config
│ __init__.py
│ config_google_api.py
│ config_subtitles.py
│ config_whisperx.py
│
├───utils
│ __init__.py
│ audio_utils.py
│ config_manager.py
│ constants.py
│ dict_utils.py
│ enums.py
│ i18n.py
│ path_helper.py
│
└───view
│ __init__.py
│ main_window.py
│
└───custom_widgets
__init__.py
ctk_scrollable_dropdown/
ctk_input_dialog.py
Install FFmpeg to execute the program. Otherwise, it won't be able to process the audio files.
To check if you have it installed on your system, run ffmpeg -version
. It should return something similar to this:
ffmpeg version 5.1.2-essentials_build-www.gyan.dev Copyright (c) 2000-2022 the FFmpeg developers
built with gcc 12.1.0 (Rev2, Built by MSYS2 project)
configuration: --enable-gpl --enable-version3 --enable-static --disable-w32threads --disable-autodetect --enable-fontconfig --enable-iconv --enable-gnutls --enable-libxml2 --enable-gmp --enable-lzma --enable-zlib --enable-libsrt --enable-libssh --enable-libzmq --enable-avisynth --enable-sdl2 --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxvid --enable-libaom --enable-libopenjpeg --enable-libvpx --enable-libass --enable-libfreetype --enable-libfribidi --enable-libvidstab --enable-libvmaf --enable-libzimg --enable-amf --enable-cuda-llvm --enable-cuvid --enable-ffnvcodec --enable-nvdec --enable-nvenc --enable-d3d11va --enable-dxva2 --enable-libmfx --enable-libgme --enable-libopenmpt --enable-libopencore-amrwb --enable-libmp3lame --enable-libtheora --enable-libvo-amrwbenc --enable-libgsm --enable-libopencore-amrnb --enable-libopus --enable-libspeex --enable-libvorbis --enable-librubberband
libavutil 57. 28.100 / 57. 28.100
libavcodec 59. 37.100 / 59. 37.100
libavformat 59. 27.100 / 59. 27.100
libavdevice 59. 7.100 / 59. 7.100
libavfilter 8. 44.100 / 8. 44.100
libswscale 6. 7.100 / 6. 7.100
libswresample 4. 7.100 / 4. 7.100
If the output is an error, it is because your system cannot find the ffmpeg
system variable, which is probably because you don't have it installed on your system. To install ffmpeg
, open a command prompt and run one of the following commands, depending on your operating system:
# on Ubuntu or Debian
sudo apt update && sudo apt install ffmpeg
# on Arch Linux
sudo pacman -S ffmpeg
# on MacOS using Homebrew (https://brew.sh/)
brew install ffmpeg
# on Windows using Chocolatey (https://chocolatey.org/)
choco install ffmpeg
# on Windows using Scoop (https://scoop.sh/)
scoop install ffmpeg
Go to releases and download the latest.
Decompress the downloaded file.
Open the audiotext
folder and double-click the Audiotext
executable file.
git clone https://github.com/HenestrosaDev/audiotext.git
.audiotext
by running cd audiotext
.virtualenv
, you would run virtualenv venv
.# on Windows
. venv/Scripts/activate
# if you get the error `FullyQualifiedErrorId : UnauthorizedAccess`, run this:
Set-ExecutionPolicy Unrestricted -Scope Process
# and then . venv/Scripts/activate
# on macOS and Linux
source venv/Scripts/activate
cat requirements.txt | xargs -n 1 pip install
to install the dependencies.
For some reason,
pip install -r requirements.txt
throws the error "Could not find a version that satisfies the requirement [PACKAGE_NAME]==[PACKAGE_VERSION] (from version: none)"
python src/app.py
to start the program.pyaudio
package. Here is a StackOverflow post explaining how to solve this issue.pprint(response_text, indent=4)
in the recognize_google
function from the __init__.py
file of the SpeechRecognition
package to avoid opening a command line along with the GUI. Otherwise, the program would not be able to use the Google API transcription method because pprint
throws an error if it cannot print to the CLI, preventing the code from generating the transcription. The same applies to the lines using the logger
package in the moviepy/audio/io/ffmpeg_audiowriter
file from the moviepy
package. There is also a change in the line 169. logger=logger
has been changed to logger=None
to avoid more errors related to opening the console.Once you open the Audiotext executable file (explained in the getting started section), you'll see something like this:
You can transcribe from three audio sources:
File (see image above): Click on the file explorer icon to select the file you want to transcribe. You can also manually enter the path to the file into the input field. You can transcribe audio from both audio and video files. Note that the file explorer has the All supported files
option selected by default. To select only audio files or video files, click the combo box in the lower right corner of the file explorer to change the file type, as marked in red in the following image:
.mp3
.mpeg
.wav
.wma
.aac
.flac
.ogg
.oga
.opus
.mp4
.m4a
.m4v
.f4v
.f4a
.m4b
.m4r
.f4b
.mov
.avi
.webm
.flv
.mkv
.3gp
.3gp2
.3g2
.3gpp
.3gpp2
.ogv
.ogx
.wmv
.asf
Microphone: To start recording, simply click the Start recording
button to begin the process. The text of the button will change to Stop recording
and its color will change to red. Click it to stop recording and generate the transcription.
Note that your operating system must recognize an input source, otherwise an error will appear in the text box indicating that no input source was detected.
Here is a video demonstrating this feature:
https://github.com/HenestrosaDev/audiotext/assets/60482743/bd0323d7-ff54-4363-8b73-a2d56e7f783b
YouTube video: Enter the video URL in the upper input field. When finished, click on the Generate transcription
button.
Once the program has generated the transcription, you'll see a green Save transcription
button below the text box. If you click on it, you'll be prompted for a file explorer where you can give the file a name and select the path where you want to save it. The file extension is .txt
by default, but you can change it to any other text file type.
If you used WhisperX to generate the transcription and checked the Generate subtitles
option, you'll notice that two files are also saved along with the text file: a .vtt
file and a .srt
file. Both contain the subtitles for the transcribed file, as explained in the Generate Subtitles section.
Before you start transcribing, it's important to understand what each transcription method offers:
The WhisperX options appear when the selected transcription method is WhisperX. You can choose whether to translate the audio into English and whether to generate subtitles from the transcription.
To translate the audio into English, simply check the Translate to English
checkbox before generating the transcription, as shown in the video below.
https://github.com/HenestrosaDev/audiotext/assets/60482743/0aeeaa17-432f-445c-b29a-d76839be489b
However, there is another unofficial way to translate audio into any supported language by setting the Audio language
to the target translation language. For example, if the audio is in English and you want to translate it into Spanish, you would set the Audio language
to "Spanish".
Here is a practical example using the microphone:
https://github.com/HenestrosaDev/audiotext/assets/60482743/b346290f-4654-48c4-bf5a-2dcb75b136e9
Make sure to double-check the generated translations.
To generate subtitles, simply check the Generate subtitles
option before generating the transcription, as you would with the Translate to English
option.
When you select this option, you'll see a Subtitle options
frame like the one below with these three options:
.srt
and .vtt
subtitle files. Not checked by default.2
by default.42
by default.
To get the files after the audio is transcribed, click Save transcription
and select the path where you want to save them, as explained in the Save Transcription section.
The output formats are .vtt
and .srt
, which are two of the most common subtitle file formats. Unfortunately, there is no current support for the .ass
file type at the moment, but it will be added as soon as WhisperX fixes a bug that prevented it from being created correctly.
When you click the Show advanced options
button in the WhisperX options
frame, the Advanced options
frame appears, as shown in the figure below.
It's highly recommended that you don't change the default configuration unless you're having problems with WhisperX or you know exactly what you're doing, especially the "Compute type" and "Batch size" options. Change them at your own risk and be aware that you may experience problems, such as having to reboot your system if the GPU runs out of VRAM.
There are five main model sizes that offer tradeoffs between speed and accuracy. The larger the model size, the more VRAM it uses and the longer it takes to transcribe. Unfortunately, WhisperX hasn't provided specific performance data for each model, so the table below is based on the one detailed in OpenAI's Whisper README. According to WhisperX, the large-v2
model requires <8GB of GPU memory and batches inference for 70x real-time transcription (taken from the project's README).
Model | Parameters | Required VRAM |
---|---|---|
tiny |
39 M | ~1 GB |
base |
74 M | ~1 GB |
small |
244 M | ~2 GB |
medium |
769 M | ~5 GB |
large |
1550 M | <8 GB |
large
is divided into three versions:large-v1
,large-v2
, andlarge-v3
. The default model size islarge-v2
, sincelarge-v3
has some bugs that weren't as common inlarge-v2
, such as hallucination and repetition, especially for certain languages like Japanese. There are also more prevalent problems with missing punctuation and capitalization. See the announcements for thelarge-v2
and thelarge-v3
models for more insight into their differences and the issues encountered with each.
The larger the model size, the lower the WER (Word Error Rate in %). The table below is taken from this Medium article, which analyzes the performance of pre-trained Whisper models on common Dutch speech.
Model | WER |
---|---|
tiny | 50.98 |
small | 17.90 |
large-v2 | 7.81 |
This term refers to different data types used in computing, particularly in the context of numerical representation. It determines how numbers are stored and represented in a computer's memory. The higher the precision, the more resources will be needed and the better the transcription will be.
There are three possible values for Audiotext:
int8
: Default if using CPU. It represents whole numbers without any fractional part. Its size is 8 bits (1 byte) and it can represent integer values from -128 to 127 (signed) or 0 to 255 (unsigned). It is used in scenarios where memory efficiency is critical, such as in quantized neural networks or edge devices with limited computational resources.float16
: Default if using CUDA GPU. It's a half precision type representing 16-bit floating point numbers. Its size is 16 bits (2 bytes). It has a smaller range and precision compared to float32
. It's often used in applications where memory is a critical resource, such as in deep learning models running on GPUs or TPUs.float32
: Recommended for CUDA GPUs with more than 8 GB of VRAM. It's a single precision type representing 32-bit floating point numbers, which is a standard for representing real numbers in computers. Its size is 32 bits (4 bytes). It can represent a wide range of real numbers with a reasonable level of precision.This option determines how many samples are processed together before the model parameters are updated. It doesn't affect the quality of the transcription, only the generation speed (the smaller, the slower).
For simplicity, let's divide the possible batch size values into two groups:
8
is the default value.16
.Checked by default if there is no CUDA GPU. WhisperX will use the CPU for transcription if checked.
As noted in the Compute Type section, the default compute type value for the CPU is int8
, since many CPUs don't support efficient float16
or float32
computation, which would result in an error. Change it at your own risk.
The Google API options
frame appears if the selected transcription method is Google API.
Since the program uses the free Google API tier by default, which allows you to transcribe up to 60 minutes of audio per month for free, you may need to add an API key if you want to make extensive use of this feature. To do so, click the Set API key
button. You'll be presented with a dialog box where you can enter your API key, which will only be used to make requests to the API.
Remember that WhisperX provides fast, unlimited audio transcription that supports translation and subtitle generation for free, unlike the Google API. Also note that Google charges for the use of the API key, for which Audiotext is not responsible.
The program supports three appearance modes:
RuntimeError: CUDA Out of memory
or want to reduce GPU/CPU memory requirements, try any of the following (2 and 3 can affect quality) (taken from WhisperX README):
4
base
int8
.srt
and .vtt
files for subtitles (only for WhisperX).Black
, isort
, and mypy
.You can propose a new feature creating an issue.
See also the list of contributors who participated in this project.
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated. Please read the CONTRIBUTING.md file, where you can find more detailed information about how to contribute to the project.
I have made use of the following resources to make this project:
Distributed under the BSD-4-Clause license. See LICENSE
for more information.
Would you like to support the project? That's very kind of you! However, I would suggest that you to consider supporting the packages that I've used to build this project first. If you still want to support this particular project, you can go to my Ko-Fi profile by clicking on the button down below!