Speech To Text Benchmark Save

speech to text benchmark framework

Project README

Speech-to-Text Benchmark

Made in Vancouver, Canada by Picovoice

This repo is a minimalist and extensible framework for benchmarking different speech-to-text engines.

Table of Contents

Data

Metrics

Word Error Rate

Word error rate (WER) is the ratio of edit distance between words in a reference transcript and the words in the output of the speech-to-text engine to the number of words in the reference transcript.

Core-Hour

The Core-Hour metric is used to evaluate the computational efficiency of the speech-to-text engine, indicating the number of CPU hours required to process one hour of audio. A speech-to-text engine with lower Core-Hour is more computationally efficient. We omit this metric for cloud-based engines.

Model Size

The aggregate size of models (acoustic and language), in MB. We omit this metric for cloud-based engines.

Engines

Usage

This benchmark has been developed and tested on Ubuntu 22.04.

  • Install FFmpeg
  • Download datasets.
  • Install the requirements:
pip3 install -r requirements.txt

In the following, we provide instructions for running the benchmark for each engine. The supported datasets are: COMMON_VOICE, LIBRI_SPEECH_TEST_CLEAN, LIBRI_SPEECH_TEST_OTHER, or TED_LIUM.

Amazon Transcribe Instructions

Replace ${DATASET} with one of the supported datasets, ${DATASET_FOLDER} with path to dataset, and ${AWS_PROFILE} with the name of AWS profile you wish to use.

python3 benchmark.py \
--dataset ${DATASET} \
--dataset-folder ${DATASET_FOLDER} \
--engine AMAZON_TRANSCRIBE \
--aws-profile ${AWS_PROFILE}

Azure Speech-to-Text Instructions

Replace ${DATASET} with one of the supported datasets, ${DATASET_FOLDER} with path to dataset, ${AZURE_SPEECH_KEY} and ${AZURE_SPEECH_LOCATION} information from your Azure account.

python3 benchmark.py \
--dataset ${DATASET} \
--dataset-folder ${DATASET_FOLDER} \
--engine AZURE_SPEECH_TO_TEXT \
--azure-speech-key ${AZURE_SPEECH_KEY}
--azure-speech-location ${AZURE_SPEECH_LOCATION}

Google Speech-to-Text Instructions

Replace ${DATASET} with one of the supported datasets, ${DATASET_FOLDER} with path to dataset, and ${GOOGLE_APPLICATION_CREDENTIALS} with credentials download from Google Cloud Platform.

python3 benchmark.py \
--dataset ${DATASET} \
--dataset-folder ${DATASET_FOLDER} \
--engine GOOGLE_SPEECH_TO_TEXT \
--google-application-credentials ${GOOGLE_APPLICATION_CREDENTIALS}

IBM Watson Speech-to-Text Instructions

Replace ${DATASET} with one of the supported datasets, ${DATASET_FOLDER} with path to dataset, and ${WATSON_SPEECH_TO_TEXT_API_KEY}/${${WATSON_SPEECH_TO_TEXT_URL}} with credentials from your IBM account.

python3 benchmark.py \
--dataset ${DATASET} \
--dataset-folder ${DATASET_FOLDER} \
--engine IBM_WATSON_SPEECH_TO_TEXT \
--watson-speech-to-text-api-key ${WATSON_SPEECH_TO_TEXT_API_KEY}
--watson-speech-to-text-url ${WATSON_SPEECH_TO_TEXT_URL}

OpenAI Whisper Instructions

Replace ${DATASET} with one of the supported datasets, ${DATASET_FOLDER} with path to dataset, and ${WHISPER_MODEL} with the whisper model type (WHISPER_TINY, WHISPER_BASE, WHISPER_SMALL, WHISPER_MEDIUM, or WHISPER_LARGE)

python3 benchmark.py \
--engine ${WHISPER_MODEL} \
--dataset ${DATASET} \
--dataset-folder ${DATASET_FOLDER} \

Picovoice Cheetah Instructions

Replace ${DATASET} with one of the supported datasets, ${DATASET_FOLDER} with path to dataset, and ${PICOVOICE_ACCESS_KEY} with AccessKey obtained from Picovoice Console.

python3 benchmark.py \
--engine PICOVOICE_CHEETAH \
--dataset ${DATASET} \
--dataset-folder ${DATASET_FOLDER} \
--picovoice-access-key ${PICOVOICE_ACCESS_KEY}

Picovoice Leopard Instructions

Replace ${DATASET} with one of the supported datasets, ${DATASET_FOLDER} with path to dataset, and ${PICOVOICE_ACCESS_KEY} with AccessKey obtained from Picovoice Console.

python3 benchmark.py \
--engine PICOVOICE_LEOPARD \
--dataset ${DATASET} \
--dataset-folder ${DATASET_FOLDER} \
--picovoice-access-key ${PICOVOICE_ACCESS_KEY}

Results

Word Error Rate (WER)

Engine LibriSpeech test-clean LibriSpeech test-other TED-LIUM CommonVoice Average
Amazon Transcribe 2.6% 5.6% 3.8% 8.7% 5.2%
Azure Speech-to-Text 2.8% 6.2% 4.6% 8.9% 5.6%
Google Speech-to-Text 10.8% 24.5% 14.4% 31.9% 20.4%
Google Speech-to-Text Enhanced 6.2% 13.0% 6.1% 18.2% 10.9%
IBM Watson Speech-to-Text 10.9% 26.2% 11.7% 39.4% 22.0%
Whisper Large (Multilingual) 3.7% 5.4% 4.6% 9.0% 5.7%
Whisper Medium 3.3% 6.2% 4.6% 10.2% 6.1%
Whisper Small 3.3% 7.2% 4.8% 12.7% 7.0%
Whisper Base 4.3% 10.4% 5.4% 17.9% 9.5%
Whisper Tiny 5.9% 13.8% 6.5% 24.4% 12.7%
Picovoice Cheetah 5.6% 12.1% 7.7% 17.5% 10.7%
Picovoice Leopard 5.3% 11.3% 7.2% 16.2% 10.0%

Core-Hour & Model Size

To obtain these results, we ran the benchmark across the entire TED-LIUM dataset and recorded the processing time. The measurement is carried out on an Ubuntu 22.04 machine with AMD CPU (AMD Ryzen 9 5900X (12) @ 3.70GHz), 64 GB of RAM, and NVMe storage, using 10 cores simultaneously. We omit Whisper Large (Multilingual) from this benchmark.

Engine Core-Hour Model Size / MB
Whisper Medium 1.50 1457
Whisper Small 0.89 462
Whisper Base 0.28 139
Whisper Tiny 0.15 73
Picovoice Leopard 0.05 36
Picovoice Cheetah 0.09 31

Open Source Agenda is not affiliated with "Speech To Text Benchmark" Project. README Source: Picovoice/speech-to-text-benchmark

Open Source Agenda Badge

Open Source Agenda Rating