:zap: TensorFlowASR: Almost State-of-the-art Automatic Speech Recognition in Tensorflow 2. Supported languages that can use characters or subwords
TensorFlowASR :zap:
Almost State-of-the-art Automatic Speech Recognition in Tensorflow 2
TensorFlowASR implements some automatic speech recognition architectures such as DeepSpeech2, Jasper, RNN Transducer, ContextNet, Conformer, etc. These models can be converted to TFLite to reduce memory and computation for deployment :smile:
--timestamp
For training and testing, you should use git clone
for installing necessary packages from other authors (ctc_decoders
, rnnt_loss
, etc.)
git clone https://github.com/TensorSpeech/TensorFlowASR.git
cd TensorFlowASR
# Tensorflow 2.x (with 2.x.x >= 2.5.1)
pip3 install -e ".[tf2.x]" # or ".[tf2.x-gpu]"
For anaconda3:
conda create -y -n tfasr tensorflow-gpu python=3.8 # tensorflow if using CPU, this makes sure conda install all dependencies for tensorflow
conda activate tfasr
pip install -U tensorflow-gpu # upgrade to latest version of tensorflow
git clone https://github.com/TensorSpeech/TensorFlowASR.git
cd TensorFlowASR
# Tensorflow 2.x (with 2.x.x >= 2.5.1)
pip3 install -e ".[tf2.x]" # or ".[tf2.x-gpu]"
# Tensorflow 2.x (with 2.x >= 2.3)
pip3 install -U "TensorFlowASR[tf2.x]" # or pip3 install -U "TensorFlowASR[tf2.x-gpu]"
docker-compose up -d
For datasets, see datasets
For training, testing and using CTC Models, run ./scripts/install_ctc_decoders.sh
For training Transducer Models with RNNT Loss in TF, make sure that warp-transducer is not installed (by simply run pip3 uninstall warprnnt-tensorflow
) (Recommended)
For training Transducer Models with RNNT Loss from warp-transducer, run export CUDA_HOME=/usr/local/cuda && ./scripts/install_rnnt_loss.sh
(Note: only export CUDA_HOME
when you have CUDA)
For mixed precision training, use flag --mxp
when running python scripts from examples
For enabling XLA, run TF_XLA_FLAGS=--tf_xla_auto_jit=2 python3 $path_to_py_script
)
For hiding warnings, run export TF_CPP_MIN_LOG_LEVEL=2
before running any examples
After converting to tflite, the tflite model is like a function that transforms directly from an audio signal to unicode code points, then we can convert unicode points to string.
tf-nightly
using pip install tf-nightly
TFSpeechFeaturizer
and TextFeaturizer
to model using function add_featurizers
func = model.make_tflite_function(**options) # options are the arguments of the function
concrete_func = func.get_concrete_function()
converter = tf.lite.TFLiteConverter.from_concrete_functions([concrete_func])
converter.experimental_new_converter = True
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS,
tf.lite.OpsSet.SELECT_TF_OPS]
tflite_model = converter.convert()
if not os.path.exists(os.path.dirname(tflite_path)):
os.makedirs(os.path.dirname(tflite_path))
with open(tflite_path, "wb") as tflite_out:
tflite_out.write(tflite_model)
.tflite
model is ready to be deployedSee augmentations
config.yml
files in the example folder for reference (you can copy and modify values such as parameters, paths, etc.. to match your local machine configuration)transcripts.tsv
files from your corpus (this is general format used in this project because each dataset has different format). For more detail, see datasets. Note: Make sure your data contain only characters in your language, for example, english has a
to z
and '
. Do not use cache
if your dataset size is not fit in the RAM.tf.data.TFRecordDataset
for better performance by using the script create_tfrecords.py
language.characters
, using the scripts generate_vocab_subwords.py or generate_vocab_sentencepiece.py. There're predefined ones in vocabularies
config.yml
and total number of elements in each dataset, for static shape training and precalculated steps per epoch.train.py
files in the example folder to see the optionstest.py
files in the example folder to see the options. Note: Testing is currently not supported for TPUs. It will print nothing other than the progress bar in the console, but it will store the predicted transcripts to the file output.tsv
and then calculate the metrics from that file.FYI: Keras builtin training uses infinite dataset, which avoids the potential last partial batch.
See examples for some predefined ASR models and results
For pretrained models, go to drive
Name | Source | Hours |
---|---|---|
LibriSpeech | LibriSpeech | 970h |
Common Voice | https://commonvoice.mozilla.org | 1932h |
Name | Source | Hours |
---|---|---|
Vivos | https://ailab.hcmus.edu.vn/vivos | 15h |
InfoRe Technology 1 | InfoRe1 (passwd: BroughtToYouByInfoRe) | 25h |
InfoRe Technology 2 (used in VLSP2019) | InfoRe2 (passwd: BroughtToYouByInfoRe) | 415h |
Name | Source | Hours |
---|---|---|
Common Voice | https://commonvoice.mozilla.org/ | 750h |
Huy Le Nguyen
Email: [email protected]