Noisy Student Training Asr Save

Pytorch implementation of Noisy Student Training for Automatic Speech Recognition and Automatic Pronunciation Error Detection problem

Project README

Introduction

Automatic Pronunciation Error Detection (APED).

An APED system will first provide a predefined text (and, if necessary, a pre-existing voiceover for learners to listen to for reference). The learner's task is very simple: try to read this passage as correctly as possible. For example, a learner wants to learn how to pronounce the word “apple” (its phoneme sequence is “æ p l”), but the learner may mispronounce it as “ə p l”. In this case, we define the string "æ p l" as the standard pronunciation string and the string "ə p l" as the reader's string. The APED system will accurately predict where the user reads the wrong word "apple" in a specific position, thereby giving feedback to the learner so that the learner can promptly correct the mistake, gradually, the learner will improve your pronunciation.

aped_asr_based

Method

1. Training Conformer using Pre-training Wav2Vec2.0 Framework combined with Self-training technique Noisy Student Training for Phoneme Recognition problem.
1. Find the longest common subsequence between the ground truth phoneme sequence and model-predicted speaker phoneme sequence to detect miss-pronunciation.

project_structure

Experiment

Dataset

Type of Label

Label is phoneme sequence, extracted from bootphon/phonemizer.

Speech Dataset

Training

Pretraining wav2vec2.0 Conformer: Using Pretrained of facebook/fairseq. Wav2Vec 2.0 Large conformer - rel_pos (LV-60).
Unlabel: LibriSpeech 360 hours remove label.
Label: Libri-Light 10h + LibriSpeech dev-clean, dev-other, dev-other. Total 25 hours.

Testing

Label: LibriSpeech test-clean 5.4 hours.

Dataset can get from here tuannguyenvananh/libri-phone.

Language Model

Language Model is 3-gram Witten-Bell LM, train on label phoneme corpus (too small)

Corpus	No. Sentences	Perplexity
Training: $80$%	8906	10.17
Testing: $20$%	2241	10.47

Results

Phoneme Recognition

Metric: Phoneme Error Rate (%)

Model	Greedy Decode	Beam Search (with LM) Decode
Teacher	$17.13$	$22.31$
Student	$12.66$	$25.45$

The result show that:

The language model is not suitable for this problem because of the small amount of text data.
Student's PER is reduced by 26% compared to that of teacher, successfully implementing supervised learning technique for the problem of phonemic sequence recognition.

Some example on Phoneme Sequence Prediction

-	Ground Truth	Predict	PER (%)
1	w aɪ ə t ʌ ŋ ɪ m p ɹ ɛ s d w ɪ ð h ʌ n i f ɹ ʌ m ɛ v ɹ i w ɪ n d	w aɪ ə t t ɑː ŋ ɪ m p ɹ ɛ s t w ɪ ð h ʌ n i f ɹ ʌ m ɛ v ɹ i w ɪ n d ɪ	$12.5$
2	w aɪ ə n ɪ ɹ ə w ə l p uː l f ɪ ɹ s t ə d ɹ ɔ k ɹ i eɪ ʃ ə n z ɪ n	w aɪ ə n ɪ ɹ ə w ə l p uː l f ɪ ɹ s t ə d ɹ ɔ k ɹ i eɪ ʃ ə n z ɪ n	$0$
3	ɔ l ɪ z s ɛ d w ɪ ð aʊ t ə w ə d	ɔ l w ɪ z s ɛ d w ɪ ð aʊ t ə w ə d	$6.25$
4	aɪ s ɪ t b ə n i θ ð aɪ l ʊ k s æ z t ʃ ɪ l d ɹ ə n d uː ɪ n ð ə n uː n s ʌ n w ɪ ð s oʊ l z ð æ t t ɹ ɛ m b l θ ɹ uː ð ɛ ɹ h æ p i aɪ l ɪ d z f ɹ ʌ m ə n ʌ n ə v ə d j ɛ t p ɹ ɑː d ɪ ɡ l ɪ n w ɚ d d ʒ ɔɪ	aɪ s ɪ t b ə n i θ aɪ l ʊ k s æ z t ʃ ɪ l d ɹ ə n d uː ɪ n ð ə n uː n s ʌ n w ɪ ð s oʊ l z ð æ t ɹ ɛ m b l θ ɹ uː ð ɛ ɹ h æ p i aɪ ə l ɪ z f ɹ ʌ m ə n ʌ n ʌ v ə d j ɛ t p ɹ ɑː n ɪ k l ɪ n w ɚ d d ʒ ɔɪ ə ɪ ɪ	$10.2$

Error Detection

Setup

Information	Value
Grapheme	why an ear a whirlpool fierce to draw creations in
IPA Phoneme	w aɪ ə n ɪ ɹ ə w ə l p uː l f ɪ ɹ s t ə d ɹ ɔ k ɹ i eɪ ʃ ə n z ɪ n
Phoneme sequence length	32
Path to LibriSpeech test-clean	test-clean/908/157963/908-157963-0030.flac

To perform error checking, change the letter form of the sample sentence to another word in turn, then recreate the IPA phonetic form, and then use the student model to predict and check errors using the longest common subsequence (LCS) algorithm.

Generate IPA with bootphon/phonemizer and generate voice using Nvidia FastPitch's text-to-speech API

Example Name	Grapheme	Original Phoneme Sequence	Phoneme Sequence Length
`error_1`	why an ear a weir pool fierce to draw creations in	w aɪ ə n ɪ ɹ ə w ɪ ɹ p uː l f ɪ ɹ s t ə d ɹ ɔ k ɹ i eɪ ʃ ə n z ɪ n	32
`error_2`	why an ear a whirlpool fear to draw creations in	w aɪ ə n ɪ ɹ ə w ə l p uː l f ɪ ɹ s t ə d ɹ ɔ k ɹ i eɪ ʃ ə n z ɪ n	31
`error_3`	why an ear a whirl pole fierce to draw creations in	w aɪ ə n ɪ ɹ ə w ə l p oʊ l f ɪ ɹ s t ə d ɹ ɔ k ɹ i eɪ ʃ ə n z ɪ n	32

Above Dataset can get from here tuannguyenvananh/aped-sample.

Error Detection Result

Using the LCS algorithm to match the longest sequence between the real sample and the predicted sample, unmatched phonemes will be considered as pronunciation errors.

Example Name	Predicted Phoneme Sequence	Phoneme that unmatch by LCS	Predicted Sequence Length	PER (%)	Result
`error_1`	w aɪ ə n ɪ ɹ ə w ɪ ɹ p uː l f ɪ ɹ s t ə d ɹ ɔ k ɹ i eɪ ʃ ə n z ɪ n	ɪ, ɹ	32	$6.25$	Correct
`error_2`	w aɪ ə n ɪ ɹ ə w ə k l f ɪ ɹ t ə d ɹ ɔ k ɹ i eɪ ʃ ə n z ɪ n	k	29	$12.5$	Wrong
`error_3`	w aɪ ə n ɪ ɹ ə w ə l p oʊ l f ɪ ɹ s t ə d ɹ ɔ k ɹ i eɪ ʃ ə n z ɪ n	oʊ	32	$3.125$	Correct

Installation

Create Conda environment

conda create -n dev -c pytorch-nightly -c nvidia -c pytorch -c conda-forge python=3.9 pytorch torchaudio cudatoolkit pandas numpy

Install requirement packages

pip install fairseq transformers torchsummary datasets evaluate torch-summary jiwer wandb matplotlib

wandb login <token>

References

[1] Yu. et el. "Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition". DOI: doi.org/10.48550/arXiv.2010.10504.

Pretrained Checkpoint

You guys can get checkpoint file from here tuannguyenvananh/nst-pretrained-model.

Report

The final report can get from here (Vietnamese version): graduation-thesis_v3.pdf.

Things to do

Restructure code.
Publish pretrained for Teacher and Student.

Star History

Open Source Agenda is not affiliated with "Noisy Student Training Asr" Project. README Source: tuanio/noisy-student-training-asr

Stars

Open Issues

Last Commit

7 months ago

Repository

tuanio/noisy-student-training-asr

Open Source Agenda Badge

<a href="https://www.opensourceagenda.com/projects/noisy-student-training-asr"><img src="https://www.opensourceagenda.com/projects/noisy-student-training-asr/reviews/badge.svg" alt="Open Source Agenda"></a>

Submit Review Review Your Favorite Project

Submit Resource Articles, Courses, Videos

Submit Article Submit a post to our blog