Pytorch implementation of Noisy Student Training for Automatic Speech Recognition and Automatic Pronunciation Error Detection problem
Automatic Pronunciation Error Detection (APED).
An APED system will first provide a predefined text (and, if necessary, a pre-existing voiceover for learners to listen to for reference). The learner's task is very simple: try to read this passage as correctly as possible. For example, a learner wants to learn how to pronounce the word “apple” (its phoneme sequence is “æ p l”), but the learner may mispronounce it as “ə p l”. In this case, we define the string "æ p l" as the standard pronunciation string and the string "ə p l" as the reader's string. The APED system will accurately predict where the user reads the wrong word "apple" in a specific position, thereby giving feedback to the learner so that the learner can promptly correct the mistake, gradually, the learner will improve your pronunciation.
Label is phoneme sequence, extracted from bootphon/phonemizer.
Wav2Vec 2.0 Large conformer - rel_pos (LV-60)
.Dataset can get from here tuannguyenvananh/libri-phone.
Corpus | No. Sentences | Perplexity |
---|---|---|
Training: $80$% | 8906 | 10.17 |
Testing: $20$% | 2241 | 10.47 |
Model | Greedy Decode | Beam Search (with LM) Decode |
---|---|---|
Teacher | $17.13$ | $22.31$ |
Student | $12.66$ | $25.45$ |
The result show that:
- | Ground Truth | Predict | PER (%) |
---|---|---|---|
1 | w aɪ ə t ʌ ŋ ɪ m p ɹ ɛ s d w ɪ ð h ʌ n i f ɹ ʌ m ɛ v ɹ i w ɪ n d | w aɪ ə t t ɑː ŋ ɪ m p ɹ ɛ s t w ɪ ð h ʌ n i f ɹ ʌ m ɛ v ɹ i w ɪ n d ɪ | $12.5$ |
2 | w aɪ ə n ɪ ɹ ə w ə l p uː l f ɪ ɹ s t ə d ɹ ɔ k ɹ i eɪ ʃ ə n z ɪ n | w aɪ ə n ɪ ɹ ə w ə l p uː l f ɪ ɹ s t ə d ɹ ɔ k ɹ i eɪ ʃ ə n z ɪ n | $0$ |
3 | ɔ l ɪ z s ɛ d w ɪ ð aʊ t ə w ə d | ɔ l w ɪ z s ɛ d w ɪ ð aʊ t ə w ə d | $6.25$ |
4 | aɪ s ɪ t b ə n i θ ð aɪ l ʊ k s æ z t ʃ ɪ l d ɹ ə n d uː ɪ n ð ə n uː n s ʌ n w ɪ ð s oʊ l z ð æ t t ɹ ɛ m b l θ ɹ uː ð ɛ ɹ h æ p i aɪ l ɪ d z f ɹ ʌ m ə n ʌ n ə v ə d j ɛ t p ɹ ɑː d ɪ ɡ l ɪ n w ɚ d d ʒ ɔɪ | aɪ s ɪ t b ə n i θ aɪ l ʊ k s æ z t ʃ ɪ l d ɹ ə n d uː ɪ n ð ə n uː n s ʌ n w ɪ ð s oʊ l z ð æ t ɹ ɛ m b l θ ɹ uː ð ɛ ɹ h æ p i aɪ ə l ɪ z f ɹ ʌ m ə n ʌ n ʌ v ə d j ɛ t p ɹ ɑː n ɪ k l ɪ n w ɚ d d ʒ ɔɪ ə ɪ ɪ | $10.2$ |
Information | Value |
---|---|
Grapheme | why an ear a whirlpool fierce to draw creations in |
IPA Phoneme | w aɪ ə n ɪ ɹ ə w ə l p uː l f ɪ ɹ s t ə d ɹ ɔ k ɹ i eɪ ʃ ə n z ɪ n |
Phoneme sequence length | 32 |
Path to LibriSpeech test-clean | test-clean/908/157963/908-157963-0030.flac |
To perform error checking, change the letter form of the sample sentence to another word in turn, then recreate the IPA phonetic form, and then use the student model to predict and check errors using the longest common subsequence (LCS) algorithm.
Generate IPA with bootphon/phonemizer and generate voice using Nvidia FastPitch's text-to-speech API
Example Name | Grapheme | Original Phoneme Sequence | Phoneme Sequence Length |
---|---|---|---|
error_1 |
why an ear a weir pool fierce to draw creations in | w aɪ ə n ɪ ɹ ə w ɪ ɹ p uː l f ɪ ɹ s t ə d ɹ ɔ k ɹ i eɪ ʃ ə n z ɪ n | 32 |
error_2 |
why an ear a whirlpool fear to draw creations in | w aɪ ə n ɪ ɹ ə w ə l p uː l f ɪ ɹ s t ə d ɹ ɔ k ɹ i eɪ ʃ ə n z ɪ n | 31 |
error_3 |
why an ear a whirl pole fierce to draw creations in | w aɪ ə n ɪ ɹ ə w ə l p oʊ l f ɪ ɹ s t ə d ɹ ɔ k ɹ i eɪ ʃ ə n z ɪ n | 32 |
Above Dataset can get from here tuannguyenvananh/aped-sample.
Using the LCS algorithm to match the longest sequence between the real sample and the predicted sample, unmatched phonemes will be considered as pronunciation errors.
Example Name | Predicted Phoneme Sequence | Phoneme that unmatch by LCS | Predicted Sequence Length | PER (%) | Result |
---|---|---|---|---|---|
error_1 |
w aɪ ə n ɪ ɹ ə w ɪ ɹ p uː l f ɪ ɹ s t ə d ɹ ɔ k ɹ i eɪ ʃ ə n z ɪ n | ɪ, ɹ | 32 | $6.25$ | Correct |
error_2 |
w aɪ ə n ɪ ɹ ə w ə k l f ɪ ɹ t ə d ɹ ɔ k ɹ i eɪ ʃ ə n z ɪ n | k | 29 | $12.5$ | Wrong |
error_3 |
w aɪ ə n ɪ ɹ ə w ə l p oʊ l f ɪ ɹ s t ə d ɹ ɔ k ɹ i eɪ ʃ ə n z ɪ n | oʊ | 32 | $3.125$ | Correct |
conda create -n dev -c pytorch-nightly -c nvidia -c pytorch -c conda-forge python=3.9 pytorch torchaudio cudatoolkit pandas numpy
pip install fairseq transformers torchsummary datasets evaluate torch-summary jiwer wandb matplotlib
wandb login <token>
You guys can get checkpoint file from here tuannguyenvananh/nst-pretrained-model.
The final report can get from here (Vietnamese version): graduation-thesis_v3.pdf.