Pre-processing KsponSpeech corpus (Korean Speech dataset) provided by AI Hub.
It's been a while since KsponSpeech was released, but it's hard to compare performance because there's no established preprocessing method. So we're revealing the pre-processing method we used in the KoSpeech project. This project provides processing in characters, subwords, and grapheme units.
KsponSpeech-preprocess
is repository for pre-processing KsponSpeech corpus
provided by AI Hub.
KsponSpeech corpus is a 1000h Korean speech data corpus provided by AI Hub in Korea.
Anyone can download this dataset just by applying. The transcription rules can see here.
You can pre-process in various output-units, such as character, subword, grapheme
We will explain the details in the Output-Unit part below.
pip install pandas
(Refer here for problem installing Pandas)pip install sentencepiece
(Refer here for problem installing Sentencepiece)python main.py --dataset_path $DATASET_PATH --vocab_dest $VOCAB_DEST --output_unit $OUTPUT_UNIT --preprocess_mode $PREPROCESS_MODE --vocab_size $VOCAB_SIZE
$ ./run.sh
You can choose between phonetic transcription and spelling transcription to preprocess.
b/ (70%)/(칠 십 퍼센트) 확률이라니 아/ (뭐+ 뭔)/(모+ 몬) 소리야 진짜 (100%)(백 프로)가 왜 안돼? n/
(70%)/(칠 십 퍼센트) 확률이라니 아/ (뭐+ 뭔)/(모+ 몬) 소리야 진짜 (100%)(백 프로)가 왜 안돼?
(70%)/(칠 십 퍼센트) 확률이라니 아 (뭐 뭔)/(모 몬) 소리야 진짜 (100%)(백 프로)가 왜 안돼?
칠 십 퍼센트 확률이라니 아 모 몬 소리야 진짜 백 프로가 왜 안돼?
70% 확률이라니 아 뭐 뭔 소리야 진짜 100%가 왜 안돼?
This project provides processing in characters, subwords, and grapheme units.
아 모 몬 소리야 칠 십 퍼센트 확률이라니
▁아 ▁모 ▁ 몬 ▁소리 야 ▁ 칠 ▁ 십 ▁퍼 센트 ▁확 률 이라 니
ㅇㅏ ㅁㅗ ㅁㅗㄴ ㅅㅗㄹㅣㅇㅑ ㅊㅣㄹ ㅅㅣㅂ ㅍㅓㅅㅔㄴㅌㅡ ㅎㅘㄱㄹㅠㄹㅇㅣㄹㅏㄴㅣ
아 모 몬 소리야 칠 십 퍼센트 확률이라니
7 3 106 3 730 3 173 32 26 3 319 3 120 3 490 552 157 3 315 747 5 33 22
If you have any questions, bug reports, and feature requests, please open an issue on Github.
For live discussions, please go to our gitter or Contacts [email protected] please.
I appreciate any kind of feedback or contribution. Feel free to proceed with small issues like bug fixes, documentation improvement. For major contributions and new features, please discuss with the collaborators in corresponding issues.