Open Korean NLP Dataset Curation for the Users All Around the Globe
We will participate in PACLIC 37 and will present our paper "Revisiting Korean Corpus Studies through Technological Advances". If you visit PolyU@HK 3 December, please drop by!
We will be in the live session and monitoring the slide chat during EMNLP 2020. If you have any questions or would simply want to drop by to say hello, please drop by!
Multiple government-funded institutions create datasets for the Korean language
Our work focuses on curating open Korean corpora under the following criteria:
For documentation status Docu. the following holds.
For License, we check the followings:
The table below describes the open Korean corpora investigated so far. To be updated along with our survey or PR. You can visit Here for the Korean description, and more information regarding government-driven database. We removed the hyperlink for pages not working.
No | Dataset | Typical Usage | Provider | Docu. | License | Redist. | mod-x | Volume | Goal | Lang. | Year |
---|---|---|---|---|---|---|---|---|---|---|---|
BM-1 | KLUE | Benchmark studies | Industry | int'l | all | rd | 0 | DP - 15K (s), DST - 10K (d), MRC - 29K (p), NER - 31K (s), NLI - 30K (p), RE - 48K (s), STS - 13K (p), TC - 64K (s) | Eval | ko | 2021 |
BM-2 | KoBEST | Benchmark studies | Industry | int'l | all | rd | 0 | BoolQ - 6K (p), COPA - 5K (s), KB-WiC - 6K (s), KB-HellaSwag - 3K (p & s), SentiNeg - 4K (s) | Eval | ko | 2022 |
No | Dataset | Typical Usage | Provider | Docu. | License | Redist. | mod-x | Volume | Goal | Lang. | Year |
---|---|---|---|---|---|---|---|---|---|---|---|
PT-1 | KAIST Morpho-Syntactically Annotated Corpus | Morphological analysis | Academia | paper | academic | no | 0 | 70M (w) | - | ko | 1999 |
PT-2 | OpenKorPOS | Morphological analysis | Academia | int'l | all | rd | 0 | 55M (w) | - | ko | 2022 |
PT-3 | KAIST Korean Tree-Tagging Corpus | Tree parsing | Academia | int'l | academic | no | 0 | 30K (s) | - | ko | 1994 |
PT-4 | UD Korean KAIST | Dependency parsing | Academia | int'l | academic | no | 1 | 27K (s) | - | ko | 2018 |
PT-5 | PKT-UD | Dependency parsing | Academia | int'l | academic | no | 0 | 5K (s) | - | ko | 2018 |
PT-6 | KMOU NER | NER | Academia | paper | academic | rd | 0 | 24K (s) | - | ko | 2019 |
PT-7 | AIR x NAVER NER | NER | Competition | docu | academic | no | 0 | 90K (s) | - | ko | 2018 |
PT-8 | AIR x NAVER SRL | SRL | Competition | docu | academic | no | 0 | 35K (s) | - | ko | 2018 |
PT-9 | KoNEC & KoNNEC | NER | Academia | docu | academic | no | 0 | 26K(s) | - | ko | 2022 |
No | Dataset | Typical Usage | Provider | Docu. | License | Redist. | mod-x | Volume | Goal | Lang. | Year |
---|---|---|---|---|---|---|---|---|---|---|---|
ESP-1 | Question Pair | Paraphrase detection | Academia | docu | all | rd | 0 | 10K (p) | - | ko | 2018 |
ESP-2 | KorNLI | NLI | Industry | int'l | all | rd | 0 | 1,000K (p) | - | ko | 2020 |
ESP-3 | KorSTS | STS | Industry | int'l | all | rd | 0 | 8,500K (p) | - | ko | 2020 |
ESP-4 | ParaKQC | Paraphrase detection | Academia | int'l | all | rd | 0 | 540K (p) | - | ko | 2020 |
ESP-5 | StyleKQC | Paraphrase detection | Academia | int'l | all | rd | 0 | 30K(s) | - | ko | 2021 |
ESP-6 | Korean Smile Style Dataset | Paraphrase detection | Industry | docu | academic | rd | 0 | 2,5K(d) | - | ko | 2022 |
No | Dataset | Typical Usage | Provider | Docu. | License | Redist. | mod-x | Volume | Goal | Lang. | Year |
---|---|---|---|---|---|---|---|---|---|---|---|
ISO-1 | 3i4K | Speech act classification | Academia | int'l | all | rd | 0 | 55K / 6K (s) | - | ko | 2019 |
ISO-2 | NSMC | Sentiment analysis | Academia | docu | all | rd | 0 | 150K / 50K (s) | - | ko | 2015 |
ISO-3 | Kocasm | Sentiment analysis | Academia | docu | all | rd | 0 | 9K (s) | - | ko | 2019 |
ISO-4 | BEEP! | Hate speech detection | Academia | int'l | all | rd | 0 | 8K / 0.5K / 1K (s) | - | ko | 2020 |
ISO-5 | APEACH | Hate speech detection | Academia | int'l | all | rd | 0 | 4K | Eval | ko | 2022 |
ISO-6 | Unsmile | Hate speech detection | Industry | docu | academic | rd | 0 | 19K | - | ko | 2022 |
ISO-7 | HateScore | Hate speech detection | Academia | int'l | academic | rd | 1 | 35K | - | ko | 2022 |
ISO-8 | KOLD | Hate speech detection | Academia | int'l | all | rd | 0 | 40K | - | ko | 2022 |
ISO-9 | K-MHaS | Hate speech detection | Academia | int'l | all | rd | 0 | 109K | - | ko | 2022 |
ISO-10 | KODOLI | Hate speech detection | Academia | int'l | all | rd | 0 | 38K | - | ko | 2023 |
ISO-11 | DKTC | Hate speech detection | Industry | docu | academic | rd | 0 | 1.5K | - | ko | 2022 |
No | Dataset | Typical Usage | Provider | Docu. | License | Redist. | mod-x | Volume | Goal | Lang. | Year |
---|---|---|---|---|---|---|---|---|---|---|---|
QAD-1 | KorQuAD 1.0 | QA | Industry | int'l | all | rd | 1 | 60K / 5K / 4K (p) | - | ko | 2018 |
QAD-2 | KorQuAD 2.0 | QA | Industry | paper | all | rd | 1 | 80K / 10K / 10K (p) | - | ko | 2019 |
QAD-3 | HuLiC | Dialog | Industry | docu | academic | rd | 0 | 115K | - | ko | 2022 |
QAD-4 | OPELA | Dialog | Industry | int'l | academic | rd | 0 | 560K (d) | - | ko | 2022 |
QAD-5 | CareCall | Dialog | Industry | int'l | academic | rd | 0 | 10K | - | ko | 2022 |
No | Dataset | Typical Usage | Provider | Docu. | License | Redist. | mod-x | Volume | Goal | Lang. | Year |
---|---|---|---|---|---|---|---|---|---|---|---|
STT-1 | Sci-news-sum-kr | Summarization | Academia | docu | academic | rd | 0 | 50K (p) | Eval | ko | 2016 |
STT-2 | sae4K | Summarization | Academia | int'l | all | rd | 0 | 50K (p) | - | ko | 2020 |
STT-3 | Korean Parallel Corpora | MT | Academia | int'l | all | rd | 1 | 100K (p) | - | ko, en | 2016 |
STT-4 | KAIST Translation Evaluation Set | MT | Academia | docu | academic | no | 0 | 3K (p) | Eval | ko, en | 2000 |
STT-5 | KAIST Chinese-Korean Multilingual Corpus | MT | Academia | docu | academic | no | 0 | 60K (p) | - | ko, zh | 2000 |
STT-6 | Transliteration Dataset | Transliteration | Academia | docu | all | rd | 0 | 35K (p) | - | ko, en | 2016 |
STT-7 | KAIST Transliteration Evaluation Set | Transliteration | Academia | docu | academic | no | 0 | 7K (p) | Eval | ko, en | 2000 |
No | Dataset | Typical Usage | Provider | Docu. | License | Redist. | mod-x | Volume | Goal | Lang. | Year |
---|---|---|---|---|---|---|---|---|---|---|---|
KM-1 | SIGMORPHON G2P | G2P conversion | Competition | int'l | all | rd | 0 | 3.6K/ 0.45K / 0.45K (p) | - | ko, en, hy, bg, fr, ka, hi, hu, is, lt, el | 2020 |
KM-2 | PAWS-X | Paraphrase detection | Industry | int'l | all | rd | 0 | 5K / 2K / 2K (p) | - | ko, fr, es, de, zh, ja | 2019 |
KM-3 | TyDi-QA | QA | Industry | int'l | all | rd | 0 | 11K / 1,7K / 1,7K (p) | - | ko, en, ar, bn, fi, ja, id, sw, ru, te, th | 2020 |
KM-4 | XPersona | Dialog | Academia | int'l | all | rd | 0 | 0.3K(d) / 4.7K (s) | - | ko, en, it, fr, id, zh, ja | 2020 |
KM-5 | MultiCoNER | NER | Competition | int'l | all | rd | 0 | 178K / 2.6K (s) | - | ko, bn, de, en, es, fa, hi, nl, ru, tr, zh, multi, mix | 2022 |
KM-6 | Multilingual Tweet Intimacy Analysis | Sentiment analysis | Competition | int'l | unk | unk | 0 | 2K (instances) | - | ko, en, fr, es, it, pt, nl, zh, hi, ar | 2022 |
KM-7 | IWSLT 2023 | MT | Competition | int'l | all | rd | 0 | 3K (p) | - | ko, en, vi, pt, ru | 2023 |
No | Dataset | Typical Usage | Provider | Docu. | License | Redist. | mod-x | Volume | Goal | Lang. | Year |
---|---|---|---|---|---|---|---|---|---|---|---|
SL-1 | KSS | ASR | Academia | docu | academic | rd | 0 | 12+ (h) / 13K (u) / 1 speaker | - | ko | 2018 |
SL-2 | Zeroth | ASR | Industry | docu | all | rd | 0 | 51+ (h) / 27K (s) / 46K (u) / 181 speakers | - | ko | 2017 |
SL-3 | ClovaCall | ASR | Industry | int'l | academic | no | 0 | 80+ (h) / 60K (u) / 11K speakers | - | ko | 2020 |
SL-4 | Pansori-TedXKR | ASR | Academia | int'l | academic | rd | 1 | 3+ (h) / 3K (u) / 41 speakers | - | ko | 2018 |
SL-5 | ProSem | SLU | Academia | int'l | all | rd | 0 | 6+ (h) / 3,500 (s) / 7K (u) / 2 speakers | - | ko | 2019 |
SL-6 | kosp2e | Speech translation | Industry | int'l | academic | rd | 0 | 39K(u) | - | ko, en | 2021 |
SL-7 | jejueo (JSS & JIT) | Other topics | Industry | int'l | all | rd | 0 | 10K (JSS), 170K (JIT) | - | ko | 2020 |
No | Dataset | Typical Usage | Provider | Docu. | License | Redist. | mod-x | Volume | Goal | Lang. | Year |
---|---|---|---|---|---|---|---|---|---|---|---|
ETC-1 | KoCHET | Other topics | Academia | int'l | academic | rd | 0 | NER 112K, RE 39K, ET 113K | - | ko | 2022 |
ETC-2 | KommonGen | Other topics | Academia | docu | all | rd | 0 | 79K(s) | - | ko | 2021 |
ETC-3 | LBox Open | Other topics | Academia | int'l | academic | rd | 0 | 150K | - | ko | 2022 |
ETC-4 | K2NLG | Other topics | Academia | int'l | academic | rd | 0 | 4K(s) | - | ko | 2020 |
ETC-5 | Korean Ambiguity Data | Other topics | Academia | int'l | all | rd | 0 | 35K | - | ko | 2023 |
ETC-6 | Korean GEC dataset | Other topics | Academia | int'l | academic | rd | 0 | 155K(s pair) | - | ko | 2022 |
To cite our work, please use the following: (Also available as cho-etal-2020-open
in anthology.bib)
@inproceedings{cho-etal-2020-open,
title = "Open {K}orean Corpora: A Practical Report",
author = "Cho, Won Ik and
Moon, Sangwhan and
Song, Youngsook",
booktitle = "Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.nlposs-1.12",
pages = "85--93",
abstract = "Korean is often referred to as a low-resource language in the research community. While this claim is partially true, it is also because the availability of resources is inadequately advertised and curated. This work curates and reviews a list of Korean corpora, first describing institution-level resource development, then further iterate through a list of current open datasets for different types of tasks. We then propose a direction on how open-source dataset construction and releases should be done for less-resourced languages to promote research.",
}
Please read the contributor guidelines before sending a pull request.