KcBERT Versions Save

๐Ÿค— Pretrained BERT model & WordPiece tokenizer trained on Korean Comments ํ•œ๊ตญ์–ด ๋Œ“๊ธ€๋กœ ํ”„๋ฆฌํŠธ๋ ˆ์ด๋‹ํ•œ BERT ๋ชจ๋ธ๊ณผ ๋ฐ์ดํ„ฐ์…‹

v2022.3Q

1 year ago

๋ถ„๊ธฐ๋ณ„ ์‹ ๊ทœ ๋ฐ์ดํ„ฐ์…‹ ๋ฆด๋ฆฌ์ฆˆ: v2022.3Q

๋ฐ์ดํ„ฐ์…‹ ์ •๋ณด

  • v2022.3Q = 2022๋…„๋„ 3๋ถ„๊ธฐ ๋ฆด๋ฆฌ์ฆˆ
  • ๋ฐ์ดํ„ฐ์…‹ ํฌํ•จ: v2019.1Q - v2022.3Q
  • ์ „์ฒด ๋ฐ์ดํ„ฐ ์ˆ˜(๊ณต๋ฐฑ์—ด ์ œ์™ธ): 345,452,030
  • ์ผ์ž: 2019.01์›” ~ 2022.09์›”

TrainData_v1์™€์˜ ์ฐจ์ด์ 

  • ๋™์ผ ํƒ€๋ž˜์˜ ๋Œ“๊ธ€๊ณผ ๋Œ€๋Œ“๊ธ€์€ ๋‹จ์ผ linebreak (\n)
  • ๋‹ค๋ฅธ ํƒ€๋ž˜์˜ ๋Œ“๊ธ€๊ฐ„์—๋Š” ๋‘๊ฐœ์˜ linebreak (\n\n)
  • ์ผ์ž๋ณ„๋กœ ์ค‘๋ณต ํ…์ŠคํŠธ ์ œ๊ฑฐ
  • ๊ทธ ์™ธ์˜ clean ์ฒ˜๋ฆฌ ์ตœ๋Œ€ํ•œ ํ•˜์ง€ ์•Š์Œ

Quarterly Aggregated Korean News Comments Dataset: v2022.3Q

Dataset Spec

  • v2022.3Q = 2022 3Q Release
  • Add Dataset from v2019.1Q ~ v2022.3Q
  • Total Lines(w/o Blank lines): 345,452,030
  • Date Range: 2019.01 ~ 2022.09

Difference from TrainData_v1

  • Reply comments(in same thread) are grouped by 1 linebreak(\n)
  • Different threads are splitted by whiteline(\n\n)
  • Duplicated comments within a day are removed (only the first comment left)
  • texts are raw as much as possible

TrainData_v1

3 years ago

Kaggle์— ๊ณต๊ฐœํ–ˆ๋˜ ๋ฐ์ดํ„ฐ์…‹์„ ์ข€ ๋” ๋‹ค์šด๋กœ๋“œ ๋ฐ›๊ธฐ ์‰ฝ๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด ๋ถ„ํ•  ์••์ถ•(๊ฐ๊ฐ 2G/2G/0.6G)ํ•ด ๋ฆด๋ฆฌ์ฆˆํ•ฉ๋‹ˆ๋‹ค :)

( Pretrain Dataset ๊ณต๊ฐœ: https://www.kaggle.com/junbumlee/kcbert-pretraining-corpus-korean-news-comments )

์•„๋ž˜ kcbert-train.tar.gz aa, ab, ac๋ฅผ ๋ชจ๋‘ ๋ฐ›์œผ์‹  ๋’ค, ํ•ด๋‹น ํด๋”์—์„œ ์•„๋ž˜ ๋ช…๋ น์–ด๋กœ ์••์ถ•์„ ํ’€์–ด์ฃผ์„ธ์š”.

cat kcbert-train.tar.gz* | tar -zxvpf -