Awesome Bert Japanese Save

📝 A list of pre-trained BERT models for Japanese with word/subword tokenization + vocabulary construction algorithm information

Project README

awesome-bert-japanese

日本語の学習済み BERT は文から単語への分かち書き，単語からサブワードへの分割の処理にいくつかの選択肢が存在します．また，単語をサブワードに分割する際に利用する語彙についても構築方法に数種類のバリエーションがあります．

本リポジトリでは，公開されている学習済み BERT モデルについて，分かち書き・サブワード分割・語彙構築アルゴリズムそれぞれどのアルゴリズムが採用されているかを表にまとめています．

A list of pre-trained BERT models for Japanese. Japanese is a complicated language; which doesn't have any word boundaries and has many kind of characters. Therefore, it requires word segmentation before tokenizing word into subwords. I summarize pretrained BERT models for Japanese by word segmentation algorithm, subword tokenization algorithm, and algorithm for constructing vocabulary used in subword tokenization.

Model

Model	Sentence -> Words	Word -> Subword	Algorithm for constructing vocabulary used in subword tokenization
Google (Multilingual BERT)	Whitespace	WordPiece	BPE?
Kikuta	--	Sentencepiece (without word segmentation)	Sentencepiece (model_type=unigram)
Hotto Link Inc.	--	Sentencepiece (without word segmentation)	Sentencepiece (model_type=unigram)
Kyoto University	Juman++ (JUMANDIC?)	WordPiece	subword-nmt (BPE)
Stockmark Inc. (a)	MeCab (mecab-ipadic-neologd)	--	--
Tohoku University (a)	MeCab (mecab-ipadic)	WordPiece	Sentencepiece (model_type=bpe)
Tohoku University (b)	MeCab (mecab-ipadic)	Character	Sentencepiece (model_type=character)
NICT (a)	MeCab (mecab-jumandic)	WordPiece	subword-nmt (BPE)
NICT (b)	MeCab (mecab-jumandic)	---	---
akirakubo (a)	MeCab (unidic-cwj) for Wikipedia and Aozora bunko written in `新仮名` + MeCab (unidic_qkana) for Aozora bunko written in `旧仮名`	WordPiece	subword-nmt (BPE)
akirakubo (b)	SudachiPy (SudachiDict_core + A mode) for Wikipedia and Aozora bunko written in `新仮名` + MeCab (unidic_qkana) for Aozora bunko written in `旧仮名`	WordPiece	subword-nmt (BPE)
The University of Tokyo	MeCab (mecab-ipadic-neologd + user dic (J-MeDic)	WordPiece	? (BPE)
Laboro.AI Inc.	--	Sentencepiece (without word segmentation)	Sentencepiece (model_type=unigram)
Bandai Namco Research Inc.	MeCab (mecab-ipadic)	WordPiece	Sentencepiece (model_type=bpe)
Retrieva, Inc.	MeCab (mecab-ipadic)	WordPiece	Sentencepiece (model_type=bpe)
Waseda University	Juman++ (JUMANDIC)	WordPiece	Sentencepiece (model_type=unigram)
LINE Corp.	MeCab (mecab-unidic)	WordPiece	Sentencepiece (model_type=bpe)
Stockmark Inc. (b)	MeCab (mecab-ipadic-neologd)	WordPiece	Sentencepiece (model_type=?)

NICT: National Institute of Information and Communications Technology
without word segmentation: 文を単語に分割せず直接サブワードへ分割する
For models by Tohoku University, MeCab+mecab-ipadic-neologd is used for sentence segmentation (thanks @ikuyamada san!)
For models by akirakubo, documents in Aozora bunko are classified into two categories. It is based on types of kana spelling. (thanks @kkadowa san and @akirakubo san!
- See also: https://github.com/akirakubo/bert-japanese-aozora/issues/1#issuecomment-667495267
For DistilBERT (by Bandai Namco Resean Inc.), the same word segmentation and algorithm for constructing vocabulary are used both for teacher/studen models.

Reference

Google (Multilingual BERT) (2018/11): https://github.com/google-research/bert/blob/master/multilingual.md
Kikuta (2019/01): https://yoheikikuta.github.io/bert-japanese/
Hotto Link Inc. (2019/03): https://www.hottolink.co.jp/blog/20190311_101674/
Kyoto University (2019/03): http://nlp.ist.i.kyoto-u.ac.jp/bert
Stockmark Inc. (a) (2019/04): https://qiita.com/mkt3/items/3c1278339ff1bcc0187f
Tohoku University (a,b) (2019/12): https://github.com/cl-tohoku/bert-japanese
NICT (2020/03): https://alaginrc.nict.go.jp/nict-bert/index.html
akirakubo (2020/03): https://github.com/akirakubo/bert-japanese-aozora
The University of Tokyo (2020/03): https://ai-health.m.u-tokyo.ac.jp/uth-ber
Laboro.AI Inc. (2020/04): https://laboro.ai/column/laboro-bert/
Bandai Namco Research Inc. (2020/04): https://github.com/BandaiNamcoResearchInc/DistilBERT-base-jp
Retrieva, Inc. (2021/04): https://tech.retrieva.jp/entry/2021/04/01/114943
Waseda University (2021/12): https://huggingface.co/nlp-waseda/roberta-base-japanese
LINE Corp. (2023/03): https://engineering.linecorp.com/ja/blog/line-distilbert-high-performance-fast-lightweight-japanese-language-model
Stockmark Inc. (b) (2020/02): https://qiita.com/mkt3/items/b41dcf0185e5873f5f75

Open Source Agenda is not affiliated with "Awesome Bert Japanese" Project. README Source: himkt/awesome-bert-japanese

Stars

129

Open Issues

Last Commit

1 year ago

Repository

himkt/awesome-bert-japanese

Open Source Agenda Badge

<a href="https://www.opensourceagenda.com/projects/awesome-bert-japanese"><img src="https://www.opensourceagenda.com/projects/awesome-bert-japanese/reviews/badge.svg" alt="Open Source Agenda"></a>

Submit Review Review Your Favorite Project

Submit Resource Articles, Courses, Videos

Submit Article Submit a post to our blog

From the blog

Dec 11, 2022

How to Choose Which Programming Language to Learn First?

From the blog

Dec 11, 2022