📝 A list of pre-trained BERT models for Japanese with word/subword tokenization + vocabulary construction algorithm information
日本語の学習済み BERT は文から単語への分かち書き,単語からサブワードへの分割の処理にいくつかの選択肢が存在します. また,単語をサブワードに分割する際に利用する語彙についても構築方法に数種類のバリエーションがあります.
本リポジトリでは,公開されている学習済み BERT モデルについて, 分かち書き・サブワード分割・語彙構築アルゴリズムそれぞれどのアルゴリズムが採用されているかを表にまとめています.
A list of pre-trained BERT models for Japanese.
Japanese is a complicated language; which doesn't have any word boundaries and has many kind of characters.
Therefore, it requires word segmentation before tokenizing word into subwords.
I summarize pretrained BERT models for Japanese by word segmentation algorithm
, subword tokenization algorithm
, and algorithm for constructing vocabulary used in subword tokenization
.
Model | Sentence -> Words | Word -> Subword | Algorithm for constructing vocabulary used in subword tokenization |
---|---|---|---|
Google (Multilingual BERT) | Whitespace | WordPiece | BPE? |
Kikuta | -- | Sentencepiece (without word segmentation) | Sentencepiece (model_type=unigram) |
Hotto Link Inc. | -- | Sentencepiece (without word segmentation) | Sentencepiece (model_type=unigram) |
Kyoto University | Juman++ (JUMANDIC?) | WordPiece | subword-nmt (BPE) |
Stockmark Inc. (a) | MeCab (mecab-ipadic-neologd) | -- | -- |
Tohoku University (a) | MeCab (mecab-ipadic) | WordPiece | Sentencepiece (model_type=bpe) |
Tohoku University (b) | MeCab (mecab-ipadic) | Character | Sentencepiece (model_type=character) |
NICT (a) | MeCab (mecab-jumandic) | WordPiece | subword-nmt (BPE) |
NICT (b) | MeCab (mecab-jumandic) | --- | --- |
akirakubo (a) | MeCab (unidic-cwj) for Wikipedia and Aozora bunko written in 新仮名 + MeCab (unidic_qkana) for Aozora bunko written in 旧仮名 |
WordPiece | subword-nmt (BPE) |
akirakubo (b) | SudachiPy (SudachiDict_core + A mode) for Wikipedia and Aozora bunko written in 新仮名 + MeCab (unidic_qkana) for Aozora bunko written in 旧仮名 |
WordPiece | subword-nmt (BPE) |
The University of Tokyo | MeCab (mecab-ipadic-neologd + user dic (J-MeDic) | WordPiece | ? (BPE) |
Laboro.AI Inc. | -- | Sentencepiece (without word segmentation) | Sentencepiece (model_type=unigram) |
Bandai Namco Research Inc. | MeCab (mecab-ipadic) | WordPiece | Sentencepiece (model_type=bpe) |
Retrieva, Inc. | MeCab (mecab-ipadic) | WordPiece | Sentencepiece (model_type=bpe) |
Waseda University | Juman++ (JUMANDIC) | WordPiece | Sentencepiece (model_type=unigram) |
LINE Corp. | MeCab (mecab-unidic) | WordPiece | Sentencepiece (model_type=bpe) |
Stockmark Inc. (b) | MeCab (mecab-ipadic-neologd) | WordPiece | Sentencepiece (model_type=?) |
without word segmentation
: 文を単語に分割せず直接サブワードへ分割する