🛥 Vaporetto: Very accelerated pointwise prediction based tokenizer
Model Files
You can use the following assets: https://github.com/daac-tools/vaporetto/releases/tag/v0.5.0
This release fixes the following bug:
You can use the following assets: https://github.com/daac-tools/vaporetto/releases/tag/v0.5.0
You can use the following assets: https://github.com/daac-tools/vaporetto/releases/tag/v0.5.0
You can use the following assets: https://github.com/daac-tools/vaporetto/releases/tag/v0.5.0
This software contains the results of joint research with the National Institute for Japanese Language and Linguistics (NINJAL).
We provide multiple model files for Vaporetto that you can download and use in your work. These models have been trained using BCCWJ and UniDic.
All of these models are trained with L1-regularization.
See below for license terms of each model.
(NOTE) Some of BCCWJ are not included in training data due to rights reasons.
We provide models containing UniDic. These models have the highest accuracy in our distributions.
bccwj-suw+unidic+tag.model.zst
: contains a tag prediction model. Tags are only trained using BCCWJ.bccwj-suw+unidic+tag-huge.model.zst
: contains a tag prediction model. Tags are trained using BCCWJ and UniDic.We also provide models that do not contain UniDic. These models have been trained over three model sizes and two word units.
Short unit words (SUW) | Long unit words (LUW) | |
---|---|---|
Tiny (C=0.003) | bccwj-suw-tiny.model.zst |
N/A |
Small (C=0.1) | bccwj-suw-small.model.zst |
bccwj-luw-small.model.zst |
Middle (C=0.5) | bccwj-suw-middle.model.zst |
bccwj-luw-middle.model.zst |
Large (C=1.0) | bccwj-suw-large.model.zst |
bccwj-luw-large.model.zst |
The following models are licensed under 3-Clause BSD License.
bccwj-suw+unidic+tag.model.zst
bccwj-suw+unidic+tag-huge.model.zst
The following models are licensed under either of Apache License (Version 2.0) or MIT License at your option.
bccwj-suw-small.model.zst
bccwj-suw-middle.model.zst
bccwj-suw-large.model.zst
bccwj-luw-small.model.zst
bccwj-luw-middle.model.zst
bccwj-luw-large.model.zst
We provide multiple model files for Vaporetto that you can download and use in your work. These models have been trained using BCCWJ and UniDic.
All of these models are trained with L1-regularization.
See below for license terms of each model.
(NOTE) Some of BCCWJ are not included in training data due to rights reasons.
We provide two models containing UniDic. These models have the highest accuracy in our distributions.
bccwj-suw+unidic+tag.model.zst
: contains a tag prediction modelbccwj-suw+unidic.model.zst
: does not contain a tag prediction modelWe also provide models that do not contain UniDic. These models have been trained over three model sizes and two word units.
Short unit words (SUW) | Long unit words (LUW) | |
---|---|---|
Tiny (C=0.003) | bccwj-suw-tiny.model.zst |
N/A |
Small (C=0.1) | bccwj-suw-small.model.zst |
bccwj-luw-small.model.zst |
Middle (C=0.5) | bccwj-suw-middle.model.zst |
bccwj-luw-middle.model.zst |
Large (C=1.0) | bccwj-suw-large.model.zst |
bccwj-luw-large.model.zst |
The following models are licensed under 3-Clause BSD License.
bccwj-suw+unidic+tag.model.zst
bccwj-suw+unidic.model.zst
The following models are licensed under either of Apache License (Version 2.0) or MIT License at your option.
bccwj-suw-small.model.zst
bccwj-suw-middle.model.zst
bccwj-suw-large.model.zst
bccwj-luw-small.model.zst
bccwj-luw-middle.model.zst
bccwj-luw-large.model.zst
We provide multiple model files for Vaporetto that you can download and use in your work. These models have been trained using BCCWJ and UniDic.
All of these models are trained with L1-regularization.
See below for license terms of each model.
(NOTE) Some of BCCWJ are not included in training data due to rights reasons.
We provide two models containing UniDic. These models have the highest accuracy in our distributions.
bccwj-suw+unidic+tag.model.zst
: contains a tag prediction modelbccwj-suw+unidic.model.zst
: does not contain a tag prediction modelWe also provide models that do not contain UniDic. These models have been trained over three model sizes and two word units.
Short unit words (SUW) | Long unit words (LUW) | |
---|---|---|
Small (C=0.1) | bccwj-suw-small.model.zst |
bccwj-luw-small.model.zst |
Middle (C=0.5) | bccwj-suw-middle.model.zst |
bccwj-luw-middle.model.zst |
Large (C=1.0) | bccwj-suw-large.model.zst |
bccwj-luw-large.model.zst |
The following models are licensed under 3-Clause BSD License.
bccwj-suw+unidic+tag.model.zst
bccwj-suw+unidic.model.zst
The following models are licensed under either of Apache License (Version 2.0) or MIT License at your option.
bccwj-suw-small.model.zst
bccwj-suw-middle.model.zst
bccwj-suw-large.model.zst
bccwj-luw-small.model.zst
bccwj-luw-middle.model.zst
bccwj-luw-large.model.zst