A Japanese Tokenizer for Business
This is a support release for Elasticsearch/OpenSearch integration 3.1.0 release.
Config.fromResource
method for reading Configs vial PathAnchor. (#212)Release v0.7.2 contains subset of the functionality of this release but did not contain crucial features. It is not a broken release, but there are no user-visible changed from v0.7.1.
This is a maintenance release
This is a maintenance release
Port relaxed boundary mode from 0.7.0 while keeping ABI compatibility with pre-0.7.0 versions.
Tokenizer.tokenize
API returns MorphemeList
instead of List<Morpheme>
. This change is ABI-incompatible with previous versions and applications which use Sudachi require recompilation. The change should be source-compatible with no changes required to the source code which uses Sudachi.MorphemeList.split
: resplit C-mode token sequence to lower level without re-analyzing the whole string.Use maxLength
field of the plugin configuration object to set maximum allowed length, in utf-8 bytes (by default 32). The unit will change to unicode codepoints in the future.
Config
)In addition to command line interface for building dictionaries, Sudachi now supports API.
See DicBuilder
class and CLI
for usage examples. No Javadocs here yet.
Introduced a new typed configuration API. See Config
class. It supports flexible path resolution with respect to classpath (with customizable prefixes and classloaders) and filesystem. Dictionary creation API which uses old Settings
is deprecated.
New configuration framework allow specifying some resources (dictionaries, character tables) preloaded and prebuilt.
For details on usage, see Javadoc for Config
class.
It is now possible to specify POS tags for OOV providers which are not present in dictionary. In that case, you must add "userPos": "allow"
to OOV plugin configuration. POS tags still must have 6 layer structure.
"oovProviderPlugin" : [
{ "class" : "com.worksap.nlp.sudachi.SimpleOovProviderPlugin",
"oovPOS" : [ "この", "たぐ", "は", "ぞんざい", "しない", "よ" ],
"userPOS": "allow",
"leftId" : 8,
"rightId" : 8,
"cost" : 6000 }
]
Introduced a new OOV provider which matches a regular expression.
Recommendations:
(?:like this)
, but not capturing groups (like this)
Caveats:
Example for matching URLs:
{
"class": "com.worksap.nlp.sudachi.RegexOovProvider",
"leftId": 5968,
"rightId": 5968,
"cost": 19000,
"regex": "^(?:https?://|www)[\\-_.!~*'()a-zA-Z0-9;/?:@&=+$,%#¯−]+",
"pos": [ "補助記号", "一般", "URL", "*", "*", "*" ],
"userPOS": "allow"
}
All deprecations in this section will be removed with 1.0 release.
DictionaryFactory
methods which use Settings
getPath
method of Settings
, use getResource
instead.Pre-relesease of 0.6.0