Fast and customizable text tokenization library with BPE and SentencePiece support
verbose
flag in file tokenization APIs to log progress every 100,000 linesoptions
property to Tokenizer
instancespyonmttok.SentencePieceTokenizer
to help creating a tokenizer compatible with SentencePieceToken
objects that was sometimes incorrectmake install
segment_alphabet_*
options behave differently on characters that appear in multiple Unicode scripts (e.g. some Japanese characters can belong to both Hiragana and Katakana scripts and should not trigger a segmentation)preserve_segmented_tokens
and the word is segmented by both a segment_*
option and BPEsupport_prior_joiners
and some joiners are within protected sequencesstd::shared_ptr
to make it outlive the Tokenizer
instance.set_random_seed
function to make subword regularization reproducibleToken
instancesOptions
structure to configure tokenization options (Flags
can still be used for backward compatibility)joiner_new
, spacer_annotate
, or spacer_new
(the previous implementation always assumed joiner_annotate
was used)spacer
argument name in Token
constructorstd::shared_ptr
support_prior_joiner
__hash__
method of pyonmttok.Token
objects to be consistent with the __eq__
implementationpyonmttok.Tokenizer
arguments (except mode
) as keyword-onlyRelease
mode by default-DBUILD_TESTS=ON
to compile the tests)segment_alphabet
optionToken
class:
__repr__
methodTokenizer
classNone
value for segment_alphabet
argumentToken
objects instead of serialized stringsunicode_ranges
argument to the detokenize_with_ranges
method to return ranges over Unicode characters instead of bytescli/tokenize
exitcli/CMakeLists.txt
to mark Boost.ProgramOptions as required(This is the first release to be created on GitHub. See the release note of previous tags in CHANGELOG.md.)