Fast and customizable text tokenization library with BPE and SentencePiece support
SpaceTokenizer
class that is not meant to be public and can be confused with the "space" tokenization modetokens_delimiter
to configure how tokens are delimited in tokenized files (default is a space)with_separators
in Python and CLI to include whitespace characters in the tokenized outputpyonmttok.__version__
with_separators
is enabledmanylinux2010
and require pip
>= 19.0 for installationsegment_alphabet
or lang
optionstraining
flag in tokenization methods to disable subword regularization during inference__len__
method in the Token
classcase_markup
with incompatible tokenization modes "space" and "none"Tokenizer.tokenize
is called from multiple Python threads (the Python GIL is now released)