Fast and customizable text tokenization library with BPE and SentencePiece support
allow_isolated_marks
to allow combining marks to appear isolated in the tokenization output in specific conditionsBPELearner
does not not find any pairs of characters in the tokenized datamanylinux2014
and requires pip
>= 19.3 for installationpyonmttok
is imported before torch
-DBUILD_SHARED_LIBS=OFF
pyonmttok.Vocab
pyonmttok.build_vocab_from_tokens
pyonmttok.build_vocab_from_lines
Tokenizer.__call__
to simplify the tokenizer usage when additional features are unused:tokens = tokenizer(text)