OpenNMT Tokenizer Versions Save

Fast and customizable text tokenization library with BPE and SentencePiece support

v1.29.0

2 years ago

Changes

[Python] Drop support for Python 3.5

New features

[Python] Build wheels for Python 3.10
[Python] Add tokenization method Tokenizer.tokenize_batch

v1.28.1

2 years ago

Fixes and improvements

Fix detokenization when a token includes a fullwidth percent sign (％) that is not used as an escape sequence (version 1.27.0 contained a partial fix for this bug)

v1.28.0

2 years ago

Changes

[C++] Remove the SpaceTokenizer class that is not meant to be public and can be confused with the "space" tokenization mode

New features

Build Python wheels for Windows
Add option tokens_delimiter to configure how tokens are delimited in tokenized files (default is a space)
Expose option with_separators in Python and CLI to include whitespace characters in the tokenized output
[Python] Add package version information in pyonmttok.__version__

Fixes and improvements

Fix detokenization when option with_separators is enabled

v1.27.0

2 years ago

Changes

Linux Python wheels are now compiled with manylinux2010 and require pip >= 19.0 for installation
macOS Python wheels now require macOS >= 10.14

Fixes and improvements

Fix casing resolution when some letters do not have case information
Fix detokenization when a token includes a fullwidth percent sign (％) that is not used as an escape sequence
Improve error message when setting invalid segment_alphabet or lang options
Update SentencePiece to 0.1.96
[Python] Improve declaration of functions and classes for better type hints and checks
[Python] Update ICU to 69.1

v1.26.4

2 years ago

Fixes and improvements

Fix regression introduced in last version for preserved tokens that are not segmented by BPE

v1.26.3

2 years ago

Fixes and improvements

Fix another divergence with the SentencePiece output when there is only one subword and the spacer is detached

v1.26.2

2 years ago

Fixes and improvements

Fix a divergence with the SentencePiece output when the spacer is detached from the word

v1.26.1

2 years ago

Fixes and improvements

Fix application of the BPE vocabulary when using preserve_segmented_tokens and a subword appears without joiner in the vocabulary
Fix compilation with ICU versions older than 60

v1.26.0

3 years ago

New features

Add lang tokenization option to apply language-specific case mappings

Fixes and improvements

Use ICU to convert strings to Unicode values instead of a custom implementation

v1.25.0

3 years ago

New features

Add training flag in tokenization methods to disable subword regularization during inference
[Python] Implement __len__ method in the Token class

Fixes and improvements

Raise an error when enabling case_markup with incompatible tokenization modes "space" and "none"
[Python] Improve parallelization when Tokenizer.tokenize is called from multiple Python threads (the Python GIL is now released)
[Python] Cleanup some manual Python <-> C++ types conversion