OpenNMT Tokenizer Versions Save

Fast and customizable text tokenization library with BPE and SentencePiece support

v1.24.0

3 years ago

New features

Add verbose flag in file tokenization APIs to log progress every 100,000 lines
[Python] Add options property to Tokenizer instances
[Python] Add class pyonmttok.SentencePieceTokenizer to help creating a tokenizer compatible with SentencePiece

Fixes and improvements

Fix deserialization into Token objects that was sometimes incorrect
Fix Windows compilation
Fix Google Test integration that was sometimes installed as part of make install
[Python] Update pybind11 to 2.6.2
[Python] Update ICU to 66.1
[Python] Compile ICU with optimization flags

v1.23.0

3 years ago

Changes

Drop Python 2 support

New features

Publish Python wheels for macOS

Fixes and improvements

Improve performance in all tokenization modes (up to 2x faster)
Fix missing space escaping within protected sequences in "none" and "space" tokenization modes
Fix a regression introduced in 1.20 where segment_alphabet_* options behave differently on characters that appear in multiple Unicode scripts (e.g. some Japanese characters can belong to both Hiragana and Katakana scripts and should not trigger a segmentation)
Fix a regression introduced in 1.21 where a joiner is incorrectly placed when using preserve_segmented_tokens and the word is segmented by both a segment_* option and BPE
Fix incorrect tokenization when using support_prior_joiners and some joiners are within protected sequences

v1.22.2

3 years ago

Fixes and improvements

Do not require "none" tokenization mode for SentencePiece vocabulary restriction

v1.22.1

3 years ago

Fixes and improvements

Fix error when enabling vocabulary restriction with SentencePiece and spacer_annotate is not explicitly set
Fix backward compatibility with Kangxi and Kanbun scripts (see segment_alphabet option)

v1.22.0

3 years ago

Changes

[C++] Subword model caching is no longer supported and should be handled by the client. The subword encoder instance can now be passed as a std::shared_ptr to make it outlive the Tokenizer instance.

New features

Add set_random_seed function to make subword regularization reproducible
[Python] Support serialization of Token instances
[C++] Add Options structure to configure tokenization options (Flags can still be used for backward compatibility)

Fixes and improvements

Fix BPE vocabulary restriction when using joiner_new, spacer_annotate, or spacer_new (the previous implementation always assumed joiner_annotate was used)
[Python] Fix spacer argument name in Token constructor
[C++] Fix ambiguous subword encoder ownership by using a std::shared_ptr

v1.21.0

3 years ago

New features

Accept vocabularies with tab-separated frequencies (format produced by SentencePiece)

Fixes and improvements

Fix BPE vocabulary restriction when words have a leading or trailing joiner
Raise an error when using a multi-character joiner and support_prior_joiner
[Python] Implement __hash__ method of pyonmttok.Token objects to be consistent with the __eq__ implementation
[Python] Declare pyonmttok.Tokenizer arguments (except mode) as keyword-only
[Python] Improve compatibility with Python 3.9

v1.20.0

3 years ago

Changes

The following changes affect users compiling the project from the source. They ensure users get the best performance and all features by default:
- ICU is now required to improve performance and Unicode support
- SentencePiece is now integrated as a Git submodule and linked statically to the project
- Boost is no longer required, the project now uses cxxopts which is integrated as a Git submodule
- The project is compiled in Release mode by default
- Tests are no longer compiled by default (use -DBUILD_TESTS=ON to compile the tests)

New features

Accept any Unicode script aliases in the segment_alphabet option
Update SentencePiece to 0.1.92
[Python] Improve the capabilities of the Token class:
- Implement the __repr__ method
- Allow setting all attributes in the constructor
- Add a copy constructor
[Python] Add a copy constructor for the Tokenizer class

Fixes and improvements

[Python] Accept None value for segment_alphabet argument

v1.19.0

3 years ago

New features

Add BPE dropout (Provilkov et al. 2019)
[Python] Introduce the "Token API": a set of methods that manipulate Token objects instead of serialized strings
[Python] Add unicode_ranges argument to the detokenize_with_ranges method to return ranges over Unicode characters instead of bytes

Fixes and improvements

Include "Half-width kana" in Katakana script detection

v1.18.5

3 years ago

Fixes and improvements

Fix possible crash when applying a case insensitive BPE model on Unicode characters

v1.18.4

3 years ago

Fixes and improvements

Fix segmentation fault on cli/tokenize exit
Ignore empty tokens during detokenization
When writing to a file, avoid flushing the output stream on each line
Update cli/CMakeLists.txt to mark Boost.ProgramOptions as required

(This is the first release to be created on GitHub. See the release note of previous tags in CHANGELOG.md.)