OpenNMT Tokenizer Versions Save

Fast and customizable text tokenization library with BPE and SentencePiece support

v1.24.0

3 years ago

New features

  • Add verbose flag in file tokenization APIs to log progress every 100,000 lines
  • [Python] Add options property to Tokenizer instances
  • [Python] Add class pyonmttok.SentencePieceTokenizer to help creating a tokenizer compatible with SentencePiece

Fixes and improvements

  • Fix deserialization into Token objects that was sometimes incorrect
  • Fix Windows compilation
  • Fix Google Test integration that was sometimes installed as part of make install
  • [Python] Update pybind11 to 2.6.2
  • [Python] Update ICU to 66.1
  • [Python] Compile ICU with optimization flags

v1.23.0

3 years ago

Changes

  • Drop Python 2 support

New features

  • Publish Python wheels for macOS

Fixes and improvements

  • Improve performance in all tokenization modes (up to 2x faster)
  • Fix missing space escaping within protected sequences in "none" and "space" tokenization modes
  • Fix a regression introduced in 1.20 where segment_alphabet_* options behave differently on characters that appear in multiple Unicode scripts (e.g. some Japanese characters can belong to both Hiragana and Katakana scripts and should not trigger a segmentation)
  • Fix a regression introduced in 1.21 where a joiner is incorrectly placed when using preserve_segmented_tokens and the word is segmented by both a segment_* option and BPE
  • Fix incorrect tokenization when using support_prior_joiners and some joiners are within protected sequences

v1.22.2

3 years ago

Fixes and improvements

  • Do not require "none" tokenization mode for SentencePiece vocabulary restriction

v1.22.1

3 years ago

Fixes and improvements

  • Fix error when enabling vocabulary restriction with SentencePiece and spacer_annotate is not explicitly set
  • Fix backward compatibility with Kangxi and Kanbun scripts (see segment_alphabet option)

v1.22.0

3 years ago

Changes

  • [C++] Subword model caching is no longer supported and should be handled by the client. The subword encoder instance can now be passed as a std::shared_ptr to make it outlive the Tokenizer instance.

New features

  • Add set_random_seed function to make subword regularization reproducible
  • [Python] Support serialization of Token instances
  • [C++] Add Options structure to configure tokenization options (Flags can still be used for backward compatibility)

Fixes and improvements

  • Fix BPE vocabulary restriction when using joiner_new, spacer_annotate, or spacer_new (the previous implementation always assumed joiner_annotate was used)
  • [Python] Fix spacer argument name in Token constructor
  • [C++] Fix ambiguous subword encoder ownership by using a std::shared_ptr

v1.21.0

3 years ago

New features

  • Accept vocabularies with tab-separated frequencies (format produced by SentencePiece)

Fixes and improvements

  • Fix BPE vocabulary restriction when words have a leading or trailing joiner
  • Raise an error when using a multi-character joiner and support_prior_joiner
  • [Python] Implement __hash__ method of pyonmttok.Token objects to be consistent with the __eq__ implementation
  • [Python] Declare pyonmttok.Tokenizer arguments (except mode) as keyword-only
  • [Python] Improve compatibility with Python 3.9

v1.20.0

3 years ago

Changes

  • The following changes affect users compiling the project from the source. They ensure users get the best performance and all features by default:
    • ICU is now required to improve performance and Unicode support
    • SentencePiece is now integrated as a Git submodule and linked statically to the project
    • Boost is no longer required, the project now uses cxxopts which is integrated as a Git submodule
    • The project is compiled in Release mode by default
    • Tests are no longer compiled by default (use -DBUILD_TESTS=ON to compile the tests)

New features

  • Accept any Unicode script aliases in the segment_alphabet option
  • Update SentencePiece to 0.1.92
  • [Python] Improve the capabilities of the Token class:
    • Implement the __repr__ method
    • Allow setting all attributes in the constructor
    • Add a copy constructor
  • [Python] Add a copy constructor for the Tokenizer class

Fixes and improvements

  • [Python] Accept None value for segment_alphabet argument

v1.19.0

3 years ago

New features

  • Add BPE dropout (Provilkov et al. 2019)
  • [Python] Introduce the "Token API": a set of methods that manipulate Token objects instead of serialized strings
  • [Python] Add unicode_ranges argument to the detokenize_with_ranges method to return ranges over Unicode characters instead of bytes

Fixes and improvements

  • Include "Half-width kana" in Katakana script detection

v1.18.5

3 years ago

Fixes and improvements

  • Fix possible crash when applying a case insensitive BPE model on Unicode characters

v1.18.4

3 years ago

Fixes and improvements

  • Fix segmentation fault on cli/tokenize exit
  • Ignore empty tokens during detokenization
  • When writing to a file, avoid flushing the output stream on each line
  • Update cli/CMakeLists.txt to mark Boost.ProgramOptions as required

(This is the first release to be created on GitHub. See the release note of previous tags in CHANGELOG.md.)