OpenNMT Tokenizer Versions Save

Fast and customizable text tokenization library with BPE and SentencePiece support

v1.29.0

2 years ago

Changes

  • [Python] Drop support for Python 3.5

New features

  • [Python] Build wheels for Python 3.10
  • [Python] Add tokenization method Tokenizer.tokenize_batch

v1.28.1

2 years ago

Fixes and improvements

  • Fix detokenization when a token includes a fullwidth percent sign (%) that is not used as an escape sequence (version 1.27.0 contained a partial fix for this bug)

v1.28.0

2 years ago

Changes

  • [C++] Remove the SpaceTokenizer class that is not meant to be public and can be confused with the "space" tokenization mode

New features

  • Build Python wheels for Windows
  • Add option tokens_delimiter to configure how tokens are delimited in tokenized files (default is a space)
  • Expose option with_separators in Python and CLI to include whitespace characters in the tokenized output
  • [Python] Add package version information in pyonmttok.__version__

Fixes and improvements

  • Fix detokenization when option with_separators is enabled

v1.27.0

2 years ago

Changes

  • Linux Python wheels are now compiled with manylinux2010 and require pip >= 19.0 for installation
  • macOS Python wheels now require macOS >= 10.14

Fixes and improvements

  • Fix casing resolution when some letters do not have case information
  • Fix detokenization when a token includes a fullwidth percent sign (%) that is not used as an escape sequence
  • Improve error message when setting invalid segment_alphabet or lang options
  • Update SentencePiece to 0.1.96
  • [Python] Improve declaration of functions and classes for better type hints and checks
  • [Python] Update ICU to 69.1

v1.26.4

2 years ago

Fixes and improvements

  • Fix regression introduced in last version for preserved tokens that are not segmented by BPE

v1.26.3

2 years ago

Fixes and improvements

  • Fix another divergence with the SentencePiece output when there is only one subword and the spacer is detached

v1.26.2

2 years ago

Fixes and improvements

  • Fix a divergence with the SentencePiece output when the spacer is detached from the word

v1.26.1

2 years ago

Fixes and improvements

  • Fix application of the BPE vocabulary when using preserve_segmented_tokens and a subword appears without joiner in the vocabulary
  • Fix compilation with ICU versions older than 60

v1.26.0

3 years ago

New features

  • Add lang tokenization option to apply language-specific case mappings

Fixes and improvements

  • Use ICU to convert strings to Unicode values instead of a custom implementation

v1.25.0

3 years ago

New features

  • Add training flag in tokenization methods to disable subword regularization during inference
  • [Python] Implement __len__ method in the Token class

Fixes and improvements

  • Raise an error when enabling case_markup with incompatible tokenization modes "space" and "none"
  • [Python] Improve parallelization when Tokenizer.tokenize is called from multiple Python threads (the Python GIL is now released)
  • [Python] Cleanup some manual Python <-> C++ types conversion