OpenNMT Tokenizer Versions Save

Fast and customizable text tokenization library with BPE and SentencePiece support

v1.37.1

1 year ago

Fixes and improvements

  • Consider escaped characters as single characters in BPE
  • Ignore undefined scripts when resolving inherited or common scripts

v1.37.0

1 year ago

New features

  • Add tokenization option allow_isolated_marks to allow combining marks to appear isolated in the tokenization output in specific conditions

Fixes and improvements

  • Fix infinite loop when the text contains an invalid Unicode character
  • Fix segmentation fault when the BPELearner does not not find any pairs of characters in the tokenized data
  • [Python] Update ICU to 72.1

v1.36.0

1 year ago

New features

  • [Python] Add argument vocabulary in the Tokenizer constructor to set the vocabulary with a list of tokens instead of using a file
  • [Python] Add function pyonmttok.is_valid_language to check if a language code is valid and can be passed to the Tokenizer constructor

v1.35.0

1 year ago

New features

  • [Python] Add pickling support to pyonmttok.Vocab

Fixes and improvements

  • Update pybind11 to 2.10.1
  • Update cibuildwheel to 2.11.2

v1.34.0

1 year ago

Changes

  • [Python] Wheels are now built under manylinux2014 and requires pip >= 19.3 for installation

New features

  • [Python] Build wheels for Python 3.11

Fixes and improvements

  • Improve error handling when reading token frequencies in the vocabulary file
  • [Python] Fix possible crash when pyonmttok is imported before torch
  • [Python] Update ICU to 71.1
  • [C++] Fix static compilation with -DBUILD_SHARED_LIBS=OFF
  • [C++] Fix CMake warning when compiling the tests

v1.33.0

1 year ago

New features

  • [Python] Build ARM64 wheels for macOS

Fixes and improvements

  • [CLI] Fix error when the option --segment_alphabet is not set
  • Fix SentencePiece build warning when compiling with Clang

v1.32.0

1 year ago

New features

  • Add property pyonmttok.Vocab.counters to retrieve the number of occurrences of each token

Fixes and improvements

  • Update pybind11 to 2.10.0
  • Update cxxopts to 3.0.0

v1.31.0

2 years ago

New features

  • Add utilities to build and use vocabularies:
    • pyonmttok.Vocab
    • pyonmttok.build_vocab_from_tokens
    • pyonmttok.build_vocab_from_lines
  • Define the method Tokenizer.__call__ to simplify the tokenizer usage when additional features are unused:
tokens = tokenizer(text)

Fixes and improvements

  • Update pybind11 to 2.9.1

v1.30.1

2 years ago

Fixes and improvements

  • Fix deprecated languages codes in ICU that are incorrectly considered as invalid (e.g. "tl" for Tagalog)

v1.30.0

2 years ago

New features

  • [Python] Build wheels for AArch64 Linux

Fixes and improvements

  • [Python] Update ICU to 70.1