OpenNMT Tokenizer Versions Save

Fast and customizable text tokenization library with BPE and SentencePiece support

v1.37.1

1 year ago

Fixes and improvements

Consider escaped characters as single characters in BPE
Ignore undefined scripts when resolving inherited or common scripts

v1.37.0

1 year ago

New features

Add tokenization option allow_isolated_marks to allow combining marks to appear isolated in the tokenization output in specific conditions

Fixes and improvements

Fix infinite loop when the text contains an invalid Unicode character
Fix segmentation fault when the BPELearner does not not find any pairs of characters in the tokenized data
[Python] Update ICU to 72.1

v1.36.0

1 year ago

New features

[Python] Add argument vocabulary in the Tokenizer constructor to set the vocabulary with a list of tokens instead of using a file
[Python] Add function pyonmttok.is_valid_language to check if a language code is valid and can be passed to the Tokenizer constructor

v1.35.0

1 year ago

New features

[Python] Add pickling support to pyonmttok.Vocab

Fixes and improvements

Update pybind11 to 2.10.1
Update cibuildwheel to 2.11.2

v1.34.0

1 year ago

Changes

[Python] Wheels are now built under manylinux2014 and requires pip >= 19.3 for installation

New features

[Python] Build wheels for Python 3.11

Fixes and improvements

Improve error handling when reading token frequencies in the vocabulary file
[Python] Fix possible crash when pyonmttok is imported before torch
[Python] Update ICU to 71.1
[C++] Fix static compilation with -DBUILD_SHARED_LIBS=OFF
[C++] Fix CMake warning when compiling the tests

v1.33.0

1 year ago

New features

[Python] Build ARM64 wheels for macOS

Fixes and improvements

[CLI] Fix error when the option --segment_alphabet is not set
Fix SentencePiece build warning when compiling with Clang

v1.32.0

1 year ago

New features

Add property pyonmttok.Vocab.counters to retrieve the number of occurrences of each token

Fixes and improvements

Update pybind11 to 2.10.0
Update cxxopts to 3.0.0

v1.31.0

2 years ago

New features

Add utilities to build and use vocabularies:
- pyonmttok.Vocab
- pyonmttok.build_vocab_from_tokens
- pyonmttok.build_vocab_from_lines
Define the method Tokenizer.__call__ to simplify the tokenizer usage when additional features are unused:

tokens = tokenizer(text)

Fixes and improvements

Update pybind11 to 2.9.1

v1.30.1

2 years ago

Fixes and improvements

Fix deprecated languages codes in ICU that are incorrectly considered as invalid (e.g. "tl" for Tagalog)

v1.30.0

2 years ago

New features

[Python] Build wheels for AArch64 Linux

Fixes and improvements

[Python] Update ICU to 70.1