💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
ignore_merges
by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1504
Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.19.0...v0.19.1
remove black
] And use ruff by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1436
AddedVocabulary
. by @eaplatanios in https://github.com/huggingface/tokenizers/pull/1443
Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.15.2...v0.19.0
Bumping 3 versions because of this: https://github.com/huggingface/transformers/blob/60dea593edd0b94ee15dc3917900b26e3acfbbee/setup.py#L177
remove black
] And use ruff by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1436
AddedVocabulary
. by @eaplatanios in https://github.com/huggingface/tokenizers/pull/1443
Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.15.2...v0.19.0rc0
Big shoutout to @rlrs for the fast replace normalizers PR. This boosts the performances of the tokenizers:
Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.15.1...v0.15.2rc1
Clone
on Tokenizer
, add Encoding.into_tokens()
method by @epwalsh in https://github.com/huggingface/tokenizers/pull/1381
Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.15.0...v0.15.1
expect()
for disabling truncation by @boyleconnor in https://github.com/huggingface/tokenizers/pull/1316
safetensors
. + Rewritten node bindings. by @Narsil in https://github.com/huggingface/tokenizers/pull/1331
huggingface_hub<1.0
by @Wauplin in https://github.com/huggingface/tokenizers/pull/1385
pre_tokenizers
] Fix sentencepiece based Metaspace by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1357
Clone
on Tokenizer
, add Encoding.into_tokens()
method by @epwalsh in https://github.com/huggingface/tokenizers/pull/1381
Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.13.4.rc2...v0.15.1.rc0
huggingface_hub<1.0
by @Wauplin in https://github.com/huggingface/tokenizers/pull/1385
pre_tokenizers
] Fix sentencepiece based Metaspace by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1357
Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.14.1...v0.15.0
decode
and decode_batch
work on borrowed content. by @mfuntowicz in https://github.com/huggingface/tokenizers/pull/1251
expect()
for disabling truncation by @boyleconnor in https://github.com/huggingface/tokenizers/pull/1316
safetensors
. + Rewritten node bindings. by @Narsil in https://github.com/huggingface/tokenizers/pull/1331
Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.13.3...v0.14.1
expect()
for disabling truncation by @boyleconnor in https://github.com/huggingface/tokenizers/pull/1316
safetensors
. + Rewritten node bindings. by @Narsil in https://github.com/huggingface/tokenizers/pull/1331
Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.13.4.rc2...v0.14.1rc1
⚠️ Reworks the release pipeline. Other breaking changes ⚠️ :
is_special_token
rename to special
for consistencyOFF
by default, and depends on hf-hub instead of cached_path (updated cache directory, better sync implementation)decode
and decode_batch
work on borrowed content. by @mfuntowicz in https://github.com/huggingface/tokenizers/pull/1251
expect()
for disabling truncation by @boyleconnor in https://github.com/huggingface/tokenizers/pull/1316
safetensors
. + Rewritten node bindings. by @Narsil in https://github.com/huggingface/tokenizers/pull/1331
Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.13.3...v0.14.0