Tokenizers Versions Save

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

v0.19.1

2 weeks ago

What's Changed

add serialization for ignore_merges by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1504

Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.19.0...v0.19.1

v0.19.0

2 weeks ago

What's Changed

chore: Remove CLI - this was originally intended for local development by @bryantbiggs in https://github.com/huggingface/tokenizers/pull/1442
[remove black] And use ruff by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1436
Bump ip from 2.0.0 to 2.0.1 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1456
Added ability to inspect a 'Sequence' decoder and the AddedVocabulary. by @eaplatanios in https://github.com/huggingface/tokenizers/pull/1443
🚨🚨 BREAKING CHANGE 🚨🚨: (add_prefix_space dropped everything is using prepend_scheme enum instead) Refactor metaspace by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1476
Add more support for tiktoken based tokenizers by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1493
PyO3 0.21. by @Narsil in https://github.com/huggingface/tokenizers/pull/1494
Remove 3.13 (potential undefined behavior.) by @Narsil in https://github.com/huggingface/tokenizers/pull/1497
Bumping all versions 3 times (ty transformers :) ) by @Narsil in https://github.com/huggingface/tokenizers/pull/1498
Fixing doc. by @Narsil in https://github.com/huggingface/tokenizers/pull/1499

Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.15.2...v0.19.0

v0.19.0rc0

2 weeks ago

Bumping 3 versions because of this: https://github.com/huggingface/transformers/blob/60dea593edd0b94ee15dc3917900b26e3acfbbee/setup.py#L177

What's Changed

chore: Remove CLI - this was originally intended for local development by @bryantbiggs in https://github.com/huggingface/tokenizers/pull/1442
[remove black] And use ruff by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1436
Bump ip from 2.0.0 to 2.0.1 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1456
Added ability to inspect a 'Sequence' decoder and the AddedVocabulary. by @eaplatanios in https://github.com/huggingface/tokenizers/pull/1443
🚨🚨 BREAKING CHANGE 🚨🚨: (add_prefix_space dropped everything is using prepend_scheme enum instead) Refactor metaspace by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1476
Add more support for tiktoken based tokenizers by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1493
PyO3 0.21. by @Narsil in https://github.com/huggingface/tokenizers/pull/1494
Remove 3.13 (potential undefined behavior.) by @Narsil in https://github.com/huggingface/tokenizers/pull/1497
Bumping all versions 3 times (ty transformers :) ) by @Narsil in https://github.com/huggingface/tokenizers/pull/1498

Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.15.2...v0.19.0rc0

v0.15.2

2 months ago

What's Changed

Big shoutout to @rlrs for the fast replace normalizers PR. This boosts the performances of the tokenizers:

chore: Update dependencies to latest supported versions by @bryantbiggs in https://github.com/huggingface/tokenizers/pull/1441
Convert word counts to u64 by @stephenroller in https://github.com/huggingface/tokenizers/pull/1433
Efficient Replace normalizer by @rlrs in https://github.com/huggingface/tokenizers/pull/1413

New Contributors

@bryantbiggs made their first contribution in https://github.com/huggingface/tokenizers/pull/1441
@stephenroller made their first contribution in https://github.com/huggingface/tokenizers/pull/1433
@rlrs made their first contribution in https://github.com/huggingface/tokenizers/pull/1413

Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.15.1...v0.15.2rc1

v0.15.1

3 months ago

What's Changed

udpate to version = "0.15.1-dev0" by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1390
Derive Clone on Tokenizer, add Encoding.into_tokens() method by @epwalsh in https://github.com/huggingface/tokenizers/pull/1381
Stale bot. by @Narsil in https://github.com/huggingface/tokenizers/pull/1404
Fix doc links in readme by @Pierrci in https://github.com/huggingface/tokenizers/pull/1367
Faster HF dataset iteration in docs by @mariosasko in https://github.com/huggingface/tokenizers/pull/1414
Add quick doc to byte_level.rs by @steventrouble in https://github.com/huggingface/tokenizers/pull/1420
Fix make bench. by @Narsil in https://github.com/huggingface/tokenizers/pull/1428
Bump follow-redirects from 1.15.1 to 1.15.4 in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1430
pyo3: update to 0.20 by @mikelui in https://github.com/huggingface/tokenizers/pull/1386
Encode special tokens by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1437
Update release for python3.12 windows by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1438

New Contributors

@steventrouble made their first contribution in https://github.com/huggingface/tokenizers/pull/1420

Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.15.0...v0.15.1

v0.15.1.rc0

3 months ago

What's Changed

pyo3: update to 0.19 by @mikelui in https://github.com/huggingface/tokenizers/pull/1322
Add expect() for disabling truncation by @boyleconnor in https://github.com/huggingface/tokenizers/pull/1316
Re-using scritpts from safetensors. by @Narsil in https://github.com/huggingface/tokenizers/pull/1328
Reduce number of different revisions by 1 by @Narsil in https://github.com/huggingface/tokenizers/pull/1329
Python 38 arm by @Narsil in https://github.com/huggingface/tokenizers/pull/1330
Move to maturing mimicking move for safetensors. + Rewritten node bindings. by @Narsil in https://github.com/huggingface/tokenizers/pull/1331
Updating the docs with the new command. by @Narsil in https://github.com/huggingface/tokenizers/pull/1333
Update added tokens by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1335
update package version for dev by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1339
Added ability to inspect a 'Sequence' pre-tokenizer. by @eaplatanios in https://github.com/huggingface/tokenizers/pull/1341
Let's allow hf_hub < 1.0 by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1344
Fixing the progressbar. by @Narsil in https://github.com/huggingface/tokenizers/pull/1353
Preparing release. by @Narsil in https://github.com/huggingface/tokenizers/pull/1355
fix a clerical error in the comment by @tiandiweizun in https://github.com/huggingface/tokenizers/pull/1356
fix: remove useless token by @rtrompier in https://github.com/huggingface/tokenizers/pull/1371
Bump @babel/traverse from 7.22.11 to 7.23.2 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1370
Allow hf_hub 0.18 by @mariosasko in https://github.com/huggingface/tokenizers/pull/1383
Allow huggingface_hub<1.0 by @Wauplin in https://github.com/huggingface/tokenizers/pull/1385
[pre_tokenizers] Fix sentencepiece based Metaspace by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1357
udpate to version = "0.15.1-dev0" by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1390
Derive Clone on Tokenizer, add Encoding.into_tokens() method by @epwalsh in https://github.com/huggingface/tokenizers/pull/1381
Stale bot. by @Narsil in https://github.com/huggingface/tokenizers/pull/1404
Fix doc links in readme by @Pierrci in https://github.com/huggingface/tokenizers/pull/1367
Faster HF dataset iteration in docs by @mariosasko in https://github.com/huggingface/tokenizers/pull/1414
Add quick doc to byte_level.rs by @steventrouble in https://github.com/huggingface/tokenizers/pull/1420
Fix make bench. by @Narsil in https://github.com/huggingface/tokenizers/pull/1428
Bump follow-redirects from 1.15.1 to 1.15.4 in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1430
pyo3: update to 0.20 by @mikelui in https://github.com/huggingface/tokenizers/pull/1386

New Contributors

@mikelui made their first contribution in https://github.com/huggingface/tokenizers/pull/1322
@eaplatanios made their first contribution in https://github.com/huggingface/tokenizers/pull/1341
@tiandiweizun made their first contribution in https://github.com/huggingface/tokenizers/pull/1356
@rtrompier made their first contribution in https://github.com/huggingface/tokenizers/pull/1371
@mariosasko made their first contribution in https://github.com/huggingface/tokenizers/pull/1383
@Wauplin made their first contribution in https://github.com/huggingface/tokenizers/pull/1385
@steventrouble made their first contribution in https://github.com/huggingface/tokenizers/pull/1420

Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.13.4.rc2...v0.15.1.rc0

v0.15.0

5 months ago

What's Changed

fix a clerical error in the comment by @tiandiweizun in https://github.com/huggingface/tokenizers/pull/1356
fix: remove useless token by @rtrompier in https://github.com/huggingface/tokenizers/pull/1371
Bump @babel/traverse from 7.22.11 to 7.23.2 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1370
Allow hf_hub 0.18 by @mariosasko in https://github.com/huggingface/tokenizers/pull/1383
Allow huggingface_hub<1.0 by @Wauplin in https://github.com/huggingface/tokenizers/pull/1385
[pre_tokenizers] Fix sentencepiece based Metaspace by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1357

New Contributors

@tiandiweizun made their first contribution in https://github.com/huggingface/tokenizers/pull/1356
@rtrompier made their first contribution in https://github.com/huggingface/tokenizers/pull/1371
@mariosasko made their first contribution in https://github.com/huggingface/tokenizers/pull/1383
@Wauplin made their first contribution in https://github.com/huggingface/tokenizers/pull/1385

Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.14.1...v0.15.0

v0.14.1

6 months ago

What's Changed

Fix conda release by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1211
Fix node release by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1212
Printing warning to stderr. by @Narsil in https://github.com/huggingface/tokenizers/pull/1222
Fixing padding_left sequence_ids. by @Narsil in https://github.com/huggingface/tokenizers/pull/1233
Use LTO for release and benchmark builds by @csko in https://github.com/huggingface/tokenizers/pull/1157
fix unigram.rs test_sample() by @chris-ha458 in https://github.com/huggingface/tokenizers/pull/1244
implement a simple max_sentencepiece_length into BPE by @chris-ha458 in https://github.com/huggingface/tokenizers/pull/1228
Makes decode and decode_batch work on borrowed content. by @mfuntowicz in https://github.com/huggingface/tokenizers/pull/1251
Update all GH Actions with dependency on actions/checkout by @mfuntowicz in https://github.com/huggingface/tokenizers/pull/1256
Parallelize unigram trainer by @mishig25 in https://github.com/huggingface/tokenizers/pull/976
Update unigram/trainer.rs by @chris-ha458 in https://github.com/huggingface/tokenizers/pull/1257
Fixing broken link. by @Narsil in https://github.com/huggingface/tokenizers/pull/1268
fix documentation regarding regex by @chris-ha458 in https://github.com/huggingface/tokenizers/pull/1264
Update Cargo.toml by @chris-ha458 in https://github.com/huggingface/tokenizers/pull/1266
Update README.md - Broken link by @sbhavani in https://github.com/huggingface/tokenizers/pull/1272
[doc build] Use secrets by @mishig25 in https://github.com/huggingface/tokenizers/pull/1273
Improve error for truncation with too high stride by @boyleconnor in https://github.com/huggingface/tokenizers/pull/1275
Add unigram bytefallback by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1217
revise type specification by @hiroshi-matsuda-rit in https://github.com/huggingface/tokenizers/pull/1289
Bump tough-cookie from 4.0.0 to 4.1.3 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1291
Update path name: master -> main by @bact in https://github.com/huggingface/tokenizers/pull/1292
import Tuple from typing by @kellymarchisio in https://github.com/huggingface/tokenizers/pull/1295
Fixing clippy warnings on 1.71. by @Narsil in https://github.com/huggingface/tokenizers/pull/1296
Bump word-wrap from 1.2.3 to 1.2.4 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1299
feat: Added CITATION.cff. by @SamuelLarkin in https://github.com/huggingface/tokenizers/pull/1302
Single warning for holes. by @Narsil in https://github.com/huggingface/tokenizers/pull/1303
Give error when initializing tokenizer with too high stride by @boyleconnor in https://github.com/huggingface/tokenizers/pull/1306
Handle when precompiled charsmap is empty by @kellymarchisio in https://github.com/huggingface/tokenizers/pull/1308
Derive clone for TrainerWrapper by @jonatanklosko in https://github.com/huggingface/tokenizers/pull/1317
CD backports by @chris-ha458 in https://github.com/huggingface/tokenizers/pull/1318
0.13.4.rc1 by @Narsil in https://github.com/huggingface/tokenizers/pull/1319
Release all at once for simplicity. by @Narsil in https://github.com/huggingface/tokenizers/pull/1320
Fix stride condition. by @Narsil in https://github.com/huggingface/tokenizers/pull/1321
pyo3: update to 0.19 by @mikelui in https://github.com/huggingface/tokenizers/pull/1322
Add expect() for disabling truncation by @boyleconnor in https://github.com/huggingface/tokenizers/pull/1316
Re-using scritpts from safetensors. by @Narsil in https://github.com/huggingface/tokenizers/pull/1328
Reduce number of different revisions by 1 by @Narsil in https://github.com/huggingface/tokenizers/pull/1329
Python 38 arm by @Narsil in https://github.com/huggingface/tokenizers/pull/1330
Move to maturing mimicking move for safetensors. + Rewritten node bindings. by @Narsil in https://github.com/huggingface/tokenizers/pull/1331
Updating the docs with the new command. by @Narsil in https://github.com/huggingface/tokenizers/pull/1333
Update added tokens by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1335
update package version for dev by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1339
Added ability to inspect a 'Sequence' pre-tokenizer. by @eaplatanios in https://github.com/huggingface/tokenizers/pull/1341
Let's allow hf_hub < 1.0 by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1344
Fixing the progressbar. by @Narsil in https://github.com/huggingface/tokenizers/pull/1353
Preparing release. by @Narsil in https://github.com/huggingface/tokenizers/pull/1355

New Contributors

@csko made their first contribution in https://github.com/huggingface/tokenizers/pull/1157
@chris-ha458 made their first contribution in https://github.com/huggingface/tokenizers/pull/1244
@sbhavani made their first contribution in https://github.com/huggingface/tokenizers/pull/1272
@boyleconnor made their first contribution in https://github.com/huggingface/tokenizers/pull/1275
@hiroshi-matsuda-rit made their first contribution in https://github.com/huggingface/tokenizers/pull/1289
@bact made their first contribution in https://github.com/huggingface/tokenizers/pull/1292
@kellymarchisio made their first contribution in https://github.com/huggingface/tokenizers/pull/1295
@SamuelLarkin made their first contribution in https://github.com/huggingface/tokenizers/pull/1302
@jonatanklosko made their first contribution in https://github.com/huggingface/tokenizers/pull/1317
@mikelui made their first contribution in https://github.com/huggingface/tokenizers/pull/1322
@eaplatanios made their first contribution in https://github.com/huggingface/tokenizers/pull/1341

Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.13.3...v0.14.1

v0.14.1rc1

7 months ago

What's Changed

pyo3: update to 0.19 by @mikelui in https://github.com/huggingface/tokenizers/pull/1322
Add expect() for disabling truncation by @boyleconnor in https://github.com/huggingface/tokenizers/pull/1316
Re-using scritpts from safetensors. by @Narsil in https://github.com/huggingface/tokenizers/pull/1328
Reduce number of different revisions by 1 by @Narsil in https://github.com/huggingface/tokenizers/pull/1329
Python 38 arm by @Narsil in https://github.com/huggingface/tokenizers/pull/1330
Move to maturing mimicking move for safetensors. + Rewritten node bindings. by @Narsil in https://github.com/huggingface/tokenizers/pull/1331
Updating the docs with the new command. by @Narsil in https://github.com/huggingface/tokenizers/pull/1333
Update added tokens by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1335
update package version for dev by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1339
Added ability to inspect a 'Sequence' pre-tokenizer. by @eaplatanios in https://github.com/huggingface/tokenizers/pull/1341
Let's allow hf_hub < 1.0 by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1344
Fixing the progressbar. by @Narsil in https://github.com/huggingface/tokenizers/pull/1353

New Contributors

@mikelui made their first contribution in https://github.com/huggingface/tokenizers/pull/1322
@eaplatanios made their first contribution in https://github.com/huggingface/tokenizers/pull/1341

Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.13.4.rc2...v0.14.1rc1

v0.14.0

7 months ago

⚠️ Reworks the release pipeline. Other breaking changes ⚠️ :

#1335, AddedToken is reworked, is_special_token rename to special for consistency
feature http is now OFF by default, and depends on hf-hub instead of cached_path (updated cache directory, better sync implementation)
Removed SSL link on the python package, calling huggingface_hub directly instead.
New dependency : huggingface_hub (while we deprecate Tokenizer.from_pretrained(...) to Tokenizer.from_file(hugginngface_hub.hf_hub_download(MODEL_ID, "tokenizer.json")

What's Changed

Fix conda release by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1211
Fix node release by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1212
Printing warning to stderr. by @Narsil in https://github.com/huggingface/tokenizers/pull/1222
Fixing padding_left sequence_ids. by @Narsil in https://github.com/huggingface/tokenizers/pull/1233
Use LTO for release and benchmark builds by @csko in https://github.com/huggingface/tokenizers/pull/1157
fix unigram.rs test_sample() by @chris-ha458 in https://github.com/huggingface/tokenizers/pull/1244
implement a simple max_sentencepiece_length into BPE by @chris-ha458 in https://github.com/huggingface/tokenizers/pull/1228
Makes decode and decode_batch work on borrowed content. by @mfuntowicz in https://github.com/huggingface/tokenizers/pull/1251
Update all GH Actions with dependency on actions/checkout by @mfuntowicz in https://github.com/huggingface/tokenizers/pull/1256
Parallelize unigram trainer by @mishig25 in https://github.com/huggingface/tokenizers/pull/976
Update unigram/trainer.rs by @chris-ha458 in https://github.com/huggingface/tokenizers/pull/1257
Fixing broken link. by @Narsil in https://github.com/huggingface/tokenizers/pull/1268
fix documentation regarding regex by @chris-ha458 in https://github.com/huggingface/tokenizers/pull/1264
Update Cargo.toml by @chris-ha458 in https://github.com/huggingface/tokenizers/pull/1266
Update README.md - Broken link by @sbhavani in https://github.com/huggingface/tokenizers/pull/1272
[doc build] Use secrets by @mishig25 in https://github.com/huggingface/tokenizers/pull/1273
Improve error for truncation with too high stride by @boyleconnor in https://github.com/huggingface/tokenizers/pull/1275
Add unigram bytefallback by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1217
revise type specification by @hiroshi-matsuda-rit in https://github.com/huggingface/tokenizers/pull/1289
Bump tough-cookie from 4.0.0 to 4.1.3 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1291
Update path name: master -> main by @bact in https://github.com/huggingface/tokenizers/pull/1292
import Tuple from typing by @kellymarchisio in https://github.com/huggingface/tokenizers/pull/1295
Fixing clippy warnings on 1.71. by @Narsil in https://github.com/huggingface/tokenizers/pull/1296
Bump word-wrap from 1.2.3 to 1.2.4 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1299
feat: Added CITATION.cff. by @SamuelLarkin in https://github.com/huggingface/tokenizers/pull/1302
Single warning for holes. by @Narsil in https://github.com/huggingface/tokenizers/pull/1303
Give error when initializing tokenizer with too high stride by @boyleconnor in https://github.com/huggingface/tokenizers/pull/1306
Handle when precompiled charsmap is empty by @kellymarchisio in https://github.com/huggingface/tokenizers/pull/1308
Derive clone for TrainerWrapper by @jonatanklosko in https://github.com/huggingface/tokenizers/pull/1317
CD backports by @chris-ha458 in https://github.com/huggingface/tokenizers/pull/1318
0.13.4.rc1 by @Narsil in https://github.com/huggingface/tokenizers/pull/1319
Release all at once for simplicity. by @Narsil in https://github.com/huggingface/tokenizers/pull/1320
Fix stride condition. by @Narsil in https://github.com/huggingface/tokenizers/pull/1321
pyo3: update to 0.19 by @mikelui in https://github.com/huggingface/tokenizers/pull/1322
Add expect() for disabling truncation by @boyleconnor in https://github.com/huggingface/tokenizers/pull/1316
Re-using scritpts from safetensors. by @Narsil in https://github.com/huggingface/tokenizers/pull/1328
Reduce number of different revisions by 1 by @Narsil in https://github.com/huggingface/tokenizers/pull/1329
Python 38 arm by @Narsil in https://github.com/huggingface/tokenizers/pull/1330
Move to maturing mimicking move for safetensors. + Rewritten node bindings. by @Narsil in https://github.com/huggingface/tokenizers/pull/1331
Updating the docs with the new command. by @Narsil in https://github.com/huggingface/tokenizers/pull/1333
Update added tokens by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1335

New Contributors

@csko made their first contribution in https://github.com/huggingface/tokenizers/pull/1157
@chris-ha458 made their first contribution in https://github.com/huggingface/tokenizers/pull/1244
@sbhavani made their first contribution in https://github.com/huggingface/tokenizers/pull/1272
@boyleconnor made their first contribution in https://github.com/huggingface/tokenizers/pull/1275
@hiroshi-matsuda-rit made their first contribution in https://github.com/huggingface/tokenizers/pull/1289
@bact made their first contribution in https://github.com/huggingface/tokenizers/pull/1292
@kellymarchisio made their first contribution in https://github.com/huggingface/tokenizers/pull/1295
@SamuelLarkin made their first contribution in https://github.com/huggingface/tokenizers/pull/1302
@jonatanklosko made their first contribution in https://github.com/huggingface/tokenizers/pull/1317
@mikelui made their first contribution in https://github.com/huggingface/tokenizers/pull/1322

Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.13.3...v0.14.0