Datasets Versions Save

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools

2.19.0

2 weeks ago

Dataset Features

Add Polars compatibility by @psmyth94 in https://github.com/huggingface/datasets/pull/6531

convert to a Polars dataframe using .to_polars();

import polars as pl
from datasets import load_dataset
ds = load_dataset("DIBT/10k_prompts_ranked", split="train")
ds.to_polars() \
    .groupby("topic") \
    .agg(pl.len(), pl.first()) \
    .sort("len", descending=True)

Use Polars formatting to return Polars objects when accessing a dataset:
```
ds = ds.with_format("polars")
ds[:10].group_by("kind").len()
```

Add fsspec support for to_json, to_csv, and to_parquet by @alvarobartt in https://github.com/huggingface/datasets/pull/6096

Save on HF in any file format:

ds.to_json("hf://datasets/username/my_json_dataset/data.jsonl")
ds.to_csv("hf://datasets/username/my_csv_dataset/data.csv")
ds.to_parquet("hf://datasets/username/my_parquet_dataset/data.parquet")

Add mode parameter to Image feature by @mariosasko in https://github.com/huggingface/datasets/pull/6735
- Set images to be read in a certain mode like "RGB"
```
dataset = dataset.cast_column("image", Image(mode="RGB"))
```
Add CLI function to convert script-dataset to Parquet by @albertvillanova in https://github.com/huggingface/datasets/pull/6795
- run command to open a PR in script-based dataset to convert it to Parquet:
```
datasets-cli convert_to_parquet <dataset_id>
```
Add Dataset.take and Dataset.skip by @lhoestq in https://github.com/huggingface/datasets/pull/6813
- same as IterableDataset.take and IterableDataset.skip
```
ds = ds.take(10)  # take only the first 10 examples
```

General improvements and bug fixes

Bump huggingface-hub lower version to 0.21.2 by @albertvillanova in https://github.com/huggingface/datasets/pull/6713
fix CastError pickling by @lhoestq in https://github.com/huggingface/datasets/pull/6712
Expand no-code dataset info with datasets-server info by @mariosasko in https://github.com/huggingface/datasets/pull/6714
Fix sliced ConcatenationTable pickling with mixed schemas vertically by @lhoestq in https://github.com/huggingface/datasets/pull/6715
Fix concurrent script loading with force_redownload by @lhoestq in https://github.com/huggingface/datasets/pull/6718
get_dataset_default_config_name docstring by @lhoestq in https://github.com/huggingface/datasets/pull/6723
Deprecate Beam API and download from HF GCS bucket by @mariosasko in https://github.com/huggingface/datasets/pull/6474
Deprecate Pandas builder by @mariosasko in https://github.com/huggingface/datasets/pull/6730
Using a registry instead of calling globals for fetching feature types by @psmyth94 in https://github.com/huggingface/datasets/pull/6727
Update torch_formatter.py by @VarunNSrivastava in https://github.com/huggingface/datasets/pull/6402
Improve default patterns resolution by @mariosasko in https://github.com/huggingface/datasets/pull/6704
Transpose images with EXIF Orientation tag by @mariosasko in https://github.com/huggingface/datasets/pull/6739
Fix missing download_config in get_data_patterns by @lhoestq in https://github.com/huggingface/datasets/pull/6742
Allow null values in dict columns by @mariosasko in https://github.com/huggingface/datasets/pull/6743
Fix fsspec tqdm callback by @lhoestq in https://github.com/huggingface/datasets/pull/6749
chore(deps): bump fsspec by @shcheklein in https://github.com/huggingface/datasets/pull/6747
Fix offline mode with single config by @lhoestq in https://github.com/huggingface/datasets/pull/6741
Remove deprecated code by @Wauplin in https://github.com/huggingface/datasets/pull/6761
fixing the issue 6755(small typo) by @JINO-ROHIT in https://github.com/huggingface/datasets/pull/6767
remove_columns/rename_columns doc fixes by @mariosasko in https://github.com/huggingface/datasets/pull/6772
Fix CI by @mariosasko in https://github.com/huggingface/datasets/pull/6780
rename datasets-server to dataset-viewer by @severo in https://github.com/huggingface/datasets/pull/6785
Install dependencies with uv in CI by @mariosasko in https://github.com/huggingface/datasets/pull/6779
Fix cache conflict in _check_legacy_cache2 by @lhoestq in https://github.com/huggingface/datasets/pull/6792
Fix typo in docs (upload CLI) by @Wauplin in https://github.com/huggingface/datasets/pull/6802
fix DatasetBuilder._split_generators incomplete type annotation by @JonasLoos in https://github.com/huggingface/datasets/pull/6799
#6791 Improve type checking around FAISS by @Dref360 in https://github.com/huggingface/datasets/pull/6803
Fix --repo-type order in cli upload docs by @lhoestq in https://github.com/huggingface/datasets/pull/6804
Fix hf-internal-testing/dataset_with_script commit SHA in CI test by @albertvillanova in https://github.com/huggingface/datasets/pull/6806
Fix cache path to snakecase for CachedDatasetModuleFactory and Cache by @izhx in https://github.com/huggingface/datasets/pull/6754
Multithreaded downloads by @lhoestq in https://github.com/huggingface/datasets/pull/6794
Remove os.path.relpath in resolve_patterns by @mariosasko in https://github.com/huggingface/datasets/pull/6815
Extract data on the fly in packaged builders by @mariosasko in https://github.com/huggingface/datasets/pull/6784
add allow_primitive_to_str and allow_decimal_to_str instead of allow_number_to_str by @Modexus in https://github.com/huggingface/datasets/pull/6811
Support indexable objects in Dataset.__getitem__ by @mariosasko in https://github.com/huggingface/datasets/pull/6817
Make convert_to_parquet CLI command create script branch by @albertvillanova in https://github.com/huggingface/datasets/pull/6809
Fix parquet export infos by @lhoestq in https://github.com/huggingface/datasets/pull/6822

New Contributors

@VarunNSrivastava made their first contribution in https://github.com/huggingface/datasets/pull/6402
@shcheklein made their first contribution in https://github.com/huggingface/datasets/pull/6747
@JINO-ROHIT made their first contribution in https://github.com/huggingface/datasets/pull/6767
@JonasLoos made their first contribution in https://github.com/huggingface/datasets/pull/6799
@izhx made their first contribution in https://github.com/huggingface/datasets/pull/6754
@Modexus made their first contribution in https://github.com/huggingface/datasets/pull/6811

Full Changelog: https://github.com/huggingface/datasets/compare/2.18.0...2.19.0

2.18.0

2 months ago

Dataset features

Make JSON builder support an array of strings by @albertvillanova in https://github.com/huggingface/datasets/pull/6696
Base parquet batch_size on parquet row group size by @lhoestq in https://github.com/huggingface/datasets/pull/6701
- Faster cold start for streaming
Change default compression argument for JsonDatasetWriter by @Rexhaif in https://github.com/huggingface/datasets/pull/6659
Automatic Conversion for uint16/uint32 to Compatible PyTorch Dtypes by @mohalisad in https://github.com/huggingface/datasets/pull/6660
fsspec: support fsspec>=2023.12.0 glob changes by @pmrowla in https://github.com/huggingface/datasets/pull/6687
- Support latest fsspec up to 2024.2.0

General improvements and bug fixes

Fix for Incorrect ex_iterable used with multi num_worker by @kq-chen in https://github.com/huggingface/datasets/pull/6582
- Previously using PyTorch DDP and num_workers could lead to incorrect shards assignments to workers and cause errors
Fix imagefolder dataset url by @mariosasko in https://github.com/huggingface/datasets/pull/6683
Improve error message for gated datasets on load by @lewtun in https://github.com/huggingface/datasets/pull/6684
Updated Quickstart Notebook link by @Codeblockz in https://github.com/huggingface/datasets/pull/6685
Update the print message for chunked_dataset in process.mdx by @gzbfgjf2 in https://github.com/huggingface/datasets/pull/6693
Faster xlistdir by @mariosasko in https://github.com/huggingface/datasets/pull/6698
Update GitHub Actions to Node 20 by @albertvillanova in https://github.com/huggingface/datasets/pull/6682
Update release instructions by @albertvillanova in https://github.com/huggingface/datasets/pull/6681
Pass through information about location of cache directory. by @stridge-cruxml in https://github.com/huggingface/datasets/pull/6677
Allow SplitDict setitem to replace existing SplitInfo by @lhoestq in https://github.com/huggingface/datasets/pull/6665
Update ruff by @lhoestq in https://github.com/huggingface/datasets/pull/6706
Silence ruff deprecation messages by @mariosasko in https://github.com/huggingface/datasets/pull/6707
fix: show correct package name to install biopython by @BioGeek in https://github.com/huggingface/datasets/pull/6662
Fix data_files when passing data_dir by @lhoestq in https://github.com/huggingface/datasets/pull/6705
Release: 2.18.0 by @lhoestq in https://github.com/huggingface/datasets/pull/6708

New Contributors

@Codeblockz made their first contribution in https://github.com/huggingface/datasets/pull/6685
@gzbfgjf2 made their first contribution in https://github.com/huggingface/datasets/pull/6693
@stridge-cruxml made their first contribution in https://github.com/huggingface/datasets/pull/6677
@pmrowla made their first contribution in https://github.com/huggingface/datasets/pull/6687
@BioGeek made their first contribution in https://github.com/huggingface/datasets/pull/6662
@Rexhaif made their first contribution in https://github.com/huggingface/datasets/pull/6659
@mohalisad made their first contribution in https://github.com/huggingface/datasets/pull/6660
@kq-chen made their first contribution in https://github.com/huggingface/datasets/pull/6582

Full Changelog: https://github.com/huggingface/datasets/compare/2.17.1...2.18.0

2.17.1

2 months ago

Bug Fixes

Revert the changes in arrow_writer.py from #6636 by @bryant1410 in https://github.com/huggingface/datasets/pull/6664
Remove deprecated verbose parameter from CSV builder by @albertvillanova in https://github.com/huggingface/datasets/pull/6672

Full Changelog: https://github.com/huggingface/datasets/compare/2.17.0...2.17.1

2.17.0

2 months ago

Dataset Features

[WebDataset] Audio support and bug fixes by @lhoestq in https://github.com/huggingface/datasets/pull/6573
Add concurrent loading of shards to datasets.load_from_disk by @kkoutini in https://github.com/huggingface/datasets/pull/6464
Support data_dir parameter in push_to_hub by @albertvillanova in https://github.com/huggingface/datasets/pull/6634
Support push_to_hub without org/user to default to logged-in user by @albertvillanova in https://github.com/huggingface/datasets/pull/6629
Allow concatenation of datasets with mixed structs by @Dref360 in https://github.com/huggingface/datasets/pull/6587

General improvements and bug fixes

Fix parallel downloads for datasets without scripts by @lhoestq in https://github.com/huggingface/datasets/pull/6551
Fix imagefolder with one image by @lhoestq in https://github.com/huggingface/datasets/pull/6556
Fix tests based on datasets that used to have scripts by @lhoestq in https://github.com/huggingface/datasets/pull/6574
remove eli5 test by @lhoestq in https://github.com/huggingface/datasets/pull/6583
[IterableDataset] Fix drop_last_batchin map after shuffling or sharding by @lhoestq in https://github.com/huggingface/datasets/pull/6575
Support standalone yaml by @lhoestq in https://github.com/huggingface/datasets/pull/6557
Drop redundant None guard. by @xkszltl in https://github.com/huggingface/datasets/pull/6596
fix os.listdir return name is empty string by @d710055071 in https://github.com/huggingface/datasets/pull/6581
Fix CI: pyarrow 15, pandas 2.2 and sqlachemy by @lhoestq in https://github.com/huggingface/datasets/pull/6617
Dedicated RNG object for fingerprinting by @mariosasko in https://github.com/huggingface/datasets/pull/6606
Migrate from setup.cfg to pyproject.toml by @mariosasko in https://github.com/huggingface/datasets/pull/6619
keep more info in DatasetInfo.from_merge #6585 by @JochenSiegWork in https://github.com/huggingface/datasets/pull/6586
Read GeoParquet files using parquet reader by @weiji14 in https://github.com/huggingface/datasets/pull/6508
Use schema metadata only if it matches features by @lhoestq in https://github.com/huggingface/datasets/pull/6616
Raise error on bad split name by @lhoestq in https://github.com/huggingface/datasets/pull/6626
Disable tqdm bars in non-interactive environments by @mariosasko in https://github.com/huggingface/datasets/pull/6627
Add with_rank param to Dataset.filter by @mariosasko in https://github.com/huggingface/datasets/pull/6608
Bump max range of dill to 0.3.8 by @ringohoffman in https://github.com/huggingface/datasets/pull/6630
Fix filelock: use current umask for filelock >= 3.10 by @lhoestq in https://github.com/huggingface/datasets/pull/6631
Faster webdataset streaming by @lhoestq in https://github.com/huggingface/datasets/pull/6578
Multi gpu docs by @lhoestq in https://github.com/huggingface/datasets/pull/6550
dataset viewer requires no-script by @severo in https://github.com/huggingface/datasets/pull/6633
Make split slicing consistent with list slicing by @mariosasko in https://github.com/huggingface/datasets/pull/5891
Do not use Parquet exports if revision is passed by @albertvillanova in https://github.com/huggingface/datasets/pull/6555
Make CLI test support multi-processing by @albertvillanova in https://github.com/huggingface/datasets/pull/6628
Fix reload cache with data dir by @lhoestq in https://github.com/huggingface/datasets/pull/6632
Fix array cast/embed with null values by @mariosasko in https://github.com/huggingface/datasets/pull/6283
Faster column validation and reordering by @psmyth94 in https://github.com/huggingface/datasets/pull/6636
Better multi-gpu example by @lhoestq in https://github.com/huggingface/datasets/pull/6646
Fix missing info when loading some datasets from Parquet export by @lhoestq in https://github.com/huggingface/datasets/pull/6635
Minor multi gpu doc improvement by @lhoestq in https://github.com/huggingface/datasets/pull/6649
Document usage of hfh cli instead of git by @lhoestq in https://github.com/huggingface/datasets/pull/6648

New Contributors

@xkszltl made their first contribution in https://github.com/huggingface/datasets/pull/6596
@kkoutini made their first contribution in https://github.com/huggingface/datasets/pull/6464
@JochenSiegWork made their first contribution in https://github.com/huggingface/datasets/pull/6586
@weiji14 made their first contribution in https://github.com/huggingface/datasets/pull/6508
@ringohoffman made their first contribution in https://github.com/huggingface/datasets/pull/6630
@psmyth94 made their first contribution in https://github.com/huggingface/datasets/pull/6636

Full Changelog: https://github.com/huggingface/datasets/compare/2.16.1...2.17.0

2.16.1

4 months ago

Bug fixes

Fix dl_manager.extract returning FileNotFoundError by @lhoestq in https://github.com/huggingface/datasets/pull/6543
- Fix bug causing FileNotFoundError when passing a relative directory as cache_dir to load_dataset
Fix custom configs from script by @lhoestq in https://github.com/huggingface/datasets/pull/6544
- Fix bug when loading a dataset with a loading script using custom arguments would fail
- e.g. load_dataset("ted_talks_iwslt", language_pair=("ja", "en"), year="2015")

Full Changelog: https://github.com/huggingface/datasets/compare/2.16.0...2.16.1

2.16.0

4 months ago

Security features

Add trust_remote_code argument by @lhoestq in https://github.com/huggingface/datasets/pull/6429
- Some Hugging Face datasets contain custom code which must be executed to correctly load the dataset. The code can be inspected in the repository content at https://hf.co/datasets/<repo_id>. A warning is shown to let the user know about the custom code, and they can avoid this message in future by passing the argument trust_remote_code=True.
- Passing trust_remote_code=True will be mandatory to load these datasets from the next major release of datasets.
- Using the environment variable HF_DATASETS_TRUST_REMOTE_CODE=0 you can already disable custom code by default without waiting for the next release of datasets
Use parquet export if possible by @lhoestq in https://github.com/huggingface/datasets/pull/6448
- This allows loading most old datasets based on custom code by downloading the Parquet export provided by Hugging Face
- You can see a dataset's Parquet export at https://hf.co/datasets/<repo_id>/tree/refs%2Fconvert%2Fparquet

Features

Webdataset dataset builder by @lhoestq in https://github.com/huggingface/datasets/pull/6391
Implement get dataset default config name by @albertvillanova in https://github.com/huggingface/datasets/pull/6511
Lazy data files resolution and offline cache reload by @lhoestq in https://github.com/huggingface/datasets/pull/6493
- This speeds up the load_dataset step that lists the data files of big repositories (up to x100) but requires huggingface_hub 0.20 or newer
- Fix load_dataset that used to reload data from cache even if the dataset was updated on Hugging Face
- Reload a dataset from your cache even if you don't have internet connection
- New cache directory scheme for no-script datasets: ~/.cache/huggingface/datasets/username___dataset_name/config_name/version/commit_sha
- Backward comaptibility: cached datasets from datasets 2.15 (using the old scheme) are still reloaded from cache

General improvements and bug fixes

Remove unused argument in _get_data_files_patterns by @lhoestq in https://github.com/huggingface/datasets/pull/6343
Set usedforsecurity=False in hashlib methods (FIPS compliance) by @Wauplin in https://github.com/huggingface/datasets/pull/6414
Use ruff for formatting by @mariosasko in https://github.com/huggingface/datasets/pull/6434
Create DatasetNotFoundError and DataFilesNotFoundError by @albertvillanova in https://github.com/huggingface/datasets/pull/6431
Fix multi gpu map example by @lhoestq in https://github.com/huggingface/datasets/pull/6415
Better tqdm wrapper by @mariosasko in https://github.com/huggingface/datasets/pull/6433
Remove Table.__getstate__ and Table.__setstate__ by @LZHgrla in https://github.com/huggingface/datasets/pull/6444
Use filelock package for file locking by @mariosasko in https://github.com/huggingface/datasets/pull/6445
Fix metadata file resolution when inferred pattern is ** by @mariosasko in https://github.com/huggingface/datasets/pull/6449
Update hub-docs reference by @mishig25 in https://github.com/huggingface/datasets/pull/6453
Refactor dill logic by @mariosasko in https://github.com/huggingface/datasets/pull/6454
Don't require trust_remote_code in inspect_dataset by @lhoestq in https://github.com/huggingface/datasets/pull/6456
[docs] troubleshooting guide by @MKhalusova in https://github.com/huggingface/datasets/pull/6424
Missing DatasetNotFoundError by @lhoestq in https://github.com/huggingface/datasets/pull/6462
Disable benchmarks in PRs by @lhoestq in https://github.com/huggingface/datasets/pull/6463
More robust temporary directory deletion by @mariosasko in https://github.com/huggingface/datasets/pull/6426
Fix shard retry mechanism in push_to_hub by @mariosasko in https://github.com/huggingface/datasets/pull/6461
Use auth to get parquet export by @lhoestq in https://github.com/huggingface/datasets/pull/6468
Remove delete doc CI by @lhoestq in https://github.com/huggingface/datasets/pull/6471
Fix CI quality by @albertvillanova in https://github.com/huggingface/datasets/pull/6473
Fix PermissionError on Windows CI by @albertvillanova in https://github.com/huggingface/datasets/pull/6477
More robust preupload retry mechanism by @mariosasko in https://github.com/huggingface/datasets/pull/6479
Add IterableDataset __repr__ by @lhoestq in https://github.com/huggingface/datasets/pull/6480
Fix max lock length on unix by @lhoestq in https://github.com/huggingface/datasets/pull/6482
Fix ArrayXD YAML conversion by @mariosasko in https://github.com/huggingface/datasets/pull/6168
Fix docs phrasing about supported formats when sharing a dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/6486
Fix deprecation warning when building conda package by @albertvillanova in https://github.com/huggingface/datasets/pull/6425
Make push_to_hub return CommitInfo by @albertvillanova in https://github.com/huggingface/datasets/pull/6492
docs: add reference Git over SSH by @severo in https://github.com/huggingface/datasets/pull/6499
Fallback on dataset script if user wants to load default config by @lhoestq in https://github.com/huggingface/datasets/pull/6498
Don't expand_info in HF glob by @lhoestq in https://github.com/huggingface/datasets/pull/6469
Fix streaming xnli by @lhoestq in https://github.com/huggingface/datasets/pull/6503
Pickle support for torch.Generator objects by @mariosasko in https://github.com/huggingface/datasets/pull/6502
Enable setting config as default when push_to_hub by @albertvillanova in https://github.com/huggingface/datasets/pull/6500
Better cast error when generating dataset by @lhoestq in https://github.com/huggingface/datasets/pull/6509
Replace list_files_info with list_repo_tree in push_to_hub by @mariosasko in https://github.com/huggingface/datasets/pull/6510
Remove deprecated HfFolder by @lhoestq in https://github.com/huggingface/datasets/pull/6512
Support huggingface-hub pre-releases by @albertvillanova in https://github.com/huggingface/datasets/pull/6516
Support push_to_hub canonical datasets by @albertvillanova in https://github.com/huggingface/datasets/pull/6519
Support commit_description parameter in push_to_hub by @albertvillanova in https://github.com/huggingface/datasets/pull/6520
fix get_metadata_patterns function args error by @d710055071 in https://github.com/huggingface/datasets/pull/6518
Fix metrics dead link by @qgallouedec in https://github.com/huggingface/datasets/pull/6491
fix tests by @lhoestq in https://github.com/huggingface/datasets/pull/6523
Cache backward compatibility with 2.15.0 by @lhoestq in https://github.com/huggingface/datasets/pull/6514
Preserve order of configs and splits when using Parquet exports by @albertvillanova in https://github.com/huggingface/datasets/pull/6526

New Contributors

@LZHgrla made their first contribution in https://github.com/huggingface/datasets/pull/6444
@d710055071 made their first contribution in https://github.com/huggingface/datasets/pull/6518

Full Changelog: https://github.com/huggingface/datasets/compare/2.15.0...2.16.0

2.15.0

5 months ago

What's Changed

Fix typo in Audio dataset documentation by @prassanna-ravishankar in https://github.com/huggingface/datasets/pull/6222
Add push_to_hub with multiple configs docs by @lhoestq in https://github.com/huggingface/datasets/pull/6226
Remove RGB -> BGR image conversion in Object Detection tutorial by @mariosasko in https://github.com/huggingface/datasets/pull/6228
Update README.md by @NinoRisteski in https://github.com/huggingface/datasets/pull/6233
Don't skip hidden files in dl_manager.iter_files when they are given as input by @mariosasko in https://github.com/huggingface/datasets/pull/6230
Update README.md by @NinoRisteski in https://github.com/huggingface/datasets/pull/6223
Remove unused global variables in audio.py by @mariosasko in https://github.com/huggingface/datasets/pull/6241
Improve error message for missing function parameters by @suavemint in https://github.com/huggingface/datasets/pull/6232
Fix cast from fixed size list to variable size list by @mariosasko in https://github.com/huggingface/datasets/pull/6243
Update create_dataset.mdx by @EswarDivi in https://github.com/huggingface/datasets/pull/6247
[DOCS] Fix typo: Elasticsearch by @leemthompo in https://github.com/huggingface/datasets/pull/6258
Support streaming datasets with pyarrow.parquet.read_table by @albertvillanova in https://github.com/huggingface/datasets/pull/6251
Temporarily pin tensorflow < 2.14.0 by @albertvillanova in https://github.com/huggingface/datasets/pull/6264
Fix CI 404 errors by @albertvillanova in https://github.com/huggingface/datasets/pull/6262
Remove apache_beam import in BeamBasedBuilder._save_info by @mariosasko in https://github.com/huggingface/datasets/pull/6265
Improve documentation of dataset.from_generator by @hartmans in https://github.com/huggingface/datasets/pull/6281
Fix parquet columns argument in streaming mode by @lhoestq in https://github.com/huggingface/datasets/pull/6295
Doc readme improvements by @mariosasko in https://github.com/huggingface/datasets/pull/6298
Unpin tensorflow maximum version by @mariosasko in https://github.com/huggingface/datasets/pull/6301
Unpin jax maximum version by @mariosasko in https://github.com/huggingface/datasets/pull/6300
Fix ArrayXD cast by @mariosasko in https://github.com/huggingface/datasets/pull/6297
Reduce the number of commits in push_to_hub by @mariosasko in https://github.com/huggingface/datasets/pull/6269
Fix typo in code example in docs by @bryant1410 in https://github.com/huggingface/datasets/pull/6307
Update README.md by @smty2018 in https://github.com/huggingface/datasets/pull/6304
Deterministic set hash by @lhoestq in https://github.com/huggingface/datasets/pull/6318
docs: resolving namespace conflict, refactored variable by @smty2018 in https://github.com/huggingface/datasets/pull/6312
Fix typos by @python273 in https://github.com/huggingface/datasets/pull/6321
Fix commit message formatting in multi-commit uploads by @qgallouedec in https://github.com/huggingface/datasets/pull/6313
Temporarily pin fsspec < 2023.10.0 by @albertvillanova in https://github.com/huggingface/datasets/pull/6331
Unpin fsspec by @lhoestq in https://github.com/huggingface/datasets/pull/6336
Fix use_dataset.mdx by @angel-luis in https://github.com/huggingface/datasets/pull/6351
Add fsspec version to the datasets-cli env command output by @mariosasko in https://github.com/huggingface/datasets/pull/6356
Expanduser in save_to_disk() by @Unknown3141592 in https://github.com/huggingface/datasets/pull/6098
Fix time measuring snippet in docs by @mariosasko in https://github.com/huggingface/datasets/pull/6367
Temporarily pin pyarrow < 14.0.0 by @albertvillanova in https://github.com/huggingface/datasets/pull/6375
Fix typo in Dataset.map docstring by @bryant1410 in https://github.com/huggingface/datasets/pull/6373
Avoid redundant warning when encoding NumPy array as Image by @mariosasko in https://github.com/huggingface/datasets/pull/6379
Replace deprecated license_file in setup.cfg by @albertvillanova in https://github.com/huggingface/datasets/pull/6332
Minor release step improvement by @lhoestq in https://github.com/huggingface/datasets/pull/6339
Fix dependency conflict within CI build documentation by @albertvillanova in https://github.com/huggingface/datasets/pull/6411
Remove redundant condition in builders by @albertvillanova in https://github.com/huggingface/datasets/pull/6398
Handle future deprecation argument by @winglian in https://github.com/huggingface/datasets/pull/6390
Remove token value from warnings by @mariosasko in https://github.com/huggingface/datasets/pull/6418
Rename audio_classificiation.py to audio_classification.py by @carlthome in https://github.com/huggingface/datasets/pull/6416
Add pyarrow-hotfix to release docs by @albertvillanova in https://github.com/huggingface/datasets/pull/6421
Simplify filesystem logic by @mariosasko in https://github.com/huggingface/datasets/pull/6362
Fix conda release by adding pyarrow-hotfix dependency by @albertvillanova in https://github.com/huggingface/datasets/pull/6423

New Contributors

@prassanna-ravishankar made their first contribution in https://github.com/huggingface/datasets/pull/6222
@NinoRisteski made their first contribution in https://github.com/huggingface/datasets/pull/6233
@suavemint made their first contribution in https://github.com/huggingface/datasets/pull/6232
@EswarDivi made their first contribution in https://github.com/huggingface/datasets/pull/6247
@leemthompo made their first contribution in https://github.com/huggingface/datasets/pull/6258
@hartmans made their first contribution in https://github.com/huggingface/datasets/pull/6281
@smty2018 made their first contribution in https://github.com/huggingface/datasets/pull/6304
@python273 made their first contribution in https://github.com/huggingface/datasets/pull/6321
@angel-luis made their first contribution in https://github.com/huggingface/datasets/pull/6351
@Unknown3141592 made their first contribution in https://github.com/huggingface/datasets/pull/6098
@winglian made their first contribution in https://github.com/huggingface/datasets/pull/6390
@carlthome made their first contribution in https://github.com/huggingface/datasets/pull/6416

Full Changelog: https://github.com/huggingface/datasets/compare/2.14.7...2.15.0

2.14.7

5 months ago

Bug Fixes

Fix UnboundLocalError if preprocessing returns an empty list by @cwallenwein in https://github.com/huggingface/datasets/pull/6346
Fix python formatting for complex types in format_table by @mariosasko in https://github.com/huggingface/datasets/pull/6368
Support pyarrow 14.0.0 by @albertvillanova in https://github.com/huggingface/datasets/pull/6378
Do not try to download from HF GCS for generator by @yundai424 in https://github.com/huggingface/datasets/pull/6372
Support pyarrow 14.0.1 and fix vulnerability CVE-2023-47248 by @albertvillanova in https://github.com/huggingface/datasets/pull/6404

New Contributors

@cwallenwein made their first contribution in https://github.com/huggingface/datasets/pull/6346
@yundai424 made their first contribution in https://github.com/huggingface/datasets/pull/6372

Full Changelog: https://github.com/huggingface/datasets/compare/2.14.6...2.14.7

2.14.6

6 months ago

What's Changed

Ignore dataset_info.json in data files resolution by @mariosasko in https://github.com/huggingface/datasets/pull/6224
Check builder cls default config name in inspect by @lhoestq in https://github.com/huggingface/datasets/pull/6253
Add support for fsspec>=2023.9.0 by @mariosasko in https://github.com/huggingface/datasets/pull/6244
Create DefunctDatasetError by @albertvillanova in https://github.com/huggingface/datasets/pull/6286
Fix get_data_patterns for directories with the word data twice by @albertvillanova in https://github.com/huggingface/datasets/pull/6309
Fix loading Hub datasets with CSV metadata file by @albertvillanova in https://github.com/huggingface/datasets/pull/6316
datasets.filesystems: fix is_remote_filesystems by @ap-- in https://github.com/huggingface/datasets/pull/6334
Pin upper version of fsspec by @albertvillanova in https://github.com/huggingface/datasets/pull/6337
Fix regex get_data_files formatting for base paths by @ZachNagengast in https://github.com/huggingface/datasets/pull/6322

New Contributors

@ap-- made their first contribution in https://github.com/huggingface/datasets/pull/6334
@ZachNagengast made their first contribution in https://github.com/huggingface/datasets/pull/6322

Full Changelog: https://github.com/huggingface/datasets/compare/2.14.5...2.14.6

2.14.5

6 months ago

Bug fixes

Bump fsspec from 2021.11.1 to 2022.3.0 by @mariosasko in https://github.com/huggingface/datasets/pull/6091
Minor fix in iter_files for hidden files by @mariosasko in https://github.com/huggingface/datasets/pull/6092
Use yaml instead of get data patterns when possible by @lhoestq in https://github.com/huggingface/datasets/pull/6154
Fix Parquet loading with columns by @mariosasko in https://github.com/huggingface/datasets/pull/6160
Fix: Missing a MetadataConfigs init when the repo has a datasets_info.json but no README by @clefourrier in https://github.com/huggingface/datasets/pull/6164
PyArrow 13 CI fixes by @mariosasko in https://github.com/huggingface/datasets/pull/6175
Don't alter input in Features.from_dict by @lhoestq in https://github.com/huggingface/datasets/pull/6189
Fix multiprocessing with spawn in iterable datasets by @Hubert-Bonisseur in https://github.com/huggingface/datasets/pull/6165
Set minimal fsspec version requirement to 2023.1.0 by @mariosasko in https://github.com/huggingface/datasets/pull/6192
Temporarily pin pandas < 2.1.0 by @albertvillanova in https://github.com/huggingface/datasets/pull/6200
Preserve split order in DataFilesDict by @albertvillanova in https://github.com/huggingface/datasets/pull/6198
Add missing revision argument by @qgallouedec in https://github.com/huggingface/datasets/pull/6191
Temporarily pin fsspec < 2023.9.0 by @albertvillanova in https://github.com/huggingface/datasets/pull/6210
Do not filter out .zip extensions from no-script datasets by @albertvillanova in https://github.com/huggingface/datasets/pull/6208
Fix empty splitinfo json by @lhoestq in https://github.com/huggingface/datasets/pull/6211
Fix to_json ValueError and remove pandas pin by @albertvillanova in https://github.com/huggingface/datasets/pull/6201
Fix checking patterns to infer packaged builder by @polinaeterna in https://github.com/huggingface/datasets/pull/6215
Rename old push_to_hub configs to "default" in dataset_infos by @lhoestq in https://github.com/huggingface/datasets/pull/6218

Other improvements

Deprecate Dataset.export by @mariosasko in https://github.com/huggingface/datasets/pull/6081
Deprecate download_custom by @mariosasko in https://github.com/huggingface/datasets/pull/6093
Ignore CI lint rule violation in Pickler.memoize by @albertvillanova in https://github.com/huggingface/datasets/pull/6138
Remove unused allowed_extensions param by @albertvillanova in https://github.com/huggingface/datasets/pull/6135
Export to_iterable_dataset to document by @npuichigo in https://github.com/huggingface/datasets/pull/6145
[Docs] Add description of select_columns to guide by @unifyh in https://github.com/huggingface/datasets/pull/6119
Ignore parallel warning in map_nested by @lhoestq in https://github.com/huggingface/datasets/pull/6148
[docs] Complete to_iterable_dataset by @stevhliu in https://github.com/huggingface/datasets/pull/6158
Raise FileNotFoundError when passing data_files that don't exist by @lhoestq in https://github.com/huggingface/datasets/pull/6155
Fix typo in about_mapstyle_vs_iterable.mdx by @lhoestq in https://github.com/huggingface/datasets/pull/6171
Document BUILDER_CONFIG_CLASS by @lhoestq in https://github.com/huggingface/datasets/pull/6166
Fix import in image_load doc by @mariosasko in https://github.com/huggingface/datasets/pull/6181
Use object detection images from huggingface/documentation-images by @mariosasko in https://github.com/huggingface/datasets/pull/6177
Use hf-internal-testing repos for hosting test dataset repos by @mariosasko in https://github.com/huggingface/datasets/pull/6180

New Contributors

@npuichigo made their first contribution in https://github.com/huggingface/datasets/pull/6145
@unifyh made their first contribution in https://github.com/huggingface/datasets/pull/6119

Full Changelog: https://github.com/huggingface/datasets/compare/2.14.4...2.14.5