Datasets Versions Save

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools

2.14.5

6 months ago

Bug fixes

Bump fsspec from 2021.11.1 to 2022.3.0 by @mariosasko in https://github.com/huggingface/datasets/pull/6091
Minor fix in iter_files for hidden files by @mariosasko in https://github.com/huggingface/datasets/pull/6092
Use yaml instead of get data patterns when possible by @lhoestq in https://github.com/huggingface/datasets/pull/6154
Fix Parquet loading with columns by @mariosasko in https://github.com/huggingface/datasets/pull/6160
Fix: Missing a MetadataConfigs init when the repo has a datasets_info.json but no README by @clefourrier in https://github.com/huggingface/datasets/pull/6164
PyArrow 13 CI fixes by @mariosasko in https://github.com/huggingface/datasets/pull/6175
Don't alter input in Features.from_dict by @lhoestq in https://github.com/huggingface/datasets/pull/6189
Fix multiprocessing with spawn in iterable datasets by @Hubert-Bonisseur in https://github.com/huggingface/datasets/pull/6165
Set minimal fsspec version requirement to 2023.1.0 by @mariosasko in https://github.com/huggingface/datasets/pull/6192
Temporarily pin pandas < 2.1.0 by @albertvillanova in https://github.com/huggingface/datasets/pull/6200
Preserve split order in DataFilesDict by @albertvillanova in https://github.com/huggingface/datasets/pull/6198
Add missing revision argument by @qgallouedec in https://github.com/huggingface/datasets/pull/6191
Temporarily pin fsspec < 2023.9.0 by @albertvillanova in https://github.com/huggingface/datasets/pull/6210
Do not filter out .zip extensions from no-script datasets by @albertvillanova in https://github.com/huggingface/datasets/pull/6208
Fix empty splitinfo json by @lhoestq in https://github.com/huggingface/datasets/pull/6211
Fix to_json ValueError and remove pandas pin by @albertvillanova in https://github.com/huggingface/datasets/pull/6201
Fix checking patterns to infer packaged builder by @polinaeterna in https://github.com/huggingface/datasets/pull/6215
Rename old push_to_hub configs to "default" in dataset_infos by @lhoestq in https://github.com/huggingface/datasets/pull/6218

Other improvements

Deprecate Dataset.export by @mariosasko in https://github.com/huggingface/datasets/pull/6081
Deprecate download_custom by @mariosasko in https://github.com/huggingface/datasets/pull/6093
Ignore CI lint rule violation in Pickler.memoize by @albertvillanova in https://github.com/huggingface/datasets/pull/6138
Remove unused allowed_extensions param by @albertvillanova in https://github.com/huggingface/datasets/pull/6135
Export to_iterable_dataset to document by @npuichigo in https://github.com/huggingface/datasets/pull/6145
[Docs] Add description of select_columns to guide by @unifyh in https://github.com/huggingface/datasets/pull/6119
Ignore parallel warning in map_nested by @lhoestq in https://github.com/huggingface/datasets/pull/6148
[docs] Complete to_iterable_dataset by @stevhliu in https://github.com/huggingface/datasets/pull/6158
Raise FileNotFoundError when passing data_files that don't exist by @lhoestq in https://github.com/huggingface/datasets/pull/6155
Fix typo in about_mapstyle_vs_iterable.mdx by @lhoestq in https://github.com/huggingface/datasets/pull/6171
Document BUILDER_CONFIG_CLASS by @lhoestq in https://github.com/huggingface/datasets/pull/6166
Fix import in image_load doc by @mariosasko in https://github.com/huggingface/datasets/pull/6181
Use object detection images from huggingface/documentation-images by @mariosasko in https://github.com/huggingface/datasets/pull/6177
Use hf-internal-testing repos for hosting test dataset repos by @mariosasko in https://github.com/huggingface/datasets/pull/6180

New Contributors

@npuichigo made their first contribution in https://github.com/huggingface/datasets/pull/6145
@unifyh made their first contribution in https://github.com/huggingface/datasets/pull/6119

Full Changelog: https://github.com/huggingface/datasets/compare/2.14.4...2.14.5

2.13.2

8 months ago

Bug fixes

Do not filter out .zip extensions from no-script datasets by @albertvillanova in https://github.com/huggingface/datasets/pull/6208

Full Changelog: https://github.com/huggingface/datasets/compare/2.13.1...2.13.2

2.14.4

9 months ago

Bug fixes

Fix authentication issues by @albertvillanova in https://github.com/huggingface/datasets/pull/6127

Full Changelog: https://github.com/huggingface/datasets/compare/2.14.3...2.14.4

2.14.3

9 months ago

Bug fixes

Fix error when loading from GCP bucket by @albertvillanova in https://github.com/huggingface/datasets/pull/6105
Fix deprecation of use_auth_token in file_utils by @albertvillanova in https://github.com/huggingface/datasets/pull/6107

Full Changelog: https://github.com/huggingface/datasets/compare/2.14.2...2.14.3

2.14.2

9 months ago

Bug fixes

Fix deprecation of use_auth_token in DownloadConfig by @albertvillanova in https://github.com/huggingface/datasets/pull/6094
Fix deprecation of errors in TextConfig by @albertvillanova in https://github.com/huggingface/datasets/pull/6095

Full Changelog: https://github.com/huggingface/datasets/compare/2.14.1...2.14.2

2.14.1

9 months ago

Bug fixes

fix tqdm lock by @lhoestq in https://github.com/huggingface/datasets/pull/6067
fix tqdm lock deletion by @lhoestq in https://github.com/huggingface/datasets/pull/6068
Fix fsspec storage_options from load_dataset by @lhoestq in https://github.com/huggingface/datasets/pull/6072
No gzip encoding from github by @lhoestq in https://github.com/huggingface/datasets/pull/6076

Other improvements

Fix Overview.ipynb & detach Jupyter Notebooks from datasets repository by @alvarobartt in https://github.com/huggingface/datasets/pull/5902
Fix Quickstart notebook link by @mariosasko in https://github.com/huggingface/datasets/pull/6070
Remove README link to deprecated Colab notebook by @mariosasko in https://github.com/huggingface/datasets/pull/6080
Misc doc improvements by @mariosasko in https://github.com/huggingface/datasets/pull/6074

Full Changelog: https://github.com/huggingface/datasets/compare/2.14.0...2.14.1

2.14.0

9 months ago

Important: caching

Datasets downloaded and cached using datasets>=2.14.0 may not be reloaded from cache using older version of datasets (and therefore re-downloaded).
Datasets that were already cached are still supported.
This affects datasets on Hugging Face without dataset scripts, e.g. made of pure parquet, csv, jsonl, etc. files.
This is due to the default configuration name for those datasets have been fixed (from "username--dataset_name" to "default") in https://github.com/huggingface/datasets/pull/5331.

Dataset Configuration

Support for multiple configs via metadata yaml info by @polinaeterna in https://github.com/huggingface/datasets/pull/5331

Configure your dataset using YAML at the top of your dataset card (docs here)
Choose which file goes into which split

  ---
  configs:
  - config_name: default
    data_files:
    - split: train
       path: data.csv
    - split: test
        path: holdout.csv
  ---

Define multiple dataset configurations

  ---
  configs:
  - config_name: main_data
    data_files: main_data.csv
  - config_name: additional_data
    data_files: additional_data.csv
  ---

Dataset Features

Support for multiple configs via metadata yaml info by @polinaeterna in https://github.com/huggingface/datasets/pull/5331
- push_to_hub() additional dataset configurations
```
ds.push_to_hub("username/dataset_name", config_name="additional_data")
# reload later
ds = load_dataset("username/dataset_name", "additional_data")
```
Support returning dataframe in map transform by @mariosasko in https://github.com/huggingface/datasets/pull/5995

What's Changed

Deprecate errors param in favor of encoding_errors in text builder by @mariosasko in https://github.com/huggingface/datasets/pull/5974
Fix select_columns columns order by @lhoestq in https://github.com/huggingface/datasets/pull/5994
Replace metadata utils with huggingface_hub's RepoCard API by @mariosasko in https://github.com/huggingface/datasets/pull/5949
Pin joblib to avoid joblibspark test failures by @mariosasko in https://github.com/huggingface/datasets/pull/6000
Align column_names type check with type hint in sort by @mariosasko in https://github.com/huggingface/datasets/pull/6001
Deprecate use_auth_token in favor of token by @mariosasko in https://github.com/huggingface/datasets/pull/5996
Drop Python 3.7 support by @mariosasko in https://github.com/huggingface/datasets/pull/6005
Misc improvements by @mariosasko in https://github.com/huggingface/datasets/pull/6004
Make IterableDataset.from_spark more efficient by @mathewjacob1002 in https://github.com/huggingface/datasets/pull/5986
Fix cast for dictionaries with no keys by @mariosasko in https://github.com/huggingface/datasets/pull/6009
Avoid stuck map operation when subprocesses crashes by @pappacena in https://github.com/huggingface/datasets/pull/5976
Deprecate task api by @mariosasko in https://github.com/huggingface/datasets/pull/5865
Add metadata ui screenshot in docs by @lhoestq in https://github.com/huggingface/datasets/pull/6015
Fix ClassLabel min max check for None values by @mariosasko in https://github.com/huggingface/datasets/pull/6023
[docs] Update return statement of index search by @stevhliu in https://github.com/huggingface/datasets/pull/6021
Improve logging by @mariosasko in https://github.com/huggingface/datasets/pull/6019
Fix style with ruff 0.0.278 by @lhoestq in https://github.com/huggingface/datasets/pull/6026
Don't reference self in Spark._validate_cache_dir by @maddiedawson in https://github.com/huggingface/datasets/pull/6024
Delete task_templates in IterableDataset when they are no longer valid by @mariosasko in https://github.com/huggingface/datasets/pull/6027
[docs] Fix link by @stevhliu in https://github.com/huggingface/datasets/pull/6029
fixed typo in comment by @NightMachinery in https://github.com/huggingface/datasets/pull/6030
Fix legacy_dataset_infos by @lhoestq in https://github.com/huggingface/datasets/pull/6040
Flatten repository_structure docs on yaml by @lhoestq in https://github.com/huggingface/datasets/pull/6041
Use new hffs by @lhoestq in https://github.com/huggingface/datasets/pull/6028
Bump dev version by @lhoestq in https://github.com/huggingface/datasets/pull/6047
Fix unused DatasetInfosDict code in push_to_hub by @lhoestq in https://github.com/huggingface/datasets/pull/6042
Rename "pattern" to "path" in YAML data_files configs by @lhoestq in https://github.com/huggingface/datasets/pull/6044
Remove HfFileSystem and deprecate S3FileSystem by @mariosasko in https://github.com/huggingface/datasets/pull/6052
Dill 3.7 support by @mariosasko in https://github.com/huggingface/datasets/pull/6061
Improve Dataset.from_list docstring by @mariosasko in https://github.com/huggingface/datasets/pull/6062
Check if column names match in Parquet loader only when config features are specified by @mariosasko in https://github.com/huggingface/datasets/pull/6045
Release: 2.14.0 by @lhoestq in https://github.com/huggingface/datasets/pull/6063

New Contributors

@mathewjacob1002 made their first contribution in https://github.com/huggingface/datasets/pull/5986
@pappacena made their first contribution in https://github.com/huggingface/datasets/pull/5976

Full Changelog: https://github.com/huggingface/datasets/compare/2.13.1...2.14.0

2.13.1

10 months ago

General improvements and bug fixes

Fix JSON generation in benchmarks CI by @mariosasko in https://github.com/huggingface/datasets/pull/5966
Always return list in list_datasets by @mariosasko in https://github.com/huggingface/datasets/pull/5964
Add encoding and errors params to JSON loader by @mariosasko in https://github.com/huggingface/datasets/pull/5969
Filter unsupported extensions by @lhoestq in https://github.com/huggingface/datasets/pull/5972

Full Changelog: https://github.com/huggingface/datasets/compare/2.13.0...2.13.1

2.13.0

11 months ago

Dataset Features

Add IterableDataset.from_spark by @maddiedawson in https://github.com/huggingface/datasets/pull/5770

Stream the data from your Spark DataFrame directly to your training pipeline

from datasets import IterableDataset
from torch.utils.data import DataLoader

ids = IterableDataset.from_spark(df)
ids = ids.map(...).filter(...).with_format("torch")
for batch in DataLoader(ids, batch_size=16, num_workers=4):
    ...

IterableDataset formatting for PyTorch, TensorFlow, Jax, NumPy and Arrow:
- IterableDataset Arrow formatting by @lhoestq in https://github.com/huggingface/datasets/pull/5821
- Iterable torch formatting by @lhoestq in https://github.com/huggingface/datasets/pull/5852
```
from datasets import load_dataset

ids = load_dataset("c4", "en", split="train", streaming=True)
ids = ids.map(...).with_format("torch")  # to get PyTorch tensors - also works with tf, np, jax etc.
```
Add IterableDataset.from_file to load local dataset as iterable by @mariusz-jachimowicz-83 in https://github.com/huggingface/datasets/pull/5893
```
from datasets import IterableDataset

ids = IterableDataset.from_file("path/to/data.arrow")
```
Arrow dataset builder to be able to load and stream Arrow datasets by @mariusz-jachimowicz-83 in https://github.com/huggingface/datasets/pull/5944
```
from datasets import load_dataset

ds = load_dataset("arrow", data_files={"train": "train.arrow", "test": "test.arrow"})
```

Experimental

Add parallel module using joblib for Spark by @es94129 in https://github.com/huggingface/datasets/pull/5924

General improvements and bug fixes

Preserve stopping_strategy of shuffled interleaved dataset (random cycling case) by @mariosasko in https://github.com/huggingface/datasets/pull/5816
Fix incomplete docstring for BuilderConfig by @Laurent2916 in https://github.com/huggingface/datasets/pull/5824
[docs] Custom decoding transforms by @stevhliu in https://github.com/huggingface/datasets/pull/5836
Add accelerate as metric's test dependency to fix CI error by @mariosasko in https://github.com/huggingface/datasets/pull/5848
Add date_format param to the CSV reader by @mariosasko in https://github.com/huggingface/datasets/pull/5845
[docs] Redirects, migrated from nginx by @julien-c in https://github.com/huggingface/datasets/pull/5853
Fix infer module for uppercase extensions by @albertvillanova in https://github.com/huggingface/datasets/pull/5872
Minor tqdm optim by @lhoestq in https://github.com/huggingface/datasets/pull/5860
Always set nullable fields in the writer by @lhoestq in https://github.com/huggingface/datasets/pull/5835
Add fn_kwargs to map and filter of IterableDataset and IterableDatasetDict by @yuukicammy in https://github.com/huggingface/datasets/pull/5810
Better error message when combining dataset dicts instead of datasets by @lhoestq in https://github.com/huggingface/datasets/pull/5861
Force overwrite existing filesystem protocol by @baskrahmer in https://github.com/huggingface/datasets/pull/5894
Support working_dir in from_spark by @maddiedawson in https://github.com/huggingface/datasets/pull/5826
Raise TypeError when indexing a dataset with bool by @albertvillanova in https://github.com/huggingface/datasets/pull/5859
Fix minor typo in docs loading.mdx by @albertvillanova in https://github.com/huggingface/datasets/pull/5900
Fix FixedSizeListArray casting by @mariosasko in https://github.com/huggingface/datasets/pull/5897
Unpin responses by @mariosasko in https://github.com/huggingface/datasets/pull/5916
Validate name parameter in make_file_instructions by @albertvillanova in https://github.com/huggingface/datasets/pull/5904
Raise error in DatasetBuilder.as_dataset when file_format is not "arrow" by @mariosasko in https://github.com/huggingface/datasets/pull/5915
Refactor extensions by @albertvillanova in https://github.com/huggingface/datasets/pull/5917
Use more efficient and idiomatic way to construct list. by @ttsugriy in https://github.com/huggingface/datasets/pull/5909
Add flatten_indices to DatasetDict by @maximxlss in https://github.com/huggingface/datasets/pull/5907
Optimize IterableDataset.from_file using ArrowExamplesIterable by @lhoestq in https://github.com/huggingface/datasets/pull/5920
Make prepare_split more robust if errors in metadata dataset_info splits by @albertvillanova in https://github.com/huggingface/datasets/pull/5901
Fix streaming parquet with image feature in schema by @lhoestq in https://github.com/huggingface/datasets/pull/5921
canonicalize data dir in config ID hash by @kylrth in https://github.com/huggingface/datasets/pull/5899
Fix link to quickstart docs in README.md by @mariosasko in https://github.com/huggingface/datasets/pull/5928
Fix string-encoding, make batch_size optional, and minor improvements in Dataset.to_tf_dataset by @alvarobartt in https://github.com/huggingface/datasets/pull/5883
Use a new low-memory approach for tf dataset index shuffling by @Rocketknight1 in https://github.com/huggingface/datasets/pull/5863
[doc build] Use secrets by @mishig25 in https://github.com/huggingface/datasets/pull/5932
Fix to_numpy when None values in the sequence by @qgallouedec in https://github.com/huggingface/datasets/pull/5933
Better row group size in push_to_hub by @lhoestq in https://github.com/huggingface/datasets/pull/5935
Avoid parallel redownload in cache by @albertvillanova in https://github.com/huggingface/datasets/pull/5937
Better filenotfound for gated by @lhoestq in https://github.com/huggingface/datasets/pull/5954
Make get_from_cache use custom temp filename that is locked by @albertvillanova in https://github.com/huggingface/datasets/pull/5938
Fix ArrowExamplesIterable.shard_data_sources by @lhoestq in https://github.com/huggingface/datasets/pull/5956
Add Arrow builder docs by @lhoestq in https://github.com/huggingface/datasets/pull/5952
Fix sequence of array support for most dtype by @qgallouedec in https://github.com/huggingface/datasets/pull/5948

New Contributors

@Laurent2916 made their first contribution in https://github.com/huggingface/datasets/pull/5824
@yuukicammy made their first contribution in https://github.com/huggingface/datasets/pull/5810
@baskrahmer made their first contribution in https://github.com/huggingface/datasets/pull/5894
@ttsugriy made their first contribution in https://github.com/huggingface/datasets/pull/5909
@maximxlss made their first contribution in https://github.com/huggingface/datasets/pull/5907
@mariusz-jachimowicz-83 made their first contribution in https://github.com/huggingface/datasets/pull/5893
@kylrth made their first contribution in https://github.com/huggingface/datasets/pull/5899
@qgallouedec made their first contribution in https://github.com/huggingface/datasets/pull/5933
@es94129 made their first contribution in https://github.com/huggingface/datasets/pull/5924

Full Changelog: https://github.com/huggingface/datasets/compare/2.12.0...zef

2.12.0

1 year ago

Datasets Features

Add Dataset.from_spark by @maddiedawson in https://github.com/huggingface/datasets/pull/5701
- Get a Dataset from a Spark DataFrame (docs):
```
>>> from datasets import Dataset
>>> ds = Dataset.from_spark(df)
```

Support streaming Beam datasets from HF GCS preprocessed data by @albertvillanova in https://github.com/huggingface/datasets/pull/5689

Stream data from Wikipedia:

>>> from datasets import load_dataset
>>> ds = load_dataset("wikipedia", "20220301.de", streaming=True)
>>> next(iter(ds["train"]))
{'id': '1', 'url': 'https://de.wikipedia.org/wiki/Alan%20Smithee', 'title': 'Alan Smithee', 'text': 'Alan Smithee steht als Pseudonym für einen fiktiven Regisseur...}

Implement sharding on merged iterable datasets by @Hubert-Bonisseur in https://github.com/huggingface/datasets/pull/5735

Use interleaved datasets in a distributed setup or with a DataLoader

>>> from datasets import load_dataset, interleave_datasets
>>> from torch.utils.data import DataLoader
>>> wiki = load_dataset("wikipedia", "20220301.en", split="train", streaming=True)
>>> c4 = load_dataset("c4", "en", split="train", streaming=True)
>>> merged = interleave_datasets([wiki, c4], probabilities=[0.1, 0.9], seed=42, stopping_strategy="all_exhausted")
>>> dataloader = DataLoader(merged, num_workers=4)

Consistent ArrayND Python formatting + better NumPy/Pandas formatting by @mariosasko in https://github.com/huggingface/datasets/pull/5751
- Return a list of lists instead of a list of NumPy arrays when converting the variable-shaped ArrayND to Python
- Improve the NumPy conversion by returning a numeric NumPy array when the offsets are equal or a NumPy object array when they aren't
- Allow converting the variable-shaped ArrayND to Pandas

General improvements and bug fixes

Fix a description error for interleave_datasets. by @QizhiPei in https://github.com/huggingface/datasets/pull/5680
[docs] Split pattern search order by @stevhliu in https://github.com/huggingface/datasets/pull/5693
Raise an error on missing distributed seed by @lhoestq in https://github.com/huggingface/datasets/pull/5697
Fix xnumpy_load for .npz files by @albertvillanova in https://github.com/huggingface/datasets/pull/5714
Temporarily pin fsspec by @albertvillanova in https://github.com/huggingface/datasets/pull/5731
Unpin fsspec by @albertvillanova in https://github.com/huggingface/datasets/pull/5733
Fix CI warnings by @albertvillanova in https://github.com/huggingface/datasets/pull/5741
Fix CI mock filesystem fixtures by @albertvillanova in https://github.com/huggingface/datasets/pull/5740
Fix link in docs by @bbbxyz in https://github.com/huggingface/datasets/pull/5746
fix typo: "mow" -> "now" by @csris in https://github.com/huggingface/datasets/pull/5763
[docs] Compress data files by @stevhliu in https://github.com/huggingface/datasets/pull/5691
Fix style by @lhoestq in https://github.com/huggingface/datasets/pull/5774
Minor tqdm fixes by @mariosasko in https://github.com/huggingface/datasets/pull/5754
Fixes #5757 by @eli-osherovich in https://github.com/huggingface/datasets/pull/5758
Fix JSON builder when missing keys in first row by @albertvillanova in https://github.com/huggingface/datasets/pull/5772
Warning specifying future change in to_tf_dataset behaviour by @amyeroberts in https://github.com/huggingface/datasets/pull/5742
Prepare tests for hfh 0.14 by @Wauplin in https://github.com/huggingface/datasets/pull/5788
Call fs.makedirs in save_to_disk by @lhoestq in https://github.com/huggingface/datasets/pull/5779
Allow to run CI on push to ci-branch by @albertvillanova in https://github.com/huggingface/datasets/pull/5790
Fix nondeterministic sharded data split order by @albertvillanova in https://github.com/huggingface/datasets/pull/5729
Raise subprocesses traceback when interrupting by @lhoestq in https://github.com/huggingface/datasets/pull/5784
Fix spark imports by @lhoestq in https://github.com/huggingface/datasets/pull/5795
Change downloaded file permission based on umask by @albertvillanova in https://github.com/huggingface/datasets/pull/5800
Fix inferring module for unsupported data files by @albertvillanova in https://github.com/huggingface/datasets/pull/5787
Reorder default data splits to have validation before test by @albertvillanova in https://github.com/huggingface/datasets/pull/5718
Validate non-empty data_files by @albertvillanova in https://github.com/huggingface/datasets/pull/5802
Spark docs by @lhoestq in https://github.com/huggingface/datasets/pull/5796
Release: 2.12.0 by @lhoestq in https://github.com/huggingface/datasets/pull/5803

New Contributors

@QizhiPei made their first contribution in https://github.com/huggingface/datasets/pull/5680
@bbbxyz made their first contribution in https://github.com/huggingface/datasets/pull/5746
@csris made their first contribution in https://github.com/huggingface/datasets/pull/5763
@eli-osherovich made their first contribution in https://github.com/huggingface/datasets/pull/5758
@maddiedawson made their first contribution in https://github.com/huggingface/datasets/pull/5701

Full Changelog: https://github.com/huggingface/datasets/compare/2.11.0...2.12.0