Datasets Versions Save

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools

2.14.5

6 months ago

Bug fixes

Other improvements

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/2.14.4...2.14.5

2.13.2

8 months ago

Bug fixes

Full Changelog: https://github.com/huggingface/datasets/compare/2.13.1...2.13.2

2.14.4

9 months ago

Bug fixes

Full Changelog: https://github.com/huggingface/datasets/compare/2.14.3...2.14.4

2.14.3

9 months ago

Bug fixes

Full Changelog: https://github.com/huggingface/datasets/compare/2.14.2...2.14.3

2.14.2

9 months ago

Bug fixes

Full Changelog: https://github.com/huggingface/datasets/compare/2.14.1...2.14.2

2.14.1

9 months ago

Bug fixes

Other improvements

Full Changelog: https://github.com/huggingface/datasets/compare/2.14.0...2.14.1

2.14.0

9 months ago

Important: caching

  • Datasets downloaded and cached using datasets>=2.14.0 may not be reloaded from cache using older version of datasets (and therefore re-downloaded).
  • Datasets that were already cached are still supported.
  • This affects datasets on Hugging Face without dataset scripts, e.g. made of pure parquet, csv, jsonl, etc. files.
  • This is due to the default configuration name for those datasets have been fixed (from "username--dataset_name" to "default") in https://github.com/huggingface/datasets/pull/5331.

Dataset Configuration

  • Support for multiple configs via metadata yaml info by @polinaeterna in https://github.com/huggingface/datasets/pull/5331

    • Configure your dataset using YAML at the top of your dataset card (docs here)
    • Choose which file goes into which split
      ---
      configs:
      - config_name: default
        data_files:
        - split: train
           path: data.csv
        - split: test
            path: holdout.csv
      ---
    
    • Define multiple dataset configurations
      ---
      configs:
      - config_name: main_data
        data_files: main_data.csv
      - config_name: additional_data
        data_files: additional_data.csv
      ---
    

Dataset Features

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/2.13.1...2.14.0

2.13.1

10 months ago

General improvements and bug fixes

Full Changelog: https://github.com/huggingface/datasets/compare/2.13.0...2.13.1

2.13.0

11 months ago

Dataset Features

  • Add IterableDataset.from_spark by @maddiedawson in https://github.com/huggingface/datasets/pull/5770

    • Stream the data from your Spark DataFrame directly to your training pipeline
    from datasets import IterableDataset
    from torch.utils.data import DataLoader
    
    ids = IterableDataset.from_spark(df)
    ids = ids.map(...).filter(...).with_format("torch")
    for batch in DataLoader(ids, batch_size=16, num_workers=4):
        ...
    
  • IterableDataset formatting for PyTorch, TensorFlow, Jax, NumPy and Arrow:

    from datasets import load_dataset
    
    ids = load_dataset("c4", "en", split="train", streaming=True)
    ids = ids.map(...).with_format("torch")  # to get PyTorch tensors - also works with tf, np, jax etc.
    
  • Add IterableDataset.from_file to load local dataset as iterable by @mariusz-jachimowicz-83 in https://github.com/huggingface/datasets/pull/5893

    from datasets import IterableDataset
    
    ids = IterableDataset.from_file("path/to/data.arrow")
    
  • Arrow dataset builder to be able to load and stream Arrow datasets by @mariusz-jachimowicz-83 in https://github.com/huggingface/datasets/pull/5944

    from datasets import load_dataset
    
    ds = load_dataset("arrow", data_files={"train": "train.arrow", "test": "test.arrow"})
    

Experimental

General improvements and bug fixes

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/2.12.0...zef

2.12.0

1 year ago

Datasets Features

  • Add Dataset.from_spark by @maddiedawson in https://github.com/huggingface/datasets/pull/5701

    • Get a Dataset from a Spark DataFrame (docs):
    >>> from datasets import Dataset
    >>> ds = Dataset.from_spark(df)
    
  • Support streaming Beam datasets from HF GCS preprocessed data by @albertvillanova in https://github.com/huggingface/datasets/pull/5689

    • Stream data from Wikipedia:
    >>> from datasets import load_dataset
    >>> ds = load_dataset("wikipedia", "20220301.de", streaming=True)
    >>> next(iter(ds["train"]))
    {'id': '1', 'url': 'https://de.wikipedia.org/wiki/Alan%20Smithee', 'title': 'Alan Smithee', 'text': 'Alan Smithee steht als Pseudonym für einen fiktiven Regisseur...}
    
  • Implement sharding on merged iterable datasets by @Hubert-Bonisseur in https://github.com/huggingface/datasets/pull/5735

    • Use interleaved datasets in a distributed setup or with a DataLoader
    >>> from datasets import load_dataset, interleave_datasets
    >>> from torch.utils.data import DataLoader
    >>> wiki = load_dataset("wikipedia", "20220301.en", split="train", streaming=True)
    >>> c4 = load_dataset("c4", "en", split="train", streaming=True)
    >>> merged = interleave_datasets([wiki, c4], probabilities=[0.1, 0.9], seed=42, stopping_strategy="all_exhausted")
    >>> dataloader = DataLoader(merged, num_workers=4)
    
  • Consistent ArrayND Python formatting + better NumPy/Pandas formatting by @mariosasko in https://github.com/huggingface/datasets/pull/5751

    • Return a list of lists instead of a list of NumPy arrays when converting the variable-shaped ArrayND to Python
    • Improve the NumPy conversion by returning a numeric NumPy array when the offsets are equal or a NumPy object array when they aren't
    • Allow converting the variable-shaped ArrayND to Pandas

General improvements and bug fixes

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/2.11.0...2.12.0