Datasets Versions Save

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools

2.19.0

2 weeks ago

Dataset Features

  • Add Polars compatibility by @psmyth94 in https://github.com/huggingface/datasets/pull/6531
    • convert to a Polars dataframe using .to_polars();
      import polars as pl
      from datasets import load_dataset
      ds = load_dataset("DIBT/10k_prompts_ranked", split="train")
      ds.to_polars() \
          .groupby("topic") \
          .agg(pl.len(), pl.first()) \
          .sort("len", descending=True)
      
    • Use Polars formatting to return Polars objects when accessing a dataset:
      ds = ds.with_format("polars")
      ds[:10].group_by("kind").len()
      
  • Add fsspec support for to_json, to_csv, and to_parquet by @alvarobartt in https://github.com/huggingface/datasets/pull/6096
    • Save on HF in any file format:
      ds.to_json("hf://datasets/username/my_json_dataset/data.jsonl")
      ds.to_csv("hf://datasets/username/my_csv_dataset/data.csv")
      ds.to_parquet("hf://datasets/username/my_parquet_dataset/data.parquet")
      
  • Add mode parameter to Image feature by @mariosasko in https://github.com/huggingface/datasets/pull/6735
    • Set images to be read in a certain mode like "RGB"
      dataset = dataset.cast_column("image", Image(mode="RGB"))
      
  • Add CLI function to convert script-dataset to Parquet by @albertvillanova in https://github.com/huggingface/datasets/pull/6795
    • run command to open a PR in script-based dataset to convert it to Parquet:
      datasets-cli convert_to_parquet <dataset_id>
      
  • Add Dataset.take and Dataset.skip by @lhoestq in https://github.com/huggingface/datasets/pull/6813
    • same as IterableDataset.take and IterableDataset.skip
      ds = ds.take(10)  # take only the first 10 examples
      

General improvements and bug fixes

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/2.18.0...2.19.0

2.18.0

2 months ago

Dataset features

General improvements and bug fixes

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/2.17.1...2.18.0

2.17.1

2 months ago

Bug Fixes

Full Changelog: https://github.com/huggingface/datasets/compare/2.17.0...2.17.1

2.17.0

2 months ago

Dataset Features

General improvements and bug fixes

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/2.16.1...2.17.0

2.16.1

4 months ago

Bug fixes

Full Changelog: https://github.com/huggingface/datasets/compare/2.16.0...2.16.1

2.16.0

4 months ago

Security features

  • Add trust_remote_code argument by @lhoestq in https://github.com/huggingface/datasets/pull/6429
    • Some Hugging Face datasets contain custom code which must be executed to correctly load the dataset. The code can be inspected in the repository content at https://hf.co/datasets/<repo_id>. A warning is shown to let the user know about the custom code, and they can avoid this message in future by passing the argument trust_remote_code=True.
    • Passing trust_remote_code=True will be mandatory to load these datasets from the next major release of datasets.
    • Using the environment variable HF_DATASETS_TRUST_REMOTE_CODE=0 you can already disable custom code by default without waiting for the next release of datasets
  • Use parquet export if possible by @lhoestq in https://github.com/huggingface/datasets/pull/6448
    • This allows loading most old datasets based on custom code by downloading the Parquet export provided by Hugging Face
    • You can see a dataset's Parquet export at https://hf.co/datasets/<repo_id>/tree/refs%2Fconvert%2Fparquet

Features

  • Webdataset dataset builder by @lhoestq in https://github.com/huggingface/datasets/pull/6391
  • Implement get dataset default config name by @albertvillanova in https://github.com/huggingface/datasets/pull/6511
  • Lazy data files resolution and offline cache reload by @lhoestq in https://github.com/huggingface/datasets/pull/6493
    • This speeds up the load_dataset step that lists the data files of big repositories (up to x100) but requires huggingface_hub 0.20 or newer
    • Fix load_dataset that used to reload data from cache even if the dataset was updated on Hugging Face
    • Reload a dataset from your cache even if you don't have internet connection
    • New cache directory scheme for no-script datasets: ~/.cache/huggingface/datasets/username___dataset_name/config_name/version/commit_sha
    • Backward comaptibility: cached datasets from datasets 2.15 (using the old scheme) are still reloaded from cache

General improvements and bug fixes

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/2.15.0...2.16.0

2.15.0

5 months ago

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/2.14.7...2.15.0

2.14.7

5 months ago

Bug Fixes

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/2.14.6...2.14.7

2.14.6

6 months ago

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/2.14.5...2.14.6

2.14.5

6 months ago

Bug fixes

Other improvements

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/2.14.4...2.14.5