🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
iter_files
for hidden files by @mariosasko in https://github.com/huggingface/datasets/pull/6092
columns
by @mariosasko in https://github.com/huggingface/datasets/pull/6160
datasets_info.json
but no README by @clefourrier in https://github.com/huggingface/datasets/pull/6164
revision
argument by @qgallouedec in https://github.com/huggingface/datasets/pull/6191
Dataset.export
by @mariosasko in https://github.com/huggingface/datasets/pull/6081
download_custom
by @mariosasko in https://github.com/huggingface/datasets/pull/6093
select_columns
to guide by @unifyh in https://github.com/huggingface/datasets/pull/6119
to_iterable_dataset
by @stevhliu in https://github.com/huggingface/datasets/pull/6158
image_load
doc by @mariosasko in https://github.com/huggingface/datasets/pull/6181
huggingface/documentation-images
by @mariosasko in https://github.com/huggingface/datasets/pull/6177
hf-internal-testing
repos for hosting test dataset repos by @mariosasko in https://github.com/huggingface/datasets/pull/6180
Full Changelog: https://github.com/huggingface/datasets/compare/2.14.4...2.14.5
Full Changelog: https://github.com/huggingface/datasets/compare/2.13.1...2.13.2
Full Changelog: https://github.com/huggingface/datasets/compare/2.14.3...2.14.4
Full Changelog: https://github.com/huggingface/datasets/compare/2.14.2...2.14.3
Full Changelog: https://github.com/huggingface/datasets/compare/2.14.1...2.14.2
Overview.ipynb
& detach Jupyter Notebooks from datasets
repository by @alvarobartt in https://github.com/huggingface/datasets/pull/5902
Full Changelog: https://github.com/huggingface/datasets/compare/2.14.0...2.14.1
datasets>=2.14.0
may not be reloaded from cache using older version of datasets
(and therefore re-downloaded).Support for multiple configs via metadata yaml info by @polinaeterna in https://github.com/huggingface/datasets/pull/5331
---
configs:
- config_name: default
data_files:
- split: train
path: data.csv
- split: test
path: holdout.csv
---
---
configs:
- config_name: main_data
data_files: main_data.csv
- config_name: additional_data
data_files: additional_data.csv
---
Support for multiple configs via metadata yaml info by @polinaeterna in https://github.com/huggingface/datasets/pull/5331
push_to_hub()
additional dataset configurationsds.push_to_hub("username/dataset_name", config_name="additional_data")
# reload later
ds = load_dataset("username/dataset_name", "additional_data")
Support returning dataframe in map transform by @mariosasko in https://github.com/huggingface/datasets/pull/5995
errors
param in favor of encoding_errors
in text builder by @mariosasko in https://github.com/huggingface/datasets/pull/5974
huggingface_hub
's RepoCard API by @mariosasko in https://github.com/huggingface/datasets/pull/5949
joblib
to avoid joblibspark
test failures by @mariosasko in https://github.com/huggingface/datasets/pull/6000
column_names
type check with type hint in sort
by @mariosasko in https://github.com/huggingface/datasets/pull/6001
use_auth_token
in favor of token
by @mariosasko in https://github.com/huggingface/datasets/pull/5996
ClassLabel
min max check for None
values by @mariosasko in https://github.com/huggingface/datasets/pull/6023
task_templates
in IterableDataset
when they are no longer valid by @mariosasko in https://github.com/huggingface/datasets/pull/6027
HfFileSystem
and deprecate S3FileSystem
by @mariosasko in https://github.com/huggingface/datasets/pull/6052
Dataset.from_list
docstring by @mariosasko in https://github.com/huggingface/datasets/pull/6062
features
are specified by @mariosasko in https://github.com/huggingface/datasets/pull/6045
Full Changelog: https://github.com/huggingface/datasets/compare/2.13.1...2.14.0
list_datasets
by @mariosasko in https://github.com/huggingface/datasets/pull/5964
encoding
and errors
params to JSON loader by @mariosasko in https://github.com/huggingface/datasets/pull/5969
Full Changelog: https://github.com/huggingface/datasets/compare/2.13.0...2.13.1
Add IterableDataset.from_spark by @maddiedawson in https://github.com/huggingface/datasets/pull/5770
from datasets import IterableDataset
from torch.utils.data import DataLoader
ids = IterableDataset.from_spark(df)
ids = ids.map(...).filter(...).with_format("torch")
for batch in DataLoader(ids, batch_size=16, num_workers=4):
...
IterableDataset formatting for PyTorch, TensorFlow, Jax, NumPy and Arrow:
from datasets import load_dataset
ids = load_dataset("c4", "en", split="train", streaming=True)
ids = ids.map(...).with_format("torch") # to get PyTorch tensors - also works with tf, np, jax etc.
Add IterableDataset.from_file to load local dataset as iterable by @mariusz-jachimowicz-83 in https://github.com/huggingface/datasets/pull/5893
from datasets import IterableDataset
ids = IterableDataset.from_file("path/to/data.arrow")
Arrow dataset builder to be able to load and stream Arrow datasets by @mariusz-jachimowicz-83 in https://github.com/huggingface/datasets/pull/5944
from datasets import load_dataset
ds = load_dataset("arrow", data_files={"train": "train.arrow", "test": "test.arrow"})
stopping_strategy
of shuffled interleaved dataset (random cycling case) by @mariosasko in https://github.com/huggingface/datasets/pull/5816
BuilderConfig
by @Laurent2916 in https://github.com/huggingface/datasets/pull/5824
accelerate
as metric's test dependency to fix CI error by @mariosasko in https://github.com/huggingface/datasets/pull/5848
date_format
param to the CSV reader by @mariosasko in https://github.com/huggingface/datasets/pull/5845
fn_kwargs
to map
and filter
of IterableDataset
and IterableDatasetDict
by @yuukicammy in https://github.com/huggingface/datasets/pull/5810
FixedSizeListArray
casting by @mariosasko in https://github.com/huggingface/datasets/pull/5897
DatasetBuilder.as_dataset
when file_format
is not "arrow"
by @mariosasko in https://github.com/huggingface/datasets/pull/5915
flatten_indices
to DatasetDict
by @maximxlss in https://github.com/huggingface/datasets/pull/5907
batch_size
optional, and minor improvements in Dataset.to_tf_dataset
by @alvarobartt in https://github.com/huggingface/datasets/pull/5883
to_numpy
when None values in the sequence by @qgallouedec in https://github.com/huggingface/datasets/pull/5933
Full Changelog: https://github.com/huggingface/datasets/compare/2.12.0...zef
Add Dataset.from_spark by @maddiedawson in https://github.com/huggingface/datasets/pull/5701
>>> from datasets import Dataset
>>> ds = Dataset.from_spark(df)
Support streaming Beam datasets from HF GCS preprocessed data by @albertvillanova in https://github.com/huggingface/datasets/pull/5689
>>> from datasets import load_dataset
>>> ds = load_dataset("wikipedia", "20220301.de", streaming=True)
>>> next(iter(ds["train"]))
{'id': '1', 'url': 'https://de.wikipedia.org/wiki/Alan%20Smithee', 'title': 'Alan Smithee', 'text': 'Alan Smithee steht als Pseudonym für einen fiktiven Regisseur...}
Implement sharding on merged iterable datasets by @Hubert-Bonisseur in https://github.com/huggingface/datasets/pull/5735
>>> from datasets import load_dataset, interleave_datasets
>>> from torch.utils.data import DataLoader
>>> wiki = load_dataset("wikipedia", "20220301.en", split="train", streaming=True)
>>> c4 = load_dataset("c4", "en", split="train", streaming=True)
>>> merged = interleave_datasets([wiki, c4], probabilities=[0.1, 0.9], seed=42, stopping_strategy="all_exhausted")
>>> dataloader = DataLoader(merged, num_workers=4)
Consistent ArrayND Python formatting + better NumPy/Pandas formatting by @mariosasko in https://github.com/huggingface/datasets/pull/5751
Full Changelog: https://github.com/huggingface/datasets/compare/2.11.0...2.12.0