🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
.to_polars()
;
import polars as pl
from datasets import load_dataset
ds = load_dataset("DIBT/10k_prompts_ranked", split="train")
ds.to_polars() \
.groupby("topic") \
.agg(pl.len(), pl.first()) \
.sort("len", descending=True)
ds = ds.with_format("polars")
ds[:10].group_by("kind").len()
fsspec
support for to_json
, to_csv
, and to_parquet
by @alvarobartt in https://github.com/huggingface/datasets/pull/6096
ds.to_json("hf://datasets/username/my_json_dataset/data.jsonl")
ds.to_csv("hf://datasets/username/my_csv_dataset/data.csv")
ds.to_parquet("hf://datasets/username/my_parquet_dataset/data.parquet")
mode
parameter to Image
feature by @mariosasko in https://github.com/huggingface/datasets/pull/6735
dataset = dataset.cast_column("image", Image(mode="RGB"))
datasets-cli convert_to_parquet <dataset_id>
ds = ds.take(10) # take only the first 10 examples
remove_columns
/rename_columns
doc fixes by @mariosasko in https://github.com/huggingface/datasets/pull/6772
uv
in CI by @mariosasko in https://github.com/huggingface/datasets/pull/6779
_check_legacy_cache2
by @lhoestq in https://github.com/huggingface/datasets/pull/6792
DatasetBuilder._split_generators
incomplete type annotation by @JonasLoos in https://github.com/huggingface/datasets/pull/6799
CachedDatasetModuleFactory
and Cache
by @izhx in https://github.com/huggingface/datasets/pull/6754
os.path.relpath
in resolve_patterns
by @mariosasko in https://github.com/huggingface/datasets/pull/6815
Dataset.__getitem__
by @mariosasko in https://github.com/huggingface/datasets/pull/6817
Full Changelog: https://github.com/huggingface/datasets/compare/2.18.0...2.19.0
num_workers
could lead to incorrect shards assignments to workers and cause errorsxlistdir
by @mariosasko in https://github.com/huggingface/datasets/pull/6698
Full Changelog: https://github.com/huggingface/datasets/compare/2.17.1...2.18.0
arrow_writer.py
from #6636 by @bryant1410 in https://github.com/huggingface/datasets/pull/6664
Full Changelog: https://github.com/huggingface/datasets/compare/2.17.0...2.17.1
drop_last_batch
in map after shuffling or sharding by @lhoestq in https://github.com/huggingface/datasets/pull/6575
setup.cfg
to pyproject.toml
by @mariosasko in https://github.com/huggingface/datasets/pull/6619
tqdm
bars in non-interactive environments by @mariosasko in https://github.com/huggingface/datasets/pull/6627
with_rank
param to Dataset.filter
by @mariosasko in https://github.com/huggingface/datasets/pull/6608
Full Changelog: https://github.com/huggingface/datasets/compare/2.16.1...2.17.0
cache_dir
to load_dataset
load_dataset("ted_talks_iwslt", language_pair=("ja", "en"), year="2015")
Full Changelog: https://github.com/huggingface/datasets/compare/2.16.0...2.16.1
https://hf.co/datasets/<repo_id>
. A warning is shown to let the user know about the custom code, and they can avoid this message in future by passing the argument trust_remote_code=True
.trust_remote_code=True
will be mandatory to load these datasets from the next major release of datasets
.HF_DATASETS_TRUST_REMOTE_CODE=0
you can already disable custom code by default without waiting for the next release of datasets
https://hf.co/datasets/<repo_id>/tree/refs%2Fconvert%2Fparquet
load_dataset
step that lists the data files of big repositories (up to x100) but requires huggingface_hub
0.20 or newerload_dataset
that used to reload data from cache even if the dataset was updated on Hugging Face~/.cache/huggingface/datasets/username___dataset_name/config_name/version/commit_sha
datasets
2.15 (using the old scheme) are still reloaded from cache_get_data_files_patterns
by @lhoestq in https://github.com/huggingface/datasets/pull/6343
usedforsecurity=False
in hashlib methods (FIPS compliance) by @Wauplin in https://github.com/huggingface/datasets/pull/6414
ruff
for formatting by @mariosasko in https://github.com/huggingface/datasets/pull/6434
tqdm
wrapper by @mariosasko in https://github.com/huggingface/datasets/pull/6433
Table.__getstate__
and Table.__setstate__
by @LZHgrla in https://github.com/huggingface/datasets/pull/6444
filelock
package for file locking by @mariosasko in https://github.com/huggingface/datasets/pull/6445
**
by @mariosasko in https://github.com/huggingface/datasets/pull/6449
dill
logic by @mariosasko in https://github.com/huggingface/datasets/pull/6454
push_to_hub
by @mariosasko in https://github.com/huggingface/datasets/pull/6461
__repr__
by @lhoestq in https://github.com/huggingface/datasets/pull/6480
torch.Generator
objects by @mariosasko in https://github.com/huggingface/datasets/pull/6502
list_files_info
with list_repo_tree
in push_to_hub
by @mariosasko in https://github.com/huggingface/datasets/pull/6510
Full Changelog: https://github.com/huggingface/datasets/compare/2.15.0...2.16.0
dl_manager.iter_files
when they are given as input by @mariosasko in https://github.com/huggingface/datasets/pull/6230
audio.py
by @mariosasko in https://github.com/huggingface/datasets/pull/6241
apache_beam
import in BeamBasedBuilder._save_info
by @mariosasko in https://github.com/huggingface/datasets/pull/6265
tensorflow
maximum version by @mariosasko in https://github.com/huggingface/datasets/pull/6301
jax
maximum version by @mariosasko in https://github.com/huggingface/datasets/pull/6300
push_to_hub
by @mariosasko in https://github.com/huggingface/datasets/pull/6269
fsspec
version to the datasets-cli env
command output by @mariosasko in https://github.com/huggingface/datasets/pull/6356
Dataset.map
docstring by @bryant1410 in https://github.com/huggingface/datasets/pull/6373
Image
by @mariosasko in https://github.com/huggingface/datasets/pull/6379
Full Changelog: https://github.com/huggingface/datasets/compare/2.14.7...2.15.0
Full Changelog: https://github.com/huggingface/datasets/compare/2.14.6...2.14.7
Full Changelog: https://github.com/huggingface/datasets/compare/2.14.5...2.14.6
iter_files
for hidden files by @mariosasko in https://github.com/huggingface/datasets/pull/6092
columns
by @mariosasko in https://github.com/huggingface/datasets/pull/6160
datasets_info.json
but no README by @clefourrier in https://github.com/huggingface/datasets/pull/6164
revision
argument by @qgallouedec in https://github.com/huggingface/datasets/pull/6191
Dataset.export
by @mariosasko in https://github.com/huggingface/datasets/pull/6081
download_custom
by @mariosasko in https://github.com/huggingface/datasets/pull/6093
select_columns
to guide by @unifyh in https://github.com/huggingface/datasets/pull/6119
to_iterable_dataset
by @stevhliu in https://github.com/huggingface/datasets/pull/6158
image_load
doc by @mariosasko in https://github.com/huggingface/datasets/pull/6181
huggingface/documentation-images
by @mariosasko in https://github.com/huggingface/datasets/pull/6177
hf-internal-testing
repos for hosting test dataset repos by @mariosasko in https://github.com/huggingface/datasets/pull/6180
Full Changelog: https://github.com/huggingface/datasets/compare/2.14.4...2.14.5