Instant search for and access to many datasets in Pyspark.
Provides instant access to many datasets right from Pyspark (in Spark DataFrame structure).
Drop a star if you like the project.😃 Motivates💪 me to keep working on such projects
The idea is simple. There are various datasets available out there, but they are scattered in different places over the web. Is there a quick way (in Pyspark) to access them instantly without going through the hassle of searching, downloading, and reading ... etc? SparkDataset tries to address that question :)
Start with importing data()
:
from sparkdataset import data
titanic = data('titanic')
data('titanic', show_doc=True)
data()
data('ab')
Did you mean:
crabs, abbey, Vocab
That's it.
Go to this notebook for a demonstration of the functionality
In R
, there is a very easy and immediate way to access multiple statistical datasets,
in almost no effort. All it takes is one line > data(dataset_name)
.
This makes the life easier for quick prototyping and testing.
Well, I am jealous that Pyspark does not have a similar functionality.
Thus, the aim of sparkdataset
is to fill that gap.
Currently, sparkdataset
has about 757 (mostly numerical-based) datasets, that are based on RDatasets
.
In the future, I plan to scale it to include a larger set of datasets.
For example,
$ pip install sparkdataset
$ pip uninstall sparkdataset
$ rm -rf $HOME/.sparkdataset
1.0.0
>>> data('heat')
Did you mean:
Wheat, heart, Heating, Yeast, eidat, badhealth, deaths, agefat, hla, heptathlon, azt
The logo credit goes to Aleksandar Savic