ByteHub: making feature stores simple
An easy-to-use feature store.
A feature store is a data storage system for data science and machine-learning. It can store raw data and also transformed features, which can be fed straight into an ML model or training script.
Feature stores allow data scientists and engineers to be more productive by organising the flow of data into models.
The Bytehub Feature Store is designed to:
It is built on Dask to support large datasets and cluster compute environments.
Also available as ☁️ ByteHub Cloud: a ready-to-use, cloud-hosted feature store.
See the ByteHub documentation and notebook tutorials to learn more and get started.
Install using pip:
pip install bytehub
Create a local SQLite feature store by running:
import bytehub as bh
import pandas as pd
fs = bh.FeatureStore()
Data lives inside namespaces within each feature store. They can be used to separate projects or environments. Create a namespace as follows:
fs.create_namespace(
'tutorial', url='/tmp/featurestore/tutorial', description='Tutorial datasets'
)
Create a feature inside this namespace which will be used to store a timeseries of pre-prepared data:
fs.create_feature('tutorial/numbers', description='Timeseries of numbers')
Now save some data into the feature store:
dts = pd.date_range('2020-01-01', '2021-02-09')
df = pd.DataFrame({'time': dts, 'value': list(range(len(dts)))})
fs.save_dataframe(df, 'tutorial/numbers')
The data is now stored, ready to be transformed, resampled, merged with other data, and fed to machine-learning models.
We can engineer new features from existing ones using the transform decorator. Suppose we want to define a new feature that contains the squared values of tutorial/numbers
:
@fs.transform('tutorial/squared', from_features=['tutorial/numbers'])
def squared_numbers(df):
# This transform function receives dataframe input, and defines a transform operation
return df ** 2 # Square the input
Now both features are saved in the feature store, and can be queried using:
df_query = fs.load_dataframe(
['tutorial/numbers', 'tutorial/squared'],
from_date='2021-01-01', to_date='2021-01-31'
)
To connect to ByteHub Cloud, first register for an account, then use:
fs = bh.FeatureStore("https://api.bytehub.ai")
This will allow you to store features in your own private namespace on ByteHub Cloud, and save datasets to an AWS S3 storage bucket.