Pystore Save

Fast data store for Pandas time-series data

Project README

PyStore - Fast data store for Pandas timeseries data

.. image:: https://img.shields.io/badge/python-2.7,%203.5+-blue.svg?style=flat :target: https://pypi.python.org/pypi/pystore :alt: Python version

.. image:: https://img.shields.io/pypi/v/pystore.svg?maxAge=60 :target: https://pypi.python.org/pypi/pystore :alt: PyPi version

.. image:: https://img.shields.io/pypi/status/pystore.svg?maxAge=60 :target: https://pypi.python.org/pypi/pystore :alt: PyPi status

.. image:: https://img.shields.io/travis/ranaroussi/pystore/master.svg?maxAge=1 :target: https://travis-ci.com/ranaroussi/pystore :alt: Travis-CI build status

.. image:: https://www.codefactor.io/repository/github/ranaroussi/pystore/badge :target: https://www.codefactor.io/repository/github/ranaroussi/pystore :alt: CodeFactor

.. image:: https://img.shields.io/github/stars/ranaroussi/pystore.svg?style=social&label=Star&maxAge=60 :target: https://github.com/ranaroussi/pystore :alt: Star this repo

.. image:: https://img.shields.io/twitter/follow/aroussi.svg?style=social&label=Follow&maxAge=60 :target: https://twitter.com/aroussi :alt: Follow me on twitter

PyStore <https://github.com/ranaroussi/pystore>_ is a simple (yet powerful) datastore for Pandas dataframes, and while it can store any Pandas object, it was designed with storing timeseries data in mind.

It's built on top of Pandas <http://pandas.pydata.org>, Numpy <http://numpy.pydata.org>, Dask <http://dask.pydata.org>, and Parquet <http://parquet.apache.org> (via Fastparquet <https://github.com/dask/fastparquet>_), to provide an easy to use datastore for Python developers that can easily query millions of rows per second per client.

==> Check out this Blog post <https://medium.com/@aroussi/fast-data-store-for-pandas-time-series-data-using-pystore-89d9caeef4e2>_ for the reasoning and philosophy behind PyStore, as well as a detailed tutorial with code examples.

==> Follow this PyStore tutorial <https://github.com/ranaroussi/pystore/blob/master/examples/pystore-tutorial.ipynb>_ in Jupyter notebook format.

Quickstart

Install PyStore

Install using pip:

.. code:: bash

$ pip install pystore --upgrade --no-cache-dir

Install using conda:

.. code:: bash

$ conda install -c ranaroussi pystore

INSTALLATION NOTE: If you don't have Snappy installed (compression/decompression library), you'll need to you'll need to install it first <https://github.com/ranaroussi/pystore#dependencies>_.

Using PyStore

.. code:: python

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import pystore
import quandl

# Set storage path (optional)
# Defaults to `~/pystore` or `PYSTORE_PATH` environment variable (if set)
pystore.set_path("~/pystore")

# List stores
pystore.list_stores()

# Connect to datastore (create it if not exist)
store = pystore.store('mydatastore')

# List existing collections
store.list_collections()

# Access a collection (create it if not exist)
collection = store.collection('NASDAQ')

# List items in collection
collection.list_items()

# Load some data from Quandl
aapl = quandl.get("WIKI/AAPL", authtoken="your token here")

# Store the first 100 rows of the data in the collection under "AAPL"
collection.write('AAPL', aapl[:100], metadata={'source': 'Quandl'})

# Reading the item's data
item = collection.item('AAPL')
data = item.data  # <-- Dask dataframe (see dask.pydata.org)
metadata = item.metadata
df = item.to_pandas()

# Append the rest of the rows to the "AAPL" item
collection.append('AAPL', aapl[100:])

# Reading the item's data
item = collection.item('AAPL')
data = item.data
metadata = item.metadata
df = item.to_pandas()


# --- Query functionality ---

# Query avaialable symbols based on metadata
collection.list_items(some_key='some_value', other_key='other_value')


# --- Snapshot functionality ---

# Snapshot a collection
# (Point-in-time named reference for all current symbols in a collection)
collection.create_snapshot('snapshot_name')

# List available snapshots
collection.list_snapshots()

# Get a version of a symbol given a snapshot name
collection.item('AAPL', snapshot='snapshot_name')

# Delete a collection snapshot
collection.delete_snapshot('snapshot_name')


# ...


# Delete the item from the current version
collection.delete_item('AAPL')

# Delete the collection
store.delete_collection('NASDAQ')

Using Dask schedulers

PyStore 0.1.18+ supports using Dask distributed.

To use a local Dask scheduler, add this to your code:

.. code:: python

from dask.distributed import LocalCluster
pystore.set_client(LocalCluster())

To use a distributed Dask scheduler, add this to your code:

.. code:: python

pystore.set_client("tcp://xxx.xxx.xxx.xxx:xxxx")
pystore.set_path("/path/to/shared/volume/all/workers/can/access")

Concepts

PyStore provides namespaced collections of data. These collections allow bucketing data by source, user or some other metric (for example frequency: End-Of-Day; Minute Bars; etc.). Each collection (or namespace) maps to a directory containing partitioned parquet files for each item (e.g. symbol).

A good practice it to create collections that may look something like this:

collection.EOD
collection.ONEMINUTE

Requirements

Python 2.7 or Python > 3.5
Pandas
Numpy
Dask
Fastparquet
Snappy <http://google.github.io/snappy/>_ (Google's compression/decompression library)
multitasking

PyStore was tested to work on *nix-like systems, including macOS.

Dependencies:

PyStore uses Snappy <http://google.github.io/snappy/>_, a fast and efficient compression/decompression library from Google. You'll need to install Snappy on your system before installing PyStore.

* See the python-snappy Github repo <https://github.com/andrix/python-snappy#dependencies>_ for more information.

*nix Systems:

APT: sudo apt-get install libsnappy-dev
RPM: sudo yum install libsnappy-devel

macOS:

First, install Snappy's C library using Homebrew <https://brew.sh>_:

.. code::

$ brew install snappy

Then, install Python's snappy using conda:

.. code::

$ conda install python-snappy -c conda-forge

...or, using pip:

.. code::

$ CPPFLAGS="-I/usr/local/include -L/usr/local/lib" pip install python-snappy

Windows:

Windows users should checkout Snappy for Windows <https://snappy.machinezoo.com>_ and this Stackoverflow post <https://stackoverflow.com/a/43756412/1783569>_ for help on installing Snappy and python-snappy.

Roadmap

PyStore currently offers support for local filesystem (including attached network drives). I plan on adding support for Amazon S3 (via s3fs <http://s3fs.readthedocs.io/>), Google Cloud Storage (via gcsfs <https://github.com/dask/gcsfs/>) and Hadoop Distributed File System (via hdfs3 <http://hdfs3.readthedocs.io/>_) in the future.

Acknowledgements

PyStore is hugely inspired by Man AHL <http://www.ahl.com/>'s Arctic <https://github.com/manahl/arctic> which uses MongoDB for storage and allow for versioning and other features. I highly reommend you check it out.

License

PyStore is licensed under the Apache License, Version 2.0. A copy of which is included in LICENSE.txt.

I'm very interested in your experience with PyStore. Please drop me an note with any feedback you have.

Contributions welcome!

- Ran Aroussi

Open Source Agenda is not affiliated with "Pystore" Project. README Source: ranaroussi/pystore

Stars

538

Open Issues

Last Commit

1 month ago

Repository

ranaroussi/pystore

License

Apache-2.0

Open Source Agenda Badge

<a href="https://www.opensourceagenda.com/projects/pystore"><img src="https://www.opensourceagenda.com/projects/pystore/reviews/badge.svg" alt="Open Source Agenda"></a>

Submit Review Review Your Favorite Project

Submit Resource Articles, Courses, Videos

Submit Article Submit a post to our blog

From the blog

Dec 11, 2022

How to Choose Which Programming Language to Learn First?

From the blog

Dec 11, 2022