A Rust DataFrame implementation, built on Apache Arrow
A dataframe implementation in Rust, powered by Apache Arrow.
A dataframe is a 2-dimensional tabular data structure that is often used for computations and other data transformations. A dataframe often has columns of the same data type, similar to a SQL table.
This project is inspired by Pandas and other dataframe libraries, but specifically currently borrows functions from Apache Spark.
It mainly focuses on computation, and aims to include:
As a point of reference, we use Apache Spark Python functions for function parity, and aim to be compatible with Apache Spark functions.
The initial experiments of this project were to see if it's possible to create some form of dataframe. We're happy that this condition is met, however the initial version relied on eager evaluation, which would make it difficult to use in a REPL fashion, and make it slow.
We are mainly focusing on creating a process for lazy evaluation (the current LazyFrame
), which involves reading an input's schema, then applying transformations on that schema until a materialising action is required.
While still figuring this out, there might not be much progress on the surface, as most of this exercise is happening offline.
The plan is to provide a reasonable API for lazily transforming data, and the ability to apply some optimisations on the computation graph (e.g. predicate pushdown, rearranging computations).
In the future, LazyFrame
will probably be renamed to DataFrame
, and the current DataFrame
with eager evaluation removed/made private.
The ongoing experiments on lazy evaluation are in the master
branch, and we would appreciate some help 🙏🏾.
Although we use Apache Spark as a reference, we do not intend on supporting distributed computation beyond a single machine.
Spark is a convenience to reduce bikeshedding, but we will probably provide a more Rust idiomatic API in future.
A low-level API can already be used for simple tasks that do not require aggregations, joins or sorts. A simpler API is currently not a priority until we have more capabilities to transform data.
One good potential immediate use of the library would be copying data from one supported data source to another (e.g. PostgreSQL to Arrow or CSV with minimal transformations).
fn
s (H1 2020)We are working on IO support, with priority for SQL read and write. PostgreSQL IO is supported using the binary protocol, although not all data types are supported (lists, structs, numeric, and a few other non-primitive types)
DataFrame Operations
Vec<RecordBatch>
as well as an iterator)Scalar Functions
num
crate where possible)regex
)cast
kernel)Aggregate Functions
Window Functions
Array Functions
We plan on providing simple benchmarks in the near future. The current blockers are: