Geni Versions Save

A Clojure dataframe library that runs on Spark

v0.0.31

3 years ago

Spark Doc Scraper: scripts/scrape-spark-docs.clj is able to scrape the relevant docs for the four modules.
Partial Docstrings: docstrings are available for core.column and ml.regression namespaces.

v0.0.30

3 years ago

Basic Spark Streaming functionalities: added some low-hanging fruits in terms of JavaDStream and JavaStreamingContext methods.
More robust Spark Streaming testing function: now expects an :expected key and automatically retries to make the test less flaky.

v0.0.29

3 years ago

DStream Testing Function: a more reliable and repeatable way to test Spark Streaming's StreamingContext and DStream methods.
Automated Version Bump: done with Babashka.
Updated Contributing Guide: thanks to @erp12 for pointing out certain gotchas on the guide.

v0.0.27

3 years ago

Excel Support: basic functions read-xlsx! and write-xlsx! are now available backed by zero.one/fxl.
Version Bumps: for Spark and nrepl to the latest version.
Install CI steps: Dockerless installs are now tested on Ubuntu and macOS.

v0.0.26

3 years ago

Schema option for read functions: all read functions now support a :schema option, which can be an actual Spark schema or its data-oriented version.
Basic support for EDN: read-edn! and write-edn! are now available with an added dependency of metosin/jsonista. The functions may not be performant, but can come in handy for small-data compositions.
More RDD functions: this closes the RDD function gaps to sparkplug and adds variadicity to functions that take in more than one RDDs.
RDD name unmangling: this follows sparkplug model of unmangling RDD names after each transformation.
Version bump for dependencies: nrepl bumped to 0.8.1.

v0.0.25

3 years ago

RDD Function Serialisation Model: changed from the sparkling model to the sparkplug model. Slack user @G on clojurians/geni mentioned that the sparkplug model results in fewer serialisation problems than the sparkling one.
More RDD Methods: added methods related to partitioners and JavaSparkContextMethods.
Community Guidelines: added a code of conduct and an issue template.
Design Goals Docs: first draft of the design goal outlining some of the main focuses of the project.

v0.0.24

3 years ago

RDD and PairRDD Support: basic actions and transformations are supported, but it will require AOT compilations to pass serialisable functions to RDD's higher-order functions. Therefore, the RDD REPL experience is rather poor.
Isolated Docker Runs: all Docker operations on the Makefile now runs on a temporary directory, so that there are no race conditions in writing to the target directory. This means that make ci --jobs 3 is now possible on a single machine.

v0.0.23

3 years ago

Preliminary RDD support with only certain transformations completed and completion of two parts of the cookbook for Spark ML.

Basic RDD support: mainly basic transformations such as map, reduce, map-to-pair and reduce-by-key. The main challenge has been doing serialisation of functions which are mainly taken from Sparkling and sparkplug.
Spark ML cookbook: added two chapters on Spark ML pipelines and ported customer segmentation blog post with non-negative matrix factorisation.
Better Geni CLI: new --submit command-line argument to emulate spark-submit.
Better CI steps: automated Geni CLI tests to avoid manual testing of the Geni REPL.
Completed benchmark results: added results from dplyr, data.table, tablecloth and tech.ml.dataset.

v0.0.22

3 years ago

Better getting-started experience with the new geni command and better alignment of Geni namespaces with Spark packages.

New geni script with install instructions and a new asciinema screencast. This will be the main way to use Geni for small, one-off analyses and throwaway scripts.
Created another layer of namespaces with zero-one.geni.core and zero-one.geni.ml. The idea is that the core namespaces should refer to only Spark SQL and the ml namespaces refer to Spark ML. This will help the mapping of Geni functions to the original Spark functions.
Added a simple benchmark piece that compares the performance of Pandas vs. Geni on a particular problem.
An asciinema screencast for the downloading the uberjar and interacting with the Geni REPL.

v0.0.21

3 years ago

Initial alpha release documented here on cljdoc.

The release includes an uberjar that should provide a Geni REPL (i.e. a Clojure spark-shell) within seconds. Download the uberjar, and simply try out the REPL with java -jar geni-repl-uberjar-0.0.21.jar! An nREPL server is automatically started with an .nrepl-port file, so that common Clojure text editors should be able to jack in automatically.

The initial namespace automatically requires:

(require '[zero-one.geni.core :as g]
         '[zero-one.geni.ml :as ml])

so that functions such as g/read-csv! and ml/logistic-regression are immediately available.

The Spark session is available as a Clojure Future object, which can be dereferenced with @spark. To see the full default spark config, invoke (g/spark-conf @spark)!