A Clojure dataframe library that runs on Spark
scripts/scrape-spark-docs.clj
is able to scrape the relevant docs for the four modules.core.column
and ml.regression
namespaces.JavaDStream
and JavaStreamingContext
methods.:expected
key and automatically retries to make the test less flaky.read-xlsx!
and write-xlsx!
are now available backed by zero.one/fxl
.:schema
option, which can be an actual Spark schema or its data-oriented version.read-edn!
and write-edn!
are now available with an added dependency of metosin/jsonista
. The functions may not be performant, but can come in handy for small-data compositions.nrepl
bumped to 0.8.1.Makefile
now runs on a temporary directory, so that there are no race conditions in writing to the target
directory. This means that make ci --jobs 3
is now possible on a single machine.Preliminary RDD support with only certain transformations completed and completion of two parts of the cookbook for Spark ML.
map
, reduce
, map-to-pair
and reduce-by-key
. The main challenge has been doing serialisation of functions which are mainly taken from Sparkling and sparkplug.--submit
command-line argument to emulate spark-submit
.Better getting-started experience with the new geni
command and better alignment of Geni namespaces with Spark packages.
geni
script with install instructions and a new asciinema screencast. This will be the main way to use Geni for small, one-off analyses and throwaway scripts.zero-one.geni.core
and zero-one.geni.ml
. The idea is that the core
namespaces should refer to only Spark SQL and the ml
namespaces refer to Spark ML. This will help the mapping of Geni functions to the original Spark functions.Initial alpha release documented here on cljdoc.
The release includes an uberjar that should provide a Geni REPL (i.e. a Clojure spark-shell
) within seconds. Download the uberjar, and simply try out the REPL with java -jar geni-repl-uberjar-0.0.21.jar
! An nREPL server is automatically started with an .nrepl-port
file, so that common Clojure text editors should be able to jack in automatically.
The initial namespace automatically requires:
(require '[zero-one.geni.core :as g]
'[zero-one.geni.ml :as ml])
so that functions such as g/read-csv!
and ml/logistic-regression
are immediately available.
The Spark session is available as a Clojure Future object, which can be dereferenced with @spark
. To see the full default spark config, invoke (g/spark-conf @spark)
!