RumbleDB Rumble Versions Save

⛈️ RumbleDB 1.21.0 "Hawthorn blossom" 🌳 for Apache Spark | Run queries on your large-scale, messy JSON-like data (JSON, text, CSV, Parquet, ROOT, AVRO, SVM...) | No install required (just a jar to download) | Declarative Machine Learning and more

v1.21.0

11 months ago

NEW! The jar for Spark 3.5 was added and is available for download.

Use RumbleDB to query data with JSONiq, even data that does not fit in DataFrames.

Try-it-out sandbox: https://colab.research.google.com/github/RumbleDB/rumble/blob/master/RumbleSandbox.ipynb

Instructions to get started: https://rumble.readthedocs.io/en/latest/Getting%20started/

Spark 3.0 and 3.1 are no longer supported as of RumbleDB 1.21, as they are no longer supported officially by the Spark team. Spark 3.4 is newly supported.

RumbleDB comes in 4 jars that you can pick from depending on your needs:

rumbledb-1.21.0-standalone.jar contains Spark already and can simply be run "out of the box" with java -jar rumbledb-1.21.0-standalone.jar with Java 8 or 11. rumbledb-1.21.0-for-spark-3.X.jar (3.2, 3.3, 3.4) is smaller in size, does not contain Spark, and can be run in a corresponding, existing Spark environment either local (so you need to download and install Spark) or on a cluster (EMR with just a few clicks, etc) with spark-submit rumbledb-1.21.0-for-spark-3.X.jar

Improvements

Automatically parallelizes range expressions with more than a million items with no need to call parallelize() any more.
some simple map expressions on homogeneous input are now faster (native SQL behind the scene).
general comparisons on equality are now considerably faster
reverse() is now more efficient and faster on homogeneous sequences
Fixed bug on equijoin involving homogeneous sequences
Add two functions jn:cosh and jn:sinh
Automatic optimization of general comparisons to value comparisons when it is detected that the sequences have at most one item (can be deactivated with --optimize-general-comparison-to-value-comparison on)
Better static type detection
It is now possible to force a sequential execution (without Spark) with --parallel-execution no. This also works with queries containing calls to parallelize() (which will be ineffective), json-doc(), and json-file() (which will simply stream-read from the disk). Other I/O functions (such as csv-file(), etc) will still involve Spark for reading, but immediately materialize for the rest of the execution.
It is now possible to deactivate Native Spark SQL execution (forcing a fallback to the use of UDFs by RumbleDB) with --native-execution no.
annotate expression (similar syntax to validate expression) allows directly annotating an item without checking for validity.
More static types are detected
Non-recursive functions are now automatically inlined for faster execution. This can be deactivated with --function-inlining no (reverting to behavior in previous versions)
TypeSwitch expressions now support DataFrame execution

Bugfixes

Fixed bug when reading longs from DataFrames
Fixed an issue with projection pushdowns in join queries
Fixed a few bugs with queries that navigate JSON in for clauses; they are compiled to native SQL whenever possible, but some chains were throwing errors (e.g., an array unboxing followed by object lookup)
Fixed a bug in which calling count() on a grouping variable did not return 1 when native SQL execution is activated
hexBinary and base64Binary values can now be used in order by clauses with parallel execution

v1.20.0

1 year ago

Use RumbleDB to query data with JSONiq, even data that does not fit in DataFrames.

Try-it-out sandbox: https://colab.research.google.com/github/RumbleDB/rumble/blob/master/RumbleSandbox.ipynb

Instructions to get started: https://rumble.readthedocs.io/en/latest/Getting%20started/

Spark 3.0 and 3.1 are no longer supported as of RumbleDB 1.20, as they are no longer supported officially by the Spark team.

RumbleDB comes in 4 jars that you can pick from depending on your needs:

rumbledb-1.20.0-standalone.jar contains Spark already and can simply be run "out of the box" with java -jar rumbledb-1.20.0-standalone.jar with Java 8 or 11. rumbledb-1.20.0-for-spark-3.X.jar (3.2, 3.3) is smaller in size, does not contain Spark, and can be run in a corresponding, existing Spark environment either local (so you need to download and install Spark) or on a cluster (EMR with just a few clicks, etc) with spark-submit rumbledb-1.20.0-for-spark-3.X.jar

New features:

Open and query YAML files (also with multiple documents) with yaml-doc()
Serialize the output of your queries to YAML with --output-format yaml
General comparisons (existential quantification on large sequences) now work with very big sequences and are automatically pushed down to Spark.

Bugfixes:

Fixed an issue preventing reading Decimal types from Parquet with some precisions and ranges
Fixed a few bugs in static typing
Fixed a bug that didn't throw an error when using the concatenation operator || on sequences with more than one item

v1.19.0

1 year ago

RumbleDB allows you to query data that does not fit in DataFrames with JSONiq.

Try-it-out sandbox: https://colab.research.google.com/github/RumbleDB/rumble/blob/master/RumbleSandbox.ipynb

Instructions to get started: https://rumble.readthedocs.io/en/latest/Getting%20started/

RumbleDB comes in 4 jars that you can pick from depending on your needs:

rumbledb-1.19.0-standalone.jar contains Spark already and can simply be run "out of the box" with java -jar rumbledb-1.19.0-standalone.jar with Java 8 or 11.
rumbledb-1.19.0-for-spark-3.X.jar (3.0, 3.1, 3.2, 3.3) is smaller in size, does not contain Spark, and can be run in a corresponding, existing Spark environment either local (so you need to download and install Spark) or on a cluster (EMR with just a few clicks, etc) with spark-submit rumbledb-1.19.0-for-spark-3.X.jar

Release notes:

Fixed the bug with spaces in paths
Various fixes and enhancement
New functions repartition#2 to change the number of physical partitions, and binary-classification-metrics#3, binary-classification-metrics#4 for preparing ROC curves, PR curves to evaluation the output of ML pipelines.

v1.18.0

2 years ago

Instructions to get started: https://rumble.readthedocs.io/en/latest/Getting%20started/

RumbleDB comes in 4 jars that you can pick from depending on your needs:

rumbledb-1.18.0-standalone.jar contains Spark already and can simply be run "out of the box" with java -jar rumbledb-1.18.0-standalone.jar with Java 8 or 11.
rumbledb-1.18.0-for-spark-3.X.jar (3.0, 3.1, 3.2) is smaller in size, does not contain Spark, and can be run in a corresponding, existing Spark environment either local (so you need to download and install Spark) or on a cluster (EMR with just a few clicks, etc) with spark-submit rumbledb-1.18.0-for-spark-3.X.jar

Release notes:

FLWOR expressions starting with a series of let are now better optimized and faster.
A warning with advice is issued in the command window if a group by is used in a FLWOR expression that starts with a let clause.
The shell will no longer exit when an error is thrown.
When a query cannot be executed in parallel, a more informative error message is output inviting the user to rewrite their query, instead of the raw Spark error.
When launching in shell or server mode, instructions are printed on the screen for next steps
Fixed crash in the execution of some where clauses when a join was not successfully detected and it falls back to linear execution
Support for context item declarations and passing an external context item value on the command line
By default, the date type no longer supports timezones (which are rarely used for this type, although supported by ISO 8601). This enables more optimizations (e.g., internal conversion to DataFrame DateType columns and export of datasets with dates to Parquet). Timezones on dates can be activated for those users who need them with a simple CLI argument (--dates-with-timezone yes).
Ctrl+C now elegantly exits the shell.

v1.17.0

2 years ago

Instructions to get started: https://rumble.readthedocs.io/en/latest/Getting%20started/

The CLI was extended with verbs (run, serve, repl) and single-dash shortcuts (-f for --output-format, etc). This is backward compatible.
Automatic internal conversion to DataFrames for FLWOR expressions executed in parallel when the statically inferred type is DataFrame-compatible.
Fixed bug that prevented calling a variable $type or lookup up a field called "type" without quotes.
Fixed but for projecting a sequence internally stored as a DataFrame to dynamically defined keys.
Fix some bugs with post-grouping count optimizations on let variables
Support for Spark 2.4, which is no longer maintained by the Spark team, is now dropped, but available on request. RumbleDB 1.17 supports Spark 3.0, 3.1 and 3.2.
plenty of smaller bug fixes
[Experimental] we also provide a jar that embeds Spark and does not require its installation (rumbledb-1.17.0-standalone.jar). It is for use on a local machine only (not a cluster) and works with java -jar rumbledb-1.17.0-standalone.jar run -q '1+1' rather with spark-submit. Feedback is welcome! This is just experimental at this point and we will take it from there.

v1.16.2

2 years ago

Interim release.

Fix recursive view "input" issue.
Nicer message for out of memory errors and hint to use CLI parameters.
Reverted to Kryo 4 for Spark 3.2, which depends on Twitter Chill 0.10.0 using this version of Kryo in a way incompatible with Kryo5

v1.16.1

2 years ago

Interim release.

Fixed race condition issue with min() and max() called multiple times that led to possibly incorrect output.
The sum() and count() functions are now able to stream locally on very large (non parallelized) sequences.
Range expressions now support 64 bit integers as well (before this, an overflow happened)
The arrow syntax works for dynamic function calls, too, so in Rumble ML pipelines can also be called with a pipelining syntax: $training-set=>$my-transformer($params)=>my-estimator($params)
substring() was fixed to follow standard behavior even with exotic parameters (mostly returning an empty string in these cases)

v1.16.0

2 years ago

new --query parameter for directly passing a query rather than a query path.
fixed a bug occurring with group by clauses on native DataFrames with complex aggregations
new --shell-filter parameters for modifying the way the output is shown in shell mode (e.g. --shell-filter 'jq . -S -C' for pretty printing)
new output formats: json (top-level strings will be quoted), tyson and possibility to indent with --output-format-options:indent yes
new JSound validator page at localhost:/jsound-validator.html
support for user-defined atomic types with JSound verbose syntax
fn:concat is now correctly in the fn namespace
When the materialization is reached and the count is unknow, it is no longer shown as the max long value.

v1.15.0

2 years ago

Fixed jn:intersect#1 to always be run locally
General performance improvements for many expressions and iterators that return at most one item
New builtin functions supported: fn:min#2, fn:max#2, fn:unordered#1, fn:distinct-values#2, fn:index-of#3, fn:deep-equal#3, fn:string#0, fn:string#1, fn:substring-before#3, fn:substring-after#3, fn:string-length#0, fn:resolve-uri#1, fn:resolve-uri#2, fn:ends-width#3, fn:starts-width#3, fn:contains#3, , fn:normalize-space#0, fn:default-collation#0, fn:number#0, fn:implicit-timezone#0, fn:not#0, fn:static-base-uri#1, fn:dateTime#2, fn:false#0, fn:true#0
all JSONiq builtin types are now supported: newly supported are byte, dateTimeStamp, gDay, gMonth, gYear, gYearMonth, gMonthDay, int, long, negativeInteger, nonNegativeInteger, positiveInteger, nonPositiveInteger, unsignedInt, unsignedLong, unsignedByte, unsignedShort, short,
ceiling, floor, round, abs, round-half-to-even are now correctly in the fn namespace (not math) and all accept numeric values (instead of converting everything to doubles) and a few bugs have been fixed
support for open object types via the JSound verbose syntax (they are, of course, not implemented as DataFrames, but this makes no difference at the syntactic level except they cannot be used with ML estimators and transformers)
support for user-defined array types via the JSound verbose syntax, including subtypes
validation of atomic values is now correctly done by casting the lexical value (not the typed value) to the expected type.
Fixed serialization of NaN, double/float infinity, dates, etc (the quotes are now correctly included to make them JSON strings)
positive and negative zero (for double, float) now compare as equals in value/general comparison

Note that Spark 2.4.x is no longer maintained. We provide rumbledb-1.15.0-for-spark-2.jar only for legacy purposes for a smooth transition, and recommend instead using Spark 3.0.x or 3.1.x with the rumbledb-1.15.0.jar package.

v1.14.0

2 years ago

Rumble now outputs error messages displaying the faulty line of code and pointing to the place of error.
Machine Learning estimators and models can now run at scale (in parallel) on very large amounts of data. This is automatically detected.
Many stability improvements in the Machine Learning library
Machine Learning Pipelines are now supported with stages given as function items
Static typing is now always done and used to optimize even more
Initial (experimental) support for user-defined types with the JSound Compact syntax. Types can be used everywhere builtin types can be used (instance of, treat as, type annotations for variables...).
New validate type expression to validate against user-defined types and (if the type is DF-compatible) to create object* instances as optimized dataframes.
Features must be assembled with the VectorAssembler transformer prior to being used with an estimator or transformer (for example, at the start of a pipeline). featuresCol and InputCol must specify the name (as a string) of the assembled feature vector field. This is now fully consistent with the Spark ML framework.

Note that Spark 2.4.x is no longer maintained. We provide rumbledb-1.14.0-for-spark-2.jar only for legacy purposes for a smooth transition, and recommend instead using Spark 3.0.x or 3.1.x with the rumbledb-1.14.0.jar package.