High performance distributed data processing engine
This is a major feature relase. Install it with npm
sc.require()
. This will ease considerably the integration of various connectors to data sources, databases, etc.save()
: now support output to CSV formatsave()
, textFile()
: automatic forward of AWS env and credentials to workerssample()
aggregateByKey()
This is a stability and bug fix release.
This is a major release. It brings new features:
textFile
) and writing (save
)aggregateByKey
, reduceByKey
, or coGroup
, join
etc., have increased considerably vs 0.6 branch.Despite new major version, this release remains backward compatible with previous branch 0.6.x
Also available as always through npm
This is a stability and bug fix release. Documentation is improved, distributed mode is better: handling of tmp files and environment has been fixed.
Performances and scalability improvement release.
In distributed mode, a direct peer-to-peer shuffle data transfer between workers has been implemented. It improves scalability on large clusters when running with hundreds of simultaneous workers.
Standalone and distributed modes are now described. Debug traces are improved.
This is a stability and performance improvements release.
Memory efficiency has been improved in presence of large datasets (thousands of partitions) and job complexity (hundreds of stages/steps).
S3 support has been fixed, both for input and output.
Multi-machine communications and debugging traces have been improved.