Google Cloud Dataflow provides a simple, powerful model for building both batch and streaming parallel data processing pipelines.
Version 2.1.0 is based on a subset of Apache Beam 2.1.0. See the Apache Beam 2.1.0 release notes for additional change information.
Known issue: When running in batch mode, Gauge metrics are not reported.
DataflowRunner
in streaming mode.OnTimeBehavior
to WindowinStrategy
to control emitting of ON_TIME
panes.FileBasedSink
s which consume windowed input.DatastoreIO
failed to make progress when Datastore was slow to respond.bzip2
files were being partially read; added support for concatenated bzip2
files.CompressedSource
s with compression type set to BZIP2
are potentially losing data during processing. For more information, see Issue #596.The Dataflow SDK for Java 2.0.0 is the first stable 2.x release of the Dataflow SDK for Java, based on a subset of Apache Beam 2.0.0. See the Apache Beam 2.0.0 release notes for additional change information.
This is a new major version, and therefore comes with the following caveats:
Version 2.0.0 is based on a subset of Apache Beam 2.0.0. The most relevant changes in this release for Cloud Dataflow customers include:
BigQueryIO
for writing into multiple tables, possibly with different schemas, based on data. See BigQueryIO.Write.to(SerializableFunction) and BigQueryIO.Write.to(DynamicDestinations).TextIO
and AvroIO
. For example, see TextIO.Write.withWindowedWrites() and TextIO.Write.withFilenamePolicy(FilenamePolicy).TFRecordIO
to read and write TensorFlow TFRecord files.CoderProvider
s in the default CoderRegistry
. CoderProvider
s are registered by a ServiceLoader
via concrete implementations of a CoderProviderRegistrar
.ParDo
with side inputs and outputs.MapElements
and FlatMapElements
transforms when specifying an output type.PubsubIO
and KafkaIO
.TextIO
, AvroIO
, TFRecordIO
, KinesisIO
, BigQueryIO
.WindowFn
itself using the Window
transform.XmlSource
and XmlSink
into XmlIO
.CountingInput
to GenerateSequence
and unified the syntax for producing bounded and unbounded sequences.BoundedSource#splitIntoBundles
to #split
.UnboundedSource#generateInitialSplits
to #split
.@StartBundle
is no longer possible. Instead of accepting a parameter of type Context
, this method may optionally accept an argument of type StartBundleContext
to access PipelineOptions
.@FinishBundle
now always requires an explicit timestamp and window. Instead of accepting a parameter of type Context
, this method may optionally accept an argument of type FinishBundleContext
to access PipelineOptions
and emit output to specific windows.XmlIO
is no longer part of the SDK core. It must be added manually using the new xml-io
package.Please see Cloud Dataflow documentation and release notes for version 2.0.
The Dataflow SDK for Java 2.0.0-beta3 is the third 2.x release of the Dataflow SDK for Java, based on a subset of the Apache Beam code base.
This is a Beta release of the Dataflow SDK 2.x for Java and includes the following caveats:
Version 2.0.0-beta3 is based on a subset of Apache Beam 0.6.0. The most relevant changes in this release for Cloud Dataflow customers include:
TextIO
to only operate on strings.KafkaIO
to specify type parameters explicitly.ToString
.Count
, Latest
, Sample
, SortValues
transforms.Write.Bound
to Write
.Flatten
transform classes.GroupByKey.create
method into create
and createWithFewKeys
methods.Please see the official Dataflow SDK 2.x for Java release notes for an updated list of additional breaking changes and updated information on the Dataflow SDK 2.x for Java releases.
The Dataflow SDK for Java 2.0.0-beta2 is the second 2.x release of the Dataflow SDK for Java, based on a subset of the Apache Beam code base.
This is a Beta release of the Dataflow SDK 2.x for Java and includes the following caveats:
This release is based on a subset of Apache Beam 0.5.0. The most relevant changes in this release for Cloud Dataflow customers include:
PubsubIO
functionality: Read
and Write
now provide access to Cloud Pubsub message attributes.DirectRunner
in this release).PubsubIO
construction: PubsubIO.Read
and PubsubIO.Write
must now be instantiated using PubsubIO.<T>read()
and PubsubIO.<T>write()
instead of the static factory methods such as PubsubIO.Read.topic(String)
. Specifying a coder via .withCoder(Coder)
for the output type is required for Read
. Specifying a coder for the input type or specifying a format function via .withAttributes(SimpleFunction<T, PubsubMessage>)
is required for Write
.Please see the official Dataflow SDK 2.x for Java release notes for an updated list of additional breaking changes and updated information on the Dataflow SDK 2.x for Java releases.
The Dataflow SDK for Java 2.0.0-beta1 is the first 2.x release of the Dataflow SDK for Java, based on a subset of the Apache Beam code base.
This is a Beta release of the Dataflow SDK 2.x for Java and includes the following caveats:
This release is based on a subset of Apache Beam 0.4.0. The most relevant changes for Cloud Dataflow customers include:
CompressedSource
supports reading ZIP-compressed files. TextIO.Write
and AvroIO.Write
support compressed output.AvroIO
functionality: Write
supports the addition of custom user metadata.BigQueryIO
functionality: Write
now splits large (> 12 TiB) bulk imports into multiple BigQuery load jobs enabling it to handle very large datasets.BigtableIO
functionality: Write
supports unbounded PCollection
s and so can be used in the DataflowRunner
in the streaming mode.For complete details, please see the release notes for Apache Beam 0.3.0-incubating and 0.4.0.
Other Apache Beam modules from version 0.4.0 can be used with this distribution, including additional I/O connectors like Java Message Service (JMS), Apache Kafka, Java Database Connectivity (JDBC), MongoDB, and Amazon Kinesis. Please see the Apache Beam site for details.
This release has a number of significant changes from the 1.x series of releases.
All users need to read and understand these changes in order to upgrade to 2.x versions.
As part of generalizing Apache Beam to work well with environments beyond Google Cloud Platform, the SDK code has been renamed and restructured.
Renamed com.google.cloud.dataflow
to org.apache.beam
The SDK is now declared in the package org.apache.beam
instead of com.google.cloud.dataflow
. You need to update all your import statements with this change.
New subpackages: runners.dataflow
, runners.direct
, and io.gcp
Runners have been reorganized into their own packages, so many things from com.google.cloud.dataflow.sdk.runners
have moved into either org.apache.beam.runners.direct
or org.apache.beam.runners.dataflow
.
Pipeline options specific to running on the Dataflow service have moved from com.google.cloud.dataflow.sdk.options
to org.apache.beam.runners.dataflow.options
.
Most I/O connectors to Google Cloud Platform services have been moved into subpackages. For example, BigQueryIO
has moved from com.google.cloud.dataflow.sdk.io
to org.apache.beam.sdk.io.gcp.bigquery
.
Most IDEs will be able to help identify the new locations. To verify the new location for specific files, you can use t
to search the code on GitHub. The Dataflow SDK 1.x for Java releases are built from the GoogleCloudPlatform/DataflowJavaSDK repository. The Dataflow SDK 2.x for Java releases correspond to code from the apache/beam repository.
Removed Pipeline
from Runner names
The names of all Runners have been shortened by removing Pipeline
from the names. For example, DirectPipelineRunner
is now DirectRunner
, and DataflowPipelineRunner
is now DataflowRunner
.
Require setting --tempLocation
to a Google Cloud Storage path
Instead of allowing allowing you to specify only one of --stagingLocation
or --tempLocation
and then Dataflow inferring the other, the Dataflow Service now requires --gcpTempLocation
to be set to a Google Cloud Storage path, but it can be inferred from the more general --tempLocation
. Unless overridden, this will also be used for the --stagingLocation
.
DirectRunner
supports unbounded PCollection
s
The DirectRunner
continues to run on a user's local machine, but now additionally supports multithreaded execution, unbounded PCollection
s, and triggers for speculative and late outputs. It more closely aligns to the documented Beam model, and may (correctly) cause additional unit tests failures.
As this functionality is now in the DirectRunner
, the InProcessPipelineRunner
(Dataflow SDK 1.6+ for Java) has been removed.
Replaced BlockingDataflowPipelineRunner
with PipelineResult.waitToFinish()
The BlockingDataflowPipelineRunner
is now removed. If your code programmatically expects to run a pipeline and wait until it has terminated, then it should use the DataflowRunner
and explicitly call pipeline.run().waitToFinish()
.
If you used --runner BlockingDataflowPipelineRunner
on the command line to interactively induce your main program to block until the pipeline has terminated, then this is a concern of the main program; it should provide an option such as --blockOnRun
that will induce it to call waitToFinish()
.
ParDo
and DoFn
DoFn
s use annotations instead of method overrides
In order to allow for more flexibility and customization, DoFn
now uses method annotations to customize processing instead of requiring users to override specific methods.
The differences between the new DoFn
and the old are demonstrated in the following code sample. Previously (with Dataflow SDK 1.x for Java), your code would look like this:
new DoFn<Foo, Baz>() {
@Override
public void processElement(ProcessContext c) { … }
}
Now (with Dataflow SDK 2.x for Java), your code will look like this:
new DoFn<Foo, Baz>() {
@ProcessElement // <-- This is the only difference
public void processElement(ProcessContext c) { … }
}
If your DoFn
accessed ProcessContext#window()
, then there is a further change. Instead of this:
public class MyDoFn extends DoFn<Foo, Baz> implements RequiresWindowAccess {
@Override
public void processElement(ProcessContext c) {
… (MyWindowType) c.window() …
}
}
you will write this:
public class MyDoFn extends DoFn<Foo, Baz> {
@ProcessElement
public void processElement(ProcessContext c, MyWindowType window) {
… window …
}
}
or:
return new DoFn<Foo, Baz>() {
@ProcessElement
public void processElement(ProcessContext c, MyWindowType window) {
… window …
}
}
The runtime will automatically provide the window to your DoFn
.
DoFn
s are reused across multiple bundles
In order to allow for performance improvements, the same DoFn
may now be reused to process multiple bundles of elements, instead of guaranteeing a fresh instance per bundle. Any DoFn
that keeps local state (e.g. instance variables) beyond the end of a bundle may encounter behavioral changes, as the next bundle
will now start with that state instead of a fresh copy.
To manage the lifecycle, new @Setup
and @Teardown
methods have been added. The full lifecycle is as follows (while a failure may truncate the lifecycle at any point):
@Setup
: Per-instance initialization of the DoFn
, such as opening reusable connections.@StartBundle
: Per-bundle initialization, such as resetting the state of the DoFn
.@ProcessElement
: The usual element processing.@FinishBundle
: Per-bundle concluding steps, such as flushing side effects.@Teardown
: Per-instance teardown of the resources held by the DoFn
, such as closing reusable connections.Note: This change is expected to have limited impact in practice. However, it does not generate a compile-time error and has the potential to silently cause unexpected results.
Removed .named()
Remove the .named()
methods from PTransforms and sub-classes. Instead use PCollection.apply(“name”, PTransform)
.
Renamed PTransform.apply()
to PTransform.expand()
PTransform.apply()
was renamed to PTransform.expand()
to avoid confusion with PCollection.apply()
. All user-written composite transforms will need to rename the overridden apply()
method to expand()
. There is no change to how pipelines are constructed.
Please see the official Dataflow SDK 2.x for Java release notes for an updated list of additional breaking changes and updated information on the Dataflow SDK 2.x for Java releases.
ValueProvider
interface for use in pipeline options. Making an option of type ValueProvider<T>
instead of T
allows its value to be supplied at runtime (rather than pipeline construction time) and enables Dataflow templates. Support for ValueProvider
has been added to TextIO
, PubSubIO
, and BigQueryIO
and can be added to arbitrary PTransforms as well.--saveProfilesToGcs
pipeline option. For more information on profiling pipelines executed by the DataflowPipelineRunner
, see issue #72.--enableProfilingAgent
pipeline option that saved profiles to the individual worker disks. For more information on profiling pipelines executed by the DataflowPipelineRunner
, see issue #72.FileBasedSource
to throw an exception when reading from a file pattern that has no matches. Pipelines will now fail at runtime rather than silently reading no data in this case. This change affects TextIO.Read
or AvroIO.Read
when configured withoutValidation
.Coder
validation in the DirectPipelineRunner
to catch coders that cannot properly encode and decode their input.PipelineOptions
.DataflowPipelineRunner
in streaming mode.InProcessRunner
, enabling testing with larger datasets.TextIO
, AvroIO
, and other FileBasedSource
implementations.DataflowPipelineRunner
.BigQueryIO.Read
for queries in the new BigQuery Standard SQL dialect using .withStandardSQL()
.BigQueryIO
for the new BYTES
, TIME
, DATE
, and DATETIME
types.BigtableIO.Read
for reading from a restricted key range using .withKeyRange(ByteKeyRange)
.CompressedSource
, leading to better performance when executing batch pipelines that use TextIO.Read
on the Cloud Dataflow service.BigQueryIO.Write
in streaming mode.com.google.cloud.dataflow.sdk.io.datastore.DatastoreIO
. Deprecated the old DatastoreIO
class that supported only the deprecated Cloud Datastore API v1beta2.DatastoreIO.Read
to support dynamic work rebalancing, and added an option to control the number of query splits using withNumQuerySplits
.DatastoreIO.Write
to work with an unbounded PCollection
, supporting writing to Cloud Datastore when using the DataflowPipelineRunner
in streaming mode.Entity
objects directly using Datastore.v1().deleteEntity
or to delete entities by key using Datastore.v1().deleteKey
.BoundedSource
to the DataflowPipelineRunner
in streaming mode. This enables the use of TextIO.Read
, AvroIO.Read
and other bounded sources in these pipelines.TextIO.Write
.Sink
.TestStream
to enable testing of triggers with multiple panes and late data with the InProcessPipelineRunner
.UnboundedCountingInput
produces elements using withRate(long, Duration)
.DataflowPipelineRunner
in streaming mode.TestStream
, reimplemented DataflowAssert
to use GroupByKey
instead of sideInputs
to check assertions. This is an update-incompatible change to DataflowAssert
for pipelines run on the DataflowPipelineRunner
in streaming mode.FileBasedSink
would produce no files when writing an empty PCollection
.BigQueryIO.Read
could not query a table in a non-US
region when using the DirectPipelineRunner
or the InProcessPipelineRunner
.allowedLateness
could cause an IllegalStateException
for pipelines run in the DirectPipelineRunner
.NullPointerException
that could be thrown during pipeline submission when using an AfterWatermark
trigger with no late firings.