DataflowJavaSDK Versions Save

Google Cloud Dataflow provides a simple, powerful model for building both batch and streaming parallel data processing pipelines.

v2.1.0

6 years ago

Version 2.1.0 is based on a subset of Apache Beam 2.1.0. See the Apache Beam 2.1.0 release notes for additional change information.

Issues

Known issue: When running in batch mode, Gauge metrics are not reported.

Updates and improvements

  • Added Metrics support forDataflowRunner in streaming mode.
  • Added OnTimeBehavior to WindowinStrategy to control emitting of ON_TIME panes.
  • Added default file name policy for windowed file FileBasedSinks which consume windowed input.
  • Fixed an issue in which processing time timers for expired windows were ignored.
  • Fixed an issue in which DatastoreIO failed to make progress when Datastore was slow to respond.
  • Fixed an issue in which bzip2 files were being partially read; added support for concatenated bzip2 files.
  • Improved several stability, performance, and documentation issues.

v1.9.1

6 years ago
  • Fixed an issue with Dataflow jobs that read from CompressedSources with compression type set to BZIP2 are potentially losing data during processing. For more information, see Issue #596.

v2.0.0

6 years ago

The Dataflow SDK for Java 2.0.0 is the first stable 2.x release of the Dataflow SDK for Java, based on a subset of Apache Beam 2.0.0. See the Apache Beam 2.0.0 release notes for additional change information.

Note for users upgrading from version 1.x

This is a new major version, and therefore comes with the following caveats:

  • Breaking Changes: The Dataflow SDK 2.x for Java has a number of breaking changes from the 1.x series of releases.
  • Update Incompatibility: The Dataflow SDK 2.x for Java is update-incompatible with Dataflow 1.x. Streaming jobs using a Dataflow 1.x SDK cannot be updated to use a Dataflow 2.x SDK. Dataflow 2.x pipelines may only be updated across versions starting with SDK version 2.0.0.

Updates and improvements since 2.0.0-beta3

Version 2.0.0 is based on a subset of Apache Beam 2.0.0. The most relevant changes in this release for Cloud Dataflow customers include:

  • Added new API in BigQueryIO for writing into multiple tables, possibly with different schemas, based on data. See BigQueryIO.Write.to(SerializableFunction) and BigQueryIO.Write.to(DynamicDestinations).
  • Added new API for writing windowed and unbounded collections to TextIO and AvroIO. For example, see TextIO.Write.withWindowedWrites() and TextIO.Write.withFilenamePolicy(FilenamePolicy).
  • Added TFRecordIO to read and write TensorFlow TFRecord files.
  • Added the ability to automatically register CoderProviders in the default CoderRegistry. CoderProviders are registered by a ServiceLoader via concrete implementations of a CoderProviderRegistrar.
  • Changed order of parameters for ParDo with side inputs and outputs.
  • Changed order of parameters for MapElements and FlatMapElements transforms when specifying an output type.
  • Changed the pattern for reading and writing custom types to PubsubIO and KafkaIO.
  • Changed the syntax for reading to and writing from TextIO, AvroIO, TFRecordIO, KinesisIO, BigQueryIO.
  • Changed syntax for configuring windowing parameters other than the WindowFn itself using the Window transform.
  • Consolidated XmlSource and XmlSink into XmlIO.
  • Renamed CountingInput to GenerateSequence and unified the syntax for producing bounded and unbounded sequences.
  • Renamed BoundedSource#splitIntoBundles to #split.
  • Renamed UnboundedSource#generateInitialSplits to #split.
  • Output from @StartBundle is no longer possible. Instead of accepting a parameter of type Context, this method may optionally accept an argument of type StartBundleContext to access PipelineOptions.
  • Output from @FinishBundle now always requires an explicit timestamp and window. Instead of accepting a parameter of type Context, this method may optionally accept an argument of type FinishBundleContext to access PipelineOptions and emit output to specific windows.
  • XmlIO is no longer part of the SDK core. It must be added manually using the new xml-io package.

More information

Please see Cloud Dataflow documentation and release notes for version 2.0.

v2.0.0-beta3

7 years ago

The Dataflow SDK for Java 2.0.0-beta3 is the third 2.x release of the Dataflow SDK for Java, based on a subset of the Apache Beam code base.

  • Breaking Changes: The Dataflow SDK 2.x for Java releases have a number of breaking changes from the 1.x series of releases and from earlier 2.x beta releases. Please see below for details.
  • Update Incompatibility: The Dataflow SDK 2.x for Java is update-incompatible with Dataflow 1.x. Streaming jobs using a Dataflow 1.x SDK cannot be updated to use a Dataflow 2.x SDK. Additionally, beta releases of 2.x may not be update-compatible with each other or with 2.0.0.

Beta

This is a Beta release of the Dataflow SDK 2.x for Java and includes the following caveats:

  • No API Stability: This release does not guarantee a stable API. The next release in the 2.x series may make breaking API changes that require you to modify your code when you upgrade. API stability guarantees will begin with the 2.0.0 release.
  • Limited Support Timeline: This release is an early preview of the upcoming 2.0.0 release. It’s intended to let you start the eventual transition to the 2.x series as convenient for you. Beta release are supported by the Dataflow service, but obtaining bugfixes and new features will require you to upgrade to a newer release that may have backwards-incompatible changes. Once 2.0.0 is released, you should plan to upgrade from any 2.0.0-betaX releases within 3 months.
  • Documentation and Code Samples: The SDK documentation on the Dataflow site continues to use code samples from the original 1.x SDKs. For the time being, please see the Apache Beam Documentation for background on the APIs in this release.

Updates since 2.0.0-beta2

Version 2.0.0-beta3 is based on a subset of Apache Beam 0.6.0. The most relevant changes in this release for Cloud Dataflow customers include:

  • Changed TextIO to only operate on strings.
  • Changed KafkaIO to specify type parameters explicitly.
  • Renamed factory functions of ToString.
  • Changed Count, Latest, Sample, SortValues transforms.
  • Renamed Write.Bound to Write.
  • Renamed Flatten transform classes.
  • Split GroupByKey.create method into create and createWithFewKeys methods.

Additional breaking changes

Please see the official Dataflow SDK 2.x for Java release notes for an updated list of additional breaking changes and updated information on the Dataflow SDK 2.x for Java releases.

v2.0.0-beta2

7 years ago

The Dataflow SDK for Java 2.0.0-beta2 is the second 2.x release of the Dataflow SDK for Java, based on a subset of the Apache Beam code base.

  • Breaking Changes: The Dataflow SDK 2.x for Java releases have a number of breaking changes from the 1.x series of releases and from earlier 2.x beta releases. Please see below for details.
  • Update Incompatibility: The Dataflow SDK 2.x for Java is update-incompatible with Dataflow 1.x. Streaming jobs using a Dataflow 1.x SDK cannot be updated to use a Dataflow 2.x SDK. Additionally, beta releases of 2.x may not be update-compatible with each other or with 2.0.0.

Beta

This is a Beta release of the Dataflow SDK 2.x for Java and includes the following caveats:

  • No API Stability: This release does not guarantee a stable API. The next release in the 2.x series may make breaking API changes that require you to modify your code when you upgrade. API stability guarantees will begin with the 2.0.0 release.
  • Limited Support Timeline: This release is an early preview of the upcoming 2.0.0 release. It’s intended to let you start the eventual transition to the 2.x series as convenient for you. Beta release are supported by the Dataflow service, but obtaining bugfixes and new features will require you to upgrade to a newer release that may have backwards-incompatible changes. Once 2.0.0 is released, you should plan to upgrade from any 2.0.0-betaX releases within 3 months.
  • Documentation and Code Samples: The SDK documentation on the Dataflow site continues to use code samples from the original 1.x SDKs. For the time being, please see the Apache Beam Documentation for background on the APIs in this release.

Updates since 2.0.0-beta1

This release is based on a subset of Apache Beam 0.5.0. The most relevant changes in this release for Cloud Dataflow customers include:

  • PubsubIO functionality: Read and Write now provide access to Cloud Pubsub message attributes.
  • New scenario: support for stateful pipelines via the new State API.
  • New scenario: support for timer via the new Timer API (limited to the DirectRunner in this release).
  • Change to PubsubIO construction: PubsubIO.Read and PubsubIO.Write must now be instantiated using PubsubIO.<T>read() and PubsubIO.<T>write() instead of the static factory methods such as PubsubIO.Read.topic(String). Specifying a coder via .withCoder(Coder) for the output type is required for Read. Specifying a coder for the input type or specifying a format function via .withAttributes(SimpleFunction<T, PubsubMessage>) is required for Write.

Additional breaking changes

Please see the official Dataflow SDK 2.x for Java release notes for an updated list of additional breaking changes and updated information on the Dataflow SDK 2.x for Java releases.

v2.0.0-beta1

7 years ago

The Dataflow SDK for Java 2.0.0-beta1 is the first 2.x release of the Dataflow SDK for Java, based on a subset of the Apache Beam code base.

  • Breaking Changes: The Dataflow SDK 2.x for Java has a number of breaking changes from the 1.x series of releases. Please see below for details.
  • Update Incompatibility: The Dataflow SDK 2.x for Java is update-incompatible with Dataflow 1.x. Streaming jobs using a Dataflow 1.x SDK cannot be updated to use a Dataflow 2.x SDK. Additionally, beta releases of 2.x may not be update-compatible with each other or with 2.0.0.

Beta

This is a Beta release of the Dataflow SDK 2.x for Java and includes the following caveats:

  • No API Stability: This release does not guarantee a stable API. The next release in the 2.x series may make breaking API changes that require you to modify your code when you upgrade. API stability guarantees will begin with the 2.0.0 release.
  • Limited Support Timeline: This release is an early preview of the upcoming 2.0.0 release. It’s intended to let you start the eventual transition to the 2.x series as convenient for you. Beta release are supported by the Dataflow service, but obtaining bugfixes and new features will require you to upgrade to a newer release that may have backwards-incompatible changes. Once 2.0.0 is released, you should plan to upgrade from any 2.0.0-betaX releases within 3 months.
  • Documentation and Code Samples: The SDK documentation on the Dataflow site continues to use code samples from the original 1.x SDKs. For the time being, please see the Apache Beam Documentation for background on the APIs in this release.

New Functionality

This release is based on a subset of Apache Beam 0.4.0. The most relevant changes for Cloud Dataflow customers include:

  • Improvements to compression: CompressedSource supports reading ZIP-compressed files. TextIO.Write and AvroIO.Write support compressed output.
  • AvroIO functionality: Write supports the addition of custom user metadata.
  • BigQueryIO functionality: Write now splits large (> 12 TiB) bulk imports into multiple BigQuery load jobs enabling it to handle very large datasets.
  • BigtableIO functionality: Write supports unbounded PCollections and so can be used in the DataflowRunner in the streaming mode.

For complete details, please see the release notes for Apache Beam 0.3.0-incubating and 0.4.0.

Other Apache Beam modules from version 0.4.0 can be used with this distribution, including additional I/O connectors like Java Message Service (JMS), Apache Kafka, Java Database Connectivity (JDBC), MongoDB, and Amazon Kinesis. Please see the Apache Beam site for details.

Major changes from Dataflow SDK 1.x for Java

This release has a number of significant changes from the 1.x series of releases.

All users need to read and understand these changes in order to upgrade to 2.x versions.

Package rename and restructuring

As part of generalizing Apache Beam to work well with environments beyond Google Cloud Platform, the SDK code has been renamed and restructured.

  • Renamed com.google.cloud.dataflow to org.apache.beam

    The SDK is now declared in the package org.apache.beam instead of com.google.cloud.dataflow. You need to update all your import statements with this change.

  • New subpackages: runners.dataflow, runners.direct, and io.gcp

    Runners have been reorganized into their own packages, so many things from com.google.cloud.dataflow.sdk.runners have moved into either org.apache.beam.runners.direct or org.apache.beam.runners.dataflow.

    Pipeline options specific to running on the Dataflow service have moved from com.google.cloud.dataflow.sdk.options to org.apache.beam.runners.dataflow.options.

    Most I/O connectors to Google Cloud Platform services have been moved into subpackages. For example, BigQueryIO has moved from com.google.cloud.dataflow.sdk.io to org.apache.beam.sdk.io.gcp.bigquery.

    Most IDEs will be able to help identify the new locations. To verify the new location for specific files, you can use t to search the code on GitHub. The Dataflow SDK 1.x for Java releases are built from the GoogleCloudPlatform/DataflowJavaSDK repository. The Dataflow SDK 2.x for Java releases correspond to code from the apache/beam repository.

Runners

  • Removed Pipeline from Runner names

    The names of all Runners have been shortened by removing Pipeline from the names. For example, DirectPipelineRunner is now DirectRunner, and DataflowPipelineRunner is now DataflowRunner.

  • Require setting --tempLocation to a Google Cloud Storage path

    Instead of allowing allowing you to specify only one of --stagingLocation or --tempLocation and then Dataflow inferring the other, the Dataflow Service now requires --gcpTempLocation to be set to a Google Cloud Storage path, but it can be inferred from the more general --tempLocation. Unless overridden, this will also be used for the --stagingLocation.

  • DirectRunner supports unbounded PCollections

    The DirectRunner continues to run on a user's local machine, but now additionally supports multithreaded execution, unbounded PCollections, and triggers for speculative and late outputs. It more closely aligns to the documented Beam model, and may (correctly) cause additional unit tests failures.

    As this functionality is now in the DirectRunner, the InProcessPipelineRunner (Dataflow SDK 1.6+ for Java) has been removed.

  • Replaced BlockingDataflowPipelineRunner with PipelineResult.waitToFinish()

    The BlockingDataflowPipelineRunner is now removed. If your code programmatically expects to run a pipeline and wait until it has terminated, then it should use the DataflowRunner and explicitly call pipeline.run().waitToFinish().

    If you used --runner BlockingDataflowPipelineRunner on the command line to interactively induce your main program to block until the pipeline has terminated, then this is a concern of the main program; it should provide an option such as --blockOnRun that will induce it to call waitToFinish().

ParDo and DoFn

  • DoFns use annotations instead of method overrides

    In order to allow for more flexibility and customization, DoFn now uses method annotations to customize processing instead of requiring users to override specific methods.

    The differences between the new DoFn and the old are demonstrated in the following code sample. Previously (with Dataflow SDK 1.x for Java), your code would look like this:

    new DoFn<Foo, Baz>() {
      @Override
      public void processElement(ProcessContext c) { … }
    }
    

    Now (with Dataflow SDK 2.x for Java), your code will look like this:

    new DoFn<Foo, Baz>() {
      @ProcessElement   // <-- This is the only difference
      public void processElement(ProcessContext c) { … }
    }
    

    If your DoFn accessed ProcessContext#window(), then there is a further change. Instead of this:

    public class MyDoFn extends DoFn<Foo, Baz> implements RequiresWindowAccess {
      @Override
      public void processElement(ProcessContext c) {
        … (MyWindowType) c.window() …
      }
    }
    

    you will write this:

    public class MyDoFn extends DoFn<Foo, Baz> {
      @ProcessElement
      public void processElement(ProcessContext c, MyWindowType window) {
        … window …
      }
    }
    

    or:

    return new DoFn<Foo, Baz>() {
      @ProcessElement
      public void processElement(ProcessContext c, MyWindowType window) {
        … window …
      }
    }
    

    The runtime will automatically provide the window to your DoFn.

  • DoFns are reused across multiple bundles

    In order to allow for performance improvements, the same DoFn may now be reused to process multiple bundles of elements, instead of guaranteeing a fresh instance per bundle. Any DoFn that keeps local state (e.g. instance variables) beyond the end of a bundle may encounter behavioral changes, as the next bundle will now start with that state instead of a fresh copy.

    To manage the lifecycle, new @Setup and @Teardown methods have been added. The full lifecycle is as follows (while a failure may truncate the lifecycle at any point):

    • @Setup: Per-instance initialization of the DoFn, such as opening reusable connections.
    • Any number of the sequence:
      • @StartBundle: Per-bundle initialization, such as resetting the state of the DoFn.
      • @ProcessElement: The usual element processing.
      • @FinishBundle: Per-bundle concluding steps, such as flushing side effects.
    • @Teardown: Per-instance teardown of the resources held by the DoFn, such as closing reusable connections.

    Note: This change is expected to have limited impact in practice. However, it does not generate a compile-time error and has the potential to silently cause unexpected results.

PTransforms

  • Removed .named()

    Remove the .named() methods from PTransforms and sub-classes. Instead use PCollection.apply(“name”, PTransform).

  • Renamed PTransform.apply() to PTransform.expand()

    PTransform.apply() was renamed to PTransform.expand() to avoid confusion with PCollection.apply(). All user-written composite transforms will need to rename the overridden apply() method to expand(). There is no change to how pipelines are constructed.

Additional breaking changes

Please see the official Dataflow SDK 2.x for Java release notes for an updated list of additional breaking changes and updated information on the Dataflow SDK 2.x for Java releases.

v1.9.0

7 years ago
  • Added the ValueProvider interface for use in pipeline options. Making an option of type ValueProvider<T> instead of T allows its value to be supplied at runtime (rather than pipeline construction time) and enables Dataflow templates. Support for ValueProvider has been added to TextIO, PubSubIO, and BigQueryIO and can be added to arbitrary PTransforms as well.
  • Added the ability to automatically save profiling information to Google Cloud Storage using the --saveProfilesToGcs pipeline option. For more information on profiling pipelines executed by the DataflowPipelineRunner, see issue #72.
  • Deprecated the --enableProfilingAgent pipeline option that saved profiles to the individual worker disks. For more information on profiling pipelines executed by the DataflowPipelineRunner, see issue #72.
  • Changed FileBasedSource to throw an exception when reading from a file pattern that has no matches. Pipelines will now fail at runtime rather than silently reading no data in this case. This change affects TextIO.Read or AvroIO.Read when configured withoutValidation.
  • Enhanced Coder validation in the DirectPipelineRunner to catch coders that cannot properly encode and decode their input.
  • Improved display data throughout core transforms, including properly handling arrays in PipelineOptions.
  • Improved performance for pipelines using the DataflowPipelineRunner in streaming mode.
  • Improved scalability of the InProcessRunner, enabling testing with larger datasets.
  • Improved the cleanup of temporary files created by TextIO, AvroIO, and other FileBasedSource implementations.
  • Modified the default version range in the archetypes to exclude beta releases of Dataflow SDK for Java, version 2.0.0 and later.

v1.8.1

7 years ago
  • Improved the performance of bounded side inputs in the DataflowPipelineRunner.

v1.8.0

7 years ago
  • Added support to BigQueryIO.Read for queries in the new BigQuery Standard SQL dialect using .withStandardSQL().
  • Added support in BigQueryIO for the new BYTES, TIME, DATE, and DATETIME types.
  • Added support to BigtableIO.Read for reading from a restricted key range using .withKeyRange(ByteKeyRange).
  • Improved initial splitting of large uncompressed files in CompressedSource, leading to better performance when executing batch pipelines that use TextIO.Read on the Cloud Dataflow service.
  • Fixed a performance regression when using BigQueryIO.Write in streaming mode.

v1.7.0

7 years ago
  • Added support for Cloud Datastore API v1 in the new com.google.cloud.dataflow.sdk.io.datastore.DatastoreIO. Deprecated the old DatastoreIO class that supported only the deprecated Cloud Datastore API v1beta2.
  • Improved DatastoreIO.Read to support dynamic work rebalancing, and added an option to control the number of query splits using withNumQuerySplits.
  • Improved DatastoreIO.Write to work with an unbounded PCollection, supporting writing to Cloud Datastore when using the DataflowPipelineRunner in streaming mode.
  • Added the ability to delete Cloud Datastore Entity objects directly using Datastore.v1().deleteEntity or to delete entities by key using Datastore.v1().deleteKey.
  • Added support for reading from a BoundedSource to the DataflowPipelineRunner in streaming mode. This enables the use of TextIO.Read, AvroIO.Read and other bounded sources in these pipelines.
  • Added support for optionally writing a header and/or footer to text files produced with TextIO.Write.
  • Added the ability to control the number of output shards produced when using a Sink.
  • Added TestStream to enable testing of triggers with multiple panes and late data with the InProcessPipelineRunner.
  • Added the ability to control the rate at which UnboundedCountingInput produces elements using withRate(long, Duration).
  • Improved performance and stability for pipelines using the DataflowPipelineRunner in streaming mode.
  • To support TestStream, reimplemented DataflowAssert to use GroupByKey instead of sideInputs to check assertions. This is an update-incompatible change to DataflowAssert for pipelines run on the DataflowPipelineRunner in streaming mode.
  • Fixed an issue in which a FileBasedSink would produce no files when writing an empty PCollection.
  • Fixed an issue in which BigQueryIO.Read could not query a table in a non-US region when using the DirectPipelineRunner or the InProcessPipelineRunner.
  • Fixed an issue in which the combination of timestamps near the end of the global window and a large allowedLateness could cause an IllegalStateException for pipelines run in the DirectPipelineRunner.
  • Fixed a NullPointerException that could be thrown during pipeline submission when using an AfterWatermark trigger with no late firings.