Google Cloud Dataflow provides a simple, powerful model for building both batch and streaming parallel data processing pipelines.
TextIO
with compression type set to GZIP
or BZIP2
. For more information, see Issue #356.InProcessPipelineRunner
, an improvement over the DirectPipelineRunner
that better implements the Dataflow model. InProcessPipelineRunner
runs on a user's local machine and supports multithreaded execution, unbounded PCollection
s, and triggers for speculative and late outputs.DoFn
, CombineFn
, and WindowFn
), Source
s, and Sink
s with static metadata to be displayed in the Dataflow Monitoring Interface. Display data has been implemented for core components and is automatically applied to all PipelineOptions
.CombineFn
s into a single CombineFn
using CombineFns.compose
or CombineFns.composeKeyed
.getSplitPointsConsumed
and getSplitPointsRemaining
to the BoundedReader
API to improve Dataflow's ability to automatically scale a job reading from these sources. Default implementations of these functions have been provided, but reader implementers should override them to provide better information when available.CombineFnWithContext
.BigtableIO.Read.withRowFilter
, which allows Cloud Bigtable rows to be filtered in the Read
transform.Write.to
to be used with merging windows.With this release, we have begun preparing the Dataflow SDK for Java for an eventual move to Apache Beam (incubating). Specifically, we have refactored a number of internal APIs and removed from the SDK classes used only within the worker, which will now be provided by the Google Cloud Dataflow Service during job execution. This refactoring should not affect any user code.
Additionally, the 1.5.0 release includes the following changes:
View.asList
, View.asMap
, View.asMultimap
, and any non-globally-windowed PCollectionView
s.3.0.0-beta-1
. If you use custom Protocol Buffers, you should recompile them with the corresponding version of the protoc
compiler. You can continue using both version 2 and 3 of the Protocol Buffers syntax, and no user pipeline code needs to change.ProtoCoder
, which is a Coder
for Protocol Buffers messages that supports both version 2 and 3 of the Protocol Buffers syntax. This coder can detect when messages can be encoded deterministically. Proto2Coder
is now deprecated; we recommend that all users switch to ProtoCoder
.withoutResultFlattening
to BigQueryIO.Read
to disable flattening query results when reading from BigQuery.BigtableIO
, enabling support for reading from and writing to Google Cloud Bigtable.CompressedSource
to detect compression format according to the file extension. Added support for reading .gz
files that are transparently decompressed by the underlying transport logic.Combine
functions to access pipeline options and side inputs through a context. See GlobalCombineFn
and PerKeyCombineFn
for further details.ParDo.withSideInputs()
such that successive calls are cumulative.DatastoreIO.Source
. However, when this limit is set, the operation that reads from Cloud Datastore is performed by a single worker rather than executing in parallel across the worker pool.PaneInfo.{EARLY, ON_TIME, LATE}
so that panes with only late data are always LATE
, and an ON_TIME
pane can never cause a later computation to yield a LATE
pane.GroupByKey
to drop late data when that late data arrives for a window that has expired. An expired window means the end of the window is passed by more than the allowed lateness.GlobalWindows
, you are no longer required to specify withAllowedLateness()
, since no data is ever dropped.gcloud
utility. If the default project configuration does not exist, Dataflow reverts to using the old project configuration generated by older versions of the gcloud
utility.IterableLikeCoder
to efficiently encode small values. This change is backward compatible; however, if you have a running pipeline that was constructed with SDK version 1.3.0 or later, it may not be possible to "update" that pipeline with a replacement that was constructed using SDK version 1.2.1 or older. Updating a running pipeline with a pipeline constructed using a new SDK version, however, should be successful.TextIO.Write
or AvroIO.Write
outputs to a fixed number of files, added a reshard (shuffle) step immediately prior to the write step. The cost of this reshard is often exceeded by additional parallelism available to the preceding stage.PubsubIO
. This allows reading from Cloud Pub/Sub topics published by Cloud Logging without losing timestamp information.GroupByKey
results.Window.Bound.withOutputTime()
.AfterWatermark
trigger, as follows: AfterWatermark.pastEndOfWindow().withEarlyFirings(...).withLateFirings(...)
.BigQueryIO
that unnecessarily printed a lot of messages when executed using DirectPipelineRunner
.MapElements
and FlatMapElements
transforms that accept Java 8 lambdas, for those cases when the full power of ParDo
is not required. Filter
and Partition
accept lambdas as well. Java 8 functionality is demonstrated in a new MinimalWordCountJava8
example.@DefaultCoder
annotations for generic types. Previously, a @DefaultCoder
annotation on a generic type was ignored, resulting in diminished functionality and confusing error messages. It now works as expected.DatastoreIO
now supports (parallel) reads within namespaces. Entities can be written to namespaces by setting the namespace in the Entity key.slf4j-jdk14
dependency to the test
scope. When a Dataflow job is executing, the slf4j-api
, slf4j-jdk14
, jcl-over-slf4j
, log4j-over-slf4j
, and log4j-to-slf4j
dependencies will be provided by the system.Set<T>
to the coder registry, when type T
has its own registered coder.NullableCoder
, which can be used in conjunction with other coders to encode a PCollection
whose elements may possibly contain null values.Filter
as a composite PTransform
. Deprecated static methods in the old Filter
implementation that return ParDo
transforms.SourceTestUtils
, which is a set of helper functions and test harnesses for testing correctness of Source
implementations.numWorkers
, maxNumWorkers
, and similar settings. If these are unspecified, the Dataflow Service will pick an appropriate value.DirectPipelineRunner
to help ensure that DoFn
s obey the existing requirement that inputs and outputs must not be modified.AvroCoder
for @Nullable
fields with deterministic encoding.CustomCoder
subclasses override getEncodingId
method.Source.Reader
, BoundedSource.BoundedReader
, UnboundedSource.UnboundedReader
to be abstract classes, instead of interfaces. AbstractBoundedReader
has been merged into BoundedSource.BoundedReader
.ByteOffsetBasedSource
and ByteOffsetBasedReader
to OffsetBasedSource
and OffsetBasedReader
, introducing getBytesPerOffset
as a translation layer.OffsetBasedReader
, such that the subclass now has to override startImpl
and advanceImpl
, rather than start
and advance
. The protected variable rangeTracker
is now hidden and updated by base class automatically. To indicate split points, use the method isAtSplitPoint
.TimeTrigger
.