Python Stream Processing
Full Changelog: https://github.com/bytewax/bytewax/compare/v0.19.0...v0.19.1
Multiple operators have been reworked to avoid taking and releasing
Python's global interpreter lock while iterating over multiple items.
Windowing operators, stateful operators and operators like branch
will see significant performance improvements.
Thanks to @damiondoesthings for helping us track this down!
Breaking change FixedPartitionedSource.build_part
,
DynamicSource.build
, FixedPartitionedSink.build_part
and DynamicSink.build
now take an additional step_id
argument. This argument can be used when
labeling custom Python metrics.
Custom Python metrics can now be collected using the prometheus-client
library.
Breaking change The schema registry interface has been removed.
You can still use schema registries, but you need to instantiate
the (de)serializers on your own. This allows for more flexibility.
See the confluent_serde
and redpanda_serde
examples for how
to use the new interface.
Fixes bug where items would be incorrectly marked as late in sliding
and tumbling windows in cases where the timestamps are very far from
the align_to
parameter of the windower.
Adds stateful_flat_map
operator.
Breaking change Removes builder
argument from stateful_map
.
Instead, the initial state value is always None
and you can call
your previous builder by hand in the mapper
.
Breaking change Improves performance by removing the now: datetime
argument from FixedPartitionedSource.build_part
,
DynamicSource.build
, and UnaryLogic.on_item
. If you need the
current time, use:
from datetime import datetime, timezone
now = datetime.now(timezone.utc)
sched: datetime
argument from StatefulSourcePartition.next_batch
,
StatelessSourcePartition.next_batch
, UnaryLogic.on_notify
. You
should already have the scheduled next awake time in whatever
instance variable you returned in
{Stateful,Stateless}SourcePartition.next_awake
or
UnaryLogic.notify_at
.collect
and branch
operator test file names by @davidselassie in https://github.com/bytewax/bytewax/pull/385
stateful_map
builder
function and adds stateful_flat_map
by @davidselassie in https://github.com/bytewax/bytewax/pull/387
now
and sched
arguments in input partitions and unary logic by @davidselassie in https://github.com/bytewax/bytewax/pull/391
time_for
twice by @whoahbot in https://github.com/bytewax/bytewax/pull/414
Full Changelog: https://github.com/bytewax/bytewax/compare/v0.18.2...v0.19.0
next_awake
is set far in the future.Full Changelog: https://github.com/bytewax/bytewax/compare/v0.18.1...v0.18.2
Full Changelog: https://github.com/bytewax/bytewax/compare/v0.18.1...v0.18.2
KafkaSource
from 1 to 1000 to match the Kafka input operator.count_window
operator: https://github.com/bytewax/bytewax/issues/364.reduce_final
by @davidselassie in https://github.com/bytewax/bytewax/pull/361
Full Changelog: https://github.com/bytewax/bytewax/compare/v0.18.0...v0.18.1
Support for schema registries, through bytewax.connectors.kafka.registry.RedpandaSchemaRegistry
and bytewax.connectors.kafka.registry.ConfluentSchemaRegistry
.
Custom Kafka operators in bytewax.connectors.kafka.operators
:
input
, output
, deserialize_key
, deserialize_value
, deserialize
,
serialize_key
, serialize_value
and serialize
.
Breaking change KafkaSource
now emits a special KafkaSourceMessage
to allow access to all data on consumed messages. KafkaSink
now consumes KafkaSinkMessage
to allow setting additional fields on produced messages.
Non-linear dataflows are now possible. Each operator method returns
a handle to the Stream
s it produces; add further steps via calling
operator functions on those returned handles, not the root
Dataflow
. See the migration guide for more info.
Auto-complete and type hinting on operators, inputs, outputs, streams, and logic functions now works.
A ton of new operators: collect_final
, count_final
,
count_window
, flatten
, inspect_debug
, join
, join_named
,
max_final
, max_window
, merge
, min_final
, min_window
,
key_on
, key_assert
, key_split
, merge
, unary
. Documentation
for all operators are in bytewax.operators
now.
New operators can be added in Python, made by grouping existing
operators. See bytewax.dataflow
module docstring for more info.
Breaking change Operators are now stand-alone functions; import bytewax.operators as op
and use e.g. op.map("step_id", upstream, lambda x: x + 1)
.
Breaking change All operators must take a step_id
argument now.
Breaking change fold
and reduce
operators have been renamed to
fold_final
and reduce_final
. They now only emit on EOF and are
only for use in batch contexts.
Breaking change batch
operator renamed to collect
, so as to
not be confused with runtime batching. Behavior is unchanged.
Breaking change output
operator does not forward downstream its
items. Add operators on the upstream handle instead.
next_batch
on input partitions can now return any Iterable
, not
just a List
.
inspect
operator now has a default inspector that prints out items
with the step ID.
collect_window
operator now can collect into set
s and dict
s.
Adds a get_fs_id
argument to {Dir,File}Source
to allow handling
non-identical files per worker.
Adds a TestingSource.EOF
and TestingSource.ABORT
sentinel values
you can use to test recovery.
Breaking change Adds a datetime
argument to
FixedPartitionSource.build_part
, DynamicSource.build_part
,
StatefulSourcePartition.next_batch
, and
StatelessSourcePartition.next_batch
. You can now use this to
update your next_awake
time easily.
Breaking change Window operators now emit WindowMetadata
objects
downstream. These objects can be used to introspect the open_time
and close_time of windows. This changes the output type of windowing
operators from: (key, values)
to (key, (metadata, values))
.
Breaking change IO classes and connectors have been renamed to better reflect their semantics and match up with documentation.
Moves the ability to start multiple Python processes with the
-p
or --processes
to the bytewax.testing
module.
Breaking change SimplePollingSource
moved from
bytewax.connectors.periodic
to bytewax.inputs
since it is an
input helper.
SimplePollingSource
's align_to
argument now works.
now
argument to build_part
and next_batch
by @davidselassie in https://github.com/bytewax/bytewax/pull/316
TestingSource.{ABORT, EOF}
by @davidselassie in https://github.com/bytewax/bytewax/pull/317
get_fs_id
argument to {Dir,File}Source
by @davidselassie in https://github.com/bytewax/bytewax/pull/320
SimplePollingSource
by @davidselassie in https://github.com/bytewax/bytewax/pull/324
key_split
operator; fixes type annotations by @davidselassie in https://github.com/bytewax/bytewax/pull/331
Stream
arguments to operators by @davidselassie in https://github.com/bytewax/bytewax/pull/334
Optional
and more complex types in operator signatures by @davidselassie in https://github.com/bytewax/bytewax/pull/338
flat_map_batch
operator by @davidselassie in https://github.com/bytewax/bytewax/pull/341
branch
operator so if it gets a TypeGuard
, the output streams are typed correctly by @davidselassie in https://github.com/bytewax/bytewax/pull/342
rusqlite
deps by @davidselassie in https://github.com/bytewax/bytewax/pull/345
ruff format
instead of black
by @davidselassie in https://github.com/bytewax/bytewax/pull/346
batch
operator to collect
by @davidselassie in https://github.com/bytewax/bytewax/pull/351
join_window
fixes by @davidselassie in https://github.com/bytewax/bytewax/pull/344
Full Changelog: https://github.com/bytewax/bytewax/compare/v0.17.1...v0.18.0
Full Changelog: https://github.com/bytewax/bytewax/compare/v0.17.1...v0.17.2
Adds the batch
operator to Dataflows. Calling Dataflow.batch
will batch incoming items until either a batch size has been reached
or a timeout has passed.
Adds the SimplePollingInput
source. Subclass this input source to
periodically source new input for a dataflow.
Re-adds GLIBC 2.27 builds to support older linux distributions.
Full Changelog: https://github.com/bytewax/bytewax/compare/v0.17.0...v0.17.1
Breaking change Recovery system re-worked. Kafka-based recovery
removed. SQLite recovery file format changed; existing recovery DB
files can not be used. See the module docstring for
bytewax.recovery
for how to use the new recovery system.
Dataflow execution supports rescaling over resumes. You can now change the number of workers and still get proper execution and recovery.
epoch-interval
has been renamed to snapshot-interval
The list-parts
method of PartitionedInput
has been changed to
return a List[str]
and should only reflect the available
inputs that a given worker has access to. You no longer need
to return the complete set of partitions for all workers.
The next
method of StatefulSource
and StatelessSource
has
been changed to next_batch
and should return a List
of elements,
or the empty list if there are no elements to return.
Added new cli parameter backup-interval
, to configure the length of
time to wait before "garbage collecting" older recovery snapshots.
Added next_awake
to input classes, which can be used to schedule
when the next call to next_batch
should occur. Use next_awake
instead of time.sleep
.
Added bytewax.inputs.batcher_async
to bridge async Python libraries
in Bytewax input sources.
Added support for linux/aarch64 and linux/armv7 platforms.
KafkaRecoveryConfig
has been removed as a recovery store.ymd
deprecation warnings. by @whoahbot in https://github.com/bytewax/bytewax/pull/262
Full Changelog: https://github.com/bytewax/bytewax/compare/v0.16.2...v0.17.0
Full Changelog: https://github.com/bytewax/bytewax/compare/v0.16.1...v0.16.2
Full Changelog: https://github.com/bytewax/bytewax/compare/v0.16.0...v0.16.1