Bytewax Versions Save

Python Stream Processing

v0.19.1

1 month ago

Overview

  • Fixes a bug where using a system clock on certain architectures causes items to be dropped from windows.

What's Changed

New Contributors

Full Changelog: https://github.com/bytewax/bytewax/compare/v0.19.0...v0.19.1

v0.19.0

1 month ago

Overview

  • Multiple operators have been reworked to avoid taking and releasing Python's global interpreter lock while iterating over multiple items. Windowing operators, stateful operators and operators like branch will see significant performance improvements.

    Thanks to @damiondoesthings for helping us track this down!

  • Breaking change FixedPartitionedSource.build_part, DynamicSource.build, FixedPartitionedSink.build_part and DynamicSink.build now take an additional step_id argument. This argument can be used when labeling custom Python metrics.

  • Custom Python metrics can now be collected using the prometheus-client library.

  • Breaking change The schema registry interface has been removed. You can still use schema registries, but you need to instantiate the (de)serializers on your own. This allows for more flexibility. See the confluent_serde and redpanda_serde examples for how to use the new interface.

  • Fixes bug where items would be incorrectly marked as late in sliding and tumbling windows in cases where the timestamps are very far from the align_to parameter of the windower.

  • Adds stateful_flat_map operator.

  • Breaking change Removes builder argument from stateful_map. Instead, the initial state value is always None and you can call your previous builder by hand in the mapper.

  • Breaking change Improves performance by removing the now: datetime argument from FixedPartitionedSource.build_part, DynamicSource.build, and UnaryLogic.on_item. If you need the current time, use:

from datetime import datetime, timezone

now = datetime.now(timezone.utc)
  • Breaking change Improves performance by removing the sched: datetime argument from StatefulSourcePartition.next_batch, StatelessSourcePartition.next_batch, UnaryLogic.on_notify. You should already have the scheduled next awake time in whatever instance variable you returned in {Stateful,Stateless}SourcePartition.next_awake or UnaryLogic.notify_at.

What's Changed

New Contributors

Full Changelog: https://github.com/bytewax/bytewax/compare/v0.18.2...v0.19.0

v0.18.2

3 months ago

Overview

  • Fixes a bug that prevented the deletion of old state in recovery stores.
  • Better error messages on invalid epoch and backup interval parameters.
  • Fixes bug where dataflow will hang if a source's next_awake is set far in the future.

What's Changed

Full Changelog: https://github.com/bytewax/bytewax/compare/v0.18.1...v0.18.2

What's Changed

Full Changelog: https://github.com/bytewax/bytewax/compare/v0.18.1...v0.18.2

v0.18.1

4 months ago

Overview

What's Changed

New Contributors

Full Changelog: https://github.com/bytewax/bytewax/compare/v0.18.0...v0.18.1

v0.18.0

4 months ago

Overview

  • Support for schema registries, through bytewax.connectors.kafka.registry.RedpandaSchemaRegistry and bytewax.connectors.kafka.registry.ConfluentSchemaRegistry.

  • Custom Kafka operators in bytewax.connectors.kafka.operators: input, output, deserialize_key, deserialize_value, deserialize, serialize_key, serialize_value and serialize.

  • Breaking change KafkaSource now emits a special KafkaSourceMessage to allow access to all data on consumed messages. KafkaSink now consumes KafkaSinkMessage to allow setting additional fields on produced messages.

  • Non-linear dataflows are now possible. Each operator method returns a handle to the Streams it produces; add further steps via calling operator functions on those returned handles, not the root Dataflow. See the migration guide for more info.

  • Auto-complete and type hinting on operators, inputs, outputs, streams, and logic functions now works.

  • A ton of new operators: collect_final, count_final, count_window, flatten, inspect_debug, join, join_named, max_final, max_window, merge, min_final, min_window, key_on, key_assert, key_split, merge, unary. Documentation for all operators are in bytewax.operators now.

  • New operators can be added in Python, made by grouping existing operators. See bytewax.dataflow module docstring for more info.

  • Breaking change Operators are now stand-alone functions; import bytewax.operators as op and use e.g. op.map("step_id", upstream, lambda x: x + 1).

  • Breaking change All operators must take a step_id argument now.

  • Breaking change fold and reduce operators have been renamed to fold_final and reduce_final. They now only emit on EOF and are only for use in batch contexts.

  • Breaking change batch operator renamed to collect, so as to not be confused with runtime batching. Behavior is unchanged.

  • Breaking change output operator does not forward downstream its items. Add operators on the upstream handle instead.

  • next_batch on input partitions can now return any Iterable, not just a List.

  • inspect operator now has a default inspector that prints out items with the step ID.

  • collect_window operator now can collect into sets and dicts.

  • Adds a get_fs_id argument to {Dir,File}Source to allow handling non-identical files per worker.

  • Adds a TestingSource.EOF and TestingSource.ABORT sentinel values you can use to test recovery.

  • Breaking change Adds a datetime argument to FixedPartitionSource.build_part, DynamicSource.build_part, StatefulSourcePartition.next_batch, and StatelessSourcePartition.next_batch. You can now use this to update your next_awake time easily.

  • Breaking change Window operators now emit WindowMetadata objects downstream. These objects can be used to introspect the open_time and close_time of windows. This changes the output type of windowing operators from: (key, values) to (key, (metadata, values)).

  • Breaking change IO classes and connectors have been renamed to better reflect their semantics and match up with documentation.

  • Moves the ability to start multiple Python processes with the -p or --processes to the bytewax.testing module.

  • Breaking change SimplePollingSource moved from bytewax.connectors.periodic to bytewax.inputs since it is an input helper.

  • SimplePollingSource's align_to argument now works.

What's Changed

New Contributors

Full Changelog: https://github.com/bytewax/bytewax/compare/v0.17.1...v0.18.0

v0.17.2

7 months ago

Overview

  • Fixes error message creation, and updates error messages when creating recovery partitions.

Full Changelog: https://github.com/bytewax/bytewax/compare/v0.17.1...v0.17.2

v0.17.1

8 months ago

Overview

  • Adds the batch operator to Dataflows. Calling Dataflow.batch will batch incoming items until either a batch size has been reached or a timeout has passed.

  • Adds the SimplePollingInput source. Subclass this input source to periodically source new input for a dataflow.

  • Re-adds GLIBC 2.27 builds to support older linux distributions.

What's Changed

Full Changelog: https://github.com/bytewax/bytewax/compare/v0.17.0...v0.17.1

v0.17.0

8 months ago

v0.17.0

Changed

  • Breaking change Recovery system re-worked. Kafka-based recovery removed. SQLite recovery file format changed; existing recovery DB files can not be used. See the module docstring for bytewax.recovery for how to use the new recovery system.

  • Dataflow execution supports rescaling over resumes. You can now change the number of workers and still get proper execution and recovery.

  • epoch-interval has been renamed to snapshot-interval

  • The list-parts method of PartitionedInput has been changed to return a List[str] and should only reflect the available inputs that a given worker has access to. You no longer need to return the complete set of partitions for all workers.

  • The next method of StatefulSource and StatelessSource has been changed to next_batch and should return a List of elements, or the empty list if there are no elements to return.

Added

  • Added new cli parameter backup-interval, to configure the length of time to wait before "garbage collecting" older recovery snapshots.

  • Added next_awake to input classes, which can be used to schedule when the next call to next_batch should occur. Use next_awake instead of time.sleep.

  • Added bytewax.inputs.batcher_async to bridge async Python libraries in Bytewax input sources.

  • Added support for linux/aarch64 and linux/armv7 platforms.

Removed

  • KafkaRecoveryConfig has been removed as a recovery store.

What's Changed

New Contributors

Full Changelog: https://github.com/bytewax/bytewax/compare/v0.16.2...v0.17.0

v0.16.2

11 months ago

Overview

  • Add support for Windows builds - thanks @zzl221000!
  • Adds a CSVInput subclass of FileInput

What's Changed

New Contributors

Full Changelog: https://github.com/bytewax/bytewax/compare/v0.16.1...v0.16.2

v0.16.1

1 year ago

Overview

  • Add a cooldown for activating workers to reduce CPU consumption.
  • Add support for Python 3.11.

What's Changed

Full Changelog: https://github.com/bytewax/bytewax/compare/v0.16.0...v0.16.1