# Bytewax Changelog

All notable changes to this project will be documented in this file.
For help with updating to new Bytewax versions, please see the
[migration guide](https://bytewax.io/docs/reference/migration).

## Latest

__Add any extra change notes here and we'll put them in the release
notes on GitHub when we make a new release.__

## v0.21.1

- `join_window` operator now supports using stream-order via the
  `ordered` parameter.

- Windowing operators now correctly respect `now_getter` and
  `to_system_utc` in `EventClock`.

- Fixes an issue where the runtime would not properly report the
  correct exception being raised by dataflow code.

- Fixes a bug where window closing will be delayed if using event time
  and all values for a key fall into a single window and event
  timestamps are within `wait_for_system_duration` of each other.

- Fixes a bug where using `EventClock` on macOS systems would randomly
  assert.

## v0.21.0

- {py:obj}`~bytewax.inputs.SimplePollingSource` now allows you to
  retain state to support at-least-once delivery.

- Fixes a bug when using
  {py:obj}`~bytewax.operators.windowing.SlidingWindower` where values
  would be assigned to an extra window if their timestamps were near
  the end of the correct window.

- Upstream type hints on
  {py:obj}`bytewax.connectors.kafka.operators.serialize_key`,
  {py:obj}`~bytewax.connectors.kafka.operators.serialize_value`, and
  {py:obj}`~bytewax.connectors.kafka.operators.serialize` have been
  made more broad to support all Kafka serializers, like
  {py:obj}`confluent_kafka.serialization.StringSerializer`.

- *Breaking change* - Fixes a bug which caused two of the same types
  of windowing operators in a dataflow to spuriously result in a
  `ValueError`. This fix invalidates any recovery data for all
  windowing operators; it is recommended to delete and re-create
  the recovery store if you are using windowing operators.

- Fixes a performance issue where
  {py:obj}`bytewax.operators.StatefulBatchLogic.notify_at` (and thus
  many of the other stateful operators' `notify_at` derived from it)
  was being called superfluously.

## v0.20.1

- Fixes a bug when using
  {py:obj}`~bytewax.operators.windowing.EventClock` where in-order but
  "slow" data results in watermark assertion errors.

## v0.20.0

- Adds a dataflow structure visualizer. Run `python -m
  bytewax.visualize`.

- *Breaking change* The internal format of recovery databases has been
  changed from using `JsonPickle` to Python's built-in {py:obj}`pickle`.
  Recovery stores that used the old format will not be usable after
  upgrading.

- *Breaking change* The `unary` operator and `UnaryLogic` have been
  renamed to `stateful` and `StatefulLogic` respectively.

- Adds a `stateful_batch` operator to allow for lower-level batch
  control while managing state.

- `StatefulLogic.on_notify`, `StatefulLogic.on_eof`, and
  `StatefulLogic.notify_at` are now optional overrides. The defaults
  retain the state and emit nothing.

- *Breaking change* Windowing operators have been moved from
  `bytewax.operators.window` into `bytewax.operators.windowing`.

- *Breaking change* `ClockConfig`s have had `Config` dropped from
  their name and are just `Clock`s. E.g. If you previously `from
  bytewax.operators.window import SystemClockConfig` now `from
  bytewax.operators.windowing import SystemClock`.

- *Breaking change* `WindowConfig`s have been renamed to `Windower`s.
  E.g. If you previously `from bytewax.operators.window import
  SessionWindow` now `from bytewax.operators.windowing import
  SessionWindower`.

- *Breaking change* All windowing operators now return a set of
  streams {py:obj}`~bytewax.operators.windowing.WindowOut`.
  {py:obj}`~bytewax.operators.windowing.WindowMetadata` now is
  branched into its own stream and is no longer part of the single
  downstream. All window operator emitted items are labeled with the
  unique window ID they came from to facilitate joining the data
  later.

- *Breaking change* {py:obj}`~bytewax.operators.windowing.fold_window`
  now requires a `merge` argument. This handles whenever the session
  windower determines that two windows must be merged because a new
  item bridged a gap.

- *Breaking change* The `join_named` and `join_window_named` operators
  have been removed because they did not support returning proper type
  information. Use {py:obj}`~bytewax.operators.join` or
  {py:obj}`~bytewax.operators.windowing.join_window` instead, which
  have been enhanced to properly type their downstream values.

- *Breaking change* {py:obj}`~bytewax.operators.join` and
  {py:obj}`~bytewax.operators.windowing.join_window` have had their
  `product` argument replaced with `insert_mode`. You now can specify
  more nuanced kinds of join modes.

- Python interfaces are now provided for custom clocks and windowers.
  Subclass {py:obj}`~bytewax.operators.windowing.Clock` (and a
  corresponding {py:obj}`~bytewax.operators.windowing.ClockLogic`) or
  {py:obj}`~bytewax.operators.windowing.Windower` (and a corresponding
  {py:obj}`~bytewax.operators.windowing.WindowerLogic`) to define your
  own senses of time and window definitions.

- Adds a {py:obj}`~bytewax.operators.windowing.window` operator to
  allow you to write more flexible custom windowing operators.

- Session windows now work correctly with out-of-order data and joins.

- {py:obj}`~bytewax.operators.windowing.WindowMetadata` now contains a
  {py:obj}`~bytewax.operators.windowing.WindowMetadata.merged_ids`
  field with any window IDs that were merged into this window.

- All windowing operators now process items in timestamp order. The
  most visible change that this results in is that the
  {py:obj}`~bytewax.operators.windowing.collect_window` operator now
  emits collections with values in timestamp order.

- Adds a {py:obj}`~bytewax.operators.filter_map_value` operator.

- Adds a {py:obj}`~bytewax.operators.enrich_cached` operator for
  easier joining with an external data source.

- Adds a {py:obj}`~bytewax.operators.key_rm` convenience operator to
  remove keys from a {py:obj}`~bytewax.operators.KeyedStream`.

## v0.19.1

- Fixes a bug where using a system clock on certain architectures
  causes items to be dropped from windows.

## v0.19.0

- Multiple operators have been reworked to avoid taking and releasing
  Python's global interpreter lock while iterating over multiple items.
  Windowing operators, stateful operators and operators like `branch`
  will see significant performance improvements.

  Thanks to @damiondoesthings for helping us track this down!

- *Breaking change* `FixedPartitionedSource.build_part`,
  `DynamicSource.build`, `FixedPartitionedSink.build_part` and `DynamicSink.build`
  now take an additional `step_id` argument. This argument can be used when
  labeling custom Python metrics.

- Custom Python metrics can now be collected using the `prometheus-client`
  library.

- *Breaking change* The schema registry interface has been removed.
  You can still use schema registries, but you need to instantiate
  the (de)serializers on your own. This allows for more flexibility.
  See the `confluent_serde` and `redpanda_serde` examples for how
  to use the new interface.

- Fixes bug where items would be incorrectly marked as late in sliding
  and tumbling windows in cases where the timestamps are very far from
  the `align_to` parameter of the windower.
- Adds `stateful_flat_map` operator.

- *Breaking change* Removes `builder` argument from `stateful_map`.
  Instead, the initial state value is always `None` and you can call
  your previous builder by hand in the `mapper`.

- *Breaking change* Improves performance by removing the `now:
  datetime` argument from `FixedPartitionedSource.build_part`,
  `DynamicSource.build`, and `UnaryLogic.on_item`. If you need the
  current time, use:

```python
from datetime import datetime, timezone

now = datetime.now(timezone.utc)
```

- *Breaking change* Improves performance by removing the `sched:
  datetime` argument from `StatefulSourcePartition.next_batch`,
  `StatelessSourcePartition.next_batch`, `UnaryLogic.on_notify`. You
  should already have the scheduled next awake time in whatever
  instance variable you returned in
  `{Stateful,Stateless}SourcePartition.next_awake` or
  `UnaryLogic.notify_at`.

## v0.18.2

- Fixes a bug that prevented the deletion of old state in recovery stores.

- Better error messages on invalid epoch and backup interval
  parameters.

- Fixes bug where dataflow will hang if a source's `next_awake` is set
  far in the future.

## v0.18.1

- Changes the default batch size for `KafkaSource` from 1 to 1000 to match
  the Kafka input operator.

- Fixes an issue with the `count_window` operator:
  https://github.com/bytewax/bytewax/issues/364.

## v0.18.0

- Support for schema registries, through
  `bytewax.connectors.kafka.registry.RedpandaSchemaRegistry` and
  `bytewax.connectors.kafka.registry.ConfluentSchemaRegistry`.

- Custom Kafka operators in `bytewax.connectors.kafka.operators`:
  `input`, `output`, `deserialize_key`, `deserialize_value`,
  `deserialize`, `serialize_key`, `serialize_value` and `serialize`.

- *Breaking change* `KafkaSource` now emits a special
  `KafkaSourceMessage` to allow access to all data on consumed
  messages. `KafkaSink` now consumes `KafkaSinkMessage` to allow
  setting additional fields on produced messages.

- Non-linear dataflows are now possible. Each operator method returns
  a handle to the `Stream`s it produces; add further steps via calling
  operator functions on those returned handles, not the root
  `Dataflow`. See the migration guide for more info.

- Auto-complete and type hinting on operators, inputs, outputs,
  streams, and logic functions now works.

- A ton of new operators: `collect_final`, `count_final`,
  `count_window`, `flatten`, `inspect_debug`, `join`, `join_named`,
  `max_final`, `max_window`, `merge`, `min_final`, `min_window`,
  `key_on`, `key_assert`, `key_split`, `merge`, `unary`. Documentation
  for all operators are in `bytewax.operators` now.

- New operators can be added in Python, made by grouping existing
  operators. See `bytewax.dataflow` module docstring for more info.

- *Breaking change* Operators are now stand-alone functions; `import
  bytewax.operators as op` and use e.g. `op.map("step_id", upstream,
  lambda x: x + 1)`.

- *Breaking change* All operators must take a `step_id` argument now.

- *Breaking change* `fold` and `reduce` operators have been renamed to
  `fold_final` and `reduce_final`. They now only emit on EOF and are
  only for use in batch contexts.

- *Breaking change* `batch` operator renamed to `collect`, so as to
  not be confused with runtime batching. Behavior is unchanged.

- *Breaking change* `output` operator does not forward downstream its
  items. Add operators on the upstream handle instead.

- `next_batch` on input partitions can now return any `Iterable`, not
  just a `List`.

- `inspect` operator now has a default inspector that prints out items
  with the step ID.

- `collect_window` operator now can collect into `set`s and `dict`s.

- Adds a `get_fs_id` argument to `{Dir,File}Source` to allow handling
  non-identical files per worker.

- Adds a `TestingSource.EOF` and `TestingSource.ABORT` sentinel values
  you can use to test recovery.

- *Breaking change* Adds a `datetime` argument to
  `FixedPartitionSource.build_part`, `DynamicSource.build_part`,
  `StatefulSourcePartition.next_batch`, and
  `StatelessSourcePartition.next_batch`. You can now use this to
  update your `next_awake` time easily.

- *Breaking change* Window operators now emit `WindowMetadata` objects
  downstream. These objects can be used to introspect the open_time
  and close_time of windows. This changes the output type of windowing
  operators from: `(key, values)` to `(key, (metadata, values))`.

- *Breaking change* IO classes and connectors have been renamed to
  better reflect their semantics and match up with documentation.

- Moves the ability to start multiple Python processes with the
  `-p` or `--processes` to the `bytewax.testing` module.

- *Breaking change* `SimplePollingSource` moved from
  `bytewax.connectors.periodic` to `bytewax.inputs` since it is an
  input helper.

- `SimplePollingSource`'s `align_to` argument now works.

## v0.17.1

- Adds the `batch` operator to Dataflows. Calling `Dataflow.batch`
  will batch incoming items until either a batch size has been reached
  or a timeout has passed.

- Adds the `SimplePollingInput` source. Subclass this input source to
  periodically source new input for a dataflow.

- Re-adds GLIBC 2.27 builds to support older linux distributions.

## v0.17.0

### Changed

- *Breaking change* Recovery system re-worked. Kafka-based recovery
  removed. SQLite recovery file format changed; existing recovery DB
  files can not be used. See the module docstring for
  `bytewax.recovery` for how to use the new recovery system.

- Dataflow execution supports rescaling over resumes. You can now
  change the number of workers and still get proper execution and
  recovery.

- `epoch-interval` has been renamed to `snapshot-interval`

- The `list-parts` method of `PartitionedInput` has been changed to
  return a `List[str]` and should only reflect the available
  inputs that a given worker has access to. You no longer need
  to return the complete set of partitions for all workers.

- The `next` method of `StatefulSource` and `StatelessSource` has
  been changed to `next_batch` and should return a `List` of elements,
  or the empty list if there are no elements to return.

### Added

- Added new cli parameter `backup-interval`, to configure the length of
  time to wait before "garbage collecting" older recovery snapshots.

- Added `next_awake` to input classes, which can be used to schedule
  when the next call to `next_batch` should occur. Use `next_awake`
  instead of `time.sleep`.

- Added `bytewax.inputs.batcher_async` to bridge async Python libraries
  in Bytewax input sources.

- Added support for linux/aarch64 and linux/armv7 platforms.

### Removed

- `KafkaRecoveryConfig` has been removed as a recovery store.

## v0.16.2

- Add support for Windows builds - thanks @zzl221000!
- Adds a CSVInput subclass of FileInput

## v0.16.1

- Add a cooldown for activating workers to reduce CPU consumption.
- Add support for Python 3.11.

## v0.16.0

- *Breaking change* Reworked the execution model. `run_main` and `cluster_main`
  have been moved to `bytewax.testing` as they are only supposed to be used
  when testing or prototyping.
  Production dataflows should be ran by calling the `bytewax.run`
  module with `python -m bytewax.run <dataflow-path>:<dataflow-name>`.
  See `python -m bytewax.run -h` for all the possible options.
  The functionality offered by `spawn_cluster` are now only offered by the
  `bytewax.run` script, so `spawn_cluster` was removed.

- *Breaking change* `{Sliding,Tumbling}Window.start_at` has been
  renamed to `align_to` and both now require that argument. It's not
  possible to recover windowing operators without it.

- Fixes bugs with windows not closing properly.

- Fixes an issue with SQLite-based recovery. Previously you'd always
  get an "interleaved executions" panic whenever you resumed a cluster
  after the first time.

- Add `SessionWindow` for windowing operators.

- Add `SlidingWindow` for windowing operators.

- *Breaking change* Rename `TumblingWindowConfig` to `TumblingWindow`

- Add `filter_map` operator.

- *Breaking change* New partition-based input and output API. This
  removes `ManualInputConfig` and `ManualOutputConfig`. See
  `bytewax.inputs` and `bytewax.outputs` for more info.

- *Breaking change* `Dataflow.capture` operator is renamed to
  `Dataflow.output`.

- *Breaking change* `KafkaInputConfig` and `KafkaOutputConfig` have
  been moved to `bytewax.connectors.kafka.KafkaInput` and
  `bytewax.connectors.kafka.KafkaOutput`.

- *Deprecation warning* The `KafkaRecovery` store is being deprecated
  in favor of `SqliteRecoveryConfig`, and will be removed in a future
  release.

## 0.15.0

- *Breaking change* Fixes issue with multi-worker recovery. If the
  cluster crashed before all workers had completed their first epoch,
  the cluster would resume from the incorrect position. This requires
  a change to the recovery store. You cannot resume from recovery data
  written with an older version.

## 0.14.0

- Dataflow continuation now works. If you run a dataflow over a finite
  input, all state will be persisted via recovery so if you re-run the
  same dataflow pointing at the same input, but with more data
  appended at the end, it will correctly continue processing from the
  previous end-of-stream.

- Fixes issue with multi-worker recovery. Previously resume data was
  being routed to the wrong worker so state would be missing.

- *Breaking change* The above two changes require that the recovery
  format has been changed for all recovery stores. You cannot resume
  from recovery data written with an older version.

- Adds an introspection web server to dataflow workers.

- Adds `collect_window` operator.

## 0.13.1

- Added Google Colab support.

## 0.13.0

- Added tracing instrumentation and configurations for tracing backends.

## 0.12.0

- Fixes bug where window is never closed if recovery occurs after last
  item but before window close.

- Recovery logging is reduced.

- *Breaking change* Recovery format has been changed for all recovery stores.
  You cannot resume from recovery data written with an older version.

- Adds a `DynamoDB` and `Bigquery` output connector.

## 0.11.2

- Performance improvements.

- Support SASL and SSL for `bytewax.inputs.KafkaInputConfig`.


## 0.11.1

- KafkaInputConfig now accepts additional properties. See
  `bytewax.inputs.KafkaInputConfig`.

- Support for a pre-built Kafka output component. See
  `bytewax.outputs.KafkaOutputConfig`.

## 0.11.0

- Added the `fold_window` operator, works like `reduce_window` but allows
  the user to build the initial accumulator for each key in a `builder` function.

- Output is no longer specified using an `output_builder` for the
  entire dataflow, but you supply an "output config" per capture. See
  `bytewax.outputs` for more info.

- Input is no longer specified on the execution entry point (like
  `run_main`), it is instead using the `Dataflow.input` operator.

- Epochs are no longer user-facing as part of the input system. Any
  custom Python-based input components you write just need to be
  iterators and emit items. Recovery snapshots and backups now happen
  periodically, defaulting to every 10 seconds.

- Recovery format has been changed for all recovery stores. You cannot
  resume from recovery data written with an older version.

- The `reduce_epoch` operator has been replaced with
  `reduce_window`. It takes a "clock" and a "windower" to define the
  kind of aggregation you want to do.

- `run` and `run_cluster` have been removed and the remaining
  execution entry points moved into `bytewax.execution`. You can now
  get similar prototyping functionality with
  `bytewax.execution.run_main` and `bytewax.execution.spawn_cluster`
  using `Testing{Input,Output}Config`s.

- `Dataflow` has been moved into `bytewax.dataflow.Dataflow`.

## 0.10.0

- Input is no longer specified using an `input_builder`, but now an
  `input_config` which allows you to use pre-built input
  components. See `bytewax.inputs` for more info.

- Preliminary support for a pre-built Kafka input component. See
  `bytewax.inputs.KafkaInputConfig`.

- Keys used in the `(key, value)` 2-tuples to route data for stateful
  operators (like `stateful_map` and `reduce_epoch`) must now be
  strings. Because of this `bytewax.exhash` is no longer necessary and
  has been removed.

- Recovery format has been changed for all recovery stores. You cannot
  resume from recovery data written with an older version.

- Slight changes to `bytewax.recovery.RecoveryConfig` config options
  due to recovery system changes.

- `bytewax.run()` and `bytewax.run_cluster()` no longer take
  `recovery_config` as they don't support recovery.


## 0.9.0

- Adds `bytewax.AdvanceTo` and `bytewax.Emit` to control when processing
  happens.

- Adds `bytewax.run_main()` as a way to test input and output builders
  without starting a cluster.

- Adds a `bytewax.testing` module with helpers for testing.

- `bytewax.run_cluster()` and `bytewax.spawn_cluster()` now take a
  `mp_ctx` argument to allow you to change the multiprocessing
  behavior. E.g. from "fork" to "spawn". Defaults now to "spawn".

- Adds dataflow recovery capabilities. See `bytewax.recovery`.

- Stateful operators `bytewax.Dataflow.reduce()` and
  `bytewax.Dataflow.stateful_map()` now require a `step_id` argument
  to handle recovery.

- Execution entry points now take configuration arguments as kwargs.

## 0.8.0

- Capture operator no longer takes arguments. Items that flow through
  those points in the dataflow graph will be processed by the output
  handlers setup by each execution entry point. Every dataflow
  requires at least one capture.

- `Executor.build_and_run()` is replaced with four entry points for
  specific use cases:

  - `run()` for exeuction in the current process. It returns all
    captured items to the calling process for you. Use this for
    prototyping in notebooks and basic tests.

  - `run_cluster()` for execution on a temporary machine-local cluster
    that Bytewax coordinates for you. It returns all captured items to
    the calling process for you. Use this for notebook analysis where
    you need parallelism.

  - `spawn_cluster()` for starting a machine-local cluster with more
    control over input and output. Use this for standalone scripts
    where you might need partitioned input and output.

  - `cluster_main()` for starting a process that will participate in a
    cluster you are coordinating manually. Use this when starting a
    Kubernetes cluster.

- Adds `bytewax.parse` module to help with reading command line
  arguments and environment variables for the above entrypoints.

- Renames `bytewax.inp` to `bytewax.inputs`.