# Changelog

### v0.x.x - Unreleased

#### Breaking

- The experimental `Pledge` for dataflow parallelism have been renamed
  `FlowEvent` to be in line with:
  - `AsyncEvent` in Nim async frameworks
  - `cudaEvent_t` in CUDA
  - `cl_event` in OpenCL

  Renaming changes:
  - `newPledge()` becomes `newFlowEvent()`
  - `fulfill()` becomes `trigger()`
  - `spawnDelayed()` becomes `spawnOnEvents()`
  - The `dependsOn` clause in `parallelFor` becomes `dependsOnEvent`

#### Features

- Added `isReady(Flowvar)` which will return true is `sync` would block on that Flowvar or if the result is actually immediately available.
- `syncScope:` to block until all tasks and their (recursive)
  descendants are completed.
- Dataflow parallelism can now be used with the C++ target.
- Weave as a background service (experimental).
  Weave can now be started on a dedicated thread
  and handle **jobs** from any thread.
  To do this, Weave can be started with `thr.runInBackground(Weave)`.
  Job providing threads should call `setupSubmitterThread(Weave)`,
  and can now use `submit function(args...)` and `waitFor(PendingResult)`
  to have Weave work as a job system.
  Jobs are handled in FIFO order.
  Within a job, tasks can be spawned.

### v0.4.0 - April 2020 - "Bespoke"

#### Compatibility

Weave now targets Nim 1.2.0 instead of `devel`. This is the first Nim release
that supports all requirements of Weave.

#### Features

Weave now provides an experimental "dataflow parallelism" mode.
Dataflow parallelism is also known under the following names:
- Graph parallelism
- Stream parallelism
- Pipeline parallelism
- Data-driven parallelism

Concretely this allows delaying tasks until a condition is met.
This condition is called `Pledge`.
Programs can now create a "computation graph"
or a pipeline of tasks ahead of time that depends on one or more `Pledge`.

For example a game engine might want to associate a pipeline of transformations
to each frame and once the frame prerequisites are met, set the `Pledge` to `fulfilled`.

The `Pledge` can be combined with parallel loops and programs can wait on specific
iterations or even iteration ranges for example to implement parallel video processing
as soon as a subset of the frame is ready instead of waiting for the whole frame.
This exposes significantly more parallelism opportunities.

Dataflow parallelism cannot be used with the C++ backend at the moment.

Weave now provides the 3 main parallelism models:
- Task Parallelism (spawn/sync)
- Data Parallelism (parallel for loop)
- Dataflow Parallelism (delayed tasks)

#### Performance

Weave scalability has been carefully measured and improved.

On matrix multiplication, the traditional benchmark to classify the top 500 supercomputers of the world, Weave speedup on an 18-core CPU is 17.5x while the state-of-the-art Intel implementation using OpenMP allows 15.5x-16x speedup.

### v0.3.0 - January 2020 - "Beam me up!"

`sync(Weave)` has been renamed `syncRoot(Weave)` to highlight that it is only valid on the root task in the main thread. In particular, a procedure that uses syncRoot should not be called be in a multithreaded section. This is a breaking change. In the future such changes will have a deprecation path but the library is only 2 weeks old at the moment.

`parallelFor`, `parallelForStrided`, `parallelForStaged`, `parallelForStagedStrided`
now support an "awaitable" statement to allow fine-grain sync.
Fine-grained data-dependencies are under research (for example launch a task when the first 50 iterations are done out of a 100 iteration loops), "awaitable" may change
to have an unified syntax for delayed tasks depending on a task, a whole loop or a subset of it.
If possible, it is recommended to use "awaitable" instead of `syncRoot()` to allow composable parallelism, `syncRoot()` can only be called in a serial section of the code.

Weave can now be compiled with Microsoft Visual Studio in C++ mode.

"LastVictim" and "LastThief" WV_Target policy has been added.
The default is still "Random", pass "-d:WV_Target=LastVictim" to explore performance on your workload

"StealEarly" has been implemented, the default is not to steal early,
pass "-d:WV_StealEarly=2" for example to allow workers to initiate a steal request
when 2 tasks or less are left in their queue.

#### Performance

Weave has been thoroughly tested and tuned on state-of-the-art matrix multiplication implementation
against competing pure Assembly, hand-tuned BLAS implementations to reach High-performance Computing scalability standards.

3 cases can trigger loop splitting in Weave:
- loadBalance(Weave),
- sharing work to idle child threads
- incoming thieves
The first 2 were not working properly and resulted in pathological performance cases.
This has been fixed.

Fixed strided loop iteration rounding
Fixed compilation with metrics

Executing a loop now counts as a single task for the adaptative steal policy.
This prevents short loops from hindering steal-half strategy as it depends
on the number of tasks executed per steal requests interval.

#### Internals
- Weave uses explicit finite state machines in several places.
- The memory pool now has the same interface has malloc/free, in the past
  freeing a block required passing a threadID as this avoided an expensive getThreadID syscall.
  The new solution uses assembly code to get the address of the current thread thread-local storage
  as an unique threadID.
- Weave memory subsystem now supports LLVM AddressSanitizer to detect memory bugs.
  Spurious (?) errors from Nim and Weave were not removed and are left as a future task.

### v0.2.0 - December 2019 - "Overture"

Weave `EventNotifier` has been rewritten and formally verified.
Combined with using raw Linux futex to workaround a condition variable bug
in glibc and musl, Weave backoff system is now deadlock-free.

Backoff has been renamed from `WV_EnableBackoff` to `WV_Backoff`.
It is now enabled by default.

Weave now supports Windows.

### v0.1.0 - December 2019 - "Arabesques"

Initial release