# Tutorial: Profiling DDlog Programs

**TODO**: Add a running example.

DDlog's profiling features are designed to help the programmer identify
parts of the program that use the most CPU and memory.  DDlog supports
two profilers:

- DDlog's **self-profiler** runs as a thread inside the DDlog program and generates
  textual memory and CPU profiles that can be output at any point
  during the runtime of the program.

- [**DDShow**](https://github.com/Kixiron/ddshow) is a standalone profiler that
  ingests performance-related events from DDlog or any other program using
  Timely Dataflow and produces a visual profile that can be explored in a
  browser, as well as a textual profile similar to the one generated by
  the self-profiler.

DDShow offers a richer (and growing) set of features and is the preferred
development-time profiler.  The self-profiler, on the other hand, is more
resource-efficient and can be even be enabled in the production environment to
troubleshoot performance issues.

The user can enable one of the two profilers when starting a DDlog program
either from the CLI or via the Rust, C, or Java API.


## Self-profiler

To enable self-profiling, pass the `--self-profiler` flag to the
DDlog-generated CLI executable:

```
./tutorial_ddlog/target/release/tutorial_cli --self-profiler < tutorial.dat
```

In Rust, use the `run_with_config()` function:
    ```
    // Create a DDlog configuration with 1 worker thread and with the self-profiling feature
    // enabled.
    let config = Config::new()
        .with_timely_workers(1)
        .with_profiling_config(ProfilingConfig::SelfProfiling {
            // Directory to store profiles under or `None` for current directory.
            profile_directory: None
        });
    // Instantiate the DDlog program with this configuration.
    let (hddlog, init_state) = tutorial_ddlog::run_with_config(config, false)?;
    ```

In C, use the `ddlog_run_with_config()` function:
    ```
    TODO
    ```

In Java, use the `DDlogConfig` class to create a DDlog configuration with self-profiling enabled:
    ```
    import ddlogapi.DDlogAPI;
    import ddlogapi.DDlogConfig;

    DDlogConfig config = new DDlogConfig(2);
    config.setProfilingConfig(DDlogConfig.selfProfiling());
    this.api = new DDlogAPI(config, false);
    ```

When self-profiling is enabled, DDlog supports several profiling commands
(also available through Rust, C, and Java APIs):

1. `profile cpu on/off;` - enables/disables recording of CPU usage info in
addition to memory profiling.  CPU profiling is not enabled by default, as
it can slow down the program somewhat, especially for large programs
that handle many small updates.

1. `profile change on/off;` - enables/disables recording of the number of
insertions and deletions per arrangement, used to construct arrangement
change profile (see below).

1. `profile;` - dumps CPU and memory profiles to an HTML file.  This command
can be invoked at any time during program execution to output performance data
accumulated so far.  The profile attributes program's CPU and memory usage to
individual DDlog operators, highlightling program hotspots and
informing potential optimizations.

As an example, consider the following rule that computes all pairs of values
`(n1,n2)` such that `n1`, `n2`, and their product all belong to the `Numbers`
relation and additionaly `n1` is not divisible by `5` and `n2` is not divisible
by `2`

```
Pairs(n1, n2) :-
    Numbers(n1),
    Numbers(n2),
    Numbers(n1*n2),
    n1 % 5 != 0,
    n2 % 2 != 0.
```

We run the above program, initializing the `Numbers` relation to contain all
numbers in the range `[0..200]` The following screenshot shows a fragment of the
resulting CPU profile.  The top record in the profile, titled "Dataflow", shows
the amount of CPU time used to execute the entire dataflow, in this case,
`380,792` microseconds.  The next several profile records show that most of this
time is spent evaluating the above rule, and so this is where we should focus
our optimization effort.  In the screenshot we expanded the operator at the top
of the profile. This operator took `179,262` microseconds, or nearly 50% of the
entire program CPU time.

![CPU profile fragment](cpu1.png)

In the expanded view we see the source code location of the operator in
question.  DDlog evaluates the rule by first computing all pairs `(n1, n2)` as a
Cartesian product of the `Numbers` relation with itself.  Next it joins the
resulting relation with `Numbers` in order to restrict it to only those pairs
whose product also belongs to `Numbers`.  To this end, it must first index
("arrange" in the differential dataflow terminology) this intermediate relation
comprised of pairs `(n1, n2)` using the value of expression `n1*n2` as a key.
This is what the operator at the top of the profile does.  Since the relation is
large, indexing it is an expensive operation.  We can confirm this by looking at
the arrangement size profile section on the same HTML page:

![Arrangement size profile fragment](size1.png)

As expected, the size of the intermediate relation is `200 x 200 = 40,000`.
This suggests a simple optimization that prunes the sets of candidate values
of `n1` and `n2` early on:

```
Pairs(n1, n2) :-
    Numbers(n1),
    n1 % 5 != 0,
    Numbers(n2),
    n2 % 2 != 0,
    Numbers(n1*n2).
```

This reduces arrangement size by a factor of `2.5`

![Arrangement size profile fragment](size2.png)

and the time to compute the arrangement by a factor of `3`:

![CPU profile fragment](cpu2.png)

Consider the next operator in the profile, which computes the Cartesian product
of the set of `n1` such that `n1 % 5 != 0` with `Numbers(n2)`:

![CPU profile fragment](cpu3.png)

We can reduce this time by first computing the set of `n2`, since `n2 % 2 != 0`
prunes more values than `n1 % 5 != 0`:

```
Pairs(n1, n2) :-
    Numbers(n2),
    n2 % 2 != 0,
    Numbers(n1),
    n1 % 5 != 0,
    Numbers(n1*n2).
```

Indeed, this shaves few more milliseconds of the program's CPU usage:

![CPU profile fragment](cpu4.png)

Internally, DDlog uses the [Differential Dataflow library
(DD)](https://github.com/timelydataflow/differential-dataflow/) for incremental
computation.  Each DDlog operator can gets translated into one or more
lower-level DD operators assembled in a dataflow graph.  In fact, each entry in
a DDlog profile corresponds to one DD operator.  The name of this operator is
shown in the last column of the profile table.  Since a DDlog operator can map
into multiple DD operators, there can be multiple profile entries with the same
description that only differ in the DD operator name.

Next, we take a closer look at the arrangement size profile.  Arrangements are
responsible for the majority of memory consumption of a DDlog program.  Several
DDlog operators expect arranged inputs.  We already saw how a prefix of a rule
must be arranged in order to join it with a relation.  This is the first
arrangement in the profile shown below.  In addition, the relation to join with,
i.e., `Numbers`, must be arranged too.  This arrangement must be indexed by the
value of field `n` of the relation.  Arranged relations in DDlog are described
by **patterns**, e.g., `(Numbers{.n=_0}: Numbers)`, where numbered variables,
e.g., `_0`, indicate one or more fields used as the key to index the relation.

![Numbers relation arranged by n](size-numbers-1.png)

The same relation may be arranged multiple times using different keys required
by different join operations.  For example, computing the Cartesian product of
the `Numbers` relation with itself requires indexing the relation by an empty
set of fields (a Cartesian product is a special case of the join operator with
an empty join key).  In fact, this arrangement is used twice, in the first and
the second argument of the product, which is why we see two separate code
fragments in the corresponding profile entry:

![Numbers relation arranged by ()](size-numbers-2.png)

There are several other kinds of arrangements in the profile, e.g., the
`group_by` operator arranges its input using group-by variables as a key.  Each
profile entry contains a hyperlink to documentation that describes the meaning
of the associated operator.

Note that in general arrangement size can be larger than the number of
elements in the underlying relation, since arrangements in Differential Dataflow
are stored as temporally indexed data structures where the same record can occur
multiple times with different weights.

In addition to the arrangement size profile, the profiler outputs the peak
arrangement size profile, which records the largest number of elements observed
in each arrangement.

If change profiling was enabled for some parts of the execution, the profiler
also outputs the "Counts of changes" profile.  Unlike the arrangement size
profile, which tracks the number of records in each arrangment, this profile
shows the amount of churn.  For example adding one record and deleting one
record will show up as two changes in the change profile, but will cancel out in
the size profile.  Identifying relations that see the most churn helps to
understand the performance of the program, as such relations are responsible for
most recomputation performed by DDlog.  In a typical profiling session, the user
identifies expensive operators in the CPU profile and looks up the relations
that are inputs to this operator in the change profile to identify the ones with
high churn.

When change profiling is disabled, the recording stops, but the previously
accumulated profile is preserved.  By selectively enabling change profiling for
a subset of transactions, the user can focus their analysis on specific parts of
the program, for example they may not be interested in the number of changes
during initial population of input relations, but may want to make sure that
subsequent small incremental input updates do not cause excessive churn in
derived relations.

### Self-profiler API

The self-profiler API allows retrieving individual profiles (CPU profile,
arrangement size and peak size profiles, and the change profile) in a format
suitable for automatic processing.  For instance, the developer may want to
aggregate profiles obtained across multiple program runs with different datasets
to determine the most expensive operators across many workloads.  To this end
they may dump individual profiles to JSON files and later process them with a
separate tool.  This API is available in Rust, C, and Java.  The Rust version of
the API, available via the `DDlogProfiling` trait, returns a strongly typed
representation of the profiles, which can be converted to JSON using the
`serde_json` crate (see `serde_json::to_string`).

```
pub trait DDlogProfiling {
    ...
    fn arrangement_size_profile(&self) -> Result<Vec<SizeProfileRecord>, String>;
    fn peak_arrangement_size_profile(&self) -> Result<Vec<SizeProfileRecord>, String>;
    fn change_profile(&self) -> Result<Option<Vec<SizeProfileRecord>>, String>;
    fn cpu_profile(&self) -> Result<Option<CpuProfile>, String>;
}
```

C and Java APIs perform the JSON conversion internally and return the resulting JSON
profile as a string:

```
extern char* ddlog_arrangement_size_profile(ddlog_prog prog);
extern char* ddlog_peak_arrangement_size_profile(ddlog_prog prog);
extern char* ddlog_change_profile(ddlog_prog prog);
extern char* ddlog_cpu_profile(ddlog_prog prog);
```

## DDShow-based profiling

To use DDShow, first install it using `cargo`:

```
cargo install --git https://github.com/Kixiron/ddshow
```

DDShow supports two profiling modes:

1. Live profiling, where the target program streams performance-related events to DDShow
via network sockets.

1. Post-mortem profiling, where the target program writes profiling events to disk and DDShow
later scans the recording and constructs a program profile.

### Live profiling

The CLI executable supports several options that configure DDShow-based
profiling (use `--help` for a complete list).  The easiest way to enable live profiling is with
the `--ddshow` switch, which tells DDlog to start DDShow in the background
and connect to it.  Additionally, `--profile-differential` enables Differential Dataflow profiling.  By default,
DDlog only records Timely Dataflow events, which is sufficient to generate the
CPU profile of the program.  `--profile-differential` additionally enables
the recording of Differential Dataflow events, used to generate arrangement
size profile:

```
./tutorial_ddlog/target/release/tutorial_cli --ddshow --profile-differential < tutorial.dat > tutorial.dump

[0s] Waiting for 1 connection on 127.0.0.1:51317: connected to 1 trace source
[0/1, 0s] ⠂ Replaying timely events: 0 events, 0/s
[0/1, 0s] ⡀ Replaying timely events: 13361 events, 47630/s
```

Alternatively, you can start DDShow manually in a separate shell:

```
# Shell 1: start DDShow
ddshow --stream-encoding rkyv --differential --connections 2 --disable-timeline
[5s] ⠁ Waiting for 2 connections on 127.0.0.1:51317: connected to 0/2 sockets

# Shell 2: start target program;
./tutorial_ddlog/target/release/tutorial_cli -w 2 --profile-timely  --profile-differential  < tutorial.dat
```

Note the DDShow command-line arguments used above:
- `--stream-encoding rkyv` - use the `rkyv` format for event traces.  This is required when debugging DDlog applications.
- `--differential` - enable reading of the Differential Dataflow trace for arrangement size profiling.
- `--connections` - the number of timely worker threads in the target application.  This argument must match the `-w` DDlog CLI flag
- `--disable-timeline` - disable event timeline generation, which can be very expensive and slow to render for
  non-trivial programs.

### Post-mortem profiling

Use `--timely-trace-dir` and `--differential-trace-dir` to record performance events for
post-mortem profiling:

```
./tutorial_ddlog/target/release/tutorial_cli --timely-trace-dir "dd_trace" --differential-trace-dir "dd_trace"  < tutorial.dat

```

In Rust:
    ```
    TODO
    ```

In C:
    ```
    TODO
    ```

In Java:
    ```
    import ddlogapi.DDlogAPI;
    import ddlogapi.DDlogConfig;

    DDlogConfig config = new DDlogConfig(1);
    config.setProfilingConfig(
            DDlogConfig.timelyProfiling(
                DDlogConfig.logToDisk("timely_trace"),
                DDlogConfig.logDisabled(),
                DDlogConfig.logToDisk("timely_trace"))
            );
    this.api = new DDlogAPI(config, true);
    ```


**NOTE**: DDShow currently requires using the same directory for timely and differential traces.


Run `ddshow` to analyze recorded event traces:


```
ddshow --replay-logs dd_trace/

[0s] Loading Timely replay from dd_trace/: loaded 2 replay files
Press enter to finish loading trace data (this will cause data to not be fully processed)...
[0/1, 5s]   Finished replaying timely events: 36717 events, 17073/s
Processing data... done!
Wrote report file to report.txt
Wrote output graph to file:////home/lryzhyk/projects/differential-datalog/test/datalog_tests/dataflow-graph/graph.html
```

When running in either live or post-portem mode, DDShow generates a text report in `report.txt`
and an HTML profile in `dataflow-graph/graph.html`.