# Getting Started with DaCe

DaCe is a Python library that enables optimizing code with ease, from running on a single core to a full supercomputer. With the power of data-centric transformations, it can automatically map code for CPUs, GPUs, and FPGAs.

Let's get started with DaCe by importing it:

In [1]:
import dace

A data-centric program can be generated from several general-purpose and domain-specific languages. Our main frontend, however, is Python/numpy. To define a program, we take an existing function on numpy arrays and decorate it with `@dace.program`:

In [2]:
@dace.program
def getstarted(A):
    return A + A

Running our dace program, we will see several outputs and a prompt. These are the available transformations we can apply. For the first step, we opt to apply none (press Enter) and proceed with compilation and running:

In [3]:
import numpy as np
a = np.random.rand(2, 3)
a

array([[0.74867876, 0.85403223, 0.16573784],
       [0.71994615, 0.29855314, 0.21483992]])

In [4]:
getstarted(a)

array([[1.49735752, 1.70806445, 0.33147568],
       [1.4398923 , 0.59710627, 0.42967985]])

The results are, as expected, `2*A`.

Now, let's inspect the intermediate representation of the data-centric program, its Stateful Dataflow Multigraph (SDFG):

In [5]:
getstarted.to_sdfg(a)

You can drag the handle at the bottom right to make the SDFG frame larger.

Notice the following four elements in the graph:

1. **State** (blue region): This is the control flow part of the application, represented as a state machine. Since there is no control-flow in the data-centric representation of `A+A`, we see only one state encompassing the computation.
2. **Arrays** (circular nodes) and **Memlets** (arrows): These nodes represent disjoint N-dimensional memory regions (similar to numpy `ndarray`s), and the edges represent data that is moved throughout the state. Hovering over a memlet will show more information about the subset being moved.
3. **Tasklets** (octagon): This node represents the computational parts of the graph. Zooming into it will show the code (addition operation in this case). Tasklets act as pure functions that can only work with the data coming into/out of its **connectors** (cyan circles on the node).
4. **Maps** (trapezoid): Anything that is enclosed between these two nodes (the map *scope*) is replicated for the number of times specified on the node (in our case, `2*3` times). This creates parametric parallelism in the graph and can be nested in each other for efficient parallelization and distribution of work.

Unfortunately (or fortunately in some cases), this graph is specialized for a specific size of array (as given to it), and will not work on other sizes. To compile a program that works with general sizes, we'll need to use symbolic sizes.

## Symbols

DaCe includes a symbolic math engine (extending SymPy) to support symbolic expressions for sizes, ranges, accesses, and more. 

Any number of symbols can be used throughout a computation. Defining a symbol is as easy as calling:

In [6]:
N = dace.symbol('N')

which we can now use for any computation and definitions. For example, annotating the types of our function from above will yield a version that works with any size:

In [7]:
@dace.program
def getstarted_sym(A: dace.float64[N, 2*N]):
    return A + A

In [8]:
getstarted_sym.to_sdfg()

If we compile this code, any array that can match a size of `Nx2N` will be automatically used to infer the value of `N` and invoke the function:

In [9]:
getstarted_sym(np.random.rand(100, 200))

array([[1.63216549, 1.26522381, 0.21606686, ..., 0.56988572, 1.12572538,
        1.72701877],
       [0.3829452 , 1.52386969, 0.82165197, ..., 1.3105662 , 1.19336786,
        1.43671993],
       [1.55277426, 1.50918516, 1.30665626, ..., 1.06562809, 1.53069088,
        1.10071159],
       ...,
       [0.60629736, 1.73240929, 1.26797782, ..., 1.72034476, 1.56691557,
        0.22283613],
       [1.96245486, 1.60559508, 0.02009914, ..., 1.40944583, 1.44560312,
        0.37804927],
       [1.17875002, 0.96963921, 0.28278902, ..., 1.56747976, 0.4616313 ,
        0.94999278]])

## Performance

Given our symbolic SDFG, we would not like to recompile it every time. Thus, we can pre-compile the graph into an .so/.dll file:

In [12]:
csdfg = getstarted_sym.compile()

A compiled SDFG, however, has to be invoked like an SDFG, with keyword arguments only:

In [13]:
b = csdfg(A=np.random.rand(10,20), N=np.int32(10))

We can now see the performance of the code on large arrays vs. numpy:

In [14]:
tester = np.random.rand(2000, 4000)

In [15]:
%timeit tester + tester

12 ms ± 143 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [16]:
%timeit csdfg(A=tester, N=np.int32(2000))

3.86 ms ± 271 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## Explicit Dataflow

One can specify explicit dataflow in dace using `for i in dace.map[begin:end]:` syntax, as well as tasklets manually using `with dace.tasklet:`. Here is an example of a real-world example (Scattering Self-Energies) with an 8-dimensional parallel computation:

In [17]:
# Declaration of symbolic variables
Nkz, NE, Nqz, Nw, N3D, NA, NB, Norb = (
    dace.symbol(name)
    for name in ['Nkz', 'NE', 'Nqz', 'Nw', 'N3D', 'NA', 'NB', 'Norb'])


@dace.program
def sse_sigma(neigh_idx: dace.int32[NA, NB],
              dH: dace.complex128[NA, NB, N3D, Norb, Norb],
              G: dace.complex128[Nkz, NE, NA, Norb, Norb],
              D: dace.complex128[Nqz, Nw, NA, NB, N3D, N3D],
              Sigma: dace.complex128[Nkz, NE, NA, Norb, Norb]):

    # Declaration of Map scope
    for k, E, q, w, i, j, a, b in dace.map[0:Nkz, 0:NE, 0:Nqz, 0:Nw, 0:N3D, 0:
                                           N3D, 0:NA, 0:NB]:
        dHG = G[k - q, E - w, neigh_idx[a, b]] @ dH[a, b, i]
        dHD = dH[a, b, j] * D[q, w, a, b, i, j]
        Sigma[k, E, a] += dHG @ dHD
        
sse_sigma.to_sdfg()