{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "# Loop Generation and Cache Blocking\n", "\n", "## Author: Ken Sible\n", "\n", "## The following module demonstrates loop generation and cache blocking (loop tiling).\n", "\n", "### NRPy+ Source Code for this module:\n", "1. [loop.py](../edit/loop.py); [\\[**tutorial**\\]](Tutorial-Loop_Generation_Cache_Blocking.ipynb) The loop.py script will generate a single or nested loop of arbitrary dimension in C, and has support for [cache blocking](https://software.intel.com/en-us/articles/how-to-use-loop-blocking-to-optimize-memory-use-on-32-bit-intel-architecture)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "# Table of Contents\n", "$$\\label{toc}$$\n", "\n", "1. [Step 1](#generation): Demonstrate Loop Generation\n", " 1. [Step 1.a](#simple): Generate a simple, three-dimensional loop with specified options\n", " 1. [Step 1.b](#arbitrary): Generate a nested loop of arbitrary dimension with exact parameters\n", "1. [Step 2](#exercise): Exercise (Loop Generation)\n", "1. [Step 3](#tiling): Demonstrate Cache Blocking\n", "1. [Step 4](#latex_pdf_output): $\\LaTeX$ PDF Output" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "# Step 1: Demonstrate Loop Generation \\[Back to [top](#toc)\\]\n", "$$\\label{generation}$$\n", "\n", "In the following section, we demonstrate single and nested loop generation in C using NRPy+.\n", "\n", "In [loop.py](../edit/loop.py), the following functions are implemented for loop generation:\n", "\n", "- ```loop(idx_var, lower_bound, upper_bound, increment, pragma, padding=\" \", interior=\"\", tile_size=\"\")```\n", " + ```idx_var```: index variable for the loop (```idxvar[0]```$\\Rightarrow$ outermost loop, ```idxvar[N - 1]```$\\Rightarrow$ innermost loop)\n", " + ```lower_bound```:   lower bound for ```idxvar``` or ```idxvar[i]```\n", " + ```upper_bound```:   upper bound for ```idxvar``` or ```idxvar[i]```\n", " + ```increment```:     increment for ```idxvar``` or ```idxvar[i]```\n", " + ```pragma```: OpenMP pragma for ```idxvar``` or ```idxvar[i]```\n", " + ```padding```: (*optional*) padding before a line (tab stop)\n", " + ```interior```:   (*optional*) interior of the loop\n", " + ```tile_size```:   (*optional*) tile size for cache blocking\n", "- ```simple_loop(options, interior)```\n", " + ```options```: options for loop generation\n", " + ```interior```: interior of the loop" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "execution": { "iopub.execute_input": "2021-09-27T22:38:08.858169Z", "iopub.status.busy": "2021-09-27T22:38:08.857589Z", "iopub.status.idle": "2021-09-27T22:38:08.859580Z", "shell.execute_reply": "2021-09-27T22:38:08.859913Z" } }, "outputs": [], "source": [ "from loop import loop, simple_loop # Import NRPy+ module for loop generation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "## Step 1.a: `simple_loop()` \\[Back to [top](#toc)\\]\n", "$$\\label{simple}$$\n", "\n", "The `simple_loop()` function will generate a simple loop in C (for use inside of a function) with specified options." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2021-09-27T22:38:08.862174Z", "iopub.status.busy": "2021-09-27T22:38:08.861780Z", "iopub.status.idle": "2021-09-27T22:38:08.863864Z", "shell.execute_reply": "2021-09-27T22:38:08.864219Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " #pragma omp parallel for\n", " for (int i2 = 0; i2 < Nxx_plus_2NGHOSTS2; i2++) {\n", " for (int i1 = 0; i1 < Nxx_plus_2NGHOSTS1; i1++) {\n", " for (int i0 = 0; i0 < Nxx_plus_2NGHOSTS0; i0++) {\n", " // \n", " } // END LOOP: for (int i0 = 0; i0 < Nxx_plus_2NGHOSTS0; i0++)\n", " } // END LOOP: for (int i1 = 0; i1 < Nxx_plus_2NGHOSTS1; i1++)\n", " } // END LOOP: for (int i2 = 0; i2 < Nxx_plus_2NGHOSTS2; i2++)\n", "\n" ] } ], "source": [ "# 'AllPoints': loop over all points on a numerical grid, including ghost zones\n", "print(simple_loop('AllPoints', '// '))" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2021-09-27T22:38:08.866556Z", "iopub.status.busy": "2021-09-27T22:38:08.866163Z", "iopub.status.idle": "2021-09-27T22:38:08.868349Z", "shell.execute_reply": "2021-09-27T22:38:08.867989Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " #pragma omp parallel for\n", " for (int i2 = NGHOSTS; i2 < NGHOSTS+Nxx2; i2++) {\n", " for (int i1 = NGHOSTS; i1 < NGHOSTS+Nxx1; i1++) {\n", " for (int i0 = NGHOSTS; i0 < NGHOSTS+Nxx0; i0++) {\n", " // \n", " } // END LOOP: for (int i0 = NGHOSTS; i0 < NGHOSTS+Nxx0; i0++)\n", " } // END LOOP: for (int i1 = NGHOSTS; i1 < NGHOSTS+Nxx1; i1++)\n", " } // END LOOP: for (int i2 = NGHOSTS; i2 < NGHOSTS+Nxx2; i2++)\n", "\n" ] } ], "source": [ "# 'InteriorPoints': loop over the interior of a numerical grid, i.e. exclude ghost zones\n", "print(simple_loop('InteriorPoints', '// '))" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2021-09-27T22:38:08.870877Z", "iopub.status.busy": "2021-09-27T22:38:08.870489Z", "iopub.status.idle": "2021-09-27T22:38:08.872593Z", "shell.execute_reply": "2021-09-27T22:38:08.872243Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " #pragma omp parallel for\n", " for (int i2 = 0; i2 < Nxx_plus_2NGHOSTS2; i2++) {\n", " const REAL xx2 = xx[2][i2];\n", " for (int i1 = 0; i1 < Nxx_plus_2NGHOSTS1; i1++) {\n", " const REAL xx1 = xx[1][i1];\n", " for (int i0 = 0; i0 < Nxx_plus_2NGHOSTS0; i0++) {\n", " const REAL xx0 = xx[0][i0];\n", " // \n", " } // END LOOP: for (int i0 = 0; i0 < Nxx_plus_2NGHOSTS0; i0++)\n", " } // END LOOP: for (int i1 = 0; i1 < Nxx_plus_2NGHOSTS1; i1++)\n", " } // END LOOP: for (int i2 = 0; i2 < Nxx_plus_2NGHOSTS2; i2++)\n", "\n" ] } ], "source": [ "# 'Read_xxs': read the xx[3][:] 1D coordinate arrays, as some interior dependency exists\n", "print(simple_loop('AllPoints Read_xxs', '// '))\n", "#HERE" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2021-09-27T22:38:08.874883Z", "iopub.status.busy": "2021-09-27T22:38:08.873948Z", "iopub.status.idle": "2021-09-27T22:38:08.876942Z", "shell.execute_reply": "2021-09-27T22:38:08.876549Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " #pragma omp parallel for\n", " for (int i2 = 0; i2 < Nxx_plus_2NGHOSTS2; i2++) {\n", " #include \"rfm_files/rfm_struct__read2.h\"\n", " for (int i1 = 0; i1 < Nxx_plus_2NGHOSTS1; i1++) {\n", " #include \"rfm_files/rfm_struct__read1.h\"\n", " for (int i0 = 0; i0 < Nxx_plus_2NGHOSTS0; i0++) {\n", " #include \"rfm_files/rfm_struct__read0.h\"\n", " // \n", " } // END LOOP: for (int i0 = 0; i0 < Nxx_plus_2NGHOSTS0; i0++)\n", " } // END LOOP: for (int i1 = 0; i1 < Nxx_plus_2NGHOSTS1; i1++)\n", " } // END LOOP: for (int i2 = 0; i2 < Nxx_plus_2NGHOSTS2; i2++)\n", "\n" ] } ], "source": [ "# 'enable_rfm_precompute': enable pre-computation of reference metric\n", "print(simple_loop('AllPoints enable_rfm_precompute', '// '))" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2021-09-27T22:38:08.879492Z", "iopub.status.busy": "2021-09-27T22:38:08.879101Z", "iopub.status.idle": "2021-09-27T22:38:08.881453Z", "shell.execute_reply": "2021-09-27T22:38:08.881056Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " #pragma omp parallel for\n", " for (int i2 = 0; i2 < Nxx_plus_2NGHOSTS2; i2++) {\n", " #include \"rfm_files/rfm_struct__SIMD_outer_read2.h\"\n", " for (int i1 = 0; i1 < Nxx_plus_2NGHOSTS1; i1++) {\n", " #include \"rfm_files/rfm_struct__SIMD_outer_read1.h\"\n", " for (int i0 = 0; i0 < Nxx_plus_2NGHOSTS0; i0 += SIMD_width) {\n", " #include \"rfm_files/rfm_struct__SIMD_inner_read0.h\"\n", " // \n", " } // END LOOP: for (int i0 = 0; i0 < Nxx_plus_2NGHOSTS0; i0 += SIMD_width)\n", " } // END LOOP: for (int i1 = 0; i1 < Nxx_plus_2NGHOSTS1; i1++)\n", " } // END LOOP: for (int i2 = 0; i2 < Nxx_plus_2NGHOSTS2; i2++)\n", "\n" ] } ], "source": [ "# 'enable_SIMD': enable SIMD support (https://en.wikipedia.org/wiki/SIMD)\n", "print(simple_loop('AllPoints enable_rfm_precompute enable_SIMD', '// '))" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "execution": { "iopub.execute_input": "2021-09-27T22:38:08.884785Z", "iopub.status.busy": "2021-09-27T22:38:08.884397Z", "iopub.status.idle": "2021-09-27T22:38:08.886270Z", "shell.execute_reply": "2021-09-27T22:38:08.886616Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " for (int i2 = 0; i2 < Nxx_plus_2NGHOSTS2; i2++) {\n", " for (int i1 = 0; i1 < Nxx_plus_2NGHOSTS1; i1++) {\n", " for (int i0 = 0; i0 < Nxx_plus_2NGHOSTS0; i0++) {\n", " // \n", " } // END LOOP: for (int i0 = 0; i0 < Nxx_plus_2NGHOSTS0; i0++)\n", " } // END LOOP: for (int i1 = 0; i1 < Nxx_plus_2NGHOSTS1; i1++)\n", " } // END LOOP: for (int i2 = 0; i2 < Nxx_plus_2NGHOSTS2; i2++)\n", "\n" ] } ], "source": [ "# 'DisableOpenMP': disable OpenMP parallelization (https://en.wikipedia.org/wiki/OpenMP)\n", "print(simple_loop('AllPoints DisableOpenMP', '// '))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "## Step 1.b: `loop()` \\[Back to [top](#toc)\\]\n", "$$\\label{arbitrary}$$\n", "\n", "The `loop()` function will generate a nested loop of arbitrary dimension in C with exact parameters (i.e. more control than `simple_loop()`)." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "execution": { "iopub.execute_input": "2021-09-27T22:38:08.889722Z", "iopub.status.busy": "2021-09-27T22:38:08.889335Z", "iopub.status.idle": "2021-09-27T22:38:08.891213Z", "shell.execute_reply": "2021-09-27T22:38:08.891539Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " for (int i = 0; i < N; i++) {\n", " // \n", " } // END LOOP: for (int i = 0; i < N; i++)\n", "\n" ] } ], "source": [ "# Generate a one-dimensional loop over i from 0 to N, stepping by 1, with specified loop body\n", "print(loop('i', '0', 'N', '1', '', interior='// '))" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "execution": { "iopub.execute_input": "2021-09-27T22:38:08.894478Z", "iopub.status.busy": "2021-09-27T22:38:08.894075Z", "iopub.status.idle": "2021-09-27T22:38:08.896371Z", "shell.execute_reply": "2021-09-27T22:38:08.895966Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " for (int i = 0; i < Nx; i++) {\n", " for (int j = 0; j < Ny; j++) {\n", " // \n", " } // END LOOP: for (int j = 0; j < Ny; j++)\n", " } // END LOOP: for (int i = 0; i < Nx; i++)\n", "\n" ] } ], "source": [ "# Generate a two-dimensional loop over i from 0 to Nx, stepping by 1,\n", "# and j from 0 to Ny, stepping by 1, with specified loop body\n", "print(loop(['i', 'j'], ['0', '0'], ['Nx', 'Ny'], ['1', '1'], ['', ''], interior='// '))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "# Step 2: Exercise (Loop Generation) \\[Back to [top](#toc)\\]\n", "$$\\label{exercise}$$\n", "\n", "**Goal:** Reproduce the following output using the loop generation infrastructure in NRPy+ ([solution](Tutorial-Loop_Generation_Cache_Blocking_soln.ipynb))\n", "\n", "```\n", "for (int n = 0; n < (Nt - 1); n++) {\n", " u[n][0] = u[n][Nx] = 0;\n", " for (int k = 1; k < (Nx - 1); k++) {\n", " u[n + 1][k] = u[n][k] + r*(u[n][k + 1] - 2*u[n][k] + u[n][k - 1]);\n", " } // END LOOP: for (int k = 1; k < (Nx - 1); k++)\n", " for (int k = 0; k < Nx; k++) {\n", " u[n][k] = u[n + 1][k];\n", " } // END LOOP: for (int k = 0; k < Nx; k++)\n", "} // END LOOP: for (int n = 0; n < (Nt - 1); n++)\n", "```" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "execution": { "iopub.execute_input": "2021-09-27T22:38:08.898920Z", "iopub.status.busy": "2021-09-27T22:38:08.898536Z", "iopub.status.idle": "2021-09-27T22:38:08.900590Z", "shell.execute_reply": "2021-09-27T22:38:08.900198Z" } }, "outputs": [], "source": [ "# Write Solution Here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "# Step 3: Demonstrate Cache Blocking \\[Back to [top](#toc)\\]\n", "$$\\label{tiling}$$\n", "\n", "In the following section, we demonstrate cache blocking (loop tiling) using NRPy+. The advantage of cache blocking is improved [spatial locality](https://en.wikipedia.org/wiki/Locality_of_reference) by blocking or tiling a data structure to fit inside cache. We minimize the number of cache misses that occur (reduce memory bandwidth pressure) by reusing the subset of our data structure that was cached rather than accessing main memory ([source](https://software.intel.com/en-us/articles/cache-blocking-techniques)). The following example of matrix-vector multiplication will demonstrate cache blocking using NRPy+." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "execution": { "iopub.execute_input": "2021-09-27T22:38:08.904360Z", "iopub.status.busy": "2021-09-27T22:38:08.903963Z", "iopub.status.idle": "2021-09-27T22:38:08.906216Z", "shell.execute_reply": "2021-09-27T22:38:08.905858Z" }, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "// Untiled Loop\n", " for (int i = 0; i < N; i++) {\n", " for (int j = 0; j < N; j++) {\n", " c[i] += a[i][j] * b[j];\n", " } // END LOOP: for (int j = 0; j < N; j++)\n", " } // END LOOP: for (int i = 0; i < N; i++)\n", "\n", "// Tiled Loop\n", "#define MIN(x, y) (((x) < (y)) ? (x) : (y))\n", " for (int iB = 0; iB < N; iB += 2) {\n", " for (int jB = 0; jB < N; jB += 2) {\n", " for (int i = iB; i < MIN(N, iB + 2); i++) {\n", " for (int j = jB; j < MIN(N, jB + 2); j++) {\n", " c[i] += a[i][j] * b[j];\n", " } // END LOOP: for (int j = jB; j < MIN(N, jB + 2); j++)\n", " } // END LOOP: for (int i = iB; i < MIN(N, iB + 2); i++)\n", " } // END LOOP: for (int jB = 0; jB < N; jB += 2)\n", " } // END LOOP: for (int iB = 0; iB < N; iB += 2)\n", "\n" ] } ], "source": [ "print('// Untiled Loop')\n", "print(loop(['i', 'j'], ['0', '0'], ['N', 'N'], ['1', '1'], ['', ''], interior='c[i] += a[i][j] * b[j];'))\n", "\n", "print('// Tiled Loop\\n#define MIN(x, y) (((x) < (y)) ? (x) : (y))')\n", "print(loop(['i', 'j'], ['0', '0'], ['N', 'N'], ['1', '1'], ['', ''], interior='c[i] += a[i][j] * b[j];', tile_size=['2', '2']))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Order of Memory Access (Untiled): `(a[0][0], b[0]), (a[0][1], b[1]), (a[0][2], b[2]), (a[0][3], b[3]), ...`\n", "\n", "Order of Memory Access (Tiled):     `(a[0][0], b[0]), (a[0][1], b[1]), (a[1][0], b[0]), (a[1][1], b[1]), ...`\n", "\n", "We should remark that cache blocking might not be a valid loop optimization whenever the order of memory access affects the resulting output. However, that potential issue does not occur in our example with matrix-vector multiplication. Moreover, the block or tile size will depend on the CPU architecture, and hence experimentation is required to determine the optimal size." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "# Step 4: Output this notebook to $\\LaTeX$-formatted PDF file \\[Back to [top](#toc)\\]\n", "$$\\label{latex_pdf_output}$$\n", "\n", "The following code cell converts this Jupyter notebook into a proper, clickable $\\LaTeX$-formatted PDF file. After the cell is successfully run, the generated PDF may be found in the root NRPy+ tutorial directory, with filename\n", "[Tutorial-Loop_Generation_Cache_Blocking.pdf](Tutorial-Loop_Generation_Cache_Blocking.pdf) (Note that clicking on this link may not work; you may need to open the PDF file through another means.)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "execution": { "iopub.execute_input": "2021-09-27T22:38:08.908842Z", "iopub.status.busy": "2021-09-27T22:38:08.908442Z", "iopub.status.idle": "2021-09-27T22:38:11.334602Z", "shell.execute_reply": "2021-09-27T22:38:11.334246Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Created Tutorial-Loop_Generation_Cache_Blocking.tex, and compiled LaTeX\n", " file to PDF file Tutorial-Loop_Generation_Cache_Blocking.pdf\n" ] } ], "source": [ "import cmdline_helper as cmd # NRPy+: Multi-platform Python command-line interface\n", "cmd.output_Jupyter_notebook_to_LaTeXed_PDF(\"Tutorial-Loop_Generation_Cache_Blocking\")" ] } ], "metadata": { "file_extension": ".py", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.0rc2" }, "mimetype": "text/x-python", "name": "python", "npconvert_exporter": "python", "pygments_lexer": "ipython3", "version": 3 }, "nbformat": 4, "nbformat_minor": 2 }