{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "a71215ab",
   "metadata": {},
   "source": [
    "Installing (updating) the following libraries for your Sagemaker\n",
    "instance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "17107283",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install .. # installing d2l\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7a4cb8fa",
   "metadata": {
    "origin_pos": 0
   },
   "source": [
    "# 自动并行\n",
    ":label:`sec_auto_para`\n",
    "\n",
    "深度学习框架（例如，MxNet、飞桨和PyTorch）会在后端自动构建计算图。利用计算图，系统可以了解所有依赖关系，并且可以选择性地并行执行多个不相互依赖的任务以提高速度。例如， :numref:`sec_async`中的 :numref:`fig_asyncgraph`独立初始化两个变量。因此，系统可以选择并行执行它们。\n",
    "\n",
    "通常情况下单个操作符将使用所有CPU或单个GPU上的所有计算资源。例如，即使在一台机器上有多个CPU处理器，`dot`操作符也将使用所有CPU上的所有核心（和线程）。这样的行为同样适用于单个GPU。因此，并行化对单设备计算机来说并不是很有用，而并行化对于多个设备就很重要了。虽然并行化通常应用在多个GPU之间，但增加本地CPU以后还将提高少许性能。例如， :cite:`Hadjis.Zhang.Mitliagkas.ea.2016`则把结合GPU和CPU的训练应用到计算机视觉模型中。借助自动并行化框架的便利性，我们可以依靠几行Python代码实现相同的目标。对自动并行计算的讨论主要集中在使用CPU和GPU的并行计算上，以及计算和通信的并行化内容。\n",
    "\n",
    "请注意，本节中的实验至少需要两个GPU来运行。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "8c944f1a",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-08-18T07:11:59.505418Z",
     "iopub.status.busy": "2023-08-18T07:11:59.504686Z",
     "iopub.status.idle": "2023-08-18T07:12:02.958789Z",
     "shell.execute_reply": "2023-08-18T07:12:02.957933Z"
    },
    "origin_pos": 2,
    "tab": [
     "pytorch"
    ]
   },
   "outputs": [],
   "source": [
    "import torch\n",
    "from d2l import torch as d2l"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4c8e7569",
   "metadata": {
    "origin_pos": 4
   },
   "source": [
    "## 基于GPU的并行计算\n",
    "\n",
    "从定义一个具有参考性的用于测试的工作负载开始：下面的`run`函数将执行$10$次*矩阵－矩阵*乘法时需要使用的数据分配到两个变量（`x_gpu1`和`x_gpu2`）中，这两个变量分别位于选择的不同设备上。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "5e7b039a",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-08-18T07:12:02.987012Z",
     "iopub.status.busy": "2023-08-18T07:12:02.986327Z",
     "iopub.status.idle": "2023-08-18T07:12:05.221346Z",
     "shell.execute_reply": "2023-08-18T07:12:05.220262Z"
    },
    "origin_pos": 6,
    "tab": [
     "pytorch"
    ]
   },
   "outputs": [],
   "source": [
    "devices = d2l.try_all_gpus()\n",
    "def run(x):\n",
    "    return [x.mm(x) for _ in range(50)]\n",
    "\n",
    "x_gpu1 = torch.rand(size=(4000, 4000), device=devices[0])\n",
    "x_gpu2 = torch.rand(size=(4000, 4000), device=devices[1])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c2f2ffe6",
   "metadata": {
    "origin_pos": 9,
    "tab": [
     "pytorch"
    ]
   },
   "source": [
    "现在使用函数来处理数据。通过在测量之前需要预热设备（对设备执行一次传递）来确保缓存的作用不影响最终的结果。`torch.cuda.synchronize()`函数将会等待一个CUDA设备上的所有流中的所有核心的计算完成。函数接受一个`device`参数，代表是哪个设备需要同步。如果device参数是`None`（默认值），它将使用`current_device()`找出的当前设备。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "970d8c24",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-08-18T07:12:05.225646Z",
     "iopub.status.busy": "2023-08-18T07:12:05.224864Z",
     "iopub.status.idle": "2023-08-18T07:12:07.664593Z",
     "shell.execute_reply": "2023-08-18T07:12:07.663740Z"
    },
    "origin_pos": 12,
    "tab": [
     "pytorch"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "GPU1 time: 0.4600 sec\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "GPU2 time: 0.4706 sec\n"
     ]
    }
   ],
   "source": [
    "run(x_gpu1)\n",
    "run(x_gpu2)  # 预热设备\n",
    "torch.cuda.synchronize(devices[0])\n",
    "torch.cuda.synchronize(devices[1])\n",
    "\n",
    "with d2l.Benchmark('GPU1 time'):\n",
    "    run(x_gpu1)\n",
    "    torch.cuda.synchronize(devices[0])\n",
    "\n",
    "with d2l.Benchmark('GPU2 time'):\n",
    "    run(x_gpu2)\n",
    "    torch.cuda.synchronize(devices[1])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4df4f720",
   "metadata": {
    "origin_pos": 15,
    "tab": [
     "pytorch"
    ]
   },
   "source": [
    "如果删除两个任务之间的`synchronize`语句，系统就可以在两个设备上自动实现并行计算。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "d6a567e4",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-08-18T07:12:07.668313Z",
     "iopub.status.busy": "2023-08-18T07:12:07.667763Z",
     "iopub.status.idle": "2023-08-18T07:12:08.130167Z",
     "shell.execute_reply": "2023-08-18T07:12:08.129377Z"
    },
    "origin_pos": 18,
    "tab": [
     "pytorch"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "GPU1 & GPU2: 0.4580 sec\n"
     ]
    }
   ],
   "source": [
    "with d2l.Benchmark('GPU1 & GPU2'):\n",
    "    run(x_gpu1)\n",
    "    run(x_gpu2)\n",
    "    torch.cuda.synchronize()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a04f1ffe",
   "metadata": {
    "origin_pos": 20
   },
   "source": [
    "在上述情况下，总执行时间小于两个部分执行时间的总和，因为深度学习框架自动调度两个GPU设备上的计算，而不需要用户编写复杂的代码。\n",
    "\n",
    "## 并行计算与通信\n",
    "\n",
    "在许多情况下，我们需要在不同的设备之间移动数据，比如在CPU和GPU之间，或者在不同的GPU之间。例如，当执行分布式优化时，就需要移动数据来聚合多个加速卡上的梯度。让我们通过在GPU上计算，然后将结果复制回CPU来模拟这个过程。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "3b71f533",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-08-18T07:12:08.133753Z",
     "iopub.status.busy": "2023-08-18T07:12:08.133184Z",
     "iopub.status.idle": "2023-08-18T07:12:10.950227Z",
     "shell.execute_reply": "2023-08-18T07:12:10.949308Z"
    },
    "origin_pos": 22,
    "tab": [
     "pytorch"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "在GPU1上运行: 0.4608 sec\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "复制到CPU: 2.3504 sec\n"
     ]
    }
   ],
   "source": [
    "def copy_to_cpu(x, non_blocking=False):\n",
    "    return [y.to('cpu', non_blocking=non_blocking) for y in x]\n",
    "\n",
    "with d2l.Benchmark('在GPU1上运行'):\n",
    "    y = run(x_gpu1)\n",
    "    torch.cuda.synchronize()\n",
    "\n",
    "with d2l.Benchmark('复制到CPU'):\n",
    "    y_cpu = copy_to_cpu(y)\n",
    "    torch.cuda.synchronize()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5290ab0c",
   "metadata": {
    "origin_pos": 25,
    "tab": [
     "pytorch"
    ]
   },
   "source": [
    "这种方式效率不高。注意到当列表中的其余部分还在计算时，我们可能就已经开始将`y`的部分复制到CPU了。例如，当计算一个小批量的（反传）梯度时。某些参数的梯度将比其他参数的梯度更早可用。因此，在GPU仍在运行时就开始使用PCI-Express总线带宽来移动数据是有利的。在PyTorch中，`to()`和`copy_()`等函数都允许显式的`non_blocking`参数，这允许在不需要同步时调用方可以绕过同步。设置`non_blocking=True`以模拟这个场景。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "b6ecdc54",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-08-18T07:12:10.954084Z",
     "iopub.status.busy": "2023-08-18T07:12:10.953336Z",
     "iopub.status.idle": "2023-08-18T07:12:12.728692Z",
     "shell.execute_reply": "2023-08-18T07:12:12.727837Z"
    },
    "origin_pos": 28,
    "tab": [
     "pytorch"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "在GPU1上运行并复制到CPU: 1.7703 sec\n"
     ]
    }
   ],
   "source": [
    "with d2l.Benchmark('在GPU1上运行并复制到CPU'):\n",
    "    y = run(x_gpu1)\n",
    "    y_cpu = copy_to_cpu(y, True)\n",
    "    torch.cuda.synchronize()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "58a269e8",
   "metadata": {
    "origin_pos": 30
   },
   "source": [
    "两个操作所需的总时间少于它们各部分操作所需时间的总和。请注意，与并行计算的区别是通信操作使用的资源：CPU和GPU之间的总线。事实上，我们可以在两个设备上同时进行计算和通信。如上所述，计算和通信之间存在的依赖关系是必须先计算`y[i]`，然后才能将其复制到CPU。幸运的是，系统可以在计算`y[i]`的同时复制`y[i-1]`，以减少总的运行时间。\n",
    "\n",
    "最后，本节给出了一个简单的两层多层感知机在CPU和两个GPU上训练时的计算图及其依赖关系的例子，如 :numref:`fig_twogpu`所示。手动调度由此产生的并行程序将是相当痛苦的。这就是基于图的计算后端进行优化的优势所在。\n",
    "\n",
    "![在一个CPU和两个GPU上的两层的多层感知机的计算图及其依赖关系](../img/twogpu.svg)\n",
    ":label:`fig_twogpu`\n",
    "\n",
    "## 小结\n",
    "\n",
    "* 现代系统拥有多种设备，如多个GPU和多个CPU，还可以并行地、异步地使用它们。\n",
    "* 现代系统还拥有各种通信资源，如PCI Express、存储（通常是固态硬盘或网络存储）和网络带宽，为了达到最高效率可以并行使用它们。\n",
    "* 后端可以通过自动化地并行计算和通信来提高性能。\n",
    "\n",
    "## 练习\n",
    "\n",
    "1. 在本节定义的`run`函数中执行了八个操作，并且操作之间没有依赖关系。设计一个实验，看看深度学习框架是否会自动地并行地执行它们。\n",
    "1. 当单个操作符的工作量足够小，即使在单个CPU或GPU上，并行化也会有所帮助。设计一个实验来验证这一点。\n",
    "1. 设计一个实验，在CPU和GPU这两种设备上使用并行计算和通信。\n",
    "1. 使用诸如NVIDIA的[Nsight](https://developer.nvidia.com/nsight-compute-2019_5)之类的调试器来验证代码是否有效。\n",
    "1. 设计并实验具有更加复杂的数据依赖关系的计算任务，以查看是否可以在提高性能的同时获得正确的结果。\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "88f15d8c",
   "metadata": {
    "origin_pos": 32,
    "tab": [
     "pytorch"
    ]
   },
   "source": [
    "[Discussions](https://discuss.d2l.ai/t/2794)\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_pytorch_p36",
   "name": "conda_pytorch_p36"
  },
  "language_info": {
   "name": "python"
  },
  "required_libs": []
 },
 "nbformat": 4,
 "nbformat_minor": 5
}