{ "cells": [ { "cell_type": "markdown", "id": "a71215ab", "metadata": {}, "source": [ "Installing (updating) the following libraries for your Sagemaker\n", "instance." ] }, { "cell_type": "code", "execution_count": null, "id": "17107283", "metadata": {}, "outputs": [], "source": [ "!pip install .. # installing d2l\n" ] }, { "cell_type": "markdown", "id": "7a4cb8fa", "metadata": { "origin_pos": 0 }, "source": [ "# 自动并行\n", ":label:`sec_auto_para`\n", "\n", "深度学习框架(例如,MxNet、飞桨和PyTorch)会在后端自动构建计算图。利用计算图,系统可以了解所有依赖关系,并且可以选择性地并行执行多个不相互依赖的任务以提高速度。例如, :numref:`sec_async`中的 :numref:`fig_asyncgraph`独立初始化两个变量。因此,系统可以选择并行执行它们。\n", "\n", "通常情况下单个操作符将使用所有CPU或单个GPU上的所有计算资源。例如,即使在一台机器上有多个CPU处理器,`dot`操作符也将使用所有CPU上的所有核心(和线程)。这样的行为同样适用于单个GPU。因此,并行化对单设备计算机来说并不是很有用,而并行化对于多个设备就很重要了。虽然并行化通常应用在多个GPU之间,但增加本地CPU以后还将提高少许性能。例如, :cite:`Hadjis.Zhang.Mitliagkas.ea.2016`则把结合GPU和CPU的训练应用到计算机视觉模型中。借助自动并行化框架的便利性,我们可以依靠几行Python代码实现相同的目标。对自动并行计算的讨论主要集中在使用CPU和GPU的并行计算上,以及计算和通信的并行化内容。\n", "\n", "请注意,本节中的实验至少需要两个GPU来运行。\n" ] }, { "cell_type": "code", "execution_count": 1, "id": "8c944f1a", "metadata": { "execution": { "iopub.execute_input": "2023-08-18T07:11:59.505418Z", "iopub.status.busy": "2023-08-18T07:11:59.504686Z", "iopub.status.idle": "2023-08-18T07:12:02.958789Z", "shell.execute_reply": "2023-08-18T07:12:02.957933Z" }, "origin_pos": 2, "tab": [ "pytorch" ] }, "outputs": [], "source": [ "import torch\n", "from d2l import torch as d2l" ] }, { "cell_type": "markdown", "id": "4c8e7569", "metadata": { "origin_pos": 4 }, "source": [ "## 基于GPU的并行计算\n", "\n", "从定义一个具有参考性的用于测试的工作负载开始:下面的`run`函数将执行$10$次*矩阵-矩阵*乘法时需要使用的数据分配到两个变量(`x_gpu1`和`x_gpu2`)中,这两个变量分别位于选择的不同设备上。\n" ] }, { "cell_type": "code", "execution_count": 2, "id": "5e7b039a", "metadata": { "execution": { "iopub.execute_input": "2023-08-18T07:12:02.987012Z", "iopub.status.busy": "2023-08-18T07:12:02.986327Z", "iopub.status.idle": "2023-08-18T07:12:05.221346Z", "shell.execute_reply": "2023-08-18T07:12:05.220262Z" }, "origin_pos": 6, "tab": [ "pytorch" ] }, "outputs": [], "source": [ "devices = d2l.try_all_gpus()\n", "def run(x):\n", " return [x.mm(x) for _ in range(50)]\n", "\n", "x_gpu1 = torch.rand(size=(4000, 4000), device=devices[0])\n", "x_gpu2 = torch.rand(size=(4000, 4000), device=devices[1])" ] }, { "cell_type": "markdown", "id": "c2f2ffe6", "metadata": { "origin_pos": 9, "tab": [ "pytorch" ] }, "source": [ "现在使用函数来处理数据。通过在测量之前需要预热设备(对设备执行一次传递)来确保缓存的作用不影响最终的结果。`torch.cuda.synchronize()`函数将会等待一个CUDA设备上的所有流中的所有核心的计算完成。函数接受一个`device`参数,代表是哪个设备需要同步。如果device参数是`None`(默认值),它将使用`current_device()`找出的当前设备。\n" ] }, { "cell_type": "code", "execution_count": 3, "id": "970d8c24", "metadata": { "execution": { "iopub.execute_input": "2023-08-18T07:12:05.225646Z", "iopub.status.busy": "2023-08-18T07:12:05.224864Z", "iopub.status.idle": "2023-08-18T07:12:07.664593Z", "shell.execute_reply": "2023-08-18T07:12:07.663740Z" }, "origin_pos": 12, "tab": [ "pytorch" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "GPU1 time: 0.4600 sec\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "GPU2 time: 0.4706 sec\n" ] } ], "source": [ "run(x_gpu1)\n", "run(x_gpu2) # 预热设备\n", "torch.cuda.synchronize(devices[0])\n", "torch.cuda.synchronize(devices[1])\n", "\n", "with d2l.Benchmark('GPU1 time'):\n", " run(x_gpu1)\n", " torch.cuda.synchronize(devices[0])\n", "\n", "with d2l.Benchmark('GPU2 time'):\n", " run(x_gpu2)\n", " torch.cuda.synchronize(devices[1])" ] }, { "cell_type": "markdown", "id": "4df4f720", "metadata": { "origin_pos": 15, "tab": [ "pytorch" ] }, "source": [ "如果删除两个任务之间的`synchronize`语句,系统就可以在两个设备上自动实现并行计算。\n" ] }, { "cell_type": "code", "execution_count": 4, "id": "d6a567e4", "metadata": { "execution": { "iopub.execute_input": "2023-08-18T07:12:07.668313Z", "iopub.status.busy": "2023-08-18T07:12:07.667763Z", "iopub.status.idle": "2023-08-18T07:12:08.130167Z", "shell.execute_reply": "2023-08-18T07:12:08.129377Z" }, "origin_pos": 18, "tab": [ "pytorch" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "GPU1 & GPU2: 0.4580 sec\n" ] } ], "source": [ "with d2l.Benchmark('GPU1 & GPU2'):\n", " run(x_gpu1)\n", " run(x_gpu2)\n", " torch.cuda.synchronize()" ] }, { "cell_type": "markdown", "id": "a04f1ffe", "metadata": { "origin_pos": 20 }, "source": [ "在上述情况下,总执行时间小于两个部分执行时间的总和,因为深度学习框架自动调度两个GPU设备上的计算,而不需要用户编写复杂的代码。\n", "\n", "## 并行计算与通信\n", "\n", "在许多情况下,我们需要在不同的设备之间移动数据,比如在CPU和GPU之间,或者在不同的GPU之间。例如,当执行分布式优化时,就需要移动数据来聚合多个加速卡上的梯度。让我们通过在GPU上计算,然后将结果复制回CPU来模拟这个过程。\n" ] }, { "cell_type": "code", "execution_count": 5, "id": "3b71f533", "metadata": { "execution": { "iopub.execute_input": "2023-08-18T07:12:08.133753Z", "iopub.status.busy": "2023-08-18T07:12:08.133184Z", "iopub.status.idle": "2023-08-18T07:12:10.950227Z", "shell.execute_reply": "2023-08-18T07:12:10.949308Z" }, "origin_pos": 22, "tab": [ "pytorch" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "在GPU1上运行: 0.4608 sec\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "复制到CPU: 2.3504 sec\n" ] } ], "source": [ "def copy_to_cpu(x, non_blocking=False):\n", " return [y.to('cpu', non_blocking=non_blocking) for y in x]\n", "\n", "with d2l.Benchmark('在GPU1上运行'):\n", " y = run(x_gpu1)\n", " torch.cuda.synchronize()\n", "\n", "with d2l.Benchmark('复制到CPU'):\n", " y_cpu = copy_to_cpu(y)\n", " torch.cuda.synchronize()" ] }, { "cell_type": "markdown", "id": "5290ab0c", "metadata": { "origin_pos": 25, "tab": [ "pytorch" ] }, "source": [ "这种方式效率不高。注意到当列表中的其余部分还在计算时,我们可能就已经开始将`y`的部分复制到CPU了。例如,当计算一个小批量的(反传)梯度时。某些参数的梯度将比其他参数的梯度更早可用。因此,在GPU仍在运行时就开始使用PCI-Express总线带宽来移动数据是有利的。在PyTorch中,`to()`和`copy_()`等函数都允许显式的`non_blocking`参数,这允许在不需要同步时调用方可以绕过同步。设置`non_blocking=True`以模拟这个场景。\n" ] }, { "cell_type": "code", "execution_count": 6, "id": "b6ecdc54", "metadata": { "execution": { "iopub.execute_input": "2023-08-18T07:12:10.954084Z", "iopub.status.busy": "2023-08-18T07:12:10.953336Z", "iopub.status.idle": "2023-08-18T07:12:12.728692Z", "shell.execute_reply": "2023-08-18T07:12:12.727837Z" }, "origin_pos": 28, "tab": [ "pytorch" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "在GPU1上运行并复制到CPU: 1.7703 sec\n" ] } ], "source": [ "with d2l.Benchmark('在GPU1上运行并复制到CPU'):\n", " y = run(x_gpu1)\n", " y_cpu = copy_to_cpu(y, True)\n", " torch.cuda.synchronize()" ] }, { "cell_type": "markdown", "id": "58a269e8", "metadata": { "origin_pos": 30 }, "source": [ "两个操作所需的总时间少于它们各部分操作所需时间的总和。请注意,与并行计算的区别是通信操作使用的资源:CPU和GPU之间的总线。事实上,我们可以在两个设备上同时进行计算和通信。如上所述,计算和通信之间存在的依赖关系是必须先计算`y[i]`,然后才能将其复制到CPU。幸运的是,系统可以在计算`y[i]`的同时复制`y[i-1]`,以减少总的运行时间。\n", "\n", "最后,本节给出了一个简单的两层多层感知机在CPU和两个GPU上训练时的计算图及其依赖关系的例子,如 :numref:`fig_twogpu`所示。手动调度由此产生的并行程序将是相当痛苦的。这就是基于图的计算后端进行优化的优势所在。\n", "\n", "![在一个CPU和两个GPU上的两层的多层感知机的计算图及其依赖关系](../img/twogpu.svg)\n", ":label:`fig_twogpu`\n", "\n", "## 小结\n", "\n", "* 现代系统拥有多种设备,如多个GPU和多个CPU,还可以并行地、异步地使用它们。\n", "* 现代系统还拥有各种通信资源,如PCI Express、存储(通常是固态硬盘或网络存储)和网络带宽,为了达到最高效率可以并行使用它们。\n", "* 后端可以通过自动化地并行计算和通信来提高性能。\n", "\n", "## 练习\n", "\n", "1. 在本节定义的`run`函数中执行了八个操作,并且操作之间没有依赖关系。设计一个实验,看看深度学习框架是否会自动地并行地执行它们。\n", "1. 当单个操作符的工作量足够小,即使在单个CPU或GPU上,并行化也会有所帮助。设计一个实验来验证这一点。\n", "1. 设计一个实验,在CPU和GPU这两种设备上使用并行计算和通信。\n", "1. 使用诸如NVIDIA的[Nsight](https://developer.nvidia.com/nsight-compute-2019_5)之类的调试器来验证代码是否有效。\n", "1. 设计并实验具有更加复杂的数据依赖关系的计算任务,以查看是否可以在提高性能的同时获得正确的结果。\n" ] }, { "cell_type": "markdown", "id": "88f15d8c", "metadata": { "origin_pos": 32, "tab": [ "pytorch" ] }, "source": [ "[Discussions](https://discuss.d2l.ai/t/2794)\n" ] } ], "metadata": { "kernelspec": { "display_name": "conda_pytorch_p36", "name": "conda_pytorch_p36" }, "language_info": { "name": "python" }, "required_libs": [] }, "nbformat": 4, "nbformat_minor": 5 }