{
"cells": [
{
"cell_type": "markdown",
"id": "ee2a8bd4",
"metadata": {},
"source": [
"The following additional libraries are needed to run this\n",
"notebook. Note that running on Colab is experimental, please report a Github\n",
"issue if you have any problem."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "222756d1",
"metadata": {},
"outputs": [],
"source": [
"!pip install d2l==0.17.6\n"
]
},
{
"cell_type": "markdown",
"id": "29e77831",
"metadata": {
"origin_pos": 0
},
"source": [
"# Minibatch Stochastic Gradient Descent\n",
":label:`sec_minibatch_sgd`\n",
"\n",
"So far we encountered two extremes in the approach to gradient based learning: :numref:`sec_gd` uses the full dataset to compute gradients and to update parameters, one pass at a time. Conversely :numref:`sec_sgd` processes one observation at a time to make progress. Each of them has its own drawbacks. Gradient Descent is not particularly *data efficient* whenever data is very similar. Stochastic Gradient Descent is not particularly *computationally efficient* since CPUs and GPUs cannot exploit the full power of vectorization. This suggests that there might be a happy medium, and in fact, that's what we have been using so far in the examples we discussed.\n",
"\n",
"## Vectorization and Caches\n",
"\n",
"At the heart of the decision to use minibatches is computational efficiency. This is most easily understood when considering parallelization to multiple GPUs and multiple servers. In this case we need to send at least one image to each GPU. With 8 GPUs per server and 16 servers we already arrive at a minibatch size of 128.\n",
"\n",
"Things are a bit more subtle when it comes to single GPUs or even CPUs. These devices have multiple types of memory, often multiple type of compute units and different bandwidth constraints between them. For instance, a CPU has a small number of registers and then L1, L2 and in some cases even L3 cache (which is shared between the different processor cores). These caches are of increasing size and latency (and at the same time they are of decreasing bandwidth). Suffice it to say, the processor is capable of performing many more operations than what the main memory interface is able to provide.\n",
"\n",
"* A 2GHz CPU with 16 cores and AVX-512 vectorization can process up to $2 \\cdot 10^9 \\cdot 16 \\cdot 32 = 10^{12}$ bytes per second. The capability of GPUs easily exceeds this number by a factor of 100. On the other hand, a midrange server processor might not have much more than 100 GB/s bandwidth, i.e., less than one tenth of what would be required to keep the processor fed. To make matters worse, not all memory access is created equal: first, memory interfaces are typically 64 bit wide or wider (e.g., on GPUs up to 384 bit), hence reading a single byte incurs the cost of a much wider access.\n",
"* There is significant overhead for the first access whereas sequential access is relatively cheap (this is often called a burst read). There are many more things to keep in mind, such as caching when we have multiple sockets, chiplets and other structures. A detailed discussion of this is beyond the scope of this section. See e.g., this [Wikipedia article](https://en.wikipedia.org/wiki/Cache_hierarchy) for a more in-depth discussion.\n",
"\n",
"The way to alleviate these constraints is to use a hierarchy of CPU caches which are actually fast enough to supply the processor with data. This is *the* driving force behind batching in deep learning. To keep matters simple, consider matrix-matrix multiplication, say $\\mathbf{A} = \\mathbf{B}\\mathbf{C}$. We have a number of options for calculating $\\mathbf{A}$. For instance we could try the following:\n",
"\n",
"1. We could compute $\\mathbf{A}_{ij} = \\mathbf{B}_{i,:} \\mathbf{C}_{:,j}^\\top$, i.e., we could compute it elementwise by means of dot products.\n",
"1. We could compute $\\mathbf{A}_{:,j} = \\mathbf{B} \\mathbf{C}_{:,j}^\\top$, i.e., we could compute it one column at a time. Likewise we could compute $\\mathbf{A}$ one row $\\mathbf{A}_{i,:}$ at a time.\n",
"1. We could simply compute $\\mathbf{A} = \\mathbf{B} \\mathbf{C}$.\n",
"1. We could break $\\mathbf{B}$ and $\\mathbf{C}$ into smaller block matrices and compute $\\mathbf{A}$ one block at a time.\n",
"\n",
"If we follow the first option, we will need to copy one row and one column vector into the CPU each time we want to compute an element $\\mathbf{A}_{ij}$. Even worse, due to the fact that matrix elements are aligned sequentially we are thus required to access many disjoint locations for one of the two vectors as we read them from memory. The second option is much more favorable. In it, we are able to keep the column vector $\\mathbf{C}_{:,j}$ in the CPU cache while we keep on traversing through $B$. This halves the memory bandwidth requirement with correspondingly faster access. Of course, option 3 is most desirable. Unfortunately, most matrices might not entirely fit into cache (this is what we are discussing after all). However, option 4 offers a practically useful alternative: we can move blocks of the matrix into cache and multiply them locally. Optimized libraries take care of this for us. Let us have a look at how efficient these operations are in practice.\n",
"\n",
"Beyond computational efficiency, the overhead introduced by Python and by the deep learning framework itself is considerable. Recall that each time we execute a command the Python interpreter sends a command to the MXNet engine which needs to insert it into the computational graph and deal with it during scheduling. Such overhead can be quite detrimental. In short, it is highly advisable to use vectorization (and matrices) whenever possible.\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "ef283114",
"metadata": {
"execution": {
"iopub.execute_input": "2022-11-12T22:11:17.789325Z",
"iopub.status.busy": "2022-11-12T22:11:17.788538Z",
"iopub.status.idle": "2022-11-12T22:11:28.186706Z",
"shell.execute_reply": "2022-11-12T22:11:28.185559Z"
},
"origin_pos": 3,
"tab": [
"tensorflow"
]
},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import numpy as np\n",
"import tensorflow as tf\n",
"from d2l import tensorflow as d2l\n",
"\n",
"timer = d2l.Timer()\n",
"A = tf.Variable(tf.zeros((256, 256)))\n",
"B = tf.Variable(tf.random.normal([256, 256], 0, 1))\n",
"C = tf.Variable(tf.random.normal([256, 256], 0, 1))"
]
},
{
"cell_type": "markdown",
"id": "b80ac485",
"metadata": {
"origin_pos": 4
},
"source": [
"Element-wise assignment simply iterates over all rows and columns of $\\mathbf{B}$ and $\\mathbf{C}$ respectively to assign the value to $\\mathbf{A}$.\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "82025e46",
"metadata": {
"execution": {
"iopub.execute_input": "2022-11-12T22:11:28.191479Z",
"iopub.status.busy": "2022-11-12T22:11:28.190762Z",
"iopub.status.idle": "2022-11-12T22:13:32.003024Z",
"shell.execute_reply": "2022-11-12T22:13:32.002155Z"
},
"origin_pos": 7,
"tab": [
"tensorflow"
]
},
"outputs": [
{
"data": {
"text/plain": [
"123.80362010002136"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Compute A = BC one element at a time\n",
"timer.start()\n",
"for i in range(256):\n",
" for j in range(256):\n",
" A[i, j].assign(tf.tensordot(B[i, :], C[:, j], axes=1))\n",
"timer.stop()"
]
},
{
"cell_type": "markdown",
"id": "987b561a",
"metadata": {
"origin_pos": 8
},
"source": [
"A faster strategy is to perform column-wise assignment.\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "e681f4bf",
"metadata": {
"execution": {
"iopub.execute_input": "2022-11-12T22:13:32.007096Z",
"iopub.status.busy": "2022-11-12T22:13:32.006421Z",
"iopub.status.idle": "2022-11-12T22:13:32.384292Z",
"shell.execute_reply": "2022-11-12T22:13:32.383507Z"
},
"origin_pos": 11,
"tab": [
"tensorflow"
]
},
"outputs": [
{
"data": {
"text/plain": [
"0.3721303939819336"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"timer.start()\n",
"for j in range(256):\n",
" A[:, j].assign(tf.tensordot(B, C[:, j], axes=1))\n",
"timer.stop()"
]
},
{
"cell_type": "markdown",
"id": "df4344b9",
"metadata": {
"origin_pos": 12
},
"source": [
"Last, the most effective manner is to perform the entire operation in one block. Let us see what the respective speed of the operations is.\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "e125dfdd",
"metadata": {
"execution": {
"iopub.execute_input": "2022-11-12T22:13:32.388285Z",
"iopub.status.busy": "2022-11-12T22:13:32.387439Z",
"iopub.status.idle": "2022-11-12T22:13:32.416693Z",
"shell.execute_reply": "2022-11-12T22:13:32.415894Z"
},
"origin_pos": 15,
"tab": [
"tensorflow"
]
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"performance in Gigaflops: element 0.016, column 5.374, full 84.145\n"
]
}
],
"source": [
"timer.start()\n",
"A.assign(tf.tensordot(B, C, axes=1))\n",
"timer.stop()\n",
"\n",
"# Multiply and add count as separate operations (fused in practice)\n",
"gigaflops = [2/i for i in timer.times]\n",
"print(f'performance in Gigaflops: element {gigaflops[0]:.3f}, '\n",
" f'column {gigaflops[1]:.3f}, full {gigaflops[2]:.3f}')"
]
},
{
"cell_type": "markdown",
"id": "ce85767b",
"metadata": {
"origin_pos": 16
},
"source": [
"## Minibatches\n",
"\n",
":label:`sec_minibatches`\n",
"\n",
"In the past we took it for granted that we would read *minibatches* of data rather than single observations to update parameters. We now give a brief justification for it. Processing single observations requires us to perform many single matrix-vector (or even vector-vector) multiplications, which is quite expensive and which incurs a significant overhead on behalf of the underlying deep learning framework. This applies both to evaluating a network when applied to data (often referred to as inference) and when computing gradients to update parameters. That is, this applies whenever we perform $\\mathbf{w} \\leftarrow \\mathbf{w} - \\eta_t \\mathbf{g}_t$ where\n",
"\n",
"$$\\mathbf{g}_t = \\partial_{\\mathbf{w}} f(\\mathbf{x}_{t}, \\mathbf{w})$$\n",
"\n",
"We can increase the *computational* efficiency of this operation by applying it to a minibatch of observations at a time. That is, we replace the gradient $\\mathbf{g}_t$ over a single observation by one over a small batch\n",
"\n",
"$$\\mathbf{g}_t = \\partial_{\\mathbf{w}} \\frac{1}{|\\mathcal{B}_t|} \\sum_{i \\in \\mathcal{B}_t} f(\\mathbf{x}_{i}, \\mathbf{w})$$\n",
"\n",
"Let us see what this does to the statistical properties of $\\mathbf{g}_t$: since both $\\mathbf{x}_t$ and also all elements of the minibatch $\\mathcal{B}_t$ are drawn uniformly at random from the training set, the expectation of the gradient remains unchanged. The variance, on the other hand, is reduced significantly. Since the minibatch gradient is composed of $b := |\\mathcal{B}_t|$ independent gradients which are being averaged, its standard deviation is reduced by a factor of $b^{-\\frac{1}{2}}$. This, by itself, is a good thing, since it means that the updates are more reliably aligned with the full gradient.\n",
"\n",
"Naively this would indicate that choosing a large minibatch $\\mathcal{B}_t$ would be universally desirable. Alas, after some point, the additional reduction in standard deviation is minimal when compared to the linear increase in computational cost. In practice we pick a minibatch that is large enough to offer good computational efficiency while still fitting into the memory of a GPU. To illustrate the savings let us have a look at some code. In it we perform the same matrix-matrix multiplication, but this time broken up into \"minibatches\" of 64 columns at a time.\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "bb804efc",
"metadata": {
"execution": {
"iopub.execute_input": "2022-11-12T22:13:32.420605Z",
"iopub.status.busy": "2022-11-12T22:13:32.419823Z",
"iopub.status.idle": "2022-11-12T22:13:32.431888Z",
"shell.execute_reply": "2022-11-12T22:13:32.431036Z"
},
"origin_pos": 19,
"tab": [
"tensorflow"
]
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"performance in Gigaflops: block 283.083\n"
]
}
],
"source": [
"timer.start()\n",
"for j in range(0, 256, 64):\n",
" A[:, j:j+64].assign(tf.tensordot(B, C[:, j:j+64], axes=1))\n",
"timer.stop()\n",
"print(f'performance in Gigaflops: block {2 / timer.times[3]:.3f}')"
]
},
{
"cell_type": "markdown",
"id": "a5bbbb64",
"metadata": {
"origin_pos": 20
},
"source": [
"As we can see, the computation on the minibatch is essentially as efficient as on the full matrix. A word of caution is in order. In :numref:`sec_batch_norm` we used a type of regularization that was heavily dependent on the amount of variance in a minibatch. As we increase the latter, the variance decreases and with it the benefit of the noise-injection due to batch normalization. See e.g., :cite:`Ioffe.2017` for details on how to rescale and compute the appropriate terms.\n",
"\n",
"## Reading the Dataset\n",
"\n",
"Let us have a look at how minibatches are efficiently generated from data. In the following we use a dataset developed by NASA to test the wing [noise from different aircraft](https://archive.ics.uci.edu/ml/datasets/Airfoil+Self-Noise) to compare these optimization algorithms. For convenience we only use the first $1,500$ examples. The data is whitened for preprocessing, i.e., we remove the mean and rescale the variance to $1$ per coordinate.\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "e3da0b99",
"metadata": {
"execution": {
"iopub.execute_input": "2022-11-12T22:13:32.435606Z",
"iopub.status.busy": "2022-11-12T22:13:32.435076Z",
"iopub.status.idle": "2022-11-12T22:13:32.440717Z",
"shell.execute_reply": "2022-11-12T22:13:32.439933Z"
},
"origin_pos": 23,
"tab": [
"tensorflow"
]
},
"outputs": [],
"source": [
"#@save\n",
"d2l.DATA_HUB['airfoil'] = (d2l.DATA_URL + 'airfoil_self_noise.dat',\n",
" '76e5be1548fd8222e5074cf0faae75edff8cf93f')\n",
"\n",
"#@save\n",
"def get_data_ch11(batch_size=10, n=1500):\n",
" data = np.genfromtxt(d2l.download('airfoil'),\n",
" dtype=np.float32, delimiter='\\t')\n",
" data = (data - data.mean(axis=0)) / data.std(axis=0)\n",
" data_iter = d2l.load_array((data[:n, :-1], data[:n, -1]),\n",
" batch_size, is_train=True)\n",
" return data_iter, data.shape[1]-1"
]
},
{
"cell_type": "markdown",
"id": "d72b2306",
"metadata": {
"origin_pos": 24
},
"source": [
"## Implementation from Scratch\n",
"\n",
"Recall the minibatch stochastic gradient descent implementation from :numref:`sec_linear_scratch`. In the following we provide a slightly more general implementation. For convenience it has the same call signature as the other optimization algorithms introduced later in this chapter. Specifically, we add the status\n",
"input `states` and place the hyperparameter in dictionary `hyperparams`. In\n",
"addition, we will average the loss of each minibatch example in the training\n",
"function, so the gradient in the optimization algorithm does not need to be\n",
"divided by the batch size.\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "97d05001",
"metadata": {
"execution": {
"iopub.execute_input": "2022-11-12T22:13:32.444485Z",
"iopub.status.busy": "2022-11-12T22:13:32.443767Z",
"iopub.status.idle": "2022-11-12T22:13:32.448102Z",
"shell.execute_reply": "2022-11-12T22:13:32.447304Z"
},
"origin_pos": 27,
"tab": [
"tensorflow"
]
},
"outputs": [],
"source": [
"def sgd(params, grads, states, hyperparams):\n",
" for param, grad in zip(params, grads):\n",
" param.assign_sub(hyperparams['lr']*grad)"
]
},
{
"cell_type": "markdown",
"id": "286dddf6",
"metadata": {
"origin_pos": 28
},
"source": [
"Next, we implement a generic training function to facilitate the use of the other optimization algorithms introduced later in this chapter. It initializes a linear regression model and can be used to train the model with minibatch stochastic gradient descent and other algorithms introduced subsequently.\n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "a16e824a",
"metadata": {
"execution": {
"iopub.execute_input": "2022-11-12T22:13:32.451943Z",
"iopub.status.busy": "2022-11-12T22:13:32.451223Z",
"iopub.status.idle": "2022-11-12T22:13:32.459993Z",
"shell.execute_reply": "2022-11-12T22:13:32.459192Z"
},
"origin_pos": 31,
"tab": [
"tensorflow"
]
},
"outputs": [],
"source": [
"#@save\n",
"def train_ch11(trainer_fn, states, hyperparams, data_iter,\n",
" feature_dim, num_epochs=2):\n",
" # Initialization\n",
" w = tf.Variable(tf.random.normal(shape=(feature_dim, 1),\n",
" mean=0, stddev=0.01),trainable=True)\n",
" b = tf.Variable(tf.zeros(1), trainable=True)\n",
"\n",
" # Train\n",
" net, loss = lambda X: d2l.linreg(X, w, b), d2l.squared_loss\n",
" animator = d2l.Animator(xlabel='epoch', ylabel='loss',\n",
" xlim=[0, num_epochs], ylim=[0.22, 0.35])\n",
" n, timer = 0, d2l.Timer()\n",
"\n",
" for _ in range(num_epochs):\n",
" for X, y in data_iter:\n",
" with tf.GradientTape() as g:\n",
" l = tf.math.reduce_mean(loss(net(X), y))\n",
"\n",
" dw, db = g.gradient(l, [w, b])\n",
" trainer_fn([w, b], [dw, db], states, hyperparams)\n",
" n += X.shape[0]\n",
" if n % 200 == 0:\n",
" timer.stop()\n",
" p = n/X.shape[0]\n",
" q = p/tf.data.experimental.cardinality(data_iter).numpy()\n",
" r = (d2l.evaluate_loss(net, data_iter, loss),)\n",
" animator.add(q, r)\n",
" timer.start()\n",
" print(f'loss: {animator.Y[0][-1]:.3f}, {timer.avg():.3f} sec/epoch')\n",
" return timer.cumsum(), animator.Y[0]"
]
},
{
"cell_type": "markdown",
"id": "a6eff7e6",
"metadata": {
"origin_pos": 32
},
"source": [
"Let us see how optimization proceeds for batch gradient descent. This can be achieved by setting the minibatch size to 1500 (i.e., to the total number of examples). As a result the model parameters are updated only once per epoch. There is little progress. In fact, after 6 steps progress stalls.\n"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "766bada2",
"metadata": {
"execution": {
"iopub.execute_input": "2022-11-12T22:13:32.463679Z",
"iopub.status.busy": "2022-11-12T22:13:32.463092Z",
"iopub.status.idle": "2022-11-12T22:13:33.485933Z",
"shell.execute_reply": "2022-11-12T22:13:33.485087Z"
},
"origin_pos": 33,
"tab": [
"tensorflow"
]
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"loss: 0.247, 0.032 sec/epoch\n"
]
},
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n"
],
"text/plain": [
"