{
"cells": [
{
"cell_type": "markdown",
"id": "1241fc98",
"metadata": {},
"source": [
"The following additional libraries are needed to run this\n",
"notebook. Note that running on Colab is experimental, please report a Github\n",
"issue if you have any problem."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6e2dcede",
"metadata": {},
"outputs": [],
"source": [
"!pip install d2l==0.17.6\n"
]
},
{
"cell_type": "markdown",
"id": "61a96526",
"metadata": {
"origin_pos": 0
},
"source": [
"# RMSProp\n",
":label:`sec_rmsprop`\n",
"\n",
"\n",
"One of the key issues in :numref:`sec_adagrad` is that the learning rate decreases at a predefined schedule of effectively $\\mathcal{O}(t^{-\\frac{1}{2}})$. While this is generally appropriate for convex problems, it might not be ideal for nonconvex ones, such as those encountered in deep learning. Yet, the coordinate-wise adaptivity of Adagrad is highly desirable as a preconditioner.\n",
"\n",
":cite:`Tieleman.Hinton.2012` proposed the RMSProp algorithm as a simple fix to decouple rate scheduling from coordinate-adaptive learning rates. The issue is that Adagrad accumulates the squares of the gradient $\\mathbf{g}_t$ into a state vector $\\mathbf{s}_t = \\mathbf{s}_{t-1} + \\mathbf{g}_t^2$. As a result $\\mathbf{s}_t$ keeps on growing without bound due to the lack of normalization, essentially linearly as the algorithm converges.\n",
"\n",
"One way of fixing this problem would be to use $\\mathbf{s}_t / t$. For reasonable distributions of $\\mathbf{g}_t$ this will converge. Unfortunately it might take a very long time until the limit behavior starts to matter since the procedure remembers the full trajectory of values. An alternative is to use a leaky average in the same way we used in the momentum method, i.e., $\\mathbf{s}_t \\leftarrow \\gamma \\mathbf{s}_{t-1} + (1-\\gamma) \\mathbf{g}_t^2$ for some parameter $\\gamma > 0$. Keeping all other parts unchanged yields RMSProp.\n",
"\n",
"## The Algorithm\n",
"\n",
"Let us write out the equations in detail.\n",
"\n",
"$$\\begin{aligned}\n",
" \\mathbf{s}_t & \\leftarrow \\gamma \\mathbf{s}_{t-1} + (1 - \\gamma) \\mathbf{g}_t^2, \\\\\n",
" \\mathbf{x}_t & \\leftarrow \\mathbf{x}_{t-1} - \\frac{\\eta}{\\sqrt{\\mathbf{s}_t + \\epsilon}} \\odot \\mathbf{g}_t.\n",
"\\end{aligned}$$\n",
"\n",
"The constant $\\epsilon > 0$ is typically set to $10^{-6}$ to ensure that we do not suffer from division by zero or overly large step sizes. Given this expansion we are now free to control the learning rate $\\eta$ independently of the scaling that is applied on a per-coordinate basis. In terms of leaky averages we can apply the same reasoning as previously applied in the case of the momentum method. Expanding the definition of $\\mathbf{s}_t$ yields\n",
"\n",
"$$\n",
"\\begin{aligned}\n",
"\\mathbf{s}_t & = (1 - \\gamma) \\mathbf{g}_t^2 + \\gamma \\mathbf{s}_{t-1} \\\\\n",
"& = (1 - \\gamma) \\left(\\mathbf{g}_t^2 + \\gamma \\mathbf{g}_{t-1}^2 + \\gamma^2 \\mathbf{g}_{t-2} + \\ldots, \\right).\n",
"\\end{aligned}\n",
"$$\n",
"\n",
"As before in :numref:`sec_momentum` we use $1 + \\gamma + \\gamma^2 + \\ldots, = \\frac{1}{1-\\gamma}$. Hence the sum of weights is normalized to $1$ with a half-life time of an observation of $\\gamma^{-1}$. Let us visualize the weights for the past 40 time steps for various choices of $\\gamma$.\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "fcb7bea3",
"metadata": {
"execution": {
"iopub.execute_input": "2022-11-12T22:00:34.648575Z",
"iopub.status.busy": "2022-11-12T22:00:34.648009Z",
"iopub.status.idle": "2022-11-12T22:00:37.314176Z",
"shell.execute_reply": "2022-11-12T22:00:37.313320Z"
},
"origin_pos": 3,
"tab": [
"tensorflow"
]
},
"outputs": [],
"source": [
"import math\n",
"import tensorflow as tf\n",
"from d2l import tensorflow as d2l"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "6c3bbdd0",
"metadata": {
"execution": {
"iopub.execute_input": "2022-11-12T22:00:37.414351Z",
"iopub.status.busy": "2022-11-12T22:00:37.317778Z",
"iopub.status.idle": "2022-11-12T22:00:38.675457Z",
"shell.execute_reply": "2022-11-12T22:00:38.674573Z"
},
"origin_pos": 4,
"tab": [
"tensorflow"
]
},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
" \n",
" \n",
" \n",
" \n",
" 2022-11-12T22:00:38.647350 \n",
" image/svg+xml \n",
" \n",
" \n",
" Matplotlib v3.5.1, https://matplotlib.org/ \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"d2l.set_figsize()\n",
"gammas = [0.95, 0.9, 0.8, 0.7]\n",
"for gamma in gammas:\n",
" x = tf.range(40).numpy()\n",
" d2l.plt.plot(x, (1-gamma) * gamma ** x, label=f'gamma = {gamma:.2f}')\n",
"d2l.plt.xlabel('time');"
]
},
{
"cell_type": "markdown",
"id": "297dea93",
"metadata": {
"origin_pos": 5
},
"source": [
"## Implementation from Scratch\n",
"\n",
"As before we use the quadratic function $f(\\mathbf{x})=0.1x_1^2+2x_2^2$ to observe the trajectory of RMSProp. Recall that in :numref:`sec_adagrad`, when we used Adagrad with a learning rate of 0.4, the variables moved only very slowly in the later stages of the algorithm since the learning rate decreased too quickly. Since $\\eta$ is controlled separately this does not happen with RMSProp.\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "46ce23f5",
"metadata": {
"execution": {
"iopub.execute_input": "2022-11-12T22:00:38.679871Z",
"iopub.status.busy": "2022-11-12T22:00:38.679148Z",
"iopub.status.idle": "2022-11-12T22:00:38.795118Z",
"shell.execute_reply": "2022-11-12T22:00:38.794253Z"
},
"origin_pos": 6,
"tab": [
"tensorflow"
]
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch 20, x1: -0.010599, x2: 0.000000\n"
]
},
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
" \n",
" \n",
" \n",
" \n",
" 2022-11-12T22:00:38.765770 \n",
" image/svg+xml \n",
" \n",
" \n",
" Matplotlib v3.5.1, https://matplotlib.org/ \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"def rmsprop_2d(x1, x2, s1, s2):\n",
" g1, g2, eps = 0.2 * x1, 4 * x2, 1e-6\n",
" s1 = gamma * s1 + (1 - gamma) * g1 ** 2\n",
" s2 = gamma * s2 + (1 - gamma) * g2 ** 2\n",
" x1 -= eta / math.sqrt(s1 + eps) * g1\n",
" x2 -= eta / math.sqrt(s2 + eps) * g2\n",
" return x1, x2, s1, s2\n",
"\n",
"def f_2d(x1, x2):\n",
" return 0.1 * x1 ** 2 + 2 * x2 ** 2\n",
"\n",
"eta, gamma = 0.4, 0.9\n",
"d2l.show_trace_2d(f_2d, d2l.train_2d(rmsprop_2d))"
]
},
{
"cell_type": "markdown",
"id": "d5663ad1",
"metadata": {
"origin_pos": 7
},
"source": [
"Next, we implement RMSProp to be used in a deep network. This is equally straightforward.\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "1e9e1043",
"metadata": {
"execution": {
"iopub.execute_input": "2022-11-12T22:00:38.798994Z",
"iopub.status.busy": "2022-11-12T22:00:38.798388Z",
"iopub.status.idle": "2022-11-12T22:00:38.802835Z",
"shell.execute_reply": "2022-11-12T22:00:38.802015Z"
},
"origin_pos": 9,
"tab": [
"tensorflow"
]
},
"outputs": [],
"source": [
"def init_rmsprop_states(feature_dim):\n",
" s_w = tf.Variable(tf.zeros((feature_dim, 1)))\n",
" s_b = tf.Variable(tf.zeros(1))\n",
" return (s_w, s_b)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "e2dca20a",
"metadata": {
"execution": {
"iopub.execute_input": "2022-11-12T22:00:38.806140Z",
"iopub.status.busy": "2022-11-12T22:00:38.805611Z",
"iopub.status.idle": "2022-11-12T22:00:38.810623Z",
"shell.execute_reply": "2022-11-12T22:00:38.809845Z"
},
"origin_pos": 12,
"tab": [
"tensorflow"
]
},
"outputs": [],
"source": [
"def rmsprop(params, grads, states, hyperparams):\n",
" gamma, eps = hyperparams['gamma'], 1e-6\n",
" for p, s, g in zip(params, states, grads):\n",
" s[:].assign(gamma * s + (1 - gamma) * tf.math.square(g))\n",
" p[:].assign(p - hyperparams['lr'] * g / tf.math.sqrt(s + eps))"
]
},
{
"cell_type": "markdown",
"id": "f12a86c5",
"metadata": {
"origin_pos": 13
},
"source": [
"We set the initial learning rate to 0.01 and the weighting term $\\gamma$ to 0.9. That is, $\\mathbf{s}$ aggregates on average over the past $1/(1-\\gamma) = 10$ observations of the square gradient.\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "10fc8356",
"metadata": {
"execution": {
"iopub.execute_input": "2022-11-12T22:00:38.814151Z",
"iopub.status.busy": "2022-11-12T22:00:38.813630Z",
"iopub.status.idle": "2022-11-12T22:00:45.069291Z",
"shell.execute_reply": "2022-11-12T22:00:45.068397Z"
},
"origin_pos": 14,
"tab": [
"tensorflow"
]
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"loss: 0.244, 0.113 sec/epoch\n"
]
},
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
" \n",
" \n",
" \n",
" \n",
" 2022-11-12T22:00:45.032440 \n",
" image/svg+xml \n",
" \n",
" \n",
" Matplotlib v3.5.1, https://matplotlib.org/ \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"data_iter, feature_dim = d2l.get_data_ch11(batch_size=10)\n",
"d2l.train_ch11(rmsprop, init_rmsprop_states(feature_dim),\n",
" {'lr': 0.01, 'gamma': 0.9}, data_iter, feature_dim);"
]
},
{
"cell_type": "markdown",
"id": "0544deb4",
"metadata": {
"origin_pos": 15
},
"source": [
"## Concise Implementation\n",
"\n",
"Since RMSProp is a rather popular algorithm it is also available in the `Trainer` instance. All we need to do is instantiate it using an algorithm named `rmsprop`, assigning $\\gamma$ to the parameter `gamma1`.\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "6af8f7a1",
"metadata": {
"execution": {
"iopub.execute_input": "2022-11-12T22:00:45.073187Z",
"iopub.status.busy": "2022-11-12T22:00:45.072871Z",
"iopub.status.idle": "2022-11-12T22:00:53.684771Z",
"shell.execute_reply": "2022-11-12T22:00:53.683934Z"
},
"origin_pos": 18,
"tab": [
"tensorflow"
]
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"loss: 0.243, 0.139 sec/epoch\n"
]
},
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
" \n",
" \n",
" \n",
" \n",
" 2022-11-12T22:00:53.650047 \n",
" image/svg+xml \n",
" \n",
" \n",
" Matplotlib v3.5.1, https://matplotlib.org/ \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"trainer = tf.keras.optimizers.RMSprop\n",
"d2l.train_concise_ch11(trainer, {'learning_rate': 0.01, 'rho': 0.9},\n",
" data_iter)"
]
},
{
"cell_type": "markdown",
"id": "ac2077ae",
"metadata": {
"origin_pos": 19
},
"source": [
"## Summary\n",
"\n",
"* RMSProp is very similar to Adagrad insofar as both use the square of the gradient to scale coefficients.\n",
"* RMSProp shares with momentum the leaky averaging. However, RMSProp uses the technique to adjust the coefficient-wise preconditioner.\n",
"* The learning rate needs to be scheduled by the experimenter in practice.\n",
"* The coefficient $\\gamma$ determines how long the history is when adjusting the per-coordinate scale.\n",
"\n",
"## Exercises\n",
"\n",
"1. What happens experimentally if we set $\\gamma = 1$? Why?\n",
"1. Rotate the optimization problem to minimize $f(\\mathbf{x}) = 0.1 (x_1 + x_2)^2 + 2 (x_1 - x_2)^2$. What happens to the convergence?\n",
"1. Try out what happens to RMSProp on a real machine learning problem, such as training on Fashion-MNIST. Experiment with different choices for adjusting the learning rate.\n",
"1. Would you want to adjust $\\gamma$ as optimization progresses? How sensitive is RMSProp to this?\n"
]
},
{
"cell_type": "markdown",
"id": "d28b3a77",
"metadata": {
"origin_pos": 22,
"tab": [
"tensorflow"
]
},
"source": [
"[Discussions](https://discuss.d2l.ai/t/1075)\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 5
}