{ "cells": [ { "cell_type": "markdown", "metadata": { "origin_pos": 0 }, "source": [ "# RMSProp\n", ":label:`sec_rmsprop`\n", "\n", "One of the key issues in :numref:`sec_adagrad` is that the learning rate decreases at a predefined schedule of effectively $\\mathcal{O}(t^{-\\frac{1}{2}})$. While this is generally appropriate for convex problems, it might not be ideal for nonconvex ones, such as those encountered in deep learning. Yet, the coordinate-wise adaptivity of Adagrad is highly desirable as a preconditioner.\n", "\n", ":cite:`Tieleman.Hinton.2012` proposed the RMSProp algorithm as a simple fix to decouple rate scheduling from coordinate-adaptive learning rates. The issue is that Adagrad accumulates the squares of the gradient $\\mathbf{g}_t$ into a state vector $\\mathbf{s}_t = \\mathbf{s}_{t-1} + \\mathbf{g}_t^2$. As a result $\\mathbf{s}_t$ keeps on growing without bound due to the lack of normalization, essentially linarly as the algorithm converges.\n", "\n", "One way of fixing this problem would be to use $\\mathbf{s}_t / t$. For reasonable distributions of $\\mathbf{g}_t$ this will converge. Unfortunately it might take a very long time until the limit behavior starts to matter since the procedure remembers the full trajectory of values. An alternative is to use a leaky average in the same way we used in the momentum method, i.e., $\\mathbf{s}_t \\leftarrow \\gamma \\mathbf{s}_{t-1} + (1-\\gamma) \\mathbf{g}_t^2$ for some parameter $\\gamma > 0$. Keeping all other parts unchanged yields RMSProp.\n", "\n", "## The Algorithm\n", "\n", "Let us write out the equations in detail.\n", "\n", "$$\\begin{aligned}\n", " \\mathbf{s}_t & \\leftarrow \\gamma \\mathbf{s}_{t-1} + (1 - \\gamma) \\mathbf{g}_t^2, \\\\\n", " \\mathbf{x}_t & \\leftarrow \\mathbf{x}_{t-1} - \\frac{\\eta}{\\sqrt{\\mathbf{s}_t + \\epsilon}} \\odot \\mathbf{g}_t.\n", "\\end{aligned}$$\n", "\n", "The constant $\\epsilon > 0$ is typically set to $10^{-6}$ to ensure that we do not suffer from division by zero or overly large step sizes. Given this expansion we are now free to control the learning rate $\\eta$ independently of the scaling that is applied on a per-coordinate basis. In terms of leaky averages we can apply the same reasoning as previously applied in the case of the momentum method. Expanding the definition of $\\mathbf{s}_t$ yields\n", "\n", "$$\n", "\\begin{aligned}\n", "\\mathbf{s}_t & = (1 - \\gamma) \\mathbf{g}_t^2 + \\gamma \\mathbf{s}_{t-1} \\\\\n", "& = (1 - \\gamma) \\left(\\mathbf{g}_t^2 + \\gamma \\mathbf{g}_{t-1}^2 + \\gamma^2 \\mathbf{g}_{t-2} + \\ldots, \\right).\n", "\\end{aligned}\n", "$$\n", "\n", "As before in :numref:`sec_momentum` we use $1 + \\gamma + \\gamma^2 + \\ldots, = \\frac{1}{1-\\gamma}$. Hence the sum of weights is normalized to $1$ with a half-life time of an observation of $\\gamma^{-1}$. Let us visualize the weights for the past 40 timesteps for various choices of $\\gamma$.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%load ../utils/djl-imports\n", "%load ../utils/plot-utils\n", "%load ../utils/Functions.java\n", "%load ../utils/GradDescUtils.java\n", "%load ../utils/Accumulator.java\n", "%load ../utils/StopWatch.java\n", "%load ../utils/Training.java\n", "%load ../utils/TrainingChapter11.java" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "NDManager manager = NDManager.newBaseManager();\n", "\n", "float[] gammas = new float[]{0.95f, 0.9f, 0.8f, 0.7f};\n", "\n", "NDArray timesND = manager.arange(40f);\n", "float[] times = timesND.toFloatArray();\n", "display(GradDescUtils.plotGammas(times, gammas, 600, 400));" ] }, { "cell_type": "markdown", "metadata": { "origin_pos": 2 }, "source": [ "## Implementation from Scratch\n", "\n", "As before we use the quadratic function $f(\\mathbf{x})=0.1x_1^2+2x_2^2$ to observe the trajectory of RMSProp. Recall that in :numref:`sec_adagrad`, when we used Adagrad with a learning rate of 0.4, the variables moved only very slowly in the later stages of the algorithm since the learning rate decreased too quickly. Since $\\eta$ is controlled separately this does not happen with RMSProp.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "origin_pos": 3, "tab": [ "mxnet" ] }, "outputs": [], "source": [ "float eta = 0.4f;\n", "float gamma = 0.9f;\n", "\n", "Function rmsProp2d = (state) -> {\n", " Float x1 = state[0], x2 = state[1], s1 = state[2], s2 = state[3];\n", " float g1 = 0.2f * x1;\n", " float g2 = 4 * x2;\n", " float eps = (float) 1e-6;\n", " s1 = gamma * s1 + (1 - gamma) * g1 * g1;\n", " s2 = gamma * s2 + (1 - gamma) * g2 * g2;\n", " x1 -= eta / (float) Math.sqrt(s1 + eps) * g1;\n", " x2 -= eta / (float) Math.sqrt(s2 + eps) * g2;\n", " return new Float[]{x1, x2, s1, s2};\n", "};\n", "\n", "BiFunction f2d = (x1, x2) -> {\n", " return 0.1f * x1 * x1 + 2 * x2 * x2;\n", "};\n", "\n", "GradDescUtils.showTrace2d(f2d, GradDescUtils.train2d(rmsProp2d, 20));" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![RmsProp Gradient Descent 2D.](https://d2l-java-resources.s3.amazonaws.com/img/chapter_optim-rmsprop-gd2d.svg)" ] }, { "cell_type": "markdown", "metadata": { "origin_pos": 4 }, "source": [ "Next, we implement RMSProp to be used in a deep network. This is equally straightforward.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "NDList initRmsPropStates(int featureDimension) {\n", " NDManager manager = NDManager.newBaseManager();\n", " NDArray sW = manager.zeros(new Shape(featureDimension, 1));\n", " NDArray sB = manager.zeros(new Shape(1));\n", " return new NDList(sW, sB);\n", "}\n", "\n", "public class Optimization {\n", " public static void rmsProp(NDList params, NDList states, Map hyperparams) {\n", " float gamma = hyperparams.get(\"gamma\");\n", " float eps = (float) 1e-6;\n", " for (int i = 0; i < params.size(); i++) {\n", " NDArray param = params.get(i);\n", " NDArray state = states.get(i);\n", " // Update parameter and state\n", " // state = gamma * state + (1 - gamma) * param.gradient^(1/2)\n", " state.muli(gamma).addi(param.getGradient().square().mul(1 - gamma));\n", " // param -= lr * param.gradient / sqrt(s + eps)\n", " param.subi(param.getGradient().mul(hyperparams.get(\"lr\")).div(state.add(eps).sqrt()));\n", " }\n", " }\n", "}" ] }, { "cell_type": "markdown", "metadata": { "origin_pos": 6 }, "source": [ "We set the initial learning rate to 0.01 and the weighting term $\\gamma$ to 0.9. That is, $\\mathbf{s}$ aggregates on average over the past $1/(1-\\gamma) = 10$ observations of the square gradient.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "AirfoilRandomAccess airfoil = TrainingChapter11.getDataCh11(10, 1500);\n", "\n", "public TrainingChapter11.LossTime trainRmsProp(float lr, float gamma, int numEpochs) \n", " throws IOException, TranslateException {\n", " int featureDimension = airfoil.getColumnNames().size();\n", " Map hyperparams = new HashMap<>();\n", " hyperparams.put(\"lr\", lr);\n", " hyperparams.put(\"gamma\", gamma);\n", " return TrainingChapter11.trainCh11(Optimization::rmsProp, \n", " initRmsPropStates(featureDimension), \n", " hyperparams, airfoil, \n", " featureDimension, numEpochs);\n", "}\n", "\n", "trainRmsProp(0.01f, 0.9f, 2);" ] }, { "cell_type": "markdown", "metadata": { "origin_pos": 8 }, "source": [ "## Concise Implementation\n", "\n", "Since RMSProp is a rather popular algorithm it is also available in `Optimizer`. We create an instance of `RmsProp` and set its learning rate and optional `gamma1` parameter. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Tracker lrt = Tracker.fixed(0.01f);\n", "Optimizer rmsProp = Optimizer.rmsprop().optLearningRateTracker(lrt).optRho(0.9f).build();\n", "\n", "TrainingChapter11.trainConciseCh11(rmsProp, airfoil, 2);" ] }, { "cell_type": "markdown", "metadata": { "origin_pos": 10 }, "source": [ "## Summary\n", "\n", "* RMSProp is very similar to Adagrad insofar as both use the square of the gradient to scale coefficients.\n", "* RMSProp shares with momentum the leaky averaging. However, RMSProp uses the technique to adjust the coefficient-wise preconditioner.\n", "* The learning rate needs to be scheduled by the experimenter in practice.\n", "* The coefficient $\\gamma$ determines how long the history is when adjusting the per-coordinate scale.\n", "\n", "## Exercises\n", "\n", "1. What happens experimentally if we set $\\gamma = 1$? Why?\n", "1. Rotate the optimization problem to minimize $f(\\mathbf{x}) = 0.1 (x_1 + x_2)^2 + 2 (x_1 - x_2)^2$. What happens to the convergence?\n", "1. Try out what happens to RMSProp on a real machine learning problem, such as training on Fashion-MNIST. Experiment with different choices for adjusting the learning rate.\n", "1. Would you want to adjust $\\gamma$ as optimization progresses? How sensitive is RMSProp to this?\n" ] } ], "metadata": { "kernelspec": { "display_name": "Java", "language": "java", "name": "java" }, "language_info": { "codemirror_mode": "java", "file_extension": ".jshell", "mimetype": "text/x-java-source", "name": "Java", "pygments_lexer": "java", "version": "14.0.2+12" } }, "nbformat": 4, "nbformat_minor": 4 }