{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "origin_pos": 0
   },
   "source": [
    "# RMSProp\n",
    ":label:`sec_rmsprop`\n",
    "\n",
    "One of the key issues in :numref:`sec_adagrad` is that the learning rate decreases at a predefined schedule of effectively $\\mathcal{O}(t^{-\\frac{1}{2}})$. While this is generally appropriate for convex problems, it might not be ideal for nonconvex ones, such as those encountered in deep learning. Yet, the coordinate-wise adaptivity of Adagrad is highly desirable as a preconditioner.\n",
    "\n",
    ":cite:`Tieleman.Hinton.2012` proposed the RMSProp algorithm as a simple fix to decouple rate scheduling from coordinate-adaptive learning rates. The issue is that Adagrad accumulates the squares of the gradient $\\mathbf{g}_t$ into a state vector $\\mathbf{s}_t = \\mathbf{s}_{t-1} + \\mathbf{g}_t^2$. As a result $\\mathbf{s}_t$ keeps on growing without bound due to the lack of normalization, essentially linarly as the algorithm converges.\n",
    "\n",
    "One way of fixing this problem would be to use $\\mathbf{s}_t / t$. For reasonable distributions of $\\mathbf{g}_t$ this will converge. Unfortunately it might take a very long time until the limit behavior starts to matter since the procedure remembers the full trajectory of values. An alternative is to use a leaky average in the same way we used in the momentum method, i.e., $\\mathbf{s}_t \\leftarrow \\gamma \\mathbf{s}_{t-1} + (1-\\gamma) \\mathbf{g}_t^2$ for some parameter $\\gamma > 0$. Keeping all other parts unchanged yields RMSProp.\n",
    "\n",
    "## The Algorithm\n",
    "\n",
    "Let us write out the equations in detail.\n",
    "\n",
    "$$\\begin{aligned}\n",
    "    \\mathbf{s}_t & \\leftarrow \\gamma \\mathbf{s}_{t-1} + (1 - \\gamma) \\mathbf{g}_t^2, \\\\\n",
    "    \\mathbf{x}_t & \\leftarrow \\mathbf{x}_{t-1} - \\frac{\\eta}{\\sqrt{\\mathbf{s}_t + \\epsilon}} \\odot \\mathbf{g}_t.\n",
    "\\end{aligned}$$\n",
    "\n",
    "The constant $\\epsilon > 0$ is typically set to $10^{-6}$ to ensure that we do not suffer from division by zero or overly large step sizes. Given this expansion we are now free to control the learning rate $\\eta$ independently of the scaling that is applied on a per-coordinate basis. In terms of leaky averages we can apply the same reasoning as previously applied in the case of the momentum method. Expanding the definition of $\\mathbf{s}_t$ yields\n",
    "\n",
    "$$\n",
    "\\begin{aligned}\n",
    "\\mathbf{s}_t & = (1 - \\gamma) \\mathbf{g}_t^2 + \\gamma \\mathbf{s}_{t-1} \\\\\n",
    "& = (1 - \\gamma) \\left(\\mathbf{g}_t^2 + \\gamma \\mathbf{g}_{t-1}^2 + \\gamma^2 \\mathbf{g}_{t-2} + \\ldots, \\right).\n",
    "\\end{aligned}\n",
    "$$\n",
    "\n",
    "As before in :numref:`sec_momentum` we use $1 + \\gamma + \\gamma^2 + \\ldots, = \\frac{1}{1-\\gamma}$. Hence the sum of weights is normalized to $1$ with a half-life time of an observation of $\\gamma^{-1}$. Let us visualize the weights for the past 40 timesteps for various choices of $\\gamma$.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load ../utils/djl-imports\n",
    "%load ../utils/plot-utils\n",
    "%load ../utils/Functions.java\n",
    "%load ../utils/GradDescUtils.java\n",
    "%load ../utils/Accumulator.java\n",
    "%load ../utils/StopWatch.java\n",
    "%load ../utils/Training.java\n",
    "%load ../utils/TrainingChapter11.java"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "NDManager manager = NDManager.newBaseManager();\n",
    "\n",
    "float[] gammas = new float[]{0.95f, 0.9f, 0.8f, 0.7f};\n",
    "\n",
    "NDArray timesND = manager.arange(40f);\n",
    "float[] times = timesND.toFloatArray();\n",
    "display(GradDescUtils.plotGammas(times, gammas, 600, 400));"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "origin_pos": 2
   },
   "source": [
    "## Implementation from Scratch\n",
    "\n",
    "As before we use the quadratic function $f(\\mathbf{x})=0.1x_1^2+2x_2^2$ to observe the trajectory of RMSProp. Recall that in :numref:`sec_adagrad`, when we used Adagrad with a learning rate of 0.4, the variables moved only very slowly in the later stages of the algorithm since the learning rate decreased too quickly. Since $\\eta$ is controlled separately this does not happen with RMSProp.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "origin_pos": 3,
    "tab": [
     "mxnet"
    ]
   },
   "outputs": [],
   "source": [
    "float eta = 0.4f;\n",
    "float gamma = 0.9f;\n",
    "\n",
    "Function<Float[], Float[]> rmsProp2d = (state) -> {\n",
    "    Float x1 = state[0], x2 = state[1], s1 = state[2], s2 = state[3];\n",
    "    float g1 = 0.2f * x1;\n",
    "    float g2 = 4 * x2;\n",
    "    float eps = (float) 1e-6;\n",
    "    s1 = gamma * s1 + (1 - gamma) * g1 * g1;\n",
    "    s2 = gamma * s2 + (1 - gamma) * g2 * g2;\n",
    "    x1 -= eta / (float) Math.sqrt(s1 + eps) * g1;\n",
    "    x2 -= eta / (float) Math.sqrt(s2 + eps) * g2;\n",
    "    return new Float[]{x1, x2, s1, s2};\n",
    "};\n",
    "\n",
    "BiFunction<Float, Float, Float> f2d = (x1, x2) -> {\n",
    "    return 0.1f * x1 * x1 + 2 * x2 * x2;\n",
    "};\n",
    "\n",
    "GradDescUtils.showTrace2d(f2d, GradDescUtils.train2d(rmsProp2d, 20));"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![RmsProp Gradient Descent 2D.](https://d2l-java-resources.s3.amazonaws.com/img/chapter_optim-rmsprop-gd2d.svg)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "origin_pos": 4
   },
   "source": [
    "Next, we implement RMSProp to be used in a deep network. This is equally straightforward.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "NDList initRmsPropStates(int featureDimension) {\n",
    "    NDManager manager = NDManager.newBaseManager();\n",
    "    NDArray sW = manager.zeros(new Shape(featureDimension, 1));\n",
    "    NDArray sB = manager.zeros(new Shape(1));\n",
    "    return new NDList(sW, sB);\n",
    "}\n",
    "\n",
    "public class Optimization {\n",
    "    public static void rmsProp(NDList params, NDList states, Map<String, Float> hyperparams) {\n",
    "        float gamma = hyperparams.get(\"gamma\");\n",
    "        float eps = (float) 1e-6;\n",
    "        for (int i = 0; i < params.size(); i++) {\n",
    "            NDArray param = params.get(i);\n",
    "            NDArray state = states.get(i);\n",
    "            // Update parameter and state\n",
    "            // state = gamma * state + (1 - gamma) * param.gradient^(1/2)\n",
    "            state.muli(gamma).addi(param.getGradient().square().mul(1 - gamma));\n",
    "            // param -= lr * param.gradient / sqrt(s + eps)\n",
    "            param.subi(param.getGradient().mul(hyperparams.get(\"lr\")).div(state.add(eps).sqrt()));\n",
    "        }\n",
    "    }\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "origin_pos": 6
   },
   "source": [
    "We set the initial learning rate to 0.01 and the weighting term $\\gamma$ to 0.9. That is, $\\mathbf{s}$ aggregates on average over the past $1/(1-\\gamma) = 10$ observations of the square gradient.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "AirfoilRandomAccess airfoil = TrainingChapter11.getDataCh11(10, 1500);\n",
    "\n",
    "public TrainingChapter11.LossTime trainRmsProp(float lr, float gamma, int numEpochs) \n",
    "                    throws IOException, TranslateException {\n",
    "    int featureDimension = airfoil.getColumnNames().size();\n",
    "    Map<String, Float> hyperparams = new HashMap<>();\n",
    "    hyperparams.put(\"lr\", lr);\n",
    "    hyperparams.put(\"gamma\", gamma);\n",
    "    return TrainingChapter11.trainCh11(Optimization::rmsProp, \n",
    "                                       initRmsPropStates(featureDimension), \n",
    "                                       hyperparams, airfoil, \n",
    "                                       featureDimension, numEpochs);\n",
    "}\n",
    "\n",
    "trainRmsProp(0.01f, 0.9f, 2);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "origin_pos": 8
   },
   "source": [
    "## Concise Implementation\n",
    "\n",
    "Since RMSProp is a rather popular algorithm it is also available in `Optimizer`. We create an instance of `RmsProp` and set its learning rate and optional `gamma1` parameter. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Tracker lrt = Tracker.fixed(0.01f);\n",
    "Optimizer rmsProp = Optimizer.rmsprop().optLearningRateTracker(lrt).optRho(0.9f).build();\n",
    "\n",
    "TrainingChapter11.trainConciseCh11(rmsProp, airfoil, 2);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "origin_pos": 10
   },
   "source": [
    "## Summary\n",
    "\n",
    "* RMSProp is very similar to Adagrad insofar as both use the square of the gradient to scale coefficients.\n",
    "* RMSProp shares with momentum the leaky averaging. However, RMSProp uses the technique to adjust the coefficient-wise preconditioner.\n",
    "* The learning rate needs to be scheduled by the experimenter in practice.\n",
    "* The coefficient $\\gamma$ determines how long the history is when adjusting the per-coordinate scale.\n",
    "\n",
    "## Exercises\n",
    "\n",
    "1. What happens experimentally if we set $\\gamma = 1$? Why?\n",
    "1. Rotate the optimization problem to minimize $f(\\mathbf{x}) = 0.1 (x_1 + x_2)^2 + 2 (x_1 - x_2)^2$. What happens to the convergence?\n",
    "1. Try out what happens to RMSProp on a real machine learning problem, such as training on Fashion-MNIST. Experiment with different choices for adjusting the learning rate.\n",
    "1. Would you want to adjust $\\gamma$ as optimization progresses? How sensitive is RMSProp to this?\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Java",
   "language": "java",
   "name": "java"
  },
  "language_info": {
   "codemirror_mode": "java",
   "file_extension": ".jshell",
   "mimetype": "text/x-java-source",
   "name": "Java",
   "pygments_lexer": "java",
   "version": "14.0.2+12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}