{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "sns.set(color_codes=True)\n",
    "\n",
    "from IPython.display import Image"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Chapter 9 On-policy Prediction with Approximation\n",
    "=========\n",
    "\n",
    "approximate value function: parameterized function $\\hat{v}(s, w) \\approx v_\\pi(s)$\n",
    "\n",
    "+ applicable to partially observable problems."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 9.1 Value-function Approximation\n",
    "\n",
    "$s \\to u$: $s$ is the state updated and $u$ is the update target that $s$'s estimated value is shifted toward.\n",
    "\n",
    "We use machine learning methods and pass to them the $s \\to g$ of each update as a training example. Then we interperet the approximate function they produce as an estimated value function.\n",
    "\n",
    "not all function approximation methods are equally well suited for use in reinforcement learning:\n",
    "+ learn efficiently from incrementally acquired data: many traditional methods assume a static training set over which multiple passes are made.\n",
    "+ are able to handle nonstationary target functions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "### 9.2 The Prediction Objective (VE)\n",
    "\n",
    "which states we care most about: a state distribution $\\mu(s) \\geq 0$, $\\sum_s \\mu(s) = 1$.\n",
    "+ Often $\\mu(s)$ is chosen to be the fraction of time spent in $s$.\n",
    "\n",
    "objective function, the Mean Squared Value Error, denoted $\\overline{VE}$:\n",
    "\n",
    "\\begin{equation}\n",
    "    \\overline{VE}(w) \\doteq \\sum_{s \\in \\delta} \\mu(s) \\left [ v_\\pi (s) - \\hat{v}(s, w) \\right ]^2\n",
    "\\end{equation}\n",
    "\n",
    "where $v_\\pi(s)$ is the true value and $\\hat{v}(s, w)$ is the approximate value.\n",
    "\n",
    "Note that best $\\overline{VE}$ is no guarantee of our ultimate purpose: to find a better policy.\n",
    "+ global optimum.\n",
    "+ local optimum.\n",
    "+ don't convergence, or diverge."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 9.3 Stochastic-gradient and Semi-gradient Methods\n",
    "\n",
    "SGD: well suited to online reinforcement learning.\n",
    "\n",
    "\\begin{align}\n",
    "    w_{t+1} &\\doteq w_t - \\frac1{2} \\alpha \\nabla \\left [ v_\\pi(S_t) - \\hat{v}(S_t, w_t) \\right ]^2 \\\\\n",
    "    &= w_t + \\alpha \\left [ \\color{blue}{v_\\pi(S_t)} - \\hat{v}(S_t, w_t) \\right ] \\nabla \\hat{v}(S_t, w_t) \\\\\n",
    "    &\\approx w_t + \\alpha \\left [ \\color{blue}{U_t} - \\hat{v}(S_t, w_t) \\right ] \\nabla \\hat{v}(S_t, w_t) \\\\\n",
    "\\end{align}\n",
    "\n",
    "$S_t \\to U_t$, is not the true value $v_\\pi(S_t)$, but some, possibly random, approximation to it. (前面各种方法累计的value）:\n",
    "+ If $U_t$ is an unbiased estimate, $w_t$ is guaranteed to converge to a local optimum.\n",
    "+ Otherwise, like boostrappig target or DP target => semi-gradient methods. (might do not converge as robustly as gradient methods)\n",
    "  - significantly faster learning.\n",
    "  - enable learning to be continual and online.\n",
    "  \n",
    "state aggregation: states are grouped together, with one estimated value for each group."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "### 9.4 Linear Methods\n",
    "\n",
    "For every state $s$, there is a real-valued feature vector $x(s) \\doteq (x_1(s), x_2(s), \\dots, x_d(s))^T$:\n",
    "\n",
    "\\begin{equation}\n",
    "    \\hat{v}(s, w) \\doteq w^T x(s) \\doteq \\sum_{i=1}^d w_i x_i(s)\n",
    "\\end{equation}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 9.5 Feature Construction for Linear Methods\n",
    "\n",
    "Choosing features appropriate to the task is an important way of adding prior domain knowledge to reinforcement learing systems.\n",
    "\n",
    "+ Polynomials\n",
    "+ Fourier Basis: low dimension, easy to select, global properities\n",
    "+ Coarse Coding\n",
    "+ Tile Coding: convolution kernel?\n",
    "+ Radial Basis Functions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "### 9.6 Selecting Step-Size Parameters Manually\n",
    "\n",
    "A good rule of thumb for setting the step-size parameter of linear SGD methods is then $\\alpha \\doteq (\\gamma \\mathbf{E}[x^T x])^{-1}$\n",
    "\n",
    "\n",
    "\n",
    "### 9.7 Nonlinear Function Approximation: Artificial Neural Networks\n",
    "\n",
    "+ANN, CNN\n",
    "\n",
    "\n",
    "### 9.8 Least-Squares TD\n",
    "\n",
    "$w_{TD} = A^{-1} b$: data efficient, while expensive computation\n",
    "\n",
    "\n",
    "### 9.9 Memory-based Function Approximation\n",
    "\n",
    "nearest neighbor method\n",
    "\n",
    "\n",
    "### 9.10 Kernel-based Function Approximation\n",
    "\n",
    "RBF function\n",
    "\n",
    "\n",
    "### 9.11 Looking Deeper at On-policy Learning: Interest and Emphasis\n",
    "\n",
    "more interested in some states than others:\n",
    "+ interest $I_t$: the degree to which we are interested in accurately valuing the state at time $t$.\n",
    "+ emphaisis $M_t$: \n",
    "\n",
    "\\begin{align}\n",
    "    w_{t+n} & \\doteq w_{t+n-1} + \\alpha M_t \\left [ G_{t:t+n} - \\hat{v}(S_t, w_{t+n-1} \\right ] \\nabla \\hat{v}(S_t, w_{t+n-1}) \\\\    \n",
    "    M_t & = I_t + \\gamma^n M_{t-n}, \\qquad 0 \\leq t < T\n",
    "\\end{align}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}