{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# DiscreteDP\n", "\n", "***Implementation Details***" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Daisuke Oyama** \n", "*Faculty of Economics, University of Tokyo*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook describes the implementation details of the DiscreteDP class." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the theoretical background and notation,\n", "see the lecture [Discrete Dynamic Programming](http://quant-econ.net/py/discrete_dp.html)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Solution methods" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The DiscreteDP class currently implements the following solution algorithms:\n", "\n", "* value iteration;\n", "* policy iteration (default);\n", "* modified policy iteration.\n", "\n", "Policy iteration computes an exact optimal policy in finitely many iterations,\n", "while value iteration and modified policy iteration return an $\\varepsilon$-optimal policy\n", "for a prespecified value of $\\varepsilon$.\n", "\n", "Value iteration relies on (only) the fact that\n", "the Bellman operator $T$ is a contraction mapping\n", "and thus iterative application of $T$ to any initial function $v^0$\n", "converges to its unique fixed point $v^*$.\n", "\n", "Policy iteration more closely exploits the particular structure of the problem,\n", "where each iteration consists of a policy evaluation step,\n", "which computes the value $v_{\\sigma}$ of a policy $\\sigma$\n", "by solving the linear equation $v = T_{\\sigma} v$,\n", "and a policy improvement step, which computes a $v_{\\sigma}$-greedy policy.\n", "\n", "Modified policy iteration replaces the policy evaluation step\n", "in policy iteration with \"partial policy evaluation\",\n", "which computes an approximation of the value of a policy $\\sigma$\n", "by iterating $T_{\\sigma}$ for a specified number of times." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below we describe our implementation of these algorithms more in detail. \n", "(While not explicit, in the actual implementation each algorithm is terminated\n", "when the number of iterations reaches iter_max.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Value iteration" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "DiscreteDP.value_iteration(v_init, epsilon, iter_max)\n", "\n", "1. Choose any $v^0 \\in \\mathbb{R}^n$, and\n", " specify $\\varepsilon > 0$; set $i = 0$.\n", "2. Compute $v^{i+1} = T v^i$.\n", "3. If $\\lVert v^{i+1} - v^i\\rVert < [(1 - \\beta) / (2\\beta)] \\varepsilon$,\n", " then go to step 4;\n", " otherwise, set $i = i + 1$ and go to step 2.\n", "4. Compute a $v^{i+1}$-greedy policy $\\sigma$, and return $v^{i+1}$ and $\\sigma$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Given $\\varepsilon > 0$,\n", "the value iteration algorithm terminates in a finite number of iterations,\n", "and returns an $\\varepsilon/2$-approximation of the optimal value funciton and\n", "an $\\varepsilon$-optimal policy function\n", "(unless iter_max is reached)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Policy iteration" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "DiscreteDP.policy_iteration(v_init, iter_max)\n", "\n", "1. Choose any $v^0 \\in \\mathbb{R}^n$ and compute a $v^0$-greedy policy $\\sigma^0$;\n", " set $i = 0$.\n", "2. [Policy evaluation]\n", " Compute the value $v_{\\sigma^i}$ by solving the equation $v = T_{\\sigma^i} v$.\n", "3. [Policy improvement]\n", " Compute a $v_{\\sigma^i}$-greedy policy $\\sigma^{i+1}$;\n", " let $\\sigma^{i+1} = \\sigma^i$ if possible.\n", "4. If $\\sigma^{i+1} = \\sigma^i$,\n", " then return $v_{\\sigma^i}$ and $\\sigma^{i+1}$;\n", " otherwise, set $i = i + 1$ and go to step 2." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The policy iteration algorithm terminates in a finite number of iterations, and\n", "returns an optimal value function and an optimal policy function\n", "(unless iter_max is reached)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Modified policy iteration" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "DiscreteDP.modified_policy_iteration(v_init, epsilon, iter_max, k)\n", "\n", "1. Choose any $v^0 \\in \\mathbb{R}^n$, and\n", " specify $\\varepsilon > 0$ and $k \\geq 0$;\n", " set $i = 0$.\n", "2. [Policy improvement]\n", " Compute a $v^i$-greedy policy $\\sigma^{i+1}$;\n", " let $\\sigma^{i+1} = \\sigma^i$ if possible (for $i \\geq 1$).\n", "3. Compute $u = T v^i$ ($= T_{\\sigma^{i+1}} v^i$).\n", " If $\\mathrm{span}(u - v^i) < [(1 - \\beta) / \\beta] \\varepsilon$, then go to step 5;\n", " otherwise go to step 4.\n", "4. [Partial policy evaluation]\n", " Compute $v^{i+1} = (T_{\\sigma^{i+1}})^k u$ ($= (T_{\\sigma^{i+1}})^{k+1} v^i$).\n", " Set $i = i + 1$ and go to step 2.\n", "5. Return\n", " $v = u + [\\beta / (1 - \\beta)] [(\\min(u - v^i) + \\max(u - v^i)) / 2] \\mathbf{1}$\n", " and $\\sigma_{i+1}$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Given $\\varepsilon > 0$,\n", "provided that $v^0$ is such that $T v^0 \\geq v^0$,\n", "the modified policy iteration algorithm terminates in a finite number of iterations,\n", "and returns an $\\varepsilon/2$-approximation of the optimal value funciton and\n", "an $\\varepsilon$-optimal policy function\n", "(unless iter_max is reached)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Remarks*\n", "\n", "* Here we employ the termination criterion based on the *span semi-norm*,\n", " where $\\mathrm{span}(z) = \\max(z) - \\min(z)$ for $z \\in \\mathbb{R}^n$.\n", " Since $\\mathrm{span}(T v - v) \\leq 2\\lVert T v - v\\rVert$,\n", " this reaches $\\varepsilon$-optimality faster than the norm-based criterion\n", " as employed in the value iteration above.\n", "* Except for the termination criterion,\n", " modified policy is equivalent to value iteration if $k = 0$ and\n", " to policy iteration in the limit as $k \\to \\infty$.\n", "* Thus, if one would like to have value iteration with the span-based rule,\n", " run modified policy iteration with $k = 0$.\n", "* In returning a value function, our implementation is slightly different from\n", " that by Puterman (2005), Section 6.6.3, pp.201-202, which uses\n", " $u + [\\beta / (1 - \\beta)] \\min(u - v^i) \\mathbf{1}$.\n", "* The condition for convergence, $T v^0 \\geq v^0$, is satisfied\n", " for example when $v^0 = v_{\\sigma}$ for some policy $\\sigma$,\n", " or when $v^0(s) = \\min_{(s', a)} r(s', a)$ for all $s$.\n", " If v_init is not specified, it is set to the latter, $\\min_{(s', a)} r(s', a))$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Illustration" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We illustrate the algorithms above\n", "by the simple example from Puterman (2005), Section 3.1, pp.33-35." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "from quantecon.markov import DiscreteDP" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "n = 2 # Number of states\n", "m = 2 # Number of actions\n", "\n", "# Reward array\n", "R = [[5, 10],\n", " [-1, -float('inf')]]\n", "\n", "# Transition probability array\n", "Q = [[(0.5, 0.5), (0, 1)],\n", " [(0, 1), (0.5, 0.5)]] # Probabilities in Q[1, 1] are arbitrary\n", "\n", "# Discount rate\n", "beta = 0.95\n", "\n", "ddp = DiscreteDP(R, Q, beta)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Analytical solution:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "def sigma_star(beta):\n", " sigma = np.empty(2, dtype=int)\n", " sigma[1] = 0\n", " if beta > 10/11:\n", " sigma[0] = 0\n", " else:\n", " sigma[0] = 1\n", " return sigma\n", "\n", "def v_star(beta):\n", " v = np.empty(2)\n", " v[1] = -1 / (1 - beta)\n", " if beta > 10/11:\n", " v[0] = (5 - 5.5*beta) / ((1 - 0.5*beta) * (1 - beta))\n", " else:\n", " v[0] = (10 - 11*beta) / (1 - beta)\n", " return v" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0, 0])" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sigma_star(beta=beta)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ -8.57142857, -20. ])" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "v_star(beta=beta)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Value iteration" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Solve the problem by value iteration;\n", "see Example 6.3.1, p.164 in Puterman (2005)." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "epsilon = 1e-2\n", "v_init = [0, 0]\n", "res_vi = ddp.solve(method='value_iteration', v_init=v_init, epsilon=epsilon)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The number of iterations required to satisfy the termination criterion:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "162" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res_vi.num_iter" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The returned value function:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ -8.5665053 , -19.99507673])" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res_vi.v" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is indeed an $\\varepsilon/2$-approximation of $v^*$:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.abs(res_vi.v - v_star(beta=beta)).max() < epsilon/2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The returned policy function:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0, 0])" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res_vi.sigma" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Value iteration converges very slowly.\n", "Let us replicate Table 6.3.1 on p.165:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "num_reps = 164\n", "values = np.empty((num_reps, n))\n", "diffs = np.empty(num_reps)\n", "spans = np.empty(num_reps)\n", "v = np.array([0, 0])\n", "\n", "values[0] = v\n", "diffs[0] = np.nan\n", "spans[0] = np.nan\n", "\n", "for i in range(1, num_reps):\n", " v_new = ddp.bellman_operator(v)\n", " values[i] = v_new\n", " diffs[i] = np.abs(v_new - v).max()\n", " spans[i] = (v_new - v).max() - (v_new - v).min()\n", " v = v_new" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
$v^i(0)$$v^i(1)$$\\lVert v^i - v^{i-1}\\rVert$
$i$
00.0000000.000000NaN
110.000000-1.00000010.000000
29.275000-1.9500000.950000
38.479375-2.8525000.902500
47.672766-3.7098750.857375
56.882373-4.5243810.814506
66.120046-5.2981620.773781
75.390395-6.0332540.735092
84.694642-6.7315910.698337
94.032449-7.3950120.663420
103.402783-8.0252610.630249
20-1.401710-12.8302820.377354
30-4.278653-15.7072250.225936
40-6.001185-17.4297570.135276
50-7.032529-18.4611000.080995
60-7.650033-19.0786040.048495
70-8.019755-19.4483260.029035
80-8.241121-19.6696930.017385
90-8.373661-19.8022330.010409
100-8.453018-19.8815890.006232
110-8.500532-19.9291030.003731
120-8.528980-19.9575510.002234
130-8.546013-19.9745840.001338
140-8.556211-19.9847830.000801
150-8.562317-19.9908890.000480
160-8.565973-19.9945450.000287
161-8.566246-19.9948180.000273
162-8.566505-19.9950770.000259
163-8.566751-19.9953230.000246
\n", "
" ], "text/plain": [ " $v^i(0)$ $v^i(1)$ $\\lVert v^i - v^{i-1}\\rVert$\n", "$i$ \n", "0 0.000000 0.000000 NaN\n", "1 10.000000 -1.000000 10.000000\n", "2 9.275000 -1.950000 0.950000\n", "3 8.479375 -2.852500 0.902500\n", "4 7.672766 -3.709875 0.857375\n", "5 6.882373 -4.524381 0.814506\n", "6 6.120046 -5.298162 0.773781\n", "7 5.390395 -6.033254 0.735092\n", "8 4.694642 -6.731591 0.698337\n", "9 4.032449 -7.395012 0.663420\n", "10 3.402783 -8.025261 0.630249\n", "20 -1.401710 -12.830282 0.377354\n", "30 -4.278653 -15.707225 0.225936\n", "40 -6.001185 -17.429757 0.135276\n", "50 -7.032529 -18.461100 0.080995\n", "60 -7.650033 -19.078604 0.048495\n", "70 -8.019755 -19.448326 0.029035\n", "80 -8.241121 -19.669693 0.017385\n", "90 -8.373661 -19.802233 0.010409\n", "100 -8.453018 -19.881589 0.006232\n", "110 -8.500532 -19.929103 0.003731\n", "120 -8.528980 -19.957551 0.002234\n", "130 -8.546013 -19.974584 0.001338\n", "140 -8.556211 -19.984783 0.000801\n", "150 -8.562317 -19.990889 0.000480\n", "160 -8.565973 -19.994545 0.000287\n", "161 -8.566246 -19.994818 0.000273\n", "162 -8.566505 -19.995077 0.000259\n", "163 -8.566751 -19.995323 0.000246" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame()\n", "df[0], df[1], df[2], df[3] = values[:, 0], values[:, 1], diffs, spans\n", "df.columns = '$v^i(0)$', '$v^i(1)$', \\\n", " '$\\\\lVert v^i - v^{i-1}\\\\rVert$', '$\\\\mathrm{span}(v^i - v^{i-1})$'\n", "\n", "iter_nums = pd.Series(list(range(num_reps)), name='$i$')\n", "df.index = iter_nums\n", "\n", "display_nums = \\\n", " list(range(10)) + [10*i for i in range(1, 16)] + [160+i for i in range(4)]\n", "df.iloc[display_nums, [0, 1, 2]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On the other hand, the span decreases faster than the norm;\n", "the following replicates Table 6.6.1, page 205:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
$\\lVert v^i - v^{i-1}\\rVert$$\\mathrm{span}(v^i - v^{i-1})$
$i$
110.0000001.100000e+01
20.9500002.250000e-01
30.9025001.068750e-01
40.8573755.076562e-02
50.8145062.411367e-02
60.7737811.145399e-02
70.7350925.440647e-03
80.6983372.584307e-03
90.6634201.227546e-03
100.6302495.830844e-04
110.5987372.769651e-04
120.5688001.315584e-04
200.3773543.409318e-07
300.2259361.993445e-10
400.1352761.154632e-13
500.0809950.000000e+00
600.0484951.776357e-15
\n", "
" ], "text/plain": [ " $\\lVert v^i - v^{i-1}\\rVert$ $\\mathrm{span}(v^i - v^{i-1})$\n", "$i$ \n", "1 10.000000 1.100000e+01\n", "2 0.950000 2.250000e-01\n", "3 0.902500 1.068750e-01\n", "4 0.857375 5.076562e-02\n", "5 0.814506 2.411367e-02\n", "6 0.773781 1.145399e-02\n", "7 0.735092 5.440647e-03\n", "8 0.698337 2.584307e-03\n", "9 0.663420 1.227546e-03\n", "10 0.630249 5.830844e-04\n", "11 0.598737 2.769651e-04\n", "12 0.568800 1.315584e-04\n", "20 0.377354 3.409318e-07\n", "30 0.225936 1.993445e-10\n", "40 0.135276 1.154632e-13\n", "50 0.080995 0.000000e+00\n", "60 0.048495 1.776357e-15" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[list(range(1, 13)) + [10*i for i in range(2, 7)], [2, 3]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The span-based termination criterion is satisfied when $i = 11$:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.0005263157894736847" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "epsilon * (1-beta) / beta" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "spans[11] < epsilon * (1-beta) / beta" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In fact, modified policy iteration with $k = 0$ terminates with $11$ iterations:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "epsilon = 1e-2\n", "v_init = [0, 0]\n", "k = 0\n", "res_mpi_1 = ddp.solve(method='modified_policy_iteration',\n", " v_init=v_init, epsilon=epsilon, k=k)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "11" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res_mpi_1.num_iter" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ -8.56904799, -19.99736883])" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res_mpi_1.v" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Policy iteration" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If $\\{\\sigma^i\\}$ is the sequence of policies obtained by policy iteration\n", "with an initial policy $\\sigma^0$,\n", "one can show that $T^i v_{\\sigma^0} \\leq v_{\\sigma^i}$ ($\\leq v^*$),\n", "so that the number of iterations required for policy iteration is smaller than\n", "that for value iteration at least weakly,\n", "and indeed in many cases, the former is significantly smaller than the latter." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "v_init = [0, 0]\n", "res_pi = ddp.solve(method='policy_iteration', v_init=v_init)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res_pi.num_iter" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Policy iteration returns the exact optimal value function (up to rounding errors):" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ -8.57142857, -20. ])" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res_pi.v" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3.5527136788005009e-15" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.abs(res_pi.v - v_star(beta=beta)).max()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To look into the iterations:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Iterate 0\n", " value: [0 0]\n", " policy: [1 0]\n", "Iterate 1\n", " value: [ -9. -20.]\n", " policy: [0 0]\n", "Iterate 2\n", " value: [ -8.57142857 -20. ]\n", " policy: [0 0]\n", "Terminated\n" ] } ], "source": [ "v = np.array([0, 0])\n", "sigma = np.array([-1, -1]) # Dummy\n", "sigma_new = ddp.compute_greedy(v)\n", "i = 0\n", "\n", "while True:\n", " print('Iterate {0}'.format(i))\n", " print(' value: {0}'.format(v))\n", " print(' policy: {0}'.format(sigma_new))\n", " if np.array_equal(sigma_new, sigma):\n", " break\n", " sigma[:] = sigma_new\n", " v = ddp.evaluate_policy(sigma)\n", " sigma_new = ddp.compute_greedy(v)\n", " i += 1\n", "\n", "print('Terminated')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "See Example 6.4.1, pp.176-177." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Modified policy iteration" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The evaluation step in policy iteration\n", "which solves the linear equation $v = T_{\\sigma} v$\n", "to obtain the policy value $v_{\\sigma}$\n", "can be expensive for problems with a large number of states.\n", "Modified policy iteration is to reduce the cost of this step\n", "by using an approximation of $v_{\\sigma}$ obtained by iteration of $T_{\\sigma}$.\n", "The tradeoff is that this approach only computes an $\\varepsilon$-optimal policy,\n", "and for small $\\varepsilon$, takes a larger number of iterations than policy iteration\n", "(but much smaller than value iteration)." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "epsilon = 1e-2\n", "v_init = [0, 0]\n", "k = 6\n", "res_mpi = ddp.solve(method='modified_policy_iteration',\n", " v_init=v_init, epsilon=epsilon, k=k)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res_mpi.num_iter" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The returned value function:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ -8.57137101, -19.99993638])" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res_mpi.v" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is indeed an $\\varepsilon/2$-approximation of $v^*$:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.abs(res_mpi.v - v_star(beta=beta)).max() < epsilon/2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To look into the iterations:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Iterate 0\n", " v: [0 0]\n", "Iterate 1\n", " sigma: [1 0]\n", " T_sigma(v): [ 10. -1.]\n", " span: 11.0\n", " T_sigma^k+1(v): [ 4.96674592 -6.03325408]\n", "Iterate 2\n", " sigma: [0 0]\n", " T_sigma(v): [ 4.49340863 -6.73159137]\n", " span: 0.22499999999999964\n", " T_sigma^k+1(v): [ 1.17973283 -10.24650042]\n", "Iterate 3\n", " sigma: [0 0]\n", " T_sigma(v): [ 0.69328539 -10.7341754 ]\n", " span: 0.0012275460282897832\n", " T_sigma^k+1(v): [ -1.7602088 -13.18876747]\n", "Iterate 4\n", " sigma: [0 0]\n", " T_sigma(v): [ -2.10076373 -13.5293291 ]\n", " span: 6.6971966727891186e-06\n", "Terminated\n", " sigma: [0 0]\n", " v: [ -8.57137101 -19.99993638]\n" ] } ], "source": [ "epsilon = 1e-2\n", "v = np.array([0, 0])\n", "k = 6\n", "i = 0\n", "print('Iterate {0}'.format(i))\n", "print(' v: {0}'.format(v))\n", "\n", "sigma = np.empty(n, dtype=int) # Store the policy function\n", "\n", "while True:\n", " i += 1\n", " u = ddp.bellman_operator(v, sigma=sigma)\n", " diff = u - v\n", " span = diff.max() - diff.min()\n", " print('Iterate {0}'.format(i))\n", " print(' sigma: {0}'.format(sigma))\n", " print(' T_sigma(v): {0}'.format(u))\n", " print(' span: {0}'.format(span))\n", " if span < epsilon * (1-ddp.beta) / ddp.beta:\n", " v = u + ((diff.max() + diff.min()) / 2) * \\\n", " (ddp.beta / (1 - ddp.beta))\n", " break\n", " ddp.operator_iteration(ddp.T_sigma(sigma), v=u, max_iter=k)\n", " v = u\n", " print(' T_sigma^k+1(v): {0}'.format(v))\n", "\n", "print('Terminated')\n", "print(' sigma: {0}'.format(sigma))\n", "print(' v: {0}'.format(v))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compare this with the implementation with the norm-based termination rule\n", "as described in Example 6.5.1, pp.187-188." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reference" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* M.L. Puterman,\n", " [*Markov Decision Processes: Discrete Stochastic Dynamic Programming*](http://onlinelibrary.wiley.com/book/10.1002/9780470316887),\n", " Wiley-Interscience, 2005." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1" } }, "nbformat": 4, "nbformat_minor": 1 }