{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# DiscreteDP\n", "\n", "***Getting Started with a Simple Example***" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Daisuke Oyama**\n", "\n", "*Faculty of Economics, University of Tokyo*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook demonstrates via a simple example how to use the `DiscreteDP` module." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from __future__ import division, print_function\n", "import numpy as np\n", "from scipy import sparse\n", "from quantecon.markov import DiscreteDP" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A two-state example" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us consider the following two-state dynamic program,\n", "taken from Puterman (2005), Section 3.1, pp.33-35;\n", "see also Example 6.2.1, pp.155-156.\n", "\n", "* There are two possible states $0$ and $1$.\n", "\n", "* At state $0$, you may choose either \"stay\", say action $0$, or \"move\", action $1$.\n", "\n", "* At state $1$, there is no way to move, so that you can only stay, i.e.,\n", " $0$ is the only available action.\n", " (You may alternatively distinguish between the action \"staty\" at state $0$\n", " and that at state $1$, and call the latter action $2$;\n", " but here we choose to refer to the both actions as action $0$.)\n", "\n", "* At state $0$,\n", " if you choose action $0$ (stay),\n", " then you receive a reward $5$, and\n", " in the next period the state will remain at $0$ with probability $1/2$,\n", " but it moves to $1$ with probability $1/2$.\n", "\n", "* If you choose action $1$ (move),\n", " then you receive a reward $10$, and\n", " the state in the next period will be $1$ with probability $1$.\n", "\n", "* At state $1$, where the only action you can take is $0$ (stay),\n", " you receive a reward $-1$, and\n", " the state will remain at $1$ with probability $1$.\n", "\n", "* You want to maximize the sum of discounted expected reward flows\n", " with discount factor $\\beta \\in [0, 1)$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The optimization problem consists of:\n", "\n", "* the state space: $S = \\{0, 1\\}$;\n", "\n", "* the action space: $A = \\{0, 1\\}$;\n", "\n", "* the set of feasible state-action pairs\n", " $\\mathit{SA} = \\{(0, 0), (0, 1), (1, 0)\\} \\subset S \\times A$;\n", "\n", "* the reward function $r\\colon \\mathit{SA} \\to \\mathbb{R}$, where\n", " $$\n", " r(0, 0) = 5,\\ r(0, 1) = 10,\\ r(1, 0) = -1;\n", " $$\n", "\n", "* the transition probability function $q \\colon \\mathit{SA} \\to \\Delta(S)$, where\n", " $$\n", " \\begin{aligned}\n", " &(q(0 | 0, 0), q(1 | 0, 0)) = (1/2, 1/2), \\\\\n", " &(q(0 | 0, 1), q(1 | 0, 1)) = (0, 1), \\\\\n", " &(q(0 | 1, 0), q(1 | 1, 0)) = (0, 1);\n", " \\end{aligned}\n", " $$\n", " \n", "* the discount factor $\\beta \\in [0, 1)$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Belmann equation for this problem is:\n", "$$\n", "\\begin{aligned}\n", "v(0) &= \\max \\left\\{5 + \\beta \\left(\\frac{1}{2} v(0) + \\frac{1}{2} v(1)\\right),\n", " 10 + \\beta v(1)\\right\\}, \\\\\n", "v(1) &= (-1) + \\beta v(1).\n", "\\end{aligned}\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This problem is simple enough to solve by hand:\n", "the optimal value function $v^*$ is given by\n", "$$\n", "\\begin{aligned}\n", "&v(0) =\n", "\\begin{cases}\n", "\\dfrac{5 - 5.5 \\beta}{(1 - 0.5 \\beta) (1 - \\beta)} & \\text{if $\\beta > \\frac{10}{11}$} \\\\\n", "\\dfrac{10 - 11 \\beta}{1 - \\beta} & \\text{otherwise},\n", "\\end{cases}\\\\\n", "&v(1) = -\\frac{1}{1 - \\beta},\n", "\\end{aligned}\n", "$$\n", "and the optimal policy function $\\sigma^*$ is given by\n", "$$\n", "\\begin{aligned}\n", "&\\sigma^*(0) =\n", "\\begin{cases}\n", "0 & \\text{if $\\beta > \\frac{10}{11}$} \\\\\n", "1 & \\text{otherwise},\n", "\\end{cases}\\\\\n", "&\\sigma^*(1) = 0.\n", "\\end{aligned}\n", "$$" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def v_star(beta):\n", " v = np.empty(2)\n", " v[1] = -1 / (1 - beta)\n", " if beta > 10/11:\n", " v[0] = (5 - 5.5*beta) / ((1 - 0.5*beta) * (1 - beta))\n", " else:\n", " v[0] = (10 - 11*beta) / (1 - beta)\n", " return v" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We want to solve this problem numerically by using the `DiscreteDP` class." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will set $\\beta = 0.95$ ($> 10/11$), for which the anlaytical solution is:\n", "$\\sigma^* = (0, 0)$ and" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([ -8.57142857, -20. ])" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "v_star(beta=0.95)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Formulating the model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are two ways to represent the data for instantiating a `DiscreteDP` object.\n", "Let $n$, $m$, and $L$ denote the numbers of states, actions,\n", "and feasbile state-action pairs, respectively;\n", "in the above example, $n = 2$, $m = 2$, and $L = 3$.\n", "\n", "1. `DiscreteDP(R, Q, beta)`\n", " \n", " with parameters:\n", " \n", " * $n \\times m$ reward array `R`,\n", " * $n \\times m \\times n$ transition probability array `Q`, and\n", " * discount factor `beta`,\n", " \n", " where `R[s, a]` is the reward for action `a` when the state is `s` and\n", " `Q[s, a, s']` is the probability that the state in the next period is `s'`\n", " when the current state is `s` and the action chosen is `a`.\n", "\n", "2. `DiscreteDP(R, Q, beta, s_indices, a_indices)`\n", "\n", " with parameters:\n", " \n", " * length $L$ reward vector `R`,\n", " * $L \\times n$ transition probability array `Q`,\n", " * discount factor `beta`,\n", " * length $L$ array `s_indices`, and\n", " * length $L$ array `a_indices`,\n", " \n", " where the pairs `(s_indices[0], a_indices[0])`, ..., `(s_indices[L-1], a_indices[L-1])`\n", " enumerate feasible state-action pairs, and\n", " `R[i]` is the reward for action `a_indices[i]` when the state is `s_indices[i]` and\n", " `Q[i, s']` is the probability that the state in the next period is `s'`\n", " when the current state is `s_indices[i]` and the action chosen is `a_indices[0]`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Creating a `DiscreteDP` instance" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us illustrate the two formulations by the simple example at the outset." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Product formulation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This formulation is straightforward\n", "when the number of feasible actions is constant across states\n", "so that the set of feasible state-action pairs is naturally represetend\n", "by the product $S \\times A$,\n", "while any problem can actually be represented in this way\n", "by defining the reward `R[s, a]` to be $-\\infty$ when action `a` is infeasible under state `s`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To apply this approach to the current example,\n", "we consider the effectively equivalent problem\n", "in which at both states $0$ and $1$,\n", "both actions $0$ (stay) and $1$ (move) are available,\n", "but action $1$ yields a reward $-\\infty$ at state $1$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The reward array `R` is an $n \\times m$ 2-dimensional array:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [], "source": [ "R = [[5, 10],\n", " [-1, -float('inf')]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The transition probability array `Q` is an $n \\times m \\times n$ 3-dimenstional array:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [], "source": [ "Q = [[(0.5, 0.5), (0, 1)],\n", " [(0, 1), (0.5, 0.5)]] # Probabilities in Q[1, 1] are arbitrary" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the transition probabilities for action $(s, a) = (1, 1)$ are arbitrary,\n", "since $a = 1$ is infeasible at $s = 1$ in the original problem." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us set the discount factor $\\beta$ to be $0.95$:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [], "source": [ "beta = 0.95" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are ready to create a `DiscreteDP` instance:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ddp = DiscreteDP(R, Q, beta)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### State-action pairs formulation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When the number of feasible actions varies across states,\n", "it can be inefficient in terms of memory usage\n", "to extend the domain by treating infeasible actions\n", "to be \"feasible but yielding reward $-\\infty$\".\n", "This formulation takes the set of feasible state-action pairs as is,\n", "defining `R` to be a 1-dimensional array of length `L`\n", "and `Q` to be a 2-dimensional array of shape `(L, n)`,\n", "where `L` is the number of feasible state-action pairs." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we have to list all the feasible state-action pairs.\n", "For our example, they are: $(s, a) = (0, 0), (0, 1), (1, 0)$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have arrays `s_indices` and ` a_indices` of length $3$\n", "contain the indices of states and actions, respectively." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [], "source": [ "s_indices = [0, 0, 1] # State indices\n", "a_indices = [0, 1, 0] # Action indices" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The reward vector `R` is a length $L$ 1-dimensional array:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Rewards for (s, a) = (0, 0), (0, 1), (1, 0), respectively\n", "R = [5, 10, -1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The transition probability array `Q` is an $L \\times n$ 2-dimensional array:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Probability vectors for (s, a) = (0, 0), (0, 1), (1, 0), respectively\n", "Q = [(0.5, 0.5), (0, 1), (0, 1)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the discount factor, set $\\beta = 0.95$ as before:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [], "source": [ "beta = 0.95" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now create a `DiscreteDP` instance:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ddp_sa = DiscreteDP(R, Q, beta, s_indices, a_indices)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Notes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Importantly, this formulation allows us to represent the transition probability array `Q`\n", "as a [`scipy.sparse`](http://docs.scipy.org/doc/scipy/reference/sparse.html) matrix\n", "(of any format),\n", "which is useful for large and sparse problems." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For example, let us convert the above ndarray `Q` to the Coordinate (coo) format:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import scipy.sparse\n", "Q = scipy.sparse.coo_matrix(Q)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pass it to `DiscreteDP` with the other parameters:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ddp_sparse = DiscreteDP(R, Q, beta, s_indices, a_indices)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Internally, the matrix `Q` is converted to the Compressed Sparse Row (csr) format:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "<3x2 sparse matrix of type ''\n", "\twith 4 stored elements in Compressed Sparse Row format>" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddp_sparse.Q" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[ 0.5, 0.5],\n", " [ 0. , 1. ],\n", " [ 0. , 1. ]])" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddp_sparse.Q.toarray()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Solving the model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let us solve our model.\n", "Currently, `DiscreteDP` supports the following solution algorithms:\n", "\n", "* policy iteration;\n", "* value iteration;\n", "* modified policy iteration.\n", "\n", "(The methods are the same across the formulations.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Policy iteration" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We solve the model first by policy iteration,\n", "which gives the exact solution:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [], "source": [ "v_init = [0, 0] # Initial value function, optional(default=max_a r(s, a))\n", "res = ddp.solve(method='policy_iteration', v_init=v_init)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`res` contains the information about the solution result:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ " method: 'policy iteration'\n", " max_iter: 250\n", " mc: Markov chain with transition matrix \n", "P = \n", "[[ 0.5 0.5]\n", " [ 0. 1. ]]\n", " sigma: array([0, 0])\n", " v: array([ -8.57142857, -20. ])\n", " num_iter: 2" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The optimal policy function:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([0, 0])" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res.sigma" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The optimal value function:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([ -8.57142857, -20. ])" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res.v" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This coincides with the analytical solution:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([ -8.57142857, -20. ])" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "v_star(beta)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.allclose(res.v, v_star(beta))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The number of iterations:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "2" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res.num_iter" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Verify that the value of the policy `[0, 0]` is actually equal to the optimal value `v`:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([ -8.57142857, -20. ])" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddp.evaluate_policy(res.sigma)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([ True, True], dtype=bool)" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddp.evaluate_policy(res.sigma) == res.v" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`res.mc` is the controlled Markov chain given by the optimal policy `[0, 0]`:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Markov chain with transition matrix \n", "P = \n", "[[ 0.5 0.5]\n", " [ 0. 1. ]]" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res.mc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Value iteration" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, solve the model by value iteration,\n", "which returns an $\\varepsilon$-optimal solution for a specified value of $\\varepsilon$:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [], "source": [ "epsilon = 1e-2 # Convergece tolerance, optional(default=1e-3)\n", "v_init = [0, 0] # Initial value function, optional(default=max_a r(s, a))\n", "res_vi = ddp.solve(method='value_iteration', v_init=v_init,\n", " epsilon=epsilon)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ " method: 'value iteration'\n", " max_iter: 250\n", " mc: Markov chain with transition matrix \n", "P = \n", "[[ 0.5 0.5]\n", " [ 0. 1. ]]\n", " sigma: array([0, 0])\n", " v: array([ -8.5665053 , -19.99507673])\n", " epsilon: 0.01\n", " num_iter: 162" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res_vi" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The computed policy function `res1.sigma` is an $\\varepsilon$-optimal policy,\n", "and the value function `res1.v` is an $\\varepsilon/2$-approximation\n", "of the true optimal value function." ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.0049232745189442539" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.abs(v_star(beta) - res_vi.v).max()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Modified policy iteration" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, solve the model by modified policy iteration:" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false }, "outputs": [], "source": [ "epsilon = 1e-2 # Convergece tolerance, optional(defaul=1e-3)\n", "v_init = [0, 0] # Initial value function, optional(default=max_a r(s, a))\n", "res_mpi = ddp.solve(method='modified_policy_iteration', v_init=v_init,\n", " epsilon=epsilon)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ " mc: Markov chain with transition matrix \n", "P = \n", "[[ 0.5 0.5]\n", " [ 0. 1. ]]\n", " sigma: array([0, 0])\n", " v: array([ -8.57142826, -19.99999965])\n", " epsilon: 0.01\n", " max_iter: 250\n", " method: 'modified policy iteration'\n", " k: 20\n", " num_iter: 3" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res_mpi" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Modified policy function also returns an $\\varepsilon$-optimal policy function\n", "and an $\\varepsilon/2$-approximate value function:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "3.4711384344632279e-07" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.abs(v_star(beta) - res_mpi.v).max()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## References" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* M.L. Puterman,\n", " [*Markov Decision Processes: Discrete Stochastic Dynamic Programming*](http://onlinelibrary.wiley.com/book/10.1002/9780470316887),\n", " Wiley-Interscience, 2005." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.0" } }, "nbformat": 4, "nbformat_minor": 0 }