{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Problem 1. [20 points] Transients of the Value Iteration" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For $0<\\beta<1$, consider the **finite state infinite horizon discounted MDP** in discrete time, given by\n", "\n", "$$\\begin{align*}\n", "&\\underset{\\gamma\\in\\Gamma}{\\text{inf}}\\quad\\mathbb{E}\\left[\\sum_{t=0}^{\\infty}\\beta^{t}\\:c\\left(x_{t},u_{t}\\right)\\right]\\\\\n", "& x \\sim \\text{Markov}\\left(P(u)\\right), \\quad [P_{ij}(u)]:= \\mathbb{P}\\left(x_{t+1} = j \\mid x_{t} = i, u_{t} = u\\right), \\quad i,j=1,...,n, \\quad u\\in\\mathcal{U}.\n", "\\end{align*}$$\n", "\n", "Let $\\mathcal{F}$ be the class of all functions $z:\\{1,...,n\\} \\mapsto \\mathbb{R}$, and define the **nonlinear** Bellman operator $T : \\mathcal{F} \\mapsto \\mathcal{F}$ as\n", "\n", "$$(Tz)(i):=\\underset{u\\in\\mathcal{U}}{\\text{inf}}\\quad\\bigg\\{c(i,u) + \\beta \\sum_{j=1}^{n}P_{ij}(u)z(j)\\bigg\\}, \\quad i=1,...,n.$$\n", "\n", "Let $h_{0}(i) := 0$ for $i=1,...,n$, and consider the $k$-th transient of the value iteration: $h_{k}=T^{k}h_{0}$. Here, $T^{k}$ denotes the $k$-fold composition of the Bellman operator, i.e., $T^{k}\\equiv\\underbrace{T \\circ T \\circ ... \\circ T}_{k\\;\\text{times}}$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## (a) [15 points] Transient value function iterate and the finite horizon OCP" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let the value function associated with the **$k$-stage finite horizon OCP** be\n", "\n", "$$\\begin{align*}W_{k}(i) \\: := \\: \n", "&\\underset{\\gamma\\in\\Gamma}{\\text{inf}}\\quad\\mathbb{E}\\left[\\sum_{t=0}^{k-1}\\beta^{t}\\:c\\left(x_{t},u_{t}\\right) \\:\\bigg\\vert\\: x_{0}=i\\right]\\\\\n", "& x \\sim \\text{Markov}\\left(P(u)\\right), \\quad [P_{ij}(u)]:= \\mathbb{P}\\left(x_{t+1} = j \\mid x_{t} = i, u_{t} = u\\right), \\quad i,j=1,...,n, \\quad u\\in\\mathcal{U}.\n", "\\end{align*}$$\n", "\n", "Carefully prove that $h_{k}(i)=W_{k}(i)$, $i=1,...,n$. [Hint: Induction]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Solution for part (a):\n", "For $k=1$, we have \n", "$$h_{1}(i):=Th_{0}(i) = \\underset{u\\in\\mathcal{U}}{\\inf} \\{c(i,u) + 0\\} = \\underset{u\\in\\mathcal{U}}{\\inf} c(i,u).$$\n", "On the other hand, \n", "$$W_{1}(i) := \\underset{\\gamma\\in\\Gamma}{\\inf}\\:\\mathbb{E}\\left[\\beta^{0}c(x_{0},u_{0}) \\mid x_{0}=i\\right] = \\underset{\\gamma\\in\\{\\gamma_{0}\\}}{\\inf}\\:c(i,u_{0}) = \\underset{u_{0}\\in\\mathcal{U}}{\\inf}\\:c(i,u_{0}).$$\n", "Therefore, $h_{1}(i)=W_{1}(i)$ for all $i=1,...,n$.\n", "\n", "For some $k > 1$, let us now make the inductive hypothesis that $h_{k}(i) = W_{k}(i)$ for all $i=1,...,n$. We will now prove that $h_{k+1}(i)=W_{k+1}(i)$. For this, note that\n", "\n", "$$\\begin{align*}\n", "W_{k+1}(i) \\: &:= \\: \n", "\\underset{\\gamma\\in\\Gamma}{\\text{inf}}\\quad\\mathbb{E}\\left[\\sum_{t=0}^{k}\\beta^{t}\\:c\\left(x_{t},u_{t}\\right) \\:\\bigg\\vert\\: x_{0}=i\\right]\\\\\n", "&= \\underset{\\gamma\\in\\Gamma}{\\inf}\\quad\\mathbb{E}\\left[\\beta^{0}c(x_{0},u_{0})\\mid x_{0}=i\\right] \\: + \\: \\beta\\mathbb{E}\\left[c(x_1, u_1) + \\beta c(x_2, u_2) + ... + \\beta^{k-1}c(x_{k}, u_{k}) \\mid x_{0}=i\\right]\\\\\n", "&= \\underset{\\gamma\\in\\Gamma}{\\inf}\\quad \\bigg\\{c(i,u) + \\beta\\sum_{j=1}^{n}P_{ij}(u)\\underset{\\gamma\\in\\Gamma}{\\inf}\\mathbb{E}\\left[\\sum_{t=0}^{k-1}\\beta^{t}c(x_t, u_t) \\mid x_{0}=j\\right]\\bigg\\}\\\\\n", "&= \\underset{u\\in\\mathcal{U}}{\\inf}\\quad \\bigg\\{c(i,u) + \\beta\\sum_{j=1}^{n}P_{ij}(u)W_{k}(j)\\bigg\\} \\qquad\\text{(by definition)}\\\\\n", "&= \\underset{u\\in\\mathcal{U}}{\\inf}\\quad \\bigg\\{c(i,u) + \\beta\\sum_{j=1}^{n}P_{ij}(u)h_{k}(j)\\bigg\\} \\qquad\\text{(by inductive hypothesis)}\\\\\n", "&= (Th_{k})(i) =: h_{k+1}(i), \n", "\\end{align*}\n", "$$\n", "\n", "which completes the inductive proof." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## (b) [5 points] Interpretation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Give clear physical interpretation of the mathematical result derived in part (a)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Solution for part (b):\n", "The result derived in part (a) shows that the $k$-step transient of the value iteration associated with an infinite horizon discounted MDP, **starting from the zero function**, coincides with the corresponding value function of a finite $k$-horizon discounted MDP. This is surprising since the value iteration was introduced as an asymptotic algorithm to solve the infinite horizon MDP to leverage the contraction properties of the Bellman operator. While the initial guess $h_{0}(i)$ is irrelevant in the asymptotic sense, the transient iterates of course depend on the initial guess. But a-priori it is hard to imagine if the transient iterates would have any physical meaning. This exercise reveals that for the special initial guess of zero function, the transients of the value iterate do have a natural meaning." ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "# Problem 2. [30 points] Finite Horizon LQR Redux" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The purpose of this exercise is to revisit the **finite horizon LQR in discrete time via dynamic programming**. To recap the treatment via PMP, please review Lecture 9, slides 18-22. The problem considered is as follows:\n", "\n", "$$\\begin{align*} \n", "&\\underset{\\{\\gamma_{t}\\}_{t=0}^{N-1}\\in\\Gamma}{\\inf}\\:\\frac{1}{2}\\bigg\\{x_{N}^{\\top}Mx_{N} + \\sum_{t=0}^{N-1}\\left(x_{t}^{\\top}Qx_{t} + u_{t}^{\\top}Ru_{t}\\right)\\bigg\\}\\\\\n", "& x_{t+1} = Ax_{t} + Bu_{t}, \\quad x_{0}\\;\\text{given}, \\quad x\\in\\mathbb{R}^{n}, \\quad u\\in\\mathbb{R}^{m},\n", "\\end{align*}$$\n", "\n", "wherein as usual $M,Q \\succeq 0$, $R \\succ 0$, and $\\Gamma$ is the set of all (time-varying) history-dependent randomized policies." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## (a) [(2 + 2) + 6 = 10 points] The dynamic programming equation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let the value function $V_{k}$ be defined as\n", "\n", "$$V_{k}(x) := \\underset{\\{\\gamma_{t}\\}_{t=k}^{N-1}\\in\\Gamma}{\\inf}\\:\\frac{1}{2}\\bigg\\{x_{N}^{\\top}Mx_{N} + \\sum_{t=k}^{N-1}\\left(x_{t}^{\\top}Qx_{t} + u_{t}^{\\top}Ru_{t}\\right)\\:\\bigg\\vert\\: x_{k}=x\\bigg\\}.$$\n", "\n", "(a.1) What is the physical interpretation of $V_{k}(x)$? What is the physical interpretation of $V_{0}(x_{0})$?\n", "\n", "(a.2) Write down the dynamic programming equation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Solution for part (a):\n", "(a.1) $V_{k}(x)$ is the optimal (in LQR sense) cost-to-go starting from the state location $x$ at time $t=k\\in\\{0,1,...,N-1\\}$. \n", "\n", "$V_{0}(x_{0})$ is the optimal (in LQR sense) cost-to-go starting from the state location (i.e., specific initial condition) $x_{0}$ at time $t=0$.\n", "\n", "(a.2) In this case, the terminal cost: $c_{N}(x,u) = \\frac{1}{2}x^{\\top}Mx$, and the running cost: $c_{k}(x,u) = \\frac{1}{2}\\left(x^{\\top}Qx + u^{\\top}R u\\right)$. Therefore, the DP equation is given by the following backward recursion:\n", "\n", "$$\\begin{align*}\n", "V_{k}(x) &= \\underset{u\\in\\mathbb{R}^{m}}{\\inf}\\:\\frac{1}{2}\\bigg\\{x^{\\top}Qx + u^{\\top}R u + V_{k+1}\\left(Ax + Bu\\right)\\bigg\\}, \\quad k=N-1,N-2,...,0,\\\\\n", "V_{N}(x) &= \\frac{1}{2}x^{\\top}Mx.\n", "\\end{align*}\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## (b) [(10 + 5) + 5 = 20 points] Solve the dynamic programming equation " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(b.1) To solve the recursion derived in (a.2), start with the guess that the value function $V_{k+1}(x) = \\frac{1}{2}x^{\\top}P_{k+1} x$ for some $P_{k+1} \\succeq 0$. Then **prove that** the value function $V_{k}(x)$ has the same structural form, i.e., $V_{k}(x) = \\frac{1}{2}x^{\\top}P_{k} x$ for some matrix $P_{k}$, expressed as an **explicit function** of $P_{k+1}$. From there, **prove that** $P_{k} \\succeq 0$. \n", "\n", "(b.2) Use your answer in part (b.1) to **show that** the optimal policy is a time-varying linear state feedback. Is this conclusion same or different from what we derived via PMP in Lecture 9? **Explain**. " ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## Solution for part (b):\n", "(b.1) Using the guess $V_{k+1}(x)=\\frac{1}{2}x^{\\top}P_{k+1}x$, $P_{k+1}\\succeq 0$, in the DP recursion in (a.2), we get $P_{N} = M$, and \n", "$$V_{k}(x) = \\underset{u\\in\\mathbb{R}^{m}}{\\inf}\\:\\frac{1}{2}\\bigg\\{x^{\\top}Qx + u^{\\top}R u + \\left(Ax + Bu\\right)^{\\top}P_{k+1}\\left(Ax + Bu\\right)\\bigg\\}, \\quad k=N-1,N-2,...,0.$$\n", "Performing the minimization over $u$ indicated above, results in the arginf $u^{*}$ given by\n", "\n", "$$u^{*} = - (R+B^{\\top}P_{k+1}B)^{-1}B^{\\top}P_{k+1}Ax = - K_{k+1}x, \\quad\\text{where}\\quad K_{k+1}:=(R+B^{\\top}P_{k+1}B)^{-1}B^{\\top}P_{k+1}A.$$\n", "\n", "Substituting $u^{*}$ back in the RHS of the DP equation, we get\n", "\n", "$$V_{k}(x) = \\frac{1}{2}x^{\\top}\\bigg\\{Q + A^{\\top}P_{k+1}A - 2A^{\\top}P_{k+1}BK_{k+1} + K_{k+1}^{\\top}\\left(R+B^{\\top}P_{k+1}B\\right)K_{k+1}\\bigg\\}x.$$\n", "\n", "Subsituting for $K_{k+1}$, notice that $K_{k+1}^{\\top}\\left(R+B^{\\top}P_{k+1}B\\right)K_{k+1} = A^{\\top}P_{k+1}B(R+B^{\\top}P_{k+1}B)^{-1}B^{\\top}P_{k+1}A = A^{\\top}P_{k+1}BK_{k+1}$, which simplifies $V_{k}(x)$ as\n", "\n", "$$\\begin{align*}\n", "V_{k}(x) &= \\frac{1}{2}x^{\\top}\\bigg\\{Q + A^{\\top}P_{k+1}A - A^{\\top}P_{k+1}BK_{k+1}\\bigg\\}x\\\\\n", "&= \\frac{1}{2}x^{\\top}\\bigg\\{Q + A^{\\top}P_{k+1}A - A^{\\top}P_{k+1}B(R+B^{\\top}P_{k+1}B)^{-1}B^{\\top}P_{k+1}A\\bigg\\}x.\n", "\\end{align*}\n", "$$\n", "Therefore, $V_{k}(x) = \\frac{1}{2}x^{\\top}P_{k}x$, where\n", "\n", "$$P_{k} = Q + A^{\\top}P_{k+1}A - A^{\\top}P_{k+1}B(R+B^{\\top}P_{k+1}B)^{-1}B^{\\top}P_{k+1}A, \\quad k=N-1,N-2,...,0,$$\n", "\n", "and $P_{N}=M$.\n", "\n", "**Let us now prove that the matrix $P_{k}$ given by the above backward matrix recursion, is positive semi-definite, given that $P_{k+1},Q \\succeq \\mathbf{0}$, and $R \\succ \\mathbf{0}$.** \n", "\n", "That $P_{k}$ is symmetric follows by noting that the RHS of the above recursion is invariant under matrix transposition. To show positive semi-definiteness, we show two ways.\n", "\n", "**First way:**\n", "We rewrite the matrix recursion $P_{k}=f(P_{k+1})$ above as\n", "\n", "$$P_{k} = \\left(A - BK_{k+1}\\right)^{\\top}P_{k+1}\\left(A - BK_{k+1}\\right) + \\begin{pmatrix}\n", "I\\\\\n", "-K_{k+1}\n", "\\end{pmatrix}^{\\top}\\begin{pmatrix}Q & \\mathbf{0}\\\\\n", "\\mathbf{0} & R\\end{pmatrix}\\begin{pmatrix}\n", "I\\\\\n", "-K_{k+1}\n", "\\end{pmatrix} \\succeq \\mathbf{0}.$$\n", "\n", "**Second way:**\n", "we put the matrix recursion $P_{k}=f(P_{k+1})$ in the form given in Lecture 9, p. 21, as follows.\n", "\n", "$$P_{k} = Q + A^{\\top}P_{k+1}^{1/2}\\left[I - P_{k+1}^{1/2}B\\left(I + R^{-1}B^{\\top}P_{k+1}B\\right)^{-1}R^{-1}B^{\\top}P_{k+1}^{1/2}\\right]P_{k+1}^{1/2}A. \\qquad\\qquad (*)$$\n", "\n", "**Claim:** $I - P_{k+1}^{1/2}B\\left(I + R^{-1}B^{\\top}P_{k+1}B\\right)^{-1}R^{-1}B^{\\top}P_{k+1}^{1/2} = \\left(I + P_{k+1}^{1/2}BR^{-1}B^{\\top}P_{k+1}^{1/2}\\right)^{-1}.$\n", "\n", "**Proof of claim:** Let $U:=P_{k+1}^{1/2}BR^{-1/2}$. Then $\\left(I + P_{k+1}^{1/2}BR^{-1}B^{\\top}P_{k+1}^{1/2}\\right)^{-1} = (I + UU^{\\top})^{-1} = I - U(I + U^{\\top}U)^{-1}U$, where the last equality is [a special case of Woodbury matrix identity](https://en.wikipedia.org/wiki/Woodbury_matrix_identity#Third).\n", "\n", "Therefore, \n", "$$\\begin{align*}\n", "\\left(I + P_{k+1}^{1/2}BR^{-1}B^{\\top}P_{k+1}^{1/2}\\right)^{-1} = I - U(I + U^{\\top}U)^{-1}U &= I - P_{k+1}^{1/2}BR^{-1/2}\\left(I + R^{-1/2}B^{\\top}P_{k+1}BR^{-1/2}\\right)^{-1}\\\\\n", "&= I - P_{k+1}^{1/2}BR^{-1/2}\\left(R^{1/2}R^{-1/2} + R^{1/2}R^{-1}B^{\\top}P_{k+1}BR^{-1/2}\\right)^{-1}\\\\\n", "&= I - P_{k+1}^{1/2}B\\left(I + R^{-1}B^{\\top}P_{k+1}B\\right)^{-1}R^{-1}B^{\\top}P_{k+1}^{1/2}.\n", "\\end{align*}\n", "$$ **end of proof of the claim.**\n", "\n", "Based on the above result/claim, the expression in square bracket in $(*)$ can be replaced by $\\left(I + P_{k+1}^{1/2}BR^{-1}B^{\\top}P_{k+1}^{1/2}\\right)^{-1}$. Consequently, we have\n", "\n", "$$P_{k} = Q + A^{\\top}P_{k+1}^{1/2}\\left(I + P_{k+1}^{1/2}BR^{-1}B^{\\top}P_{k+1}^{1/2}\\right)^{-1}P_{k+1}^{1/2}A,$$\n", "\n", "which is exactly the recursion given in Lec. 9, p. 21. Since the term in parenthesis is positive definite, and $Q \\succeq \\mathbf{0}$, hence $P_{k} \\succeq \\mathbf{0}$.\n", "\n", "\n", "(b.2) In part (b.1), we have already shown that $u^{*} = -K_{k+1}x$, i.e., the optimal control is a (time-varying) linear state feedback.\n", "\n", "This conclusion and the expression for optimal feedback gain matrix, are same as the one we derived via PMP in Lecture 9. They should give the same results as in this case, the DP and the PMP are simply two different ways to solve the same deterministic OCP instance (namely the finite horizon discrete time LQR)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.14" } }, "nbformat": 4, "nbformat_minor": 2 }