{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Dynamic programming"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The goal of this exercise is to find the optimal policy for the recycling robot.\n",
    " \n",
    "In this problem, a recycling robot has to search for empty cans to collect (each can defines a \"reward\" given to the robot). It can also decide to stay where it is to save its battery and wait that somebody brings it a can (which gives less cans in average than actively searching for them). \n",
    "\n",
    "The robot has two battery levels, *high* and *low*. \n",
    "\n",
    "* In the *high* level, the robot can either search or wait. \n",
    "\n",
    "* In the *low* state, three actions are possible: search, wait and recharge. \n",
    "\n",
    "State-action transitions are probabilistic, i.e. they bring the robot in different states based on different probabilities $\\alpha$ and $\\beta$.\n",
    "\n",
    "This problem defines a finite MDP, with two states *high* and *low* corresponding to the battery level. The actions *search* and *wait* are possible in the *high* and *low* states, while the action *recharge* is only possible in the *low* state.\n",
    "\n",
    "$$\n",
    "\\begin{aligned}\n",
    "    \\mathcal{S} &=& \\{ \\text{high}, \\text{low} \\} \\\\\n",
    "    \\mathcal{A}(\\text{high} ) &=& \\{ \\text{search}, \\text{wait} \\} \\\\\n",
    "    \\mathcal{A}(\\text{low} ) &=& \\{ \\text{search}, \\text{wait}, \\text{recharge} \\}\n",
    "\\end{aligned}\n",
    "$$\n",
    "\n",
    "The action *search* brings on average a reward of $\\mathcal{R}^\\text{search}$, the action *wait* a reward of $\\mathcal{R}^\\text{wait}$, the action *recharge* brings no reward, but allows to get in the *high* state.\n",
    "\n",
    "Note that if the robot decides to search in the *low* state, there is a probability $1 - \\beta$ that it totally empties its battery, requiring human intervention. This is punished with a negative reward of -3.\n",
    "\n",
    "The transition and reward probabilities of each transition is defined in the following table, completely defining a MDP.\n",
    "\n",
    "| s             |  s'           |  a           | p(s' / s, a)      |  r(s, a, s')                 |\n",
    "|:-------------:|:-------------:|:------------:|:-----------------:|:-----------------------------:|\n",
    "| high          | high          | search       | $\\alpha$          | $\\mathcal{R}^\\text{search}$   |\n",
    "| high          | low           | search       | $1 - \\alpha$      | $\\mathcal{R}^\\text{search}$   |\n",
    "| low           | high          | search       | $1 - \\beta$       | $-3$                          |\n",
    "| low           | low           | search       | $\\beta$           | $\\mathcal{R}^\\text{search}$   |\n",
    "| high          | high          | wait         | $1$               | $\\mathcal{R}^\\text{wait}$     |\n",
    "| high          | low           | wait         | $0$               | $\\mathcal{R}^\\text{wait}$     |\n",
    "| low           | high          | wait         | $0$               | $\\mathcal{R}^\\text{wait}$     |\n",
    "| low           | low           | wait         | $1$               | $\\mathcal{R}^\\text{wait}$     |\n",
    "| low           | high          | recharge     | $1$               | $0$                           |\n",
    "| low           | low           | recharge     | $0$               | $0$                           |\n",
    "\n",
    "The goal of this exercise is to find the optimal policy $\\pi^*$ of the robot, i.e to find for each state the action that should be performed systematically in order to gather the maximum of reward on the long term. \n",
    "\n",
    "We will apply here two **dynamic programming** methods, policy iteration and value iteration, to solve the Bellman equations.\n",
    "\n",
    "The Bellman equation for the state function is:\n",
    "\n",
    "$$V^{\\pi} (s)  = \\sum_{a \\in \\mathcal{A}(s)} \\pi(s, a) \\, \\sum_{s' \\in \\mathcal{S}} p(s' | s, a) \\, [ r(s, a, s') + \\gamma \\, V^{\\pi} (s') ]$$\n",
    "\n",
    "**Q:** On paper, adapt the Bellman equation to the problem. First, for every state $s$ and possible action $a$, find the optimal value of the action with the form:\n",
    "\n",
    "$$Q^{\\pi} (s, a) = f( V^\\pi (\\text{high}), V^\\pi (\\text{low}), \\alpha, \\beta, \\gamma, \\mathcal{R}^{\\text{search}}, \\mathcal{R}^{\\text{wait}} )$$\n",
    "\n",
    "Deduce the Bellman equation for the two states $V^\\pi (\\text{high})$ and $V^\\pi (\\text{low})$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Policy Iteration\n",
    "\n",
    "Now that we have the Bellman equations for the two states high and low, we can solve them using **iterative policy evaluation** for a fixed policy $\\pi$. \n",
    "\n",
    "### Iterative policy evaluation\n",
    "\n",
    "Let's start by setting the parameters of the MDP. In the rest of the exercise, you will modify these parameters to investigate how it changes the optimal policy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Transition probabilities\n",
    "alpha = 0.3\n",
    "beta = 0.2\n",
    "\n",
    "# Discount parameter\n",
    "gamma = 0.7\n",
    "\n",
    "# Expected rewards\n",
    "r_search = 6.0\n",
    "r_wait = 2.0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are many ways to represent states and actions in a MDP. The suggestion for this exercise is to use dictionaries here the keys are the actions' name and the vaues are indices:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nb_states = 2\n",
    "nb_actions = 3\n",
    "\n",
    "s = {'high': 0, 'low': 1}\n",
    "a = {'search': 0, 'wait': 1, 'recharge': 2}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using dictionaries, you can access numpy arrays with `s['high']` or `a['recharge']` instead of 0 and 2, what will make the code readable.\n",
    "\n",
    "The next step is to initialize numpy arrays where we will store the V and Q values. `V` will have only two elements for high and low, while `Q` will be a 2x3 matrix with one element for each state-action pair. Notice that (high, recharge) is not a possible action, so this element will not be be updated."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "V = np.zeros(nb_states)\n",
    "Q = np.zeros((nb_states, nb_actions))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can then access the individual values with `V[s['high']]` or `Q[s['low'], a['wait']]`.\n",
    "\n",
    "We can now evaluate a policy $\\pi$. In dynamic programming, the policies are deterministic, as we want to estimate the optimal policy.\n",
    "\n",
    "To implement the policy, we just need to assign the index of an action to each state, i.e. $\\pi(s)$. The following cell creates an initial policy $\\pi$ where the agent **searches** in both states high and low. We here make sure that the array contains integers (0, 1 or 2), but that is not even necessary."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pi = np.array([a['search'], a['search']], dtype=int)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Q:** Evaluate this policy using iterative policy evaluation.\n",
    "\n",
    "We would normally only need to update the V-value of the two states using:\n",
    "\n",
    "$$\n",
    "     V (s) \\leftarrow \\sum_{a \\in \\mathcal{A}(s)} \\pi(s, a) \\, \\sum_{s' \\in \\mathcal{S}} p(s' | s, a) \\, [ r(s, a, s') + \\gamma \\, V (s') ] \\quad \\forall s \\in \\mathcal{S}\n",
    "$$\n",
    "\n",
    "The code will be more readable if you first update the Q-values of the 5 state-action pairs:\n",
    "\n",
    "$$\n",
    "     Q (s, a) \\leftarrow  \\sum_{s' \\in \\mathcal{S}} p(s' | s, a) \\, [ r(s, a, s') + \\gamma \\, V (s') ] \\quad \\forall s \\in \\mathcal{S}\n",
    "$$\n",
    "\n",
    "and only then update the two V-values:\n",
    "\n",
    "$$\n",
    "     V (s) \\leftarrow \\sum_{a \\in \\mathcal{A}(s)} \\pi(s, a) \\, Q(s, a)\n",
    "$$\n",
    "\n",
    "These updates should normally be applied until the V-values converge. For simplicity, we could decide to simply apply 50 updates or so, and hope that it is enough.\n",
    "\n",
    "Record the V-value of the two states after each update and plot them. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Q:** Do the V-values converge? How fast? What do the final values represent? Change the value of $\\gamma$ and conclude on its importance (do not forget to reset the V and Q arrays to 0!). "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Q:** Print the Q-values at the end of the policy evaluation. What would the greedy policy with respect to these Q-values? \n",
    "\n",
    "**Q:** Change the initial policy to this policy and evaluate it. What happens? Compare the final value of the states under both policies. Which one is the best?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Policy iteration\n",
    "\n",
    "Improving the policy is now straightforward. We just to look at the Q-values in each state, and change the policy so that it takes the action with the maximal Q-value. If this does not change the policy (we still take the same actions), we have found the optimal policy, we can stop.\n",
    "\n",
    "**Q:** Implement policy iteration.\n",
    "\n",
    "Do not forget to reset the V and Q arrays at the beginning of the cell, as well as the original policy.\n",
    "\n",
    "Use an infinite loop that you will quit when the policy has not changed between two iterations. Something like:\n",
    "\n",
    "```python\n",
    "while True:\n",
    "    # 1 - Policy evaluation\n",
    "    for k in range(50):\n",
    "        # Update the values\n",
    "    \n",
    "    # 2 - Policy improvement\n",
    "    \n",
    "    if pi != pi_old:\n",
    "        break\n",
    "```\n",
    "\n",
    "**Beware:** if you simply assign the policy to another array and modify the policy:\n",
    "\n",
    "```python\n",
    "pi_old = pi\n",
    "pi[s['high']] = a['search']\n",
    "```\n",
    "\n",
    "`pi_old` will also change! You need to `.copy()` the policy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Value iteration\n",
    "\n",
    "In value iteration, we merge the policy evaluation and improvement in a single update rule: \n",
    "\n",
    "$$\n",
    "    V (s) \\leftarrow \\max_{a \\in \\mathcal{A}(s)} \\sum_{s' \\in \\mathcal{S}} p(s' | s, a) \\, [ r(s, a, s') + \\gamma \\, V (s') ]\n",
    "$$\n",
    "\n",
    "The value of state takes the value of its greedy action. The policy is therefore implicitly greedy w.r.t the Q-values.\n",
    "\n",
    "The algorithm becomes:\n",
    "\n",
    "* while not converged:\n",
    "\n",
    "    * for all states $s$:\n",
    "    \n",
    "        * Update the value estimates with:\n",
    "        \n",
    "        $$\n",
    "            V (s)  \\leftarrow \\max_{a \\in \\mathcal{A}(s)} \\sum_{s' \\in \\mathcal{S}} p(s' | s, a) \\, [ r(s, a, s') + \\gamma \\, V (s') ]\n",
    "        $$\n",
    "        \n",
    "**Q:** Modify your previous code to implement value iteration. Use a fixed number of iterations (e.g. 50) as in policy evaluation. Visualize the evolution of the V-values and print the greedy policy after each iteration. Conclude."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Q:**  Change the value of the discount factor $\\gamma =0.3$ so that the agent becomes short-sighted: it only takes into account the immediate rewards, but forgets about the long-term. Does it change the strategy? Explain why."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Q:** Change $\\gamma$ to 0.99 (far-sighted agent). What does it change and why?  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Q:**  Change the parameters to:\n",
    "\n",
    "$$\\alpha = 0.01 \\quad \\beta = 0.2 \\quad \\gamma = 0.7 \\quad \\mathcal{R}^{\\text{search}} = 6 \\quad  \\mathcal{R}^{\\text{wait}} = 5$$\n",
    "\n",
    "Find the optimal policy. What is the optimal action to be taken in the state *high*, although the probability to stay in this state is very small? Why?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Q:** Find a set of parameters where it would be optimal to search while in the *low* state."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.9.12 ('base')",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.13"
  },
  "vscode": {
   "interpreter": {
    "hash": "3d24234067c217f49dc985cbc60012ce72928059d528f330ba9cb23ce737906d"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}