{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <center>NYU CSCI-UA 9472 Artificial Intelligence </center>\n",
    "   ### <center> Q learning and Generalization in RL </center>\n",
    "\n",
    "<center>In this session, we will discuss Q-learning and generalization through parametric models</center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question 1. Tic Tac Toe\n",
    "\n",
    "We consider a simple Tic Tac Toe game such as shown below. \n",
    "\n",
    "\n",
    "<img src=\"TicTacToe.png\" width=\"400\" height=\"60\">\n",
    "\n",
    "Implement a simple TD Learning reinforcement learning framework for this game. Recall that the temporal difference (TD) update rule is given by \n",
    "\n",
    "\\begin{align*}\n",
    "U^\\pi(s) \\leftarrow U^\\pi(s) + \\alpha\\left(R(s) + \\gamma U^\\pi(s')- U^\\pi(s)\\right)\n",
    "\\end{align*}\n",
    "\n",
    "The agent can start by alternating between exploration and exploitation but should ultimately exploit the utility.\n",
    "\n",
    "Hint: add a reward of -.1 for each action and a reward of +100 for a win and -100 for a loss\n",
    "\n",
    "##### Question 1.1. Single Agent \n",
    "\n",
    "Start by implementing a simple RL agent that plays against a random oponent. We will train the agent for a couple of episodes (game completions) and then evaluate it on a couple more games (through the total number of wins). How many possible states are there in tic tac toe?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "\n",
    "won = False # should switch to true when 3 crosses are aligned \n",
    "\n",
    "for episode in np.arange():\n",
    "\n",
    "    while won != True:\n",
    "        \n",
    "        \n",
    "        # add your simulation here\n",
    "        \n",
    "        \n",
    "        \n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Question 1.2. Parametric representation\n",
    "\n",
    "Modify your answer to the first question so that the utility is stored as a linear model in some features. Try different models and compare the result with your previous answer. Possible features include \n",
    "\n",
    "- A representation of the board with each feature taking the value 0,1 or 2 depending on whether it contains a '+', a 'O' or is empty  \n",
    "- The number of rows/columns/diagonals with two of the agent's pieces and one emtpy field\n",
    "- The number of rows/columns/diagonals with two of the opponent's pieces and one empty field\n",
    "- Is there one of the agent's piece in the center \n",
    "- Number of agent's pieces in corners\n",
    "- Number of rows/columns/diagonals with one of the agent's pieces and two empty fields\n",
    "- Number of rows/columns/diagonals with three of the agent's pieces\n",
    "-...\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "\n",
    "won = False # should switch to true when 3 crosses are aligned \n",
    "\n",
    "for episode in np.arange():\n",
    "\n",
    "    while won != True:\n",
    "        \n",
    "        \n",
    "        # add your simulation here\n",
    "        \n",
    "        \n",
    "        \n",
    "      "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Question 1.3. Adversarial Framework\n",
    "\n",
    "We now want to consider an adversarial environment in which two agents are competing against each other. \n",
    "\n",
    "##### Question 1.3.a competition"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "\n",
    "won = False # should switch to true when 3 crosses are aligned \n",
    "\n",
    "for episode in np.arange():\n",
    "\n",
    "    while won != True:\n",
    "        \n",
    "        \n",
    "        # add your simulation here\n",
    "        \n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Question 1.3.b Adversarial training \n",
    "\n",
    "Competition can be good in training too provided that the trainees are actually willing to learn. \n",
    "In this second question we want the agents to train against each other. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "\n",
    "won = False # should switch to true when 3 crosses are aligned \n",
    "\n",
    "for episode in np.arange():\n",
    "\n",
    "    while won != True:\n",
    "        \n",
    "        \n",
    "        # add your simulation here\n",
    "        \n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question 2. Obstacle free environment\n",
    "\n",
    "We consider the obstacle-free environment below. \n",
    "\n",
    "<img src=\"simpleObstacleFree2.png\" width=\"400\" height=\"60\">\n",
    "\n",
    "For this simple environment, implement a direct utility estimation agent for which the utility is stored in tabular form first and through the simple linear model \n",
    "\n",
    "\\begin{align*}\n",
    "\\hat{U}_\\theta(x, y) = \\theta_0 + \\theta_1 x + \\theta_2 y\n",
    "\\end{align*}\n",
    "\n",
    "then. train the two agents for several episodes and compare their performance. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "\n",
    "exit = False # should switch to true when 3 crosses are aligned \n",
    "\n",
    "for episode in np.arange():\n",
    "\n",
    "    while exit != True:\n",
    "        \n",
    "        \n",
    "        # add your simulation here\n",
    "        \n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question 3. Bringing back the obstacles\n",
    "\n",
    "#### Question 3.1. Q learning\n",
    "\n",
    "We now want to add obstacles back. \n",
    "\n",
    "<img src=\"obstacleMaze01.png\" width=\"400\" height=\"60\">\n",
    "\n",
    "\n",
    "\n",
    "To handle the obstacles, we will consider a Q learning approach. Implement both the traditional TD Q-learning agent  \n",
    "\n",
    "\\begin{align*}\n",
    "Q[s, a] \\leftarrow Q[s, a] + \\eta \\left(R[s] + \\gamma \\max_{a'} Q[s', a'] - Q[s, a]\\right)\n",
    "\\end{align*}\n",
    "\n",
    "as well as the SARSA update \n",
    "\n",
    "\\begin{align*}\n",
    "Q[s, a] \\leftarrow Q[s, a] + \\eta \\left(R[s] + \\gamma Q[s', a'] - Q[s, a]\\right)\n",
    "\\end{align*}\n",
    "\n",
    "Use an exploration/exploitation framework or implement an exploration function of the form\n",
    "\n",
    "\\begin{align*}\n",
    "f(U, n) &= \\left\\{\\begin{array}{ll}\n",
    "R^+ & \\text{if $n<N_e$}\\\\\n",
    "U & \\text{otherwise}\n",
    "\\end{array}\\right.\n",
    "\\end{align*}\n",
    " "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "\n",
    "exit = False # should switch to true when 3 crosses are aligned \n",
    "\n",
    "for episode in np.arange():\n",
    "\n",
    "    while exit != True:\n",
    "        \n",
    "        \n",
    "        # add your simulation here\n",
    "        \n",
    "      "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " #### Question 3.2. Parametric Q-learning\n",
    " \n",
    " We now want to store the Q-table as a parametric function of the form. You can for example take this function to be of the form \n",
    " \n",
    "\\begin{align*}\n",
    "Q[s, a] = \\theta_0 + \\theta_1 x + \\theta_2 y + \\theta_3 a\n",
    "\\end{align*}\n",
    "\n",
    "Compare the performance of the two agents. Recall that for a parametric representation of a TD learning agent, the parameters are updated as \n",
    "\n",
    "\\begin{align*}\n",
    "\\mathbf{\\theta}_i \\leftarrow \\mathbf{\\theta}_i + \\eta \\left[R[s] + \\gamma \\max_{a'} \\hat{Q}_{\\theta}[s', a'] - \\hat{Q}[s, a]\\right]\\frac{\\partial \\hat{Q}_\\theta[s, a]}{\\partial \\theta_i}\n",
    "\\end{align*}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "\n",
    "exit = False # should switch to true when 3 crosses are aligned \n",
    "\n",
    "for episode in np.arange():\n",
    "\n",
    "    while exit != True:\n",
    "        \n",
    "        \n",
    "        # add your simulation here\n",
    "        \n",
    "      "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " #### Question 3.3. Towards deep Q-learning\n",
    " \n",
    " Encode the Q-table by means of a neural network which takes as input a position $(x, y)$ and returns the probability of each move being optimal for that particular position."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "\n",
    "exit = False # should switch to true when 3 crosses are aligned \n",
    "\n",
    "for episode in np.arange():\n",
    "\n",
    "    while exit != True:\n",
    "        \n",
    "        \n",
    "        # add your simulation here\n",
    "        \n",
    "      "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}