{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Multi-agent Reinforcement Learning using PettingZoo: Tic-tac-toe example part II\n",
    "\n",
    "*Gertjan Verhoeven*\n",
    "\n",
    "*March 2021*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Having learned the basics of PettingZoo, and how to use hashing with `defaultdict` to build up a Q-table, we are ready to implement Q-learning on the tic-tac-toe environment using PettingZoo. Before we get into the coding part, first some general remarks about Q-learning in a two-player alternating turn game."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Training multi-agent games using single agent techniques\n",
    "\n",
    "Here we will train a multi-agent game using single agent Q-learning. This means that from the perspective of each player, the other player is part of the environment. It follows that the learned strategy of the first player is tuned to the strategy used by the second player. The second player's strategy can be thought of forming a part of the environment the first player learns to perform optimally in.\n",
    "\n",
    "Therefore, if we use Q-learning against a player that uses a random strategy, The Q-learner optimizes for play in an evironment **that contains a player that uses a random strategy**.\n",
    "\n",
    "Here, we let **both** players learn using Q-learning with constant learning parameter $\\alpha = 0.6$, and have them use the same decreasing $\\epsilon$ exploration rate. So both players start off playing randomly ($\\epsilon = 1$), and after 200.000 steps end up exploiting the learned policy for 97.5% of the time.\n",
    "\n",
    "\n",
    "We expect that this produces Q-tables for both players that perform well against random players and human players.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Q-learning with two players: remembering previous states and actions\n",
    "\n",
    "Many implementations of Q-learning for Tic-Tac-Toe play out a full game and memorize the sequence of states for both players. After a single game has ended, the \"Q-learning\" part is done in one go, starting from the end state en propagating back to the beginning. \n",
    "\n",
    "This is possible for Tic-Tac-Toe because no state is ever visited twice during a single game. However, this is not true for all games, so I think it is better to implement Q-learning in a more general way, where we learn each time the player \"steps\" through the game."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In single agent RL using Gym, we start in a state, make a move, end up in a new state, and collect our reward. This allowed us to use variables like `current_state` and `next_state` within a single RL iteration for the Q-learning. \n",
    "\n",
    "In multi-agent (two player alternating turn, to be more precise), an agent observes a state, chooses an action, THEN hands over the turn, where the other agent observes the state, and chooses an action, then control returns to the first agent, which sees the result of its previous action, and finds itself in a new state.\n",
    "\n",
    "Thus, we need a way to store the previous state and previous action for each player to use Q-learning. We store these values in two simple dictionaries:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "prev_state = {'player_1': -1,\n",
    "              'player_2': -1}\n",
    "prev_action = {'player_1': -1,\n",
    "              'player_2': -1}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I went for two independent Q-tables, and player 1 always starts first (This is how the environment works). So we will end up with two Q-tables, one for the player going first, and one for the player going second. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Coding Q-learning in PettingZoo step by step"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pettingzoo.classic import tictactoe_v3\n",
    "import random\n",
    "import numpy as np\n",
    "from collections import defaultdict\n",
    "import dill"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is advisable to have the actual Q-learning and helper functions in a separate `marl_qlearning.py` python script, which we import and run code from in Jupyter notebook, i.e.:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from marl_qlearning import *"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We build on the code from the previous notebook.\n",
    "We already have `encode_state()` to convert the observation into a unique state label.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import hashlib\n",
    "\n",
    "def encode_state(observation):\n",
    "    # encode observation as bytes           \n",
    "    obs_bytes = str(observation).encode('utf-8')\n",
    "    # create md5 hash\n",
    "    m = hashlib.md5(obs_bytes)\n",
    "    # return hash as hex digest\n",
    "    state = m.hexdigest()\n",
    "    return(state)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "## Exercise 1: Create epsilon_greedy_policy()\n",
    "\n",
    "We need a function that contains the epsilon-greedy policy functionality.\n",
    "\n",
    "Code a function `epsilon_greedy_policy()` that takes as arguments:\n",
    "\n",
    "* `nA`, number of actions\n",
    "* `Q`, the list of Q-table dictionaries for both players, indexed by agent and by state\n",
    "* `agent`, the agent currently acting\n",
    "* `action_mask`, containing the available actions in the current state\n",
    "* `state` , the hash of the current state\n",
    "* `eps`, the value of the exploration parameter epsilon\n",
    "\n",
    "(Hint: I used `Q[agent][state][action_mask == 1]` to select the Q-values of all available actions for an agent in a state)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "def epsilon_greedy_policy(nA, Q, agent, action_mask, \\\n",
    "                          state, eps):\n",
    "    return action"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 2: Create update_Q_value()\n",
    "\n",
    "Code a function update_Q_value() that takes as arguments:\n",
    "\n",
    "* `Q`, the list of Q-table dictionaries for both players, indexed by agent and by state\n",
    "* `agent`, the agent currently acting\n",
    "* `previous_state`, the hash of the previous state\n",
    "* `previous_action`, the previous action\n",
    "* `reward`, the reward received since the previous action\n",
    "* `alpha`, the learning parameter\n",
    "* `gamma`, the discounting parameter\n",
    "* `current_state`, the hash value of the current state\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "def update_Q_value(Q, agent, previous_state, previous_action,\\\n",
    "                   reward, alpha, gamma, current_state = None): \n",
    "    \n",
    "    return Q"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 3: create marl_q_learning()\n",
    "\n",
    "Finally, the main program where all the parts come together.\n",
    "For this function, we built on the Pettingzoo code from the previous notebook.\n",
    "\n",
    "Code a function `marl_q_learning()` that takes as arguments\n",
    "\n",
    "* multi_env, a pettingzoo multi-agent environment\n",
    "* num_episodes, the number of episodes to run\n",
    "* alpha, the learning parameter\n",
    "* gamma, the discounting parameter with default 1\n",
    "* eps_start, the starting value for epsilon\n",
    "* eps_decay, the decay factor for epsilon\n",
    "* eps_min, the minimal value for epsilon\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "def marl_q_learning(multi_env, num_episodes, alpha, gamma=1.0, \\\n",
    "                    eps_start=1.0, eps_decay=.99999, \\\n",
    "                        eps_min=0.025):\n",
    "    \n",
    "    multi_env.reset()\n",
    "    Q = {}\n",
    "    for agent in multi_env.agents:    \n",
    "        nA = multi_env.action_spaces[agent].n\n",
    "        Q[agent] = defaultdict(lambda: np.zeros(nA))\n",
    "    \n",
    "    epsilon = eps_start\n",
    "\n",
    "    i_episode = 1    \n",
    "\n",
    "    prev_state = {'player_1': -1,\n",
    "                  'player_2': -1}\n",
    "    prev_action = {'player_1': -1,\n",
    "                  'player_2': -1}\n",
    "\n",
    "    # keeps iterating over active agents until num episode break\n",
    "    while i_episode <= num_episodes:\n",
    "        for agent in multi_env.agent_iter():        \n",
    "            \n",
    "            # get observation (state) for current agent:\n",
    "            observation, reward, done, info = multi_env.last()\n",
    "            \n",
    "            # perform q-learning with update_Q_value()\n",
    "            # your code here\n",
    "            \n",
    "            \n",
    "            # store current state\n",
    "            prev_state[agent] = state\n",
    "            \n",
    "            if not done: \n",
    "                # choose action using epsilon_greedy_policy()\n",
    "                # your code here               \n",
    "                multi_env.step(action)\n",
    "                \n",
    "                # store chosen action\n",
    "                prev_action[agent] = action \n",
    "            else: \n",
    "                # agent is done\n",
    "                multi_env.step(None)\n",
    "                        \n",
    "        # reset env and memory\n",
    "        multi_env.reset()\n",
    "        prev_state = {'player_1': -1,\n",
    "                      'player_2': -1}\n",
    "        prev_action = {'player_1': -1,\n",
    "                       'player_2': -1}\n",
    "        # bump episode\n",
    "        i_episode += 1\n",
    "        # decrease epsilon\n",
    "        epsilon = max(epsilon*eps_decay, eps_min)\n",
    "                     \n",
    "    return Q"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Learning the optimal policy for Tic-Tac-Toe\n",
    "\n",
    "We run the Q-learning algorithm for 500.000 steps and use `dill` to save the Q dictionary to disk."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "from marl_qlearning import *\n",
    "\n",
    "env = tictactoe_v3.env()\n",
    "\n",
    "random.seed(123)\n",
    "\n",
    "fullrun = 0\n",
    "\n",
    "if fullrun:\n",
    "\n",
    "    Q, N = marl_q_learning(env, 500_000, alpha = 0.6, gamma = 0.95, \\\n",
    "                           decay = True, render = False)\n",
    "    with open('cached/Q.pkl', 'wb') as f:\n",
    "        dill.dump(Q, f)    \n",
    "\n",
    "else:\n",
    "    with open('cached/Q.pkl', 'rb') as f:\n",
    "        Q = dill.load(f)   "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Test the learned policy against a Random Player\n",
    "\n",
    "The original plan was to use the learned policy greedily for **both** players. However, this will result in deterministic moves on both sides, if for each state a single action has the highest Q-value. So this is not very informative about how good our AI players have become.\n",
    "\n",
    "Lets try out what it learned by playing a few thousand games fully exploiting the Q-table against a random playing opponent. From https://planspace.org/20191103-q_learning_tic_tac_toe_briefly/, we expect at least 93% wins. We do not expect 100% wins, because the random player can \"by accident\" play the right moves to obtain a draw, which is the best Player 2 can do. We do not expect Player 1 to lose. \n",
    "\n",
    "To do so I took the `marl_q_learning()` function, removed the Q-learning part and the epsilon decay part, and added a conditional step that sets epsilon to 0 if Player 1 chooses its action, and sets epsilon to 1 if Player 2 chooses its action. I named this function `test_policy()`. I managed to get an outcome with 93% wins for Player 1, and 7% draws. Curious if this is actually the best one can do for this game...\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## License\n",
    "\n",
    "The code in this notebook is copyrighted by Gertjan Verhoeven and licensed under the new BSD (3-clause) license:\n",
    "\n",
    "https://opensource.org/licenses/BSD-3-Clause\n",
    "\n",
    "The text and figures in this notebook (if any) are copyrighted by Gertjan Verhoeven and licensed under the CC BY-NC 4.0 license:\n",
    "\n",
    "https://creativecommons.org/licenses/by-nc/4.0/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (Spyder)",
   "language": "python3",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}