{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# PPO\n",
    "\n",
    "The goal of this exercise is to use the `tianshou` library to apply PPO on the cartpole environment. `tianshou` is the latest and most up-to-date DRL library. It is based on pytorch for the deep networks and is the only library currently compatible with gymnasium, not gym.\n",
    "\n",
    "Github: <https://github.com/thu-ml/tianshou> \\\n",
    "Documentation: <https://tianshou.readthedocs.io/en/latest>\n",
    "\n",
    "Install it in your virtual environment simply with:\n",
    "\n",
    "```bash\n",
    "pip install -U tianshou\n",
    "```\n",
    "\n",
    "It will also install pytorch, which becomes double use with tensorflow, but well, storage is cheap...\n",
    "\n",
    "Let's first import the usual stuff:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "try:\n",
    "    import google.colab\n",
    "    IN_COLAB = True\n",
    "except:\n",
    "    IN_COLAB = False\n",
    "\n",
    "if IN_COLAB:\n",
    "    !pip install -U gymnasium pygame moviepy\n",
    "    !pip install -U tianshou"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "rng = np.random.default_rng()\n",
    "import matplotlib.pyplot as plt\n",
    "import os\n",
    "from IPython.display import clear_output\n",
    "from collections import deque\n",
    "\n",
    "import gymnasium as gym\n",
    "print(\"gym version:\", gym.__version__)\n",
    "\n",
    "import tianshou as ts\n",
    "\n",
    "import torch\n",
    "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
    "\n",
    "import pygame\n",
    "from moviepy.editor import ImageSequenceClip, ipython_display\n",
    "\n",
    "\n",
    "class GymRecorder(object):\n",
    "    \"\"\"\n",
    "    Simple wrapper over moviepy to generate a .gif with the frames of a gym environment.\n",
    "    \n",
    "    The environment must have the render_mode `rgb_array_list`.\n",
    "    \"\"\"\n",
    "    def __init__(self, env):\n",
    "        self.env = env\n",
    "        self._frames = []\n",
    "\n",
    "    def record(self, frames):\n",
    "        \"To be called at the end of an episode.\"\n",
    "        for frame in frames:\n",
    "            self._frames.append(np.array(frame))\n",
    "\n",
    "    def make_video(self, filename):\n",
    "        \"Generates the gif video.\"\n",
    "        directory = os.path.dirname(os.path.abspath(filename))\n",
    "        if not os.path.exists(directory):\n",
    "            os.mkdir(directory)\n",
    "        self.clip = ImageSequenceClip(list(self._frames), fps=self.env.metadata[\"render_fps\"])\n",
    "        self.clip.write_gif(filename, fps=self.env.metadata[\"render_fps\"], loop=0)\n",
    "        del self._frames\n",
    "        self._frames = []"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## tianshou\n",
    "\n",
    "### Structure\n",
    "\n",
    "``tianshou`` provides an implementation of most model-free algorithms seen in the course: DQN and its variants, A3C, DDPG, PPO and more. It also has several offline RL algorithms. You can see the list of algorithms here:\n",
    "\n",
    "<https://tianshou.readthedocs.io/en/latest/index.html>\n",
    "\n",
    "``tianshou`` relies on several concepts, which are explained here:\n",
    "\n",
    "<https://tianshou.readthedocs.io/en/latest/01_tutorials/01_concepts.html#>\n",
    "\n",
    "![](https://tianshou.readthedocs.io/en/latest/_images/concepts_arch2.png)\n",
    "\n",
    "* The **policy** is actually the DRL algorithm (DQN, PPO), not the mapping from states into actions used in the course. It relies on one (or more) neural networks called the **model**.\n",
    "* The interaction of the policy with the environment is done by the **collector**. By default, the collector used **distributed learning**, i.e. it uses parallel workers to interact with copies of the environment, thereby speeding up data collection. This is used even for algorithms which do not need distributed learning (DQN), as it is only beneficial. \n",
    "* The data collected by the collector is stored in a **buffer**, which can be an ERM for off-policy algorithms or a temporary buffer for on-policy ones.\n",
    "* The (distributed) data is stored in **batches**. How data circulates between the collector, the policy and the buffer during training is controlled by the **trainer**.\n",
    "\n",
    "Let's demonstrate this interaction with a dummy DQN network on Cartpole:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "env = gym.make('CartPole-v0')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Policy\n",
    "\n",
    "The first step is to create the neural network for the DQN network. It must have `env.observation_space.shape=4` input neurons and `env.action_space.n=2` discrete output neurons. Let's put a single hidden layer with 32 neurons for now:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "net = ts.utils.net.common.Net(\n",
    "    env.observation_space.shape,\n",
    "    env.action_space.n,\n",
    "    hidden_sizes=[64, 64],\n",
    "    device=device,\n",
    ").to(device)\n",
    "\n",
    "optim = torch.optim.Adam(net.parameters(), lr=0.001)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`optim` is the Adam optimizer in pytorch, modifying all parameters (weights and biases) of the value network. Check the doc of `Net()` if you want a more specific architecture.\n",
    "\n",
    "The output layer of the network is discrete, so that tianshou knows how to sample an action from the output (here the output neurons represent the Q-values, but it could be logits of a continuous policy). \n",
    "\n",
    "Now that we have the neural network, we can create the DQN policy object:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "policy = ts.policy.DQNPolicy(\n",
    "    model=net, # value network\n",
    "    optim=optim, # optimizer\n",
    "    discount_factor=0.95, # gamma\n",
    "    target_update_freq=1000, # how often to update the target network\n",
    "    action_space=env.action_space, # action space\n",
    ")\n",
    "policy.set_eps(0.1) # epsilon-greedy action selection"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check the doc of `DQNPolicy` for additional parameters (e.g. to implement a double duelling DQN).\n",
    "\n",
    "We can now use the policy to interact with the environment as usual and visualize a trial with an untrained network:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Evaluation mode\n",
    "policy.eval() \n",
    "\n",
    "# Create a recordable environment\n",
    "env = gym.make('CartPole-v0', render_mode=\"rgb_array_list\")\n",
    "recorder = GymRecorder(env)\n",
    "\n",
    "# Sample the initial state\n",
    "state, info = env.reset()\n",
    "\n",
    "# One episode:\n",
    "done = False\n",
    "return_episode = 0\n",
    "while not done:\n",
    "\n",
    "    # Select an action from the learned policy\n",
    "    action = policy.forward(ts.data.Batch(obs=[state], info=None)).act[0]\n",
    "    \n",
    "    # Sample a single transition\n",
    "    next_state, reward, terminal, truncated, info = env.step(action)\n",
    "\n",
    "    # End of the episode\n",
    "    done = terminal or truncated\n",
    "\n",
    "    # Update undiscounted return\n",
    "    return_episode += reward\n",
    "    \n",
    "    # Go in the next state\n",
    "    state = next_state\n",
    "\n",
    "print(\"Return:\", return_episode)\n",
    "\n",
    "recorder.record(env.render())\n",
    "video = \"videos/cartpole-before.gif\"\n",
    "recorder.make_video(video)\n",
    "ipython_display(video)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The action selection is done by calling `policy.forward()` on a batch of data containing only the current state:\n",
    "\n",
    "```python\n",
    "action = policy.forward(ts.data.Batch(obs=[state], info=None)).act[0]\n",
    "```\n",
    "\n",
    "### Collector\n",
    "\n",
    "As we have seen in the DQN exercise, using a neural network with a batch size of 1 is extremely inefficient and slow. It is much better to use **distributed learning** and parallel workers to collect data. That way, we can form a minibatch of states that can be processed efficiently by the NN. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "collector = ts.data.Collector(\n",
    "    policy=policy, \n",
    "    env=ts.env.DummyVectorEnv([lambda: gym.make('CartPole-v0') for _ in range(10)]),\n",
    "    exploration_noise=True\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`policy` is the exploration policy. The `exploration_noise` flag allows to switch exploration on and off. For a discrete DQN policy, this impacts the $\\epsilon$-greedy action selection scheme, but other algorithms might use another mechanism (softmax, Gaussian policies, Ornstein-Uhlenbeck, noisy parameters, etc).\n",
    "\n",
    "`ts.env.DummyVectorEnv([lambda: gym.make('CartPole-v0') for _ in range(10)])` means that we create 10 copies of the Cartpole environment which will be acted upon in parallel using the policy.\n",
    "\n",
    "Let's collect some data with the collector. We can either collect a fixed number of steps (over the parallel workers) or episodes. Let's start with 10 steps:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "collector.collect(n_step=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's weird, we apparently did not receive any reward. Let's try to collect more steps."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "collector.collect(n_step=1000)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Alright, returns are only reported at the end of an episode. With 1000 steps (in parallel over 10 workers, i.e. each of them did 100 steps), we collected around 100 episodes of length 9 or 10, i.e. the cartpole falls right away, as expected with a random policy.\n",
    "\n",
    "Can we collect complete episodes? Yes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "collector.collect(n_episode=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "But where is the data, i.e. the collected transitions? Nowhere, because we forgot to create a buffer to store them. Let's fix that mistake."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "collector = ts.data.Collector(\n",
    "    policy=policy, \n",
    "    env=ts.env.DummyVectorEnv([lambda: gym.make('CartPole-v0') for _ in range(10)]),\n",
    "    buffer=ts.data.VectorReplayBuffer(1000, 10),\n",
    "    exploration_noise=True\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We preallocate an ERM of max 1000 transitions for each of the 10 workers. One could use a single replay buffer 10 times bigger, but tianshou requires it . Let's collect some episodes and look at the data stored in the first buffer:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "collector.collect(n_episode=10)\n",
    "collector.buffer.buffers[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that each element of the transitions (s, a, r, s', done, terminated) is saved in a preallocated array of 1000 entries. The first replay buffer has only saved one short episode, so most of the data is zero.\n",
    "\n",
    "Sampling a minibatch of transitions is easy:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "collector.buffer.sample(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The transitions come randomly from the workers, so we do not need to worry about it.\n",
    "\n",
    "We can reset the buffers with the following command:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "collector.reset_buffer(keep_statistics=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we print the buffer, data still seems to be there:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "collector.buffer.buffers[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "But sampling returns an error:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "collector.buffer.sample(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Interaction loop\n",
    "\n",
    "We do not even need to actually sample the buffer, because `policy.update()` takes the buffer and a batch size as input. The following code implements DQN on Cartpole, with an okayish choice of hyperparameters. The main interaction loop consists of:\n",
    "\n",
    "1. `collector.collect()`: Collect 100 samples using the 10 workers and store them in the ERM.\n",
    "2. `policy.update()`: Sample `repeat=10` minitaches of 64 transitions from the buffer and learn from them.\n",
    "3. `test_collector.collect()`: Test the performance by running 10 episodes without exploration.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Model\n",
    "net = ts.utils.net.common.Net(\n",
    "    env.observation_space.shape,\n",
    "    env.action_space.n,\n",
    "    hidden_sizes=[64, 64],\n",
    "    device=device,\n",
    ").to(device)\n",
    "\n",
    "optim = torch.optim.Adam(net.parameters(), lr=0.001)\n",
    "\n",
    "# Policy\n",
    "policy = ts.policy.DQNPolicy(\n",
    "    model=net, # value network\n",
    "    optim=optim, # optimizer\n",
    "    discount_factor=0.99, # gamma\n",
    "    estimation_step=1, # n-step returns\n",
    "    is_double=False, # double Q-learning\n",
    "    target_update_freq=120, # how often to update the target network\n",
    "    action_space=env.action_space, # action space\n",
    ")\n",
    "policy.set_eps(0.1) # epsilon-greedy action selection\n",
    "\n",
    "# Collector for training\n",
    "collector = ts.data.Collector(\n",
    "    policy=policy, \n",
    "    env=ts.env.DummyVectorEnv([lambda: gym.make('CartPole-v0') for _ in range(10)]),\n",
    "    buffer=ts.data.VectorReplayBuffer(20000, 10),\n",
    "    exploration_noise=True\n",
    ")\n",
    "# Collector for testing (without exploration). No need for a buffer\n",
    "test_collector = ts.data.Collector(\n",
    "    policy=policy, \n",
    "    env=ts.env.DummyVectorEnv([lambda: gym.make('CartPole-v0') for _ in range(10)]),\n",
    "    exploration_noise=False\n",
    ")\n",
    "# Pre-fill the training buffer with random transitions\n",
    "collector.collect(n_step=1000, random=True)\n",
    "\n",
    "# Interaction\n",
    "returns = []\n",
    "for iteration in range(1000):\n",
    "\n",
    "    # Training mode\n",
    "    policy.train()\n",
    "    \n",
    "    # Collect transitions\n",
    "    result = collector.collect(n_step=100)\n",
    "\n",
    "    # Train DQN network on minibatch\n",
    "    policy.update(\n",
    "        buffer=collector.buffer,\n",
    "        sample_size=0, # use the whole buffer\n",
    "        batch_size=64,\n",
    "        repeat=10,\n",
    "    )\n",
    "\n",
    "    # Test 10 episodes\n",
    "    policy.eval()\n",
    "    result = test_collector.collect(n_episode=10)\n",
    "    mean_reward = result['rew']\n",
    "    #print(iteration, \":\", mean_reward)\n",
    "    returns.append(mean_reward)\n",
    "\n",
    "plt.figure()\n",
    "plt.plot(np.array(returns))\n",
    "plt.xlabel(\"Epochs\")\n",
    "plt.xlabel(\"Returns\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Evaluation mode\n",
    "policy.eval() \n",
    "\n",
    "# Create a recordable environment\n",
    "env = gym.make('CartPole-v0', render_mode=\"rgb_array_list\")\n",
    "recorder = GymRecorder(env)\n",
    "\n",
    "# Sample the initial state\n",
    "state, info = env.reset()\n",
    "\n",
    "# One episode:\n",
    "done = False\n",
    "return_episode = 0\n",
    "while not done:\n",
    "    # Select an action from the learned policy\n",
    "    action = policy.forward(ts.data.Batch(obs=[state], info=None)).act[0]\n",
    "    # Sample a single transition\n",
    "    next_state, reward, terminal, truncated, info = env.step(action)\n",
    "    # End of the episode\n",
    "    done = terminal or truncated\n",
    "    # Update undiscounted return\n",
    "    return_episode += reward\n",
    "    # Go in the next state\n",
    "    state = next_state\n",
    "print(\"Return:\", return_episode)\n",
    "\n",
    "recorder.record(env.render())\n",
    "video = \"videos/cartpole-dqn.gif\"\n",
    "recorder.make_video(video)\n",
    "ipython_display(video)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Q:** Understand and run the code. Experiment with the hyperparameters and compare them to the previous exercise. In particular, what is `estimation_step=3` in the constructor of PPO? What is its influence?\n",
    "\n",
    "**Q:** Implement scheduling of the exploration parameter with the right hyperparameters."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## PPO\n",
    "\n",
    "Now that DQN works on Cartpole, let's use PPO and compare its performance to DQN.\n",
    "\n",
    "You will need to use the PPO policy, obviously:\n",
    "\n",
    "```python\n",
    "policy = ts.policy.PPOPolicy(\n",
    "    actor=actor, \n",
    "    critic=critic, \n",
    "    optim=optim,\n",
    "    dist_fn=torch.distributions.Categorical, \n",
    "    action_space=env.action_space,\n",
    "    discount_factor=0.99,\n",
    "    max_grad_norm=0.5,\n",
    "    eps_clip=0.2,\n",
    "    gae_lambda=0.95,\n",
    "    deterministic_eval=True,\n",
    "    action_scaling=False,\n",
    ")\n",
    "```\n",
    "\n",
    "It has many more hyperparameters, which can be let at their default value (or not, depending on the time you have). Check the doc for their meaning. The important thing is that you now need an actor and a critic, not a single network.\n",
    "\n",
    "One way to do it is to use the actor/critic specifications provided by tianshou:\n",
    "\n",
    "```python\n",
    "features = ts.utils.net.common.Net(\n",
    "    state_shape=env.observation_space.shape, \n",
    "    hidden_sizes=[64, 64], \n",
    "    device=device)\n",
    "\n",
    "actor = ts.utils.net.discrete.Actor(\n",
    "    preprocess_net=features, \n",
    "    action_shape=env.action_space.n, \n",
    "    device=device).to(device)\n",
    "\n",
    "critic = ts.utils.net.discrete.Critic(\n",
    "    preprocess_net=features, \n",
    "    device=device).to(device)\n",
    "\n",
    "actor_critic = ts.utils.net.common.ActorCritic(actor=actor, critic=critic)\n",
    "\n",
    "optim = torch.optim.Adam(actor_critic.parameters(), lr=0.001)\n",
    "```\n",
    "\n",
    "``features`` is the shared feature extractor between the actor and the critic. ``actor`` is the policy head (one neuron per discrete action), ``critic`` is a single output neuron for the value V(s). ``actor_critic`` is the combined two-headed network. \n",
    "\n",
    "When defining the PPO policy, `dist_fn=torch.distributions.Categorical` specifies how exploration is performed, here a softmax over the two actions left and right. \n",
    "\n",
    "**Q:** Implement PPO on Cartpole. You will need to find the right hyperparameters for the task. \n",
    "\n",
    "Remember that learning is **on-policy**, so the transition buffer must be emptied after training the network. Your interaction loop must therefore look like this:\n",
    "\n",
    "```python\n",
    "for iteration in range(N):\n",
    "\n",
    "    # Collect enough on-policy steps\n",
    "    result = collector.collect(...)\n",
    "\n",
    "    # Update the PP0 network by learning the on-policy buffer\n",
    "    policy.update(...)\n",
    "\n",
    "    # Empty the buffer as we are on-policy\n",
    "    collector.reset_buffer(keep_statistics=False)\n",
    "```\n",
    "\n",
    "Do not hesitate to collect many on-policy steps before training the network."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Q:** How does it compare to DQN once the right hyperparameters are found? What is their influence? Play especially with `eps_clip`, the $\\epsilon$ threshold used to clip the IS weight in the PPO loss.\n",
    "\n",
    "**Q:** Apply PPO to more complex environments available in gymnasium. Beware that for continuous action spaces, you will need to use continuous actor/critic networks, i.e.:\n",
    "\n",
    "```python\n",
    "actor = ts.utils.net.continuous.Actor(...)\n",
    "critic = ts.utils.net.continuous.Critic(...)\n",
    "```\n",
    "\n",
    "instead of:\n",
    "\n",
    "```python\n",
    "actor = ts.utils.net.discrete.Actor(...)\n",
    "critic = ts.utils.net.discrete.Critic(...)\n",
    "```\n",
    "\n",
    "The `dist_fn` argument of `PPOPolicy` must also be set accordingly. Check the doc!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "tianshou",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}