{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Grid2Op integration with existing frameworks\n",
    "\n",
    "Try me out interactively with: [![Binder](./img/badge_logo.svg)](https://mybinder.org/v2/gh/rte-france/Grid2Op/master)\n",
    "\n",
    "\n",
    "**objectives** This notebooks briefly explains how to use grid2op with commonly used RL frameworks. It also explains the main methods / class of the `grid2op.gym_compat` module that ease grid2op integration with these frameworks.\n",
    "\n",
    "The structure is always very similar:\n",
    "1. Create a grid2op environment\n",
    "2. Convert it to a gym environment\n",
    "3. (optional) Customize the action space and observation space\n",
    "4. Use the framework to train an agent\n",
    "5. Embed the trained agent into a grid2op Agent to take valid grid2op actions.\n",
    "\n",
    "In this notebook, we will demonstrate its usage with 3 different framework. The code provided here are given as examples and we do not assume anything on their performance or fitness of use. More detailed example will be provided in the l2rpn-baselines repository in due time (work in progress at the time of writing this notebook). The 3 framework we will demonstrate an example of are:\n",
    "\n",
    "- ray (rllib): see [ray on github](https://github.com/ray-project/ray) or [rllib on github](https://github.com/ray-project/ray/blob/master/doc/source/rllib.rst)\n",
    "- stable-baselines3: see [stable-baselines3 on github](https://github.com/DLR-RM/stable-baselines3)\n",
    "- tf_agents: see [tf_agents on github](https://github.com/tensorflow/agents)\n",
    "\n",
    "Other RL frameworks are not cover here. If you already use them, let us know !\n",
    "- https://github.com/PaddlePaddle/PARL/blob/develop/README.md (used by the winner teams of Neurips competitions !) Work in progress.\n",
    "- https://github.com/wau/keras-rl2\n",
    "- https://github.com/deepmind/acme\n",
    "\n",
    "Note also that there is still the possibility to use past codes in the l2rpn-baselines repository: https://github.com/rte-france/l2rpn-baselines . This repository contains code snippets that can be reuse to make really nice agents on the l2rpn competitions. You can try it out :-) \n",
    "\n",
    "<img src=\"https://colab.research.google.com/assets/colab-badge.svg\" width=\"200\">\n",
    "Execute the cell below by removing the `#` characters if you use google colab !\n",
    "\n",
    "Cell will look like:\n",
    "```python\n",
    "import sys\n",
    "!$sys.executable install grid2op[optional]  # for use with google colab (grid2Op is not installed by default)\n",
    "!$sys.executable install tensorflow pytorch stable-baselines3 'ray[rllib]' tf_agents\n",
    "```\n",
    "\n",
    "It might take a while\n",
    "<img src=\"https://colab.research.google.com/assets/colab-badge.svg\" width=\"200\">"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "# !$sys.executable install grid2op[optional]  # for use with google colab (grid2Op is not installed by default)\n",
    "# !$sys.executable -m pip install stable-baselines3 'ray[rllib]' tf_agents"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# because this notebook is part of some tests, we train the agent for only a small number of steps\n",
    "nb_step_train = 0 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Organisation of this notebook\n",
    "\n",
    "For the organisation of this notebook we decided to first detail features closer to grid2op to go later on \"higher level\" feature that are closer to \"standard\" gym representation (eg `Box` and `Discrete` space).\n",
    "\n",
    "Note the closer you are to grid2op the more grid2op feature you can use. For example, in gym environment, it is not possible to use the \"simulate\" function (remember, this function allow to use a simulator that has a behaviour close to the one of the environment) at all. Also, grid2op observation and action comes with a lot of different feature (capacity to add them, to retrieve the graph of the grid etc.) which is not possible to use directly in gym.\n",
    "\n",
    "That being said, this notebook is organized as follow:\n",
    "\n",
    "- [Convert it to a gym environment](#Convert-it-to-a-gym-environment): basic use of the `gym_compat` grid2op module allowing to convert a grid2op environment to a gym environment.\n",
    "- [Action space](#Action-space): basic usage of the action space, by removing redundant feature (`gym_env.observation_space.ignore_attr`) or transforming feature from a continuous space to a discrete space (`ContinuousToDiscreteConverter`)\n",
    "- [Observation space](#Observation-space): basic usage of the observation space, by removing redunddant features (`keep_only_attr`) or to scale the data on between a certain range (`ScalerAttrConverter`)\n",
    "- [Making the grid2op agent](#Making-the-grid2op-agent) explains how to make a grid2op agent once trained. Note that a more \"agent focused\" view is provided in the notebook [04_TrainingAnAgent](04_TrainingAnAgent.ipynb) !\n",
    "- [1) RLLIB](#1\\)-RLLIB): more advance usage for customizing the observation space (`gym_env.observation_space.reencode_space` and `gym_env.observation_space.add_key`) or modifying the type of gym attribute (`MultiToTupleConverter`) as well as an example of how to use RLLIB framework\n",
    "- [2)-Stable baselines](#2\\)-Stable-baselines): even more advanced usage for customizing the observation space by concatenating it to a single \"Box\" (instead of a dictionnary) thanks to `BoxGymObsSpace` and to use `BoxGymActSpace` if you are more focus on continuous actions and `MultiDiscreteActSpace` for discrete actions (**NB** in both case there will be loss of information as compared to regular grid2op actions! for example it will be harder to have a representation of the graph of the grid there)\n",
    "- [3) Tf Agents](#3\\)-Tf-Agents) explains how to convert the action space into a \"Discrete\" gym space thanks to `DiscreteActSpace`\n",
    "\n",
    "On each sections, we also explain concisely how to train the agent. Note that we did not spend any time on customizing the default agents and training scheme. It is then less than likely that these agents there"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 0) Recommended initial steps\n",
    "\n",
    "### Create a grid2op environment\n",
    "\n",
    "This is a rather standard step, with lots of inspiration drawn from openAI gym framework, and there is absolutely no specificity here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import grid2op\n",
    "env_name = \"l2rpn_case14_sandbox\"\n",
    "env_glop = grid2op.make(env_name, test=True)  # NOTE: do not set the flag \"test=True\" for a real usage !\n",
    "# This flag is here for testing purpose !!!\n",
    "obs_glop = env_glop.reset()\n",
    "obs_glop"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Convert it to a gym environment\n",
    "\n",
    "To that end, we recommend using the \"gym_compat\" module. More information is given in the [official grid2op documentation](https://grid2op.readthedocs.io/en/latest/gym.html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import gym\n",
    "import numpy as np\n",
    "from grid2op.gym_compat import GymEnv\n",
    "env_gym = GymEnv(env_glop)\n",
    "print(f\"The \\\"env_gym\\\" is a gym environment: {isinstance(env_gym, gym.Env)}\")\n",
    "obs_gym = env_gym.reset()\n",
    "# obs_gym"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Customize the action space and observation space\n",
    "\n",
    "This step is optional, but highly recommended.\n",
    "\n",
    "By default, grid2op actions and observations are huge. Even for this very simplistic example, you have really important sizes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dim_act_space = np.sum([np.sum(env_gym.action_space[el].shape) for el in env_gym.action_space.spaces])\n",
    "print(f\"The size of the action space is : \"\n",
    "      f\"{dim_act_space}\")\n",
    "dim_obs_space = np.sum([np.sum(env_gym.observation_space[el].shape).astype(int) \n",
    "                        for el in env_gym.observation_space.spaces])\n",
    "print(f\"The size of the observation space is : \"\n",
    "      f\"{dim_obs_space}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Action space\n",
    "This is partly due because in grid2op, you can represent the same concept (*eg* reconnect a powerline) in different manners (in this case: either you \"toggle a switch\" - if the said powerline was connected, it will disconnect it, otherwise it will reconnect it- or you can say \"i want this line connected whatever its original state\"). This behaviour is detailed in the [official grid2op documentation](https://grid2op.readthedocs.io/en/latest/action.html#usage-examples).\n",
    "\n",
    "To (in general) reduce the action space by a factor of 2, you can represent these actions only using the change method (for example). You can do that with:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# example: ignore the \"set_status\" and \"set_bus\" type of actions, that are covered by the \"change_status\" and\n",
    "# \"change_bus\"\n",
    "\n",
    "env_gym.action_space = env_gym.action_space.ignore_attr(\"set_bus\").ignore_attr(\"set_line_status\")\n",
    "\n",
    "new_dim_act_space = np.sum([np.sum(env_gym.action_space[el].shape) for el in env_gym.action_space.spaces])\n",
    "print(f\"The new size of the action space is : {new_dim_act_space}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Grid2op environments allow for both continuous and discrete action. For the sake of the example, let's \"convert\" the continuous actions in discrete ones (this is done with \"binning\" the values as explained in more details [in the documentation](https://grid2op.readthedocs.io/en/latest/gym.html#grid2op.gym_compat.ContinuousToDiscreteConverter) )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# example: convert the continuous action type \"redispatch\" to a discrete action type\n",
    "from grid2op.gym_compat import ContinuousToDiscreteConverter\n",
    "env_gym.action_space = env_gym.action_space.reencode_space(\"redispatch\",\n",
    "                                                           ContinuousToDiscreteConverter(nb_bins=11)\n",
    "                                                           )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# And now our action space looks like:\n",
    "env_gym.action_space"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Observation space\n",
    "\n",
    "For the obsevation space, we will remove lots of useless attributes (remember, it is for the sake of the example here, and rescale some other so that they have numbers between rougly 0. and 1., which stabilizes the learning process."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# first let's see which are the attributes in the observation space:\n",
    "# More information on\n",
    "# https://beta-grid2op.readthedocs.io/en/latest/observation.html#main-observation-attributes\n",
    "# and \n",
    "# https://grid2op.readthedocs.io/en/latest/gym.html#observation-space-and-action-space-customization\n",
    "env_gym.observation_space"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's keep only the information about the flow on the powerlines: `rho`, the generation `gen_p`, the load `load_p` and the representation of the topology `topo_vect` (for the sake of the example, once again)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "env_gym.observation_space = env_gym.observation_space.keep_only_attr([\"rho\", \"gen_p\", \"load_p\", \"topo_vect\", \n",
    "                                                                      \"actual_dispatch\"])\n",
    "new_dim_obs_space = np.sum([np.sum(env_gym.observation_space[el].shape).astype(int) \n",
    "                        for el in env_gym.observation_space.spaces])\n",
    "print(f\"The new size of the observation space is : \"\n",
    "      f\"{new_dim_obs_space} (it was {dim_obs_space} before!)\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One other detail here, the generation and loads are not scaled (they are given in MW). We recommend to scale them to have number roughly between 0 and 1 for stability during learning.\n",
    "\n",
    "This can be done pretty easily with the code below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from grid2op.gym_compat import ScalerAttrConverter\n",
    "from gym.spaces import Box\n",
    "ob_space = env_gym.observation_space\n",
    "ob_space = ob_space.reencode_space(\"actual_dispatch\",\n",
    "                                   ScalerAttrConverter(substract=0.,\n",
    "                                                       divide=env_glop.gen_pmax\n",
    "                                                       )\n",
    "                                   )\n",
    "ob_space = ob_space.reencode_space(\"gen_p\",\n",
    "                                   ScalerAttrConverter(substract=0.,\n",
    "                                                       divide=env_glop.gen_pmax\n",
    "                                                       )\n",
    "                                   )\n",
    "ob_space = ob_space.reencode_space(\"load_p\",\n",
    "                                  ScalerAttrConverter(substract=obs_gym[\"load_p\"],\n",
    "                                                      divide=0.5 * obs_gym[\"load_p\"]\n",
    "                                                      )\n",
    "                                  )\n",
    "\n",
    "# for even more customization, you can use any functions you want !\n",
    "shape_ = (env_glop.dim_topo, env_glop.dim_topo)\n",
    "env_gym.observation_space.add_key(\"connectivity_matrix\",\n",
    "                                  lambda obs: obs.connectivity_matrix(),  # can be any function returning a gym space\n",
    "                                  Box(shape=shape_,\n",
    "                                      low=np.zeros(shape_),\n",
    "                                      high=np.ones(shape_),\n",
    "                                    )  # this \"Box\" should represent the return type of the above function\n",
    "                                  )\n",
    "env_gym.observation_space = ob_space\n",
    "env_gym.observation_space"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Making the grid2op agent\n",
    "\n",
    "In this subsection we briefly explain how to wrapped the trained agent (see below for training methods depending on the framework you want to use). The goal is to make this \"tutorial\" complete, in the sense that you will be able to use the trained agent in regular grid2op framework, for example using the `Runner`\n",
    "\n",
    "This subsection is compatible with all code that is explained in this notebook, even though we demonstrate it with the env created above.\n",
    "\n",
    "The basic idea is really simple, you create an grid2op agent, initialize it with the gym_env (you got from the `gym_compat` module) and use the \"gym_env.action_space.from_gym\" and \"gym_env.observation_space.to_gym\" function to convert the action and the observation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from grid2op.Agent import BaseAgent\n",
    "\n",
    "class AgentFromGym(BaseAgent):\n",
    "    def __init__(self, gym_env, trained_agent):\n",
    "        self.gym_env = gym_env\n",
    "        BaseAgent.__init__(self, gym_env.init_env.action_space)\n",
    "        self.trained_aget = trained_agent\n",
    "    def act(self, obs, reward, done):\n",
    "        gym_obs = self.gym_env.observation_space.to_gym(obs)\n",
    "        gym_act = self.trained_agent.act(gym_obs, reward, done)\n",
    "        grid2op_act = self.gym_env.action_space.from_gym(gym_act)\n",
    "        return act"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And this is it. You are done ;-)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1) RLLIB\n",
    "\n",
    "This part is not a tutorial on how to use rllib. Please refer to [their documentation](https://docs.ray.io/en/master/rllib.html) for more detailed information.\n",
    "\n",
    "As explained in the header of this notebook, we will follow the recommended usage:\n",
    "1. Create a grid2op environment (see section [0) Recommended initial steps](#0\\)-Recommended-initial-steps))\n",
    "2. Convert it to a gym environment (see section [0) Recommended initial steps](#0\\)-Recommended-initial-steps))\n",
    "3. (optional) Customize the action space and observation space (see section [0) Recommended initial steps](#0\\)-Recommended-initial-steps))\n",
    "4. Use the framework to train an agent  **(only this part is framework specific)**\n",
    "\n",
    "\n",
    "The issue with rllib is that it does not take into account MultiBinary nor MultiDiscrete action space (see \n",
    "see https://github.com/ray-project/ray/issues/1519) so we need some way to encode these types of actions. This can be done automatically with the `MultiToTupleConverter` provided in grid2op (as always, more information [in the documentation](https://grid2op.readthedocs.io/en/latest/gym.html#grid2op.gym_compat.MultiToTupleConverter) ).\n",
    "\n",
    "We will then use this to customize our environment previously defined:\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import copy\n",
    "env_rllib = copy.deepcopy(env_gym)\n",
    "from grid2op.gym_compat import MultiToTupleConverter\n",
    "env_rllib.action_space = env_rllib.action_space.reencode_space(\"change_bus\", MultiToTupleConverter())\n",
    "env_rllib.action_space = env_rllib.action_space.reencode_space(\"change_line_status\", MultiToTupleConverter())\n",
    "env_rllib.action_space = env_rllib.action_space.reencode_space(\"redispatch\", MultiToTupleConverter())\n",
    "env_rllib.action_space"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another specificity of RLLIB is that it handles creation of environments \"on its own\". This implies that you need to create a custom class representing an environment, rather a python object.\n",
    "\n",
    "And finally, you ask it to use this class, and learn a specific agent. This is really well explained in their documentation: https://docs.ray.io/en/master/rllib-env.html#configuring-environments."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# gym specific, we simply do a copy paste of what we did in the previous cells, wrapping it in the\n",
    "# MyEnv class, and train a Proximal Policy Optimisation based agent\n",
    "import gym\n",
    "import ray\n",
    "import gym\n",
    "import numpy as np\n",
    "      \n",
    "class MyEnv(gym.Env):\n",
    "    def __init__(self, env_config):\n",
    "        import grid2op\n",
    "        from grid2op.gym_compat import GymEnv\n",
    "        from grid2op.gym_compat import ScalerAttrConverter, ContinuousToDiscreteConverter, MultiToTupleConverter\n",
    "\n",
    "        # 1. create the grid2op environment\n",
    "        if not \"env_name\" in env_config:\n",
    "            raise RuntimeError(\"The configuration for RLLIB should provide the env name\")\n",
    "        nm_env = env_config[\"env_name\"]\n",
    "        del env_config[\"env_name\"]\n",
    "        self.env_glop = grid2op.make(nm_env, **env_config)\n",
    "\n",
    "        # 2. create the gym environment\n",
    "        self.env_gym = GymEnv(self.env_glop)\n",
    "        obs_gym = self.env_gym.reset()\n",
    "\n",
    "        # 3. (optional) customize it (see section above for more information)\n",
    "        ## customize action space\n",
    "        self.env_gym.action_space = self.env_gym.action_space.ignore_attr(\"set_bus\").ignore_attr(\"set_line_status\")\n",
    "        self.env_gym.action_space = self.env_gym.action_space.reencode_space(\"redispatch\",\n",
    "                                                                             ContinuousToDiscreteConverter(nb_bins=11)\n",
    "                                                                             )\n",
    "        self.env_gym.action_space = self.env_gym.action_space.reencode_space(\"change_bus\", MultiToTupleConverter())\n",
    "        self.env_gym.action_space = self.env_gym.action_space.reencode_space(\"change_line_status\",\n",
    "                                                                             MultiToTupleConverter())\n",
    "        self.env_gym.action_space = self.env_gym.action_space.reencode_space(\"redispatch\", MultiToTupleConverter())\n",
    "        ## customize observation space\n",
    "        ob_space = self.env_gym.observation_space\n",
    "        ob_space = ob_space.keep_only_attr([\"rho\", \"gen_p\", \"load_p\", \"topo_vect\", \"actual_dispatch\"])\n",
    "        ob_space = ob_space.reencode_space(\"actual_dispatch\",\n",
    "                                           ScalerAttrConverter(substract=0.,\n",
    "                                                               divide=self.env_glop.gen_pmax\n",
    "                                                               )\n",
    "                                           )\n",
    "        ob_space = ob_space.reencode_space(\"gen_p\",\n",
    "                                           ScalerAttrConverter(substract=0.,\n",
    "                                                               divide=self.env_glop.gen_pmax\n",
    "                                                               )\n",
    "                                           )\n",
    "        ob_space = ob_space.reencode_space(\"load_p\",\n",
    "                                           ScalerAttrConverter(substract=obs_gym[\"load_p\"],\n",
    "                                                               divide=0.5 * obs_gym[\"load_p\"]\n",
    "                                                               )\n",
    "                                           )\n",
    "        self.env_gym.observation_space = ob_space\n",
    "\n",
    "        # 4. specific to rllib\n",
    "        self.action_space = self.env_gym.action_space\n",
    "        self.observation_space = self.env_gym.observation_space\n",
    "\n",
    "    def reset(self):\n",
    "        obs = self.env_gym.reset()\n",
    "        return obs\n",
    "\n",
    "    def step(self, action):\n",
    "        obs, reward, done, info = self.env_gym.step(action)\n",
    "        return obs, reward, done, info"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test = MyEnv({\"env_name\": \"l2rpn_case14_sandbox\"})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And now you can train it :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if nb_step_train:  # remember: don't forge to change this number to perform an actual training !\n",
    "    from ray.rllib.agents import ppo  # import the type of agents\n",
    "    # fist initialize ray\n",
    "    ray.init()\n",
    "    try:\n",
    "        # then define a \"trainer\"\n",
    "        trainer = ppo.PPOTrainer(env=MyEnv, config={\n",
    "            \"env_config\": {\"env_name\":\"l2rpn_case14_sandbox\"},  # config to pass to env class\n",
    "        })\n",
    "        # and then train it for a given number of iteration\n",
    "        for step in range(nb_step_train):\n",
    "            trainer.train()\n",
    "    finally:   \n",
    "        # shutdown ray\n",
    "        ray.shutdown()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**NB** We want to emphasize here that:\n",
    "- This encoding is far from being suitable here. It is shown as an example, mainly to demonstrate the use of some of the gym_compat module\n",
    "- The actions in particular are not really suited here. Actions in grid2op are relatively complex and encoding them this way does not seem like a great idea. For example, with this encoding, the agent will have to learn that it cannot act on more than 2 lines or two substations at the same time...\n",
    "- The \"PPO\" agent shown here, with some default parameters is unlikely to lead to a good agent. You might want to read litterature on past L2RPN agents or draw some inspiration from L2RPN baselines packages for more information."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2) Stable baselines\n",
    "\n",
    "This part is not a tutorial on how to use stable baselines. Please refer to [their documentation](https://stable-baselines3.readthedocs.io/en/master/) for more detailed information.\n",
    "\n",
    "As explained in the header of this notebook, we will follow the recommended usage:\n",
    "1. Create a grid2op environment (see section [0) Recommended initial steps](#0\\)-Recommended-initial-steps))\n",
    "2. Convert it to a gym environment (see section [0) Recommended initial steps](#0\\)-Recommended-initial-steps))\n",
    "3. (optional) Customize the action space and observation space (see section [0) Recommended initial steps](#0\\)-Recommended-initial-steps))\n",
    "4. Use the framework to train an agent  **(only this part is framework specific)**\n",
    "\n",
    "\n",
    "The issue with stable beselines 3 is that it expects standard action / observation types as explained there:\n",
    "https://stable-baselines3.readthedocs.io/en/master/guide/algos.html#rl-algorithms\n",
    "\n",
    "> Non-array spaces such as Dict or Tuple are not currently supported by any algorithm.\n",
    "\n",
    "Unfortunately, it's not possible to convert without any \"loss of information\" an action space of dictionnary type to a vector.\n",
    "\n",
    "It is possible to use the grid2op framework in such cases, and in this section, we will explain how.\n",
    "\n",
    "\n",
    "First, as always, we convert the grid2op environment in a gym environment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "env_sb = GymEnv(env_glop)  # sb for \"stable baselines\"\n",
    "glop_obs = env_glop.reset()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, we need to convert everything into a \"Box\" as it is the only things that stable baselines seems to digest at time of writing (March 20201).\n",
    "\n",
    "### Observation Space\n",
    "\n",
    "We explain here how we convert an observation as a single Box. This step is rather easy, you just need to specify which attributes of the observation you want to keep and if you want so scale them (with the keword `subtract` and `divide`)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from grid2op.gym_compat import BoxGymObsSpace\n",
    "env_sb.observation_space = BoxGymObsSpace(env_sb.init_env.observation_space,\n",
    "                                          attr_to_keep=[\"gen_p\", \"load_p\", \"topo_vect\",\n",
    "                                                        \"rho\", \"actual_dispatch\", \"connectivity_matrix\"],\n",
    "                                          divide={\"gen_p\": env_glop.gen_pmax,\n",
    "                                                  \"load_p\": glop_obs.load_p,\n",
    "                                                  \"actual_dispatch\": env_glop.gen_pmax},\n",
    "                                          functs={\"connectivity_matrix\": (\n",
    "                                                      lambda grid2obs: grid2obs.connectivity_matrix().flatten(),\n",
    "                                                      0., 1., None, None,\n",
    "                                                      )\n",
    "                                                 }\n",
    "                                         )\n",
    "obs_gym = env_sb.reset()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "obs_gym in env_sb.observation_space"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**NB**: the above code is equivalent to something like:\n",
    "\n",
    "```python\n",
    "from gym.spaces import Box\n",
    "class BoxGymObsSpaceExample(Box):\n",
    "    def __init__(self, observation_space)\n",
    "        shape = observation_space.n_gen + \\     # dimension of gen_p\n",
    "                observation_space.n_load + \\    # load_p\n",
    "                observation_space.dim_topo + \\  # topo_vect\n",
    "                observation_space.n_line + \\    # rho\n",
    "                observation_space.n_gen + \\     # actual_dispatch\n",
    "                observation_space.dim_topo ** 2 # connectivity_matrix\n",
    "        \n",
    "        ob_sp = observation_space\n",
    "        # lowest value the attribute can take (see doc for more information)\n",
    "        low = np.concatenate((np.full(shape=(ob_sp.n_gen,), fill_value=0., dtype=dt_float),  # gen_p\n",
    "                              np.full(shape=(ob_sp.n_load,), fill_value=-np.inf, dtype=dt_float),  # load_p\n",
    "                              np.full(shape=(ob_sp.dim_topo,), fill_value=-1., dtype=dt_float),  # topo_vect\n",
    "                              np.full(shape=(ob_sp.n_line,), fill_value=0., dtype=dt_float),  # rho\n",
    "                              np.full(shape=(ob_sp.n_line,), fill_value=-ob_sp.gen_pmax, dtype=dt_float),  # actual_dispatch\n",
    "                              np.full(shape=(ob_sp.dim_topo**2,), fill_value=0., dtype=dt_float),  #  connectivity_matrix\n",
    "                              ))\n",
    "        \n",
    "        # highest value the attribute can take\n",
    "        high = np.concatenate((np.full(shape=(ob_sp.n_gen,), fill_value=np.inf, dtype=dt_float),  # gen_p\n",
    "                              np.full(shape=(ob_sp.n_load,), fill_value=np.inf, dtype=dt_float),  # load_p\n",
    "                              np.full(shape=(ob_sp.dim_topo,), fill_value=2., dtype=dt_float),  # topo_vect\n",
    "                              np.full(shape=(ob_sp.n_line,), fill_value=np.inf, dtype=dt_float),  # rho\n",
    "                              np.full(shape=(ob_sp.n_line,), fill_value=ob_sp.gen_pmax, dtype=dt_float),  # actual_dispatch\n",
    "                              np.full(shape=(ob_sp.dim_topo**2,), fill_value=1., dtype=dt_float),  #  connectivity_matrix\n",
    "                              ))\n",
    "        Box.__init__(self, low=low, high=high, shape=shape)\n",
    "     \n",
    "    def to_gym(self, observation):\n",
    "        res = np.concatenate((obs.gen_p / obs.gen_pmax,\n",
    "                              obs.prod_p / glop_obs.load_p,\n",
    "                              obs.topo_vect.astype(float),\n",
    "                              obs.rho,\n",
    "                              obs.actual_dispatch / env_glop.gen_pmax,\n",
    "                              obs.connectivity_matrix().flatten()\n",
    "                             ))\n",
    "        return res\n",
    "```\n",
    "\n",
    "So if you want more customization, but making less generic code (the `BoxGymObsSpace` works for all the attribute of the observation) you can customize it by adapting the snippet above or read the documentation here (TODO).\n",
    "\n",
    "Only the \"to_gym\" function, and this exact signature is important in this case. It should take an observation in a grid2op format and return this same observation compatible with the gym Box (so a numpy array with the right shape and in the right range)\n",
    "                \n",
    "\n",
    "### Action space\n",
    "\n",
    "Converting the grid2op actions in something that is not a Tuple, nor a Dict. The main restriction in these frameworks is that they do not allow for easy integration of environment where both discrete actions and continuous actions are possible.\n",
    "\n",
    "\n",
    "#### Using a BoxGymActSpace\n",
    "\n",
    "We can use the same kind of method explained above with the use of the class `BoxGymActSpace`. In this case, you need to provide a way to convert a numpy array (an element of a gym Box) into a grid2op action.\n",
    "\n",
    "**NB** This method is particularly suited if you want to focus on CONTINUOUS part of the action space, for example redispatching, curtailment or action on storage unit.\n",
    "\n",
    "Though we made it possible to also use discrete action, we do not recommend to use it. Prefer using the `MultiDiscreteActSpace` for such purpose."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from grid2op.gym_compat import BoxGymActSpace\n",
    "scale_gen =  env_sb.init_env.gen_max_ramp_up + env_sb.init_env.gen_max_ramp_down\n",
    "scale_gen[~env_sb.init_env.gen_redispatchable] = 1.0\n",
    "env_sb.action_space = BoxGymActSpace(env_sb.init_env.action_space,\n",
    "                                     attr_to_keep=[\"redispatch\"],\n",
    "                                     multiply={\"redispatch\": scale_gen},\n",
    "                                    )\n",
    "obs_gym = env_sb.reset()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**NB**: the above code is equivalent to something like:\n",
    "\n",
    "```python\n",
    "from gym.spaces import Box\n",
    "class BoxGymActSpace(Box):\n",
    "    def __init__(self, action_space)\n",
    "        shape = observation_space.n_gen  # redispatch\n",
    "        \n",
    "        ob_sp = observation_space\n",
    "        # lowest value the attribute can take (see doc for more information)\n",
    "        low = np.full(shape=(ob_sp.n_gen,), fill_value=-1., dtype=dt_float)\n",
    "        \n",
    "        # highest value the attribute can take\n",
    "        high = np.full(shape=(ob_sp.n_gen,), fill_value=1., dtype=dt_float)\n",
    "        \n",
    "        Box.__init__(self, low=low, high=high, shape=shape)\n",
    "     \n",
    "        self.action_space = action_space\n",
    "        \n",
    "    def from_gym(self, gym_observation):\n",
    "        res = self.action_space()\n",
    "        res.redispatch = gym_observation * scale_gen\n",
    "        return res\n",
    "```\n",
    "\n",
    "So if you want more customization, but making less generic code (the `BoxGymActSpace` works for all the attribute of the action) you can customize it by adapting the snippet above or read the documentation here (TODO). The only important method you need to code is the \"from_gym\" one that should take into account an action as sampled by the gym Box and return a grid2op action.\n",
    "\n",
    "\n",
    "#### Using a MultiDiscreteActSpace\n",
    "\n",
    "We can use the same kind of method explained above with the use of the class `BoxGymActSpace`, but which is more suited to the discrete type of actions.\n",
    "\n",
    "In this case, you need to provide a way to convert a numpy array of integer (an element of a gym MultiDiscrete) into a grid2op action.\n",
    "\n",
    "**NB** This method is particularly suited if you want to focus on DISCRETE part of the action space, for example set_bus or change_line_status."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from grid2op.gym_compat import MultiDiscreteActSpace\n",
    "reencoded_act_space = MultiDiscreteActSpace(env_sb.init_env.action_space,\n",
    "                                           attr_to_keep=[\"set_line_status\", \"set_bus\", \"redispatch\"])\n",
    "env_sb.action_space = reencoded_act_space\n",
    "obs_gym = env_sb.reset()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Wrapping all up and starting the training\n",
    "\n",
    "First, let's make sure our environment is compatible with stable baselines, thanks to their helper function.\n",
    "\n",
    "This means that "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from stable_baselines3.common.env_checker import check_env\n",
    "check_env(env_sb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So as we see, the environment seems to be compatible with stable baselines. Now we can start the training."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from stable_baselines3 import PPO\n",
    "model = PPO(\"MlpPolicy\", env_sb, verbose=1)\n",
    "if nb_step_train:\n",
    "    model.learn(total_timesteps=nb_step_train)\n",
    "    # model.save(\"ppo_stable_baselines3\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Again, the goal of this section was not to demonstrate how to train a state of the art algorithm, but rather to demonstrate how to use grid2op with the stable baselines repository.\n",
    "\n",
    "Most importantly, the neural networks there are not customized for the environment, default parameters are used. This is unlikely to work at all !\n",
    "\n",
    "For more information and to use tips and tricks to get started with RL agents, the devs of \"stable baselines\" have done a really nice job. You can have some tips for training RL agents here\n",
    "https://stable-baselines3.readthedocs.io/en/master/guide/rl_tips.html\n",
    "and consult any of the resources listed there https://stable-baselines3.readthedocs.io/en/master/guide/rl.html\n",
    "\n",
    "\n",
    "## 3) Tf Agents\n",
    "Lastly, the RL frameworks we will use is tf agents.\n",
    "\n",
    "Compared to the previous one, this framework is more verbose. In this notebook we will mimic what has been done in the https://github.com/tensorflow/agents/blob/master/docs/tutorials/1_dqn_tutorial.ipynb\n",
    "\n",
    "To that end, we will introduce the last \"gym transformer\" available in grid2op at time of writing. This function will transform the action space in a Discrete one. With this modeling, the agent can take an action on a substation, or act on a powerline or perform redispatching. But, as opposed to what is done previously, it cannot act on, say, a substation and a powerline at the same time.\n",
    "\n",
    "This limitation does not come from tf agents. But this limitation is necessary to run the tutorial of the DQN provided with tf agents.\n",
    "\n",
    "\n",
    "First we will build the observation space as for the stable baselines repository. See section [2) Stable baselines](#2\\)-Stable-baselines) for more information.\n",
    "\n",
    "### Observation space"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# create the gym environment\n",
    "env_tfa = GymEnv(env_glop)  # tfa for \"tf agents\"\n",
    "glop_obs = env_glop.reset()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# customize the observation space\n",
    "env_tfa.observation_space = BoxGymObsSpace(env_tfa.init_env.observation_space,\n",
    "                                          attr_to_keep=[\"gen_p\", \"load_p\", \"topo_vect\",\n",
    "                                                        \"rho\", \"actual_dispatch\", \"connectivity_matrix\"],\n",
    "                                          divide={\"gen_p\": env_glop.gen_pmax,\n",
    "                                                  \"load_p\": glop_obs.load_p,\n",
    "                                                  \"actual_dispatch\": env_glop.gen_pmax},\n",
    "                                          functs={\"connectivity_matrix\": (\n",
    "                                                      lambda grid2obs: grid2obs.connectivity_matrix().flatten(),\n",
    "                                                      0., 1., None, None,\n",
    "                                                      )\n",
    "                                                 }\n",
    "                                         )\n",
    "obs_gym = env_tfa.reset()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Again, the observation space might need to be customize. We don't assume here that everything here is relevant, nor that any information that would be needed for an agent is here. \n",
    "\n",
    "This example is only here to demonstrate how to use grid2op with openai gym framework.\n",
    "\n",
    "### Action space\n",
    "\n",
    "As opposed to the previous action space, to use the tutorial of tf agents, we need to customize the action space to ouput a single number (the id of the action you want to take).\n",
    "\n",
    "This can be done with the `DiscreteActSpace` gym converter, that behave approximately the same way as `MultiDiscreteActSpace` does."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from grid2op.gym_compat import DiscreteActSpace\n",
    "reencoded_act_space = DiscreteActSpace(env_sb.init_env.action_space,\n",
    "                                       attr_to_keep=[\"set_line_status\", \"set_bus\", \"redispatch\"])\n",
    "env_tfa.action_space = reencoded_act_space\n",
    "obs_gym = env_sb.reset()\n",
    "print(env_tfa.action_space.from_gym(env_tfa.action_space.sample()))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(env_tfa.action_space.from_gym(env_tfa.action_space.sample()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Wrapping up and start training\n",
    "\n",
    "And that is it. All the rest is done thanks to tf agents. \n",
    "\n",
    "tf agents is a lot more verbose than ray and stable baselines, but it allows for more control on what you want to do, we will, for the sake of the example, only show the step without detailing them.\n",
    "\n",
    "For more information, you can visit their github:\n",
    "https://github.com/tensorflow/agents\n",
    "\n",
    "website:\n",
    "https://www.tensorflow.org/agents/api_docs/python/tf_agents\n",
    "\n",
    "and the notebook that inspired this one:\n",
    "https://colab.research.google.com/github/tensorflow/agents/blob/master/docs/tutorials/1_dqn_tutorial.ipynb"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note: the above code, once again, only aims at showing how to integrate grid2op with tf agents. Its aim is not to showcase the best use of tensorflow, tf agents or grid2op.\n",
    "\n",
    "It is only an example for demonstration purpose and do not aim at providing an interesting agent at all. For that you might want to use something different than DQN, tune the hyper parameters (including size of each neural networks, number of step for which you train, learning rate, etc. etc.), define in a better fasshion the action space and observation space etc."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import tensorflow as tf\n",
    "\n",
    "from tf_agents.agents.dqn import dqn_agent\n",
    "from tf_agents.drivers import dynamic_step_driver\n",
    "from tf_agents.environments import suite_gym\n",
    "from tf_agents.environments import tf_py_environment\n",
    "from tf_agents.eval import metric_utils\n",
    "from tf_agents.metrics import tf_metrics\n",
    "from tf_agents.networks import sequential\n",
    "from tf_agents.policies import random_tf_policy\n",
    "from tf_agents.replay_buffers import tf_uniform_replay_buffer\n",
    "from tf_agents.trajectories import trajectory\n",
    "from tf_agents.specs import tensor_spec\n",
    "from tf_agents.utils import common\n",
    "\n",
    "# initialize the environment\n",
    "from tf_agents.environments.gym_wrapper import GymWrapper\n",
    "tf_env_train = tf_py_environment.TFPyEnvironment(GymWrapper(env_tfa))\n",
    "eval_env = tf_py_environment.TFPyEnvironment(GymWrapper(copy.deepcopy(env_tfa)))\n",
    "\n",
    "# meta parameters\n",
    "num_iterations = nb_step_train\n",
    "\n",
    "initial_collect_steps = 100\n",
    "collect_steps_per_iteration = 1\n",
    "replay_buffer_max_length = 100000\n",
    "batch_size = 64\n",
    "learning_rate = 1e-3\n",
    "log_interval = 200\n",
    "num_eval_episodes = 10 \n",
    "eval_interval = 1000\n",
    "\n",
    "# neural nets (for the agents)\n",
    "fc_layer_params = (100, 50)\n",
    "action_tensor_spec = tensor_spec.from_spec(tf_env_train.action_spec())\n",
    "num_actions = action_tensor_spec.maximum - action_tensor_spec.minimum + 1\n",
    "\n",
    "# Define a helper function to create Dense layers configured with the right\n",
    "# activation and kernel initializer.\n",
    "def dense_layer(num_units):\n",
    "    return tf.keras.layers.Dense(\n",
    "      num_units,\n",
    "      activation=tf.keras.activations.relu,\n",
    "      kernel_initializer=tf.keras.initializers.VarianceScaling(\n",
    "          scale=2.0, mode='fan_in', distribution='truncated_normal'))\n",
    "\n",
    "# QNetwork consists of a sequence of Dense layers followed by a dense layer\n",
    "# with `num_actions` units to generate one q_value per available action as\n",
    "# it's output.\n",
    "dense_layers = [dense_layer(num_units) for num_units in fc_layer_params]\n",
    "q_values_layer = tf.keras.layers.Dense(\n",
    "    num_actions,\n",
    "    activation=None,\n",
    "    kernel_initializer=tf.keras.initializers.RandomUniform(\n",
    "        minval=-0.03, maxval=0.03),\n",
    "    bias_initializer=tf.keras.initializers.Constant(-0.2))\n",
    "q_net = sequential.Sequential(dense_layers + [q_values_layer])\n",
    "\n",
    "# optimizer (for training)\n",
    "optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)\n",
    "\n",
    "# just a variable to count the number of \"env.step\" performed\n",
    "train_step_counter = tf.Variable(0)\n",
    "\n",
    "# create the agent\n",
    "agent = dqn_agent.DqnAgent(\n",
    "    tf_env_train.time_step_spec(),\n",
    "    tf_env_train.action_spec(),\n",
    "    q_network=q_net,\n",
    "    optimizer=optimizer,\n",
    "    td_errors_loss_fn=common.element_wise_squared_loss,\n",
    "    train_step_counter=train_step_counter)\n",
    "agent.initialize()\n",
    "\n",
    "# for exploration\n",
    "random_policy = random_tf_policy.RandomTFPolicy(tf_env_train.time_step_spec(),\n",
    "                                                tf_env_train.action_spec())\n",
    "\n",
    "# replay buffer (to store the past actions / states / rewards)\n",
    "replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(\n",
    "    data_spec=agent.collect_data_spec,\n",
    "    batch_size=tf_env_train.batch_size,\n",
    "    max_length=replay_buffer_max_length)\n",
    "def collect_step(environment, policy, buffer):\n",
    "    time_step = environment.current_time_step()\n",
    "    action_step = policy.action(time_step)\n",
    "    next_time_step = environment.step(action_step.action)\n",
    "    traj = trajectory.from_transition(time_step, action_step, next_time_step)\n",
    "    # Add trajectory to the replay buffer\n",
    "    buffer.add_batch(traj)\n",
    "def collect_data(env, policy, buffer, steps):\n",
    "    for _ in range(steps):\n",
    "        collect_step(env, policy, buffer)\n",
    "collect_data(tf_env_train, random_policy, replay_buffer, initial_collect_steps)\n",
    "\n",
    "# generate the datasets\n",
    "# Dataset generates trajectories with shape [Bx2x...]\n",
    "dataset = replay_buffer.as_dataset(\n",
    "    num_parallel_calls=3, \n",
    "    sample_batch_size=batch_size, \n",
    "    num_steps=2).prefetch(3)\n",
    "iterator = iter(dataset)\n",
    "\n",
    "# train it\n",
    "# (Optional) Optimize by wrapping some of the code in a graph using TF function.\n",
    "agent.train = common.function(agent.train)\n",
    "\n",
    "# Reset the train step\n",
    "agent.train_step_counter.assign(0)\n",
    "\n",
    "# Evaluate the agent's policy once before training.\n",
    "def compute_avg_return(environment, policy, num_episodes=10):\n",
    "    total_return = 0.0\n",
    "    for _ in range(num_episodes):\n",
    "        time_step = environment.reset()\n",
    "        episode_return = 0.0\n",
    "\n",
    "    while not time_step.is_last():\n",
    "        action_step = policy.action(time_step)\n",
    "        time_step = environment.step(action_step.action)\n",
    "        episode_return += time_step.reward\n",
    "    total_return += episode_return\n",
    "\n",
    "    avg_return = total_return / num_episodes\n",
    "    return avg_return.numpy()[0]\n",
    "\n",
    "\n",
    "# See also the metrics module for standard implementations of different metrics.\n",
    "# https://github.com/tensorflow/agents/tree/master/tf_agents/metrics\n",
    "\n",
    "avg_return = compute_avg_return(eval_env, agent.policy, num_eval_episodes)\n",
    "returns = [avg_return]\n",
    "\n",
    "for _ in range(num_iterations):\n",
    "    # Collect a few steps using collect_policy and save to the replay buffer.\n",
    "    collect_data(tf_env_train, agent.collect_policy, replay_buffer, collect_steps_per_iteration)\n",
    "\n",
    "    # Sample a batch of data from the buffer and update the agent's network.\n",
    "    experience, unused_info = next(iterator)\n",
    "    trainer = agent.train(experience)\n",
    "    train_loss = trainer.loss\n",
    "    \n",
    "    step = agent.train_step_counter.numpy()\n",
    "\n",
    "    if step % log_interval == 0:\n",
    "        print('step = {0}: loss = {1}'.format(step, train_loss))\n",
    "\n",
    "    if step % eval_interval == 0:\n",
    "        avg_return = compute_avg_return(eval_env, agent.policy, num_eval_episodes)\n",
    "        print('step = {0}: Average Return = {1}'.format(step, avg_return))\n",
    "        returns.append(avg_return)\n",
    "    avg_return = compute_avg_return(eval_env, agent.policy, num_eval_episodes)\n",
    "    \n",
    "if num_iterations:\n",
    "    print('Final Average return aftre training for {} steps: {}'.format(step, avg_return))\n",
    "    returns.append(avg_return)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}