{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Actions as vector, and RL agent training" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is recommended to have a look at the [0_basic_functionalities](0_basic_functionalities.ipynb), [1_Observation_Agents](1_Observation_Agents.ipynb) and [2_Action_GridManipulation](2_Action_GridManipulation.ipynb) notebooks before getting into this one." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Objectives**\n", "\n", "In this notebook we will expose :\n", "* how to use the \"converters\": some specific action_space that allows to manipulate a specific action representation\n", "* how to train a (stupid) Agent using reinforcement learning.\n", "* how to inspect (rapidly) the action taken by the Agent\n", "\n", "**NB** for this tutorial we train an Agent inspired from this blog post: [deep-reinforcement-learning-tutorial-with-open-ai-gym](https://towardsdatascience.com/deep-reinforcement-learning-tutorial-with-open-ai-gym-c0de4471f368). Many other different reinforcement learning tutorial exist. The code showed in this notebook has no pretention except to demonstrate how to use Grid2Op functionality to train a Deep Reinforcement learning Agent and inspect its behaviour. There are absolutely nothing implied about the performance, training strategy, type of Agent, meta parameters etc. All of them are purely \"random\".\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "import sys\n", "import grid2op" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
run previous cell, wait for 2 seconds
\n", "" ], "text/plain": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res = None\n", "try:\n", " from jyquickhelper import add_notebook_menu\n", " res = add_notebook_menu()\n", "except ModuleNotFoundError:\n", " print(\"Impossible to automatically add a menu / table of content to this notebook.\\nYou can download \\\"jyquickhelper\\\" package with: \\n\\\"pip install jyquickhelper\\\"\")\n", "res" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## I) Manipulating action representation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Grid2op package has been built with an \"object oriented\" perspective: almost everything is encapsulated in a dedicated `class`. This allows for more customization of the plateform.\n", "\n", "The downside of this approach is that machine learning method, and especially deep learning, often prefers to deal with vectors rather than with `complex` objects. Indeed, as we covered in the previous tutorials on the platform, building our own actions can be tedious and can sometime require knowledge of the powergrid.\n", "\n", "On the contrary, in most of standard Reinforcement Learning environment, actions have an higher representation. For example in pacman, there are 4 different types of actions: turn left, turn right, go up or do down. This allows for easy sampling (you need to achieve a uniform sampling you simply need to sample a number between 0 and 3 included) and an easy representation: each action is a different component of a vector of dimension 4 [because there are 4 actions]. \n", "\n", "On the other hand this representation is not \"human friendly\". It is quite convenient in the case of pacman because the action space is rather small making it possible to remember which action corresponds to which component, but in the case of the grid2op package, there are hundreds, sometimes thousands of actions, making it impossible to remember which component corresponds to which actions. We suppose we don't really care about this fact here, as tutorials on Reinforcement Learning with discrete action space often assume that actions are labelled with integer (such as in pacman for example).\n", "\n", "Howerever, to allow the training of RL agent more easily, we allows to make some \"[Converters](https://grid2op.readthedocs.io/en/latest/converter.html)\" which roles are to allow an agent to deal with a custom representation of the action space. The class [AgentWithConverter](https://grid2op.readthedocs.io/en/latest/agent.html#grid2op.Agent.AgentWithConverter) is perfect for such usage." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/tezirg/Code/Grid2Op.BDonnot/getting_started/grid2op/MakeEnv.py:592: UserWarning:\n", "\n", "Your are using only 2 chronics for this environment. More can be download by running, from a command line:\n", "python -m grid2op.download --name \"case14_redisp\" --path_save PATH\\WHERE\\YOU\\WANT\\TO\\DOWNLOAD\\DATA\n", "\n" ] } ], "source": [ "# import the usefull class\n", "import numpy as np\n", "\n", "from grid2op import make\n", "from grid2op.Agent import RandomAgent \n", "from grid2op.Converter import IdToAct\n", "max_iter = 100 # to make computation much faster we will only consider 50 time steps instead of 287\n", "\n", "env = make(name_env=\"case14_redisp\")\n", "env.seed(0)\n", "my_agent = RandomAgent(env.action_space)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And that's it. This agent will be able to perform any action, but instead of going through the description of the actions from a powersystem point of view (ie setting what is connected to what, what is disconnected etc.) it will simply choose an integer with the method `my_act` this integer will then be converter back to a proper valid action.\n", "\n", "Here we have an example on the action representation as seen by the Agent:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "172\n", "47\n", "117\n" ] } ], "source": [ "for el in range(3):\n", " print(my_agent.my_act(None, None))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And below you can see the \"`act`\" functions behaves as expected:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "This action will:\n", "\t - NOT change anything to the injections\n", "\t - NOT perform any redispatching action\n", "\t - NOT force any line status\n", "\t - NOT switch any line status\n", "\t - NOT switch anything in the topology\n", "\t - Set the bus of the following element:\n", "\t \t - assign bus 2 to line (origin) 2 [on substation 8]\n", "\t \t - assign bus 2 to line (origin) 3 [on substation 8]\n", "\t \t - assign bus 1 to line (extremity) 16 [on substation 8]\n", "\t \t - assign bus 2 to line (extremity) 19 [on substation 8]\n", "\t \t - assign bus 1 to load 6 [on substation 8]\n", "This action will:\n", "\t - NOT change anything to the injections\n", "\t - NOT perform any redispatching action\n", "\t - force reconnection of 1 powerlines ([7])\n", "\t - NOT switch any line status\n", "\t - NOT switch anything in the topology\n", "\t - Set the bus of the following element:\n", "\t \t - assign bus 2 to line (origin) 7 [on substation 1]\n", "\t \t - assign bus 1 to line (extremity) 7 [on substation 2]\n", "This action will:\n", "\t - NOT change anything to the injections\n", "\t - NOT perform any redispatching action\n", "\t - NOT force any line status\n", "\t - NOT switch any line status\n", "\t - NOT switch anything in the topology\n", "\t - Set the bus of the following element:\n", "\t \t - assign bus 2 to line (origin) 2 [on substation 8]\n", "\t \t - assign bus 1 to line (origin) 3 [on substation 8]\n", "\t \t - assign bus 2 to line (extremity) 16 [on substation 8]\n", "\t \t - assign bus 2 to line (extremity) 19 [on substation 8]\n", "\t \t - assign bus 1 to load 6 [on substation 8]\n" ] } ], "source": [ "for el in range(3):\n", " print(my_agent.act(None, None))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**NB** lots of these actions are equivalent to the \"do nothing\" action at some point. For example, when trying to reconnect a powerline that is already connected will do nothing. Same for the topology. If everything is already connected to bus 1, then the action to connect things to bus 1 on the same substation will not affect the powergrid." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## II) Training an Agent" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For this tutorial, we will expose to built a Q-learning Agent. Most of the code (except the neural network architecture) are inspired from this blog post: [https://towardsdatascience.com/deep-reinforcement-learning-tutorial-with-open-ai-gym-c0de4471f368](https://towardsdatascience.com/deep-reinforcement-learning-tutorial-with-open-ai-gym-c0de4471f368).\n", "\n", "**Requirements** This notebook require to have `keras` installed on your machine.\n", "\n", "As always in these notebook, we will use the `case14_fromfile` Environment. No proper care has been taken to set the thermal limits on this grid. It's unlikely that the agent can learn anything in this context." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### II.A) Defining some \"helpers\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The type of Agent were are using require a bit of set up, independantly of Grid2Op. We will reuse the code showed in \n", "[https://towardsdatascience.com/deep-reinforcement-learning-tutorial-with-open-ai-gym-c0de4471f368](https://towardsdatascience.com/deep-reinforcement-learning-tutorial-with-open-ai-gym-c0de4471f368) and in [Reinforcement-Learning-Tutorial](https://github.com/abhinavsagar/Reinforcement-Learning-Tutorial) from Abhinav Sagar code under a *MIT license* found here: [MIT License](https://github.com/abhinavsagar/Reinforcement-Learning-Tutorial/blob/master/LICENSE).\n", "\n", "This first section is here to define these classes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But first let's import the necessary dependencies" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "#tf2.0 friendly\n", "import numpy as np\n", "import random\n", "import warnings\n", "with warnings.catch_warnings():\n", " warnings.filterwarnings(\"ignore\", category=FutureWarning)\n", " import tensorflow.keras\n", " import tensorflow.keras.backend as K\n", " from tensorflow.keras.models import load_model, Sequential, Model\n", " from tensorflow.keras.optimizers import Adam\n", " from tensorflow.keras.layers import Activation, Dropout, Flatten, Dense, subtract, add\n", " from tensorflow.keras.layers import Input, Lambda, Concatenate" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### a) Replay buffer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " First we define a \"replay buffer\" necessary to train the Agent." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# Credit Abhinav Sagar: \n", "# https://github.com/abhinavsagar/Reinforcement-Learning-Tutorial\n", "# Code under MIT license, available at:\n", "# https://github.com/abhinavsagar/Reinforcement-Learning-Tutorial/blob/master/LICENSE\n", "from collections import deque\n", "\n", "class ReplayBuffer:\n", " \"\"\"Constructs a buffer object that stores the past moves\n", " and samples a set of subsamples\"\"\"\n", "\n", " def __init__(self, buffer_size):\n", " self.buffer_size = buffer_size\n", " self.count = 0\n", " self.buffer = deque()\n", "\n", " def add(self, s, a, r, d, s2):\n", " \"\"\"Add an experience to the buffer\"\"\"\n", " # S represents current state, a is action,\n", " # r is reward, d is whether it is the end, \n", " # and s2 is next state\n", " experience = (s, a, r, d, s2)\n", " if self.count < self.buffer_size:\n", " self.buffer.append(experience)\n", " self.count += 1\n", " else: \n", " self.buffer.popleft()\n", " self.buffer.append(experience)\n", "\n", " def size(self):\n", " return self.count\n", "\n", " def sample(self, batch_size):\n", " \"\"\"Samples a total of elements equal to batch_size from buffer\n", " if buffer contains enough elements. Otherwise return all elements\"\"\"\n", "\n", " batch = []\n", "\n", " if self.count < batch_size:\n", " batch = random.sample(self.buffer, self.count)\n", " else:\n", " batch = random.sample(self.buffer, batch_size)\n", "\n", " # Maps each experience in batch in batches of states, actions, rewards\n", " # and new states\n", " s_batch, a_batch, r_batch, d_batch, s2_batch = list(map(np.array, list(zip(*batch))))\n", "\n", " return s_batch, a_batch, r_batch, d_batch, s2_batch\n", "\n", " def clear(self):\n", " self.buffer.clear()\n", " self.count = 0" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### b) Meta parameters of the methods" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then we re-use the default parameters, note that these can be optimized. Nothing has been changed for this example.\n", "\n", "For more information about them, please refer to the blog post of Abhinav Sagar [available here](https://towardsdatascience.com/deep-reinforcement-learning-tutorial-with-open-ai-gym-c0de4471f368)." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "DECAY_RATE = 0.9\n", "BUFFER_SIZE = 40000\n", "MINIBATCH_SIZE = 64\n", "TOT_FRAME = 3000000\n", "EPSILON_DECAY = 10000\n", "MIN_OBSERVATION = 42 #5000\n", "FINAL_EPSILON = 1/300 # have on average 1 random action per scenario of approx 287 time steps\n", "INITIAL_EPSILON = 0.1\n", "TAU = 0.01\n", "ALPHA = 1\n", "# Number of frames to \"throw\" into network\n", "NUM_FRAMES = 1 ## this has been changed compared to the original implementation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### II.B) Adapatation of the inputs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the original code, the models were used to play an Atari game and the inputs were images. For our system, the inputs are \"Observation\" converted as vector.\n", "\n", "For a more detailed description of the code used, please check:\n", "* [https://towardsdatascience.com/deep-reinforcement-learning-tutorial-with-open-ai-gym-c0de4471f368](https://towardsdatascience.com/deep-reinforcement-learning-tutorial-with-open-ai-gym-c0de4471f368)\n", "* and [Reinforcement-Learning-Tutorial](https://github.com/abhinavsagar/Reinforcement-Learning-Tutorial)\n", "\n", "\n", "This is why we adapted the original code from Abhinav Sagar:\n", "* We replaced convolutional layers with fully connected (dense) layers\n", "* We made sure not to look at all the observations, but rather at only some part of it." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### a) extracting relevant information of observation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First we extract relevant information about the dimension of the observation space, and the action space." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "OBSERVATION_SIZE = env.observation_space.size()\n", "NUM_ACTIONS = my_agent.action_space.n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A few comments here.\n", "\n", "First, we don't change anything to the observation space. This means that the vector the agent will receive is really big, not scaled and with lots of informations that are not really usefull." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### b) Code the neural networks" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The code of the neural networks used have been impacted only slightly to adapt them to our problem. The biggest changes comes from removing the convolutional layers, as well as adapting the input and output size.\n", "\n", "For each of the method bellow, we specify what have been adapted." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We want to emphasize here that these models are used through the \"`predict_movement`\" method. This method outputs an integer: the action id. It's perfectly suited to use a representation of actions with integer rather than with complete descriptions of what the agent is doing." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### II.C) Making the code of the Agent and train it" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the \"reference\" article [https://towardsdatascience.com/deep-reinforcement-learning-tutorial-with-open-ai-gym-c0de4471f368](https://towardsdatascience.com/deep-reinforcement-learning-tutorial-with-open-ai-gym-c0de4471f368), the author Abhinav Sagar made a dedicated environment based on SpaceInvader in the gym repository. We proceed here on a similar way, but with a the grid2op environment." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### a) Adapated code" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "# Credit Abhinav Sagar: \n", "# https://github.com/abhinavsagar/Reinforcement-Learning-Tutorial\n", "# Code under MIT license, available at:\n", "# https://github.com/abhinavsagar/Reinforcement-Learning-Tutorial/blob/master/LICENSE\n", "\n", "class DeepQ(object):\n", " \"\"\"Constructs the desired deep q learning network\"\"\"\n", " def __init__(self, action_size, lr=1e-5, observation_size=OBSERVATION_SIZE):\n", " # It is not modified from Abhinav Sagar's code, except for adding the possibility to change the learning rate\n", " # in parameter is also present the size of the action space\n", " # (it used to be a global variable in the original code)\n", " self.action_size = action_size\n", " self.observation_size = observation_size\n", " self.model = None\n", " self.target_model = None\n", " self.lr_ = lr\n", " self.qvalue_evolution = []\n", " self.construct_q_network()\n", " \n", " def construct_q_network(self):\n", " # replacement of the Convolution layers by Dense layers, and change the size of the input space and output space\n", " \n", " # Uses the network architecture found in DeepMind paper\n", " self.model = Sequential()\n", " input_layer = Input(shape = (self.observation_size*NUM_FRAMES,))\n", " layer1 = Dense(self.observation_size*NUM_FRAMES)(input_layer)\n", " layer1 = Activation('relu')(layer1)\n", " layer2 = Dense(self.observation_size)(layer1)\n", " layer2 = Activation('relu')(layer2)\n", " layer3 = Dense(self.observation_size)(layer2)\n", " layer3 = Activation('relu')(layer3)\n", " layer4 = Dense(2*NUM_ACTIONS)(layer3)\n", " layer4 = Activation('relu')(layer4)\n", " output = Dense(NUM_ACTIONS)(layer4)\n", " \n", " self.model = Model(inputs=[input_layer], outputs=[output])\n", " self.model.compile(loss='mse', optimizer=Adam(lr=self.lr_))\n", " self.target_model = Model(inputs=[input_layer], outputs=[output])\n", " self.target_model.compile(loss='mse', optimizer=Adam(lr=self.lr_))\n", " self.target_model.set_weights(self.model.get_weights())\n", " \n", " def predict_movement(self, data, epsilon):\n", " \"\"\"Predict movement of game controler where is epsilon\n", " probability randomly move.\"\"\"\n", " # nothing has changed from the original implementation\n", " rand_val = np.random.random()\n", " q_actions = self.model.predict(data.reshape(1, self.observation_size*NUM_FRAMES), batch_size = 1)\n", " \n", " if rand_val < epsilon:\n", " opt_policy = np.random.randint(0, NUM_ACTIONS)\n", " else:\n", " opt_policy = np.argmax(np.abs(q_actions))\n", " \n", " self.qvalue_evolution.append(q_actions[0,opt_policy])\n", "\n", " return opt_policy, q_actions[0, opt_policy]\n", "\n", " def train(self, s_batch, a_batch, r_batch, d_batch, s2_batch, observation_num):\n", " \"\"\"Trains network to fit given parameters\"\"\"\n", " # nothing has changed from the original implementation, except for changing the input dimension 'reshape'\n", " batch_size = s_batch.shape[0]\n", " targets = np.zeros((batch_size, NUM_ACTIONS))\n", "\n", " for i in range(batch_size):\n", " targets[i] = self.model.predict(s_batch[i].reshape(1, self.observation_size*NUM_FRAMES), batch_size = 1)\n", " fut_action = self.target_model.predict(s2_batch[i].reshape(1, self.observation_size*NUM_FRAMES), batch_size = 1)\n", " targets[i, a_batch[i]] = r_batch[i]\n", " if d_batch[i] == False:\n", " targets[i, a_batch[i]] += DECAY_RATE * np.max(fut_action)\n", " loss = self.model.train_on_batch(s_batch, targets)\n", " # Print the loss every 100 iterations.\n", " if observation_num % 100 == 0:\n", " print(\"We had a loss equal to \", loss)\n", "\n", " def save_network(self, path):\n", " # Saves model at specified path as h5 file\n", " # nothing has changed\n", " self.model.save(path)\n", " print(\"Successfully saved network.\")\n", "\n", " def load_network(self, path):\n", " # nothing has changed\n", " self.model = load_model(path)\n", " print(\"Succesfully loaded network.\")\n", "\n", " def target_train(self):\n", " # nothing has changed from the original implementation\n", " model_weights = self.model.get_weights()\n", " target_model_weights = self.target_model.get_weights()\n", " for i in range(len(model_weights)):\n", " target_model_weights[i] = TAU * model_weights[i] + (1 - TAU) * target_model_weights[i]\n", " self.target_model.set_weights(target_model_weights)\n", " \n", "class DuelQ(object):\n", " \"\"\"Constructs the desired deep q learning network\"\"\"\n", " def __init__(self, action_size, lr=0.00001, observation_size=OBSERVATION_SIZE):\n", " # It is not modified from Abhinav Sagar's code, except for adding the possibility to change the learning rate\n", " # in parameter is also present the size of the action space\n", " # (it used to be a global variable in the original code)\n", " self.action_size = action_size\n", " self.observation_size = observation_size\n", " self.lr_ = lr\n", " self.model = None\n", " self.qvalue_evolution = []\n", " self.construct_q_network()\n", "\n", " def construct_q_network(self):\n", " # Uses the network architecture found in DeepMind paper\n", " # The inputs and outputs size have changed, as well as replacing the convolution by dense layers.\n", " self.model = Sequential()\n", " \n", " input_layer = Input(shape = (self.observation_size*NUM_FRAMES,))\n", " lay1 = Dense(self.observation_size*NUM_FRAMES)(input_layer)\n", " lay1 = Activation('relu')(lay1)\n", " \n", " lay2 = Dense(self.observation_size)(lay1)\n", " lay2 = Activation('relu')(lay2)\n", " \n", " lay3 = Dense(2*NUM_ACTIONS)(lay2)\n", " lay3 = Activation('relu')(lay3)\n", " \n", " fc1 = Dense(NUM_ACTIONS)(lay3)\n", " advantage = Dense(NUM_ACTIONS)(fc1)\n", " fc2 = Dense(NUM_ACTIONS)(lay3)\n", " value = Dense(1)(fc2)\n", " \n", " meaner = Lambda(lambda x: K.mean(x, axis=1) )\n", " mn_ = meaner(advantage) \n", " tmp = subtract([advantage, mn_]) # keras doesn't like this part...\n", " policy = add([tmp, value])\n", "\n", " self.model = Model(inputs=[input_layer], outputs=[policy])\n", " self.model.compile(loss='mse', optimizer=Adam(lr=self.lr_))\n", "\n", " self.target_model = Model(inputs=[input_layer], outputs=[policy])\n", " self.target_model.compile(loss='mse', optimizer=Adam(lr=self.lr_))\n", " print(\"Successfully constructed networks.\")\n", " \n", " def predict_movement(self, data, epsilon):\n", " \"\"\"Predict movement of game controler where is epsilon\n", " probability randomly move.\"\"\"\n", " # only changes lie in adapting the input shape\n", " q_actions = self.model.predict(data.reshape(1, self.observation_size*NUM_FRAMES), batch_size = 1)\n", " opt_policy = np.argmax(q_actions)\n", " rand_val = np.random.random()\n", " if rand_val < epsilon:\n", " opt_policy = np.random.randint(0, NUM_ACTIONS)\n", " \n", " self.qvalue_evolution.append(q_actions[0,opt_policy])\n", "\n", " return opt_policy, q_actions[0, opt_policy]\n", "\n", " def train(self, s_batch, a_batch, r_batch, d_batch, s2_batch, observation_num):\n", " \"\"\"Trains network to fit given parameters\"\"\"\n", " # nothing has changed except adapting the input shapes\n", " batch_size = s_batch.shape[0]\n", " targets = np.zeros((batch_size, NUM_ACTIONS))\n", "\n", " for i in range(batch_size):\n", " targets[i] = self.model.predict(s_batch[i].reshape(1, self.observation_size*NUM_FRAMES), batch_size = 1)\n", " fut_action = self.target_model.predict(s2_batch[i].reshape(1, self.observation_size*NUM_FRAMES), batch_size = 1)\n", " targets[i, a_batch[i]] = r_batch[i]\n", " if d_batch[i] == False:\n", " targets[i, a_batch[i]] += DECAY_RATE * np.max(fut_action)\n", "\n", " loss = self.model.train_on_batch(s_batch, targets)\n", "\n", " # Print the loss every 100 iterations.\n", " if observation_num % 100 == 0:\n", " print(\"We had a loss equal to \", loss)\n", "\n", " def save_network(self, path):\n", " # Saves model at specified path as h5 file\n", " # nothing has changed\n", " self.model.save(path)\n", " print(\"Successfully saved network.\")\n", "\n", " def load_network(self, path):\n", " # nothing has changed\n", " self.model.load_weights(path)\n", " self.target_model.load_weights(path)\n", " print(\"Succesfully loaded network.\")\n", "\n", " def target_train(self):\n", " # nothing has changed\n", " model_weights = self.model.get_weights()\n", " self.target_model.set_weights(model_weights)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another custom made version of the Q-Learning algorithm" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "class RealQ(object):\n", " \"\"\"Constructs the desired deep q learning network\"\"\"\n", " def __init__(self, action_size, lr=1e-5, observation_size=OBSERVATION_SIZE, mean_reg=False):\n", "\n", " self.action_size = action_size\n", " self.observation_size = observation_size\n", " self.model = None\n", " self.target_model = None\n", " self.lr_ = lr\n", " self.mean_reg = mean_reg\n", " \n", " self.qvalue_evolution=[]\n", "\n", " self.construct_q_network()\n", " \n", " def construct_q_network(self):\n", "\n", " self.model = Sequential()\n", " \n", " input_states = Input(shape = (self.observation_size,))\n", " input_action = Input(shape = (NUM_ACTIONS,))\n", " input_layer = Concatenate()([input_states, input_action])\n", " \n", " lay1 = Dense(self.observation_size)(input_layer)\n", " lay1 = Activation('relu')(lay1)\n", " \n", " lay2 = Dense(self.observation_size)(lay1)\n", " lay2 = Activation('relu')(lay2)\n", " \n", " lay3 = Dense(2*NUM_ACTIONS)(lay2)\n", " lay3 = Activation('relu')(lay3)\n", " \n", " fc1 = Dense(NUM_ACTIONS)(lay3)\n", " advantage = Dense(1, activation = 'linear')(fc1)\n", " \n", " if self.mean_reg==True:\n", " advantage = Lambda(lambda x : x - K.mean(x))(advantage)\n", " \n", " self.model = Model(inputs=[input_states, input_action], outputs=[advantage])\n", " self.model.compile(loss='mse', optimizer=Adam(lr=self.lr_))\n", " \n", " self.model_copy = Model(inputs=[input_states, input_action], outputs=[advantage])\n", " self.model_copy.compile(loss='mse', optimizer=Adam(lr=self.lr_))\n", " self.model_copy.set_weights(self.model.get_weights())\n", " \n", " def predict_movement(self, states, epsilon):\n", " \"\"\"Predict movement of game controler where is epsilon\n", " probability randomly move.\"\"\"\n", " # nothing has changed from the original implementation\n", " rand_val = np.random.random()\n", " q_actions = self.model.predict([np.tile(states.reshape(1, self.observation_size),(NUM_ACTIONS,1)), np.eye(NUM_ACTIONS)]).reshape(1,-1)\n", " if rand_val < epsilon:\n", " opt_policy = np.random.randint(0, NUM_ACTIONS)\n", " else:\n", " opt_policy = np.argmax(np.abs(q_actions))\n", " \n", " self.qvalue_evolution.append(q_actions[0,opt_policy])\n", " \n", " return opt_policy, q_actions[0,opt_policy]\n", "\n", " def train(self, s_batch, a_batch, r_batch, d_batch, s2_batch, observation_num):\n", " \"\"\"Trains network to fit given parameters\"\"\"\n", " # nothing has changed from the original implementation, except for changing the input dimension 'reshape'\n", " batch_size = s_batch.shape[0]\n", " targets = np.zeros(batch_size)\n", " last_action=np.zeros((batch_size, NUM_ACTIONS))\n", " for i in range(batch_size):\n", " last_action[i,a_batch[i]] = 1\n", " q_pre = self.model.predict([s_batch[i].reshape(1, self.observation_size), last_action[i].reshape(1,-1)], batch_size=1).reshape(1,-1)\n", " q_fut = self.model_copy.predict([np.tile(s2_batch[i].reshape(1, self.observation_size),(NUM_ACTIONS,1)), np.eye(NUM_ACTIONS)]).reshape(1,-1)\n", " fut_action = np.max(q_fut)\n", " if d_batch[i] == False:\n", " targets[i] = ALPHA * (r_batch[i] + DECAY_RATE * fut_action - q_pre)\n", " else:\n", " targets[i] = ALPHA * (r_batch[i] - q_pre)\n", " loss = self.model.train_on_batch([s_batch, last_action], targets)\n", " # Print the loss every 100 iterations.\n", " if observation_num % 100 == 0:\n", " print(\"We had a loss equal to \", loss)\n", "\n", " def save_network(self, path):\n", " # Saves model at specified path as h5 file\n", " # nothing has changed\n", " self.model.save(path)\n", " print(\"Successfully saved network.\")\n", "\n", " def load_network(self, path):\n", " # nothing has changed\n", " self.model = load_model(path)\n", " print(\"Succesfully loaded network.\")\n", "\n", " def target_train(self):\n", " # nothing has changed from the original implementation\n", " model_weights = self.model.get_weights() \n", " target_model_weights = self.model_copy.get_weights()\n", " for i in range(len(model_weights)):\n", " target_model_weights[i] = TAU * model_weights[i] + (1 - TAU) * target_model_weights[i]\n", " self.model_copy.set_weights(model_weights)\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A version of SAC algorithm as described in https://spinningup.openai.com/en/latest/algorithms/sac.html" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "class SAC(object):\n", " \"\"\"Constructs the desired deep q learning network\"\"\"\n", " def __init__(self, action_size, lr=1e-5, observation_size=OBSERVATION_SIZE):\n", "\n", " self.action_size = action_size\n", " self.observation_size = observation_size\n", " self.model = None\n", " self.target_model = None\n", " self.lr_ = lr\n", " self.average_reward = 0\n", " self.life_spent = 1\n", " self.qvalue_evolution = []\n", " self.Is_nan = False\n", "\n", " self.construct_q_network()\n", " \n", " def build_q_NN(self):\n", " model = Sequential()\n", " \n", " input_states = Input(shape = (self.observation_size,))\n", " input_action = Input(shape = (NUM_ACTIONS,))\n", " input_layer = Concatenate()([input_states, input_action])\n", " \n", " lay1 = Dense(self.observation_size)(input_layer)\n", " lay1 = Activation('relu')(lay1)\n", " \n", " lay2 = Dense(self.observation_size)(lay1)\n", " lay2 = Activation('relu')(lay2)\n", " \n", " lay3 = Dense(2*NUM_ACTIONS)(lay2)\n", " lay3 = Activation('relu')(lay3)\n", " \n", " advantage = Dense(1, activation = 'linear')(lay3)\n", " \n", " model = Model(inputs=[input_states, input_action], outputs=[advantage])\n", " model.compile(loss='mse', optimizer=Adam(lr=self.lr_))\n", " \n", " return model\n", " \n", " def construct_q_network(self):\n", " #construct double Q networks\n", " self.model_Q = self.build_q_NN()\n", " self.model_Q2 = self.build_q_NN()\n", "\n", " \n", " #state value function approximation\n", " self.model_value = Sequential()\n", " \n", " input_states = Input(shape = (self.observation_size,))\n", " \n", " lay1 = Dense(self.observation_size)(input_states)\n", " lay1 = Activation('relu')(lay1)\n", " \n", " lay3 = Dense(2*NUM_ACTIONS)(lay1)\n", " lay3 = Activation('relu')(lay3)\n", " \n", " advantage = Dense(NUM_ACTIONS, activation = 'relu')(lay3)\n", " state_value = Dense(1, activation = 'linear')(advantage)\n", " \n", " self.model_value = Model(inputs=[input_states], outputs=[state_value])\n", " self.model_value.compile(loss='mse', optimizer=Adam(lr=self.lr_))\n", " \n", " self.model_value_target = Sequential()\n", " \n", " input_states = Input(shape = (self.observation_size,))\n", " \n", " lay1 = Dense(self.observation_size)(input_states)\n", " lay1 = Activation('relu')(lay1)\n", " \n", " lay3 = Dense(2*NUM_ACTIONS)(lay1)\n", " lay3 = Activation('relu')(lay3)\n", " \n", " advantage = Dense(NUM_ACTIONS, activation = 'relu')(lay3)\n", " state_value = Dense(1, activation = 'linear')(advantage)\n", " \n", " self.model_value_target = Model(inputs=[input_states], outputs=[state_value])\n", " self.model_value_target.compile(loss='mse', optimizer=Adam(lr=self.lr_))\n", " self.model_value_target.set_weights(self.model_value.get_weights())\n", " #policy function approximation\n", " \n", " self.model_policy = Sequential()\n", " input_states = Input(shape = (self.observation_size,))\n", " lay1 = Dense(self.observation_size)(input_states)\n", " lay1 = Activation('relu')(lay1)\n", " \n", " lay2 = Dense(self.observation_size)(lay1)\n", " lay2 = Activation('relu')(lay2)\n", " \n", " lay3 = Dense(2*NUM_ACTIONS)(lay2)\n", " lay3 = Activation('relu')(lay3)\n", " \n", " soft_proba = Dense(NUM_ACTIONS, activation=\"softmax\", kernel_initializer='uniform')(lay3)\n", " \n", " self.model_policy = Model(inputs=[input_states], outputs=[soft_proba])\n", " self.model_policy.compile(loss='categorical_crossentropy', optimizer=Adam(lr=self.lr_))\n", " \n", " print(\"Successfully constructed networks.\")\n", " \n", " def predict_movement(self, states, epsilon):\n", " \"\"\"Predict movement of game controler where is epsilon\n", " probability randomly move.\"\"\"\n", " # nothing has changed from the original implementation\n", " p_actions = self.model_policy.predict(states.reshape(1, self.observation_size)).ravel()\n", " rand_val = np.random.random()\n", " \n", " if rand_val < epsilon / 10:\n", " opt_policy = np.random.randint(0, NUM_ACTIONS)\n", " else:\n", " #opt_policy = np.random.choice(np.arange(NUM_ACTIONS, dtype=int), size=1, p = p_actions)\n", " opt_policy = np.argmax(p_actions)\n", " \n", " return np.int(opt_policy), p_actions[opt_policy]\n", "\n", " def train(self, s_batch, a_batch, r_batch, d_batch, s2_batch, observation_num):\n", " \"\"\"Trains networks to fit given parameters\"\"\"\n", "\n", " # nothing has changed from the original implementation, except for changing the input dimension 'reshape'\n", " batch_size = s_batch.shape[0]\n", " target = np.zeros((batch_size, 1))\n", " new_proba = np.zeros((batch_size, NUM_ACTIONS))\n", " last_action=np.zeros((batch_size, NUM_ACTIONS))\n", " \n", " #training of the action state value networks\n", " last_action=np.zeros((batch_size, NUM_ACTIONS))\n", "\n", " for i in range(batch_size):\n", " last_action[i,a_batch[i]] = 1\n", "\n", " v_t = self.model_value_target.predict(s_batch[i].reshape(1, self.observation_size*NUM_FRAMES), batch_size = 1)\n", " \n", " self.qvalue_evolution.append(v_t[0])\n", " fut_action = self.model_value_target.predict(s2_batch[i].reshape(1, self.observation_size*NUM_FRAMES), batch_size = 1)\n", "\n", " target[i,0] = r_batch[i] + (1 - d_batch[i]) * DECAY_RATE * fut_action \n", "\n", " \n", " loss = self.model_Q.train_on_batch([s_batch, last_action], target)\n", " loss_2 = self.model_Q2.train_on_batch([s_batch, last_action], target) \n", " \n", " #training of the policy\n", " \n", " for i in range(batch_size):\n", " self.life_spent += 1\n", " temp = 1 / np.log(self.life_spent) / 2\n", " new_values = self.model_Q.predict([np.tile(s_batch[i].reshape(1, self.observation_size),(NUM_ACTIONS,1)),\n", " np.eye(NUM_ACTIONS)]).reshape(1,-1)\n", " new_values -= np.amax(new_values, axis=-1)\n", " new_proba[i] = np.exp(new_values / temp) / np.sum(np.exp(new_values / temp), axis=-1)\n", " \n", " loss_policy = self.model_policy.train_on_batch(s_batch, new_proba)\n", " \n", " #training of the value_function\n", " value_target = np.zeros(batch_size)\n", " for i in range(batch_size):\n", " target_pi = self.model_policy.predict(s_batch[i].reshape(1, self.observation_size*NUM_FRAMES), batch_size = 1)\n", " action_v1 = self.model_Q.predict([np.tile(s_batch[i].reshape(1, self.observation_size),(NUM_ACTIONS,1)),\n", " np.eye(NUM_ACTIONS)]).reshape(1,-1)\n", " action_v2 = self.model_Q2.predict([np.tile(s_batch[i].reshape(1, self.observation_size),(NUM_ACTIONS,1)),\n", " np.eye(NUM_ACTIONS)]).reshape(1,-1)\n", " value_target[i] = np.fmin(action_v1[0,a_batch[i]], action_v2[0,a_batch[i]]) - np.sum(target_pi * np.log(target_pi + 1e-6))\n", " \n", " loss_value = self.model_value.train_on_batch(s_batch, value_target.reshape(-1,1))\n", " \n", " self.Is_nan = np.isnan(loss) + np.isnan(loss_2) + np.isnan(loss_policy) + np.isnan(loss_value)\n", " # Print the loss every 100 iterations.\n", " if observation_num % 100 == 0:\n", " print(\"We had a loss equal to \", loss, loss_2, loss_policy, loss_value)\n", "\n", " def save_network(self, path):\n", " # Saves model at specified path as h5 file\n", " # nothing has changed\n", " self.model_policy.save(\"policy_\"+path)\n", " self.model_value_target.save(\"value_\"+path)\n", " print(\"Successfully saved network.\")\n", "\n", " def load_network(self, path):\n", " # nothing has changed\n", " self.model_policy = load_model(\"policy_\"+path)\n", " elf.model_value_target = load_model(\"value_\"+path)\n", " print(\"Succesfully loaded network.\")\n", "\n", " def target_train(self):\n", " # nothing has changed from the original implementation\n", " model_weights = self.model_value.get_weights() \n", " target_model_weights = self.model_value_target.get_weights()\n", " for i in range(len(model_weights)):\n", " target_model_weights[i] = TAU * model_weights[i] + (1 - TAU) * target_model_weights[i]\n", " self.model_value_target.set_weights(model_weights)\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We first show how the agent code looks like without the \"utilities\" to train it. As we can see, defining this agent is pretty simple. The only real method that need to be adapted it \"`my_act`\" that is a method of 2 lines of code." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "# Credit Abhinav Sagar: \n", "# https://github.com/abhinavsagar/Reinforcement-Learning-Tutorial\n", "# Code under MIT license, available at:\n", "# https://github.com/abhinavsagar/Reinforcement-Learning-Tutorial/blob/master/LICENSE\n", "\n", "from grid2op.Parameters import Parameters\n", "from grid2op.Agent import AgentWithConverter\n", "from grid2op.Converter import IdToAct\n", "import pdb\n", "\n", "\n", "class DeepQAgent(AgentWithConverter):\n", " # first change: An Agent must derived from grid2op.Agent (in this case MLAgent, because we manipulate vector instead\n", " # of classes)\n", " \n", " def convert_obs(self, observation):\n", " return observation.to_vect()\n", " \n", " def my_act(self, transformed_observation, reward, done=False):\n", " if self.deep_q is None:\n", " self.init_deep_q(transformed_observation)\n", " predict_movement_int, *_ = self.deep_q.predict_movement(transformed_observation, epsilon=0.0)\n", " # print(\"predict_movement_int: {}\".format(predict_movement_int))\n", " return predict_movement_int\n", " \n", " def init_deep_q(self, transformed_observation):\n", " if self.deep_q is None:\n", " # the first time an observation is observed, I set up the neural network with the proper dimensions.\n", " if self.mode == \"DQN\":\n", " cls = DeepQ\n", " elif self.mode == \"DDQN\":\n", " cls = DuelQ\n", " elif self.mode == \"RealQ\":\n", " cls = RealQ\n", " elif self.mode == \"SAC\":\n", " cls = SAC\n", " else:\n", " raise RuntimeError(\"Unknown neural network named \\\"{}\\\"\".format(self.mode))\n", " self.deep_q = cls(self.action_space.size(), observation_size=transformed_observation.shape[0], lr=self.lr)\n", " \n", " def __init__(self, action_space, mode=\"DDQN\", lr=1e-5):\n", " # this function has been adapted.\n", " \n", " # to built a AgentWithConverter, we need an action_space. \n", " # No problem, we add it in the constructor.\n", " AgentWithConverter.__init__(self, action_space, action_space_converter=IdToAct)\n", " \n", " # and now back to the origin implementation\n", " self.replay_buffer = ReplayBuffer(BUFFER_SIZE)\n", " \n", " # compare to original implementation, i don't know the observation space size. \n", " # Because it depends on the component of the observation we want to look at. So these neural network will\n", " # be initialized the first time an observation is observe.\n", " self.deep_q = None\n", " self.mode = mode\n", " self.lr=lr\n", " \n", " def load_network(self, path):\n", " # not modified compare to original implementation\n", " self.deep_q.load_network(path)\n", " \n", " def convert_process_buffer(self):\n", " \"\"\"Converts the list of NUM_FRAMES images in the process buffer\n", " into one training sample\"\"\"\n", " # here i simply concatenate the action in case of multiple action in the \"buffer\"\n", " # this function existed in the original implementation, bus has been adapted.\n", " return np.concatenate(self.process_buffer)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And now we will also define some utility class (as defined in the blog post) to make the training easier." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "from grid2op.Reward import L2RPNReward\n", "from grid2op.Reward import RedispReward\n", "\n", "class L2RPNReward_LoadWise(L2RPNReward):\n", " \"\"\"\n", " Update the L2RPN reward to tke into account the fact that a change in the loads sum shall not be allocated as reward for the agent.\n", " \n", " \"\"\"\n", " def __init__(self):\n", " super().__init__()\n", " \n", " def initialize(self, env):\n", " super().initialize(env)\n", " self.reward_min = - 10 * env.backend.n_line\n", " self.previous_loads = self.reward_max * np.ones(env.backend.n_line)\n", " \n", " def __call__(self, action, env, has_error, is_done, is_illegal, is_ambiguous):\n", " if not is_done and not has_error:\n", " line_cap = self._L2RPNReward__get_lines_capacity_usage(env)\n", " \n", " new_loads, _, _ = env.backend.loads_info()\n", " new_flows = np.abs(env.backend.get_line_flow())\n", " loads_variation = (np.sum(new_loads) - np.sum(self.previous_loads)) / np.sum(self.previous_loads)\n", " \n", " res = np.sum(line_cap + loads_variation)\n", " else:\n", " # no more data to consider, no powerflow has been run, reward is what it is\n", " res = self.reward_min\n", " return res\n", " \n", "class L2RPNReward_LoadWise_ActionWise(L2RPNReward):\n", " \"\"\"\n", " Update the L2RPN reward to tke into account the fact that a change in the loads sum shall not be allocated as reward for the agent.\n", " \n", " \"\"\"\n", " def __init__(self):\n", " super().__init__()\n", " \n", " def initialize(self, env):\n", " super().initialize(env)\n", " self.reward_min = - 10 * env.backend.n_line\n", " self.previous_loads = self.reward_max * np.ones(env.backend.n_line)\n", " self.last_action = env.helper_action_env({})\n", " \n", " def __call__(self, action, env, has_error, is_done, is_illegal, is_ambiguous):\n", " if not is_done and not has_error:\n", " line_cap = self._L2RPNReward__get_lines_capacity_usage(env)\n", " \n", " new_loads, _, _ = env.backend.loads_info()\n", " new_flows = np.abs(env.backend.get_line_flow())\n", " loads_variation = (np.sum(new_loads) - np.sum(self.previous_loads)) / np.sum(self.previous_loads)\n", " \n", " res = np.sum(line_cap + loads_variation)\n", " else:\n", " # no more data to consider, no powerflow has been run, reward is what it is\n", " res = self.reward_min\n", " \n", " res -= (action != env.helper_action_env({})) * (action == self.last_action) * env.backend.n_line / 2\n", " \n", " self.last_action = action\n", " \n", " return res\n", " \n" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "class TrainAgent(object):\n", " def __init__(self, agent, reward_fun=RedispReward, env=None):\n", " self.agent = agent\n", " self.reward_fun = reward_fun\n", " self.env = env\n", " \n", " def _build_valid_env(self):\n", " # now we are creating a valid Environment\n", " # it's mandatory because no environment are created when the agent is \n", " # an Agent should not get direct access to the environment, but can interact with it only by:\n", " # * receiving reward\n", " # * receiving observation\n", " # * sending action\n", " \n", " close_env = False\n", " \n", " if self.env is None:\n", " self.env = grid2op.make(action_class=type(self.agent.action_space({})),\n", " reward_class=self.reward_fun)\n", " close_env = True\n", " \n", " # I make sure the action space of the user and the environment are the same.\n", " if not isinstance(self.agent.init_action_space, type(self.env.action_space)):\n", " raise RuntimeError(\"Imposssible to build an agent with 2 different action space\")\n", " if not isinstance(self.env.action_space, type(self.agent.init_action_space)):\n", " raise RuntimeError(\"Imposssible to build an agent with 2 different action space\")\n", " \n", " # A buffer that keeps the last `NUM_FRAMES` images\n", " self.agent.replay_buffer.clear()\n", " self.agent.process_buffer = []\n", " \n", " # make sure the environment is reset\n", " obs = self.env.reset()\n", " self.agent.process_buffer.append(self.agent.convert_obs(obs))\n", " do_nothing = self.env.action_space()\n", " for _ in range(NUM_FRAMES-1):\n", " # Initialize buffer with the first frames\n", " s1, r1, _, _ = self.env.step(do_nothing)\n", " self.agent.process_buffer.append(self.agent.convert_obs(s1)) \n", " return close_env\n", " \n", " def train(self, num_frames, env=None):\n", " # this function existed in the original implementation, but has been slightly adapted.\n", " \n", " # first we create an environment or make sure the given environment is valid\n", " close_env = self._build_valid_env()\n", " \n", " # bellow that, only slight modification has been made. They are highlighted\n", " observation_num = 0\n", " curr_state = self.agent.convert_process_buffer()\n", " epsilon = INITIAL_EPSILON\n", " alive_frame = 0\n", " total_reward = 0\n", "\n", " while observation_num < num_frames:\n", " if observation_num % 1000 == 999:\n", " print((\"Executing loop %d\" %observation_num))\n", "\n", " # Slowly decay the learning rate\n", " if epsilon > FINAL_EPSILON:\n", " epsilon -= (INITIAL_EPSILON-FINAL_EPSILON)/EPSILON_DECAY\n", "\n", " initial_state = self.agent.convert_process_buffer()\n", " self.agent.process_buffer = []\n", "\n", " # it's a bit less convenient that using the SpaceInvader environment.\n", " # first we need to initiliaze the neural network\n", " self.agent.init_deep_q(curr_state)\n", " # then we need to predict the next move\n", " predict_movement_int, predict_q_value = self.agent.deep_q.predict_movement(curr_state, epsilon)\n", " # and then we convert it to a valid action\n", " act = self.agent.convert_act(predict_movement_int)\n", " \n", " reward, done = 0, False\n", " for i in range(NUM_FRAMES):\n", " temp_observation_obj, temp_reward, temp_done, _ = self.env.step(act)\n", " # here it has been adapted too. The observation get from the environment is\n", " # first converted to vector\n", " \n", " # below this line no changed have been made to the original implementation.\n", " reward += temp_reward\n", " self.agent.process_buffer.append(self.agent.convert_obs(temp_observation_obj))\n", " done = done | temp_done\n", "\n", " if done:\n", " print(\"Lived with maximum time \", alive_frame)\n", " print(\"Earned a total of reward equal to \", total_reward)\n", " # reset the environment\n", " self.env.reset()\n", " \n", " alive_frame = 0\n", " total_reward = 0\n", "\n", " new_state = self.agent.convert_process_buffer()\n", " self.agent.replay_buffer.add(initial_state, predict_movement_int, reward, done, new_state)\n", " total_reward += reward\n", " if self.agent.replay_buffer.size() > MIN_OBSERVATION:\n", " s_batch, a_batch, r_batch, d_batch, s2_batch = self.agent.replay_buffer.sample(MINIBATCH_SIZE)\n", " self.agent.deep_q.train(s_batch, a_batch, r_batch, d_batch, s2_batch, observation_num)\n", " self.agent.deep_q.target_train()\n", "\n", " # Save the network every 1000 iterations after 50000 iterations\n", " if observation_num > 50000 and observation_num % 1000 == 0 and self.agent.deep_q.Is_nan == 0:\n", " print(\"Saving Network\")\n", " self.agent.deep_q.save_network(\"saved_agent_\"+self.agent.mode+\".h5\")\n", "\n", " alive_frame += 1\n", " observation_num += 1\n", " \n", " if close_env:\n", " print(\"closing env\")\n", " self.env.close()\n", " \n", " def calculate_mean(self, num_episode = 100, env=None):\n", " # this method has been only slightly adapted from the original implementation\n", " \n", " # Note that it is NOT the recommended method to evaluate an Agent. Please use \"Grid2Op.Runner\" instead\n", " \n", " # first we create an environment or make sure the given environment is valid\n", " close_env = self._build_valid_env()\n", " \n", " reward_list = []\n", " print(\"Printing scores of each trial\")\n", " for i in range(num_episode):\n", " done = False\n", " tot_award = 0\n", " self.env.reset()\n", " while not done:\n", " state = self.convert_process_buffer()\n", " \n", " # same adapation as in \"train\" function. \n", " predict_movement_int = self.agent.deep_q.predict_movement(state, 0.0)[0]\n", " predict_movement = self.agent.convert_act(predict_movement_int)\n", " \n", " # same adapation as in the \"train\" funciton\n", " observation_obj, reward, done, _ = self.env.step(predict_movement)\n", " observation_vect_full = observation_obj.to_vect()\n", " \n", " tot_award += reward\n", " self.process_buffer.append(observation)\n", " self.process_buffer = self.process_buffer[1:]\n", " print(tot_award)\n", " reward_list.append(tot_award)\n", " \n", " if close_env:\n", " self.env.close()\n", " return np.mean(reward_list), np.std(reward_list)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### b) Training the model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can define the model (agent), and then train it.\n", "\n", "This is done exactly the same way as in the Abhinav Sagar implementation.\n", "\n", "**NB** The code bellow can take a few minutes to run. It's training a Deep Reinforcement Learning Agent afterall. It this takes too long on your machine, you can always decrease the \"nb_frame\", and set it to 1000 for example. In this case, the Agent will probably not be really good.\n", "\n", "**NB** For a real Agent, it would take much longer to train." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/tezirg/Code/Grid2Op.BDonnot/getting_started/grid2op/MakeEnv.py:592: UserWarning:\n", "\n", "Your are using only 2 chronics for this environment. More can be download by running, from a command line:\n", "python -m grid2op.download --name \"case14_redisp\" --path_save PATH\\WHERE\\YOU\\WANT\\TO\\DOWNLOAD\\DATA\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Successfully constructed networks.\n", "Lived with maximum time 3\n", "Earned a total of reward equal to 3265.510168153268\n", "Lived with maximum time 4\n", "Earned a total of reward equal to 3211.0176961992283\n", "Lived with maximum time 5\n", "Earned a total of reward equal to 4351.474102870227\n", "Lived with maximum time 5\n", "Earned a total of reward equal to 4297.903349713559\n", "Lived with maximum time 7\n", "Earned a total of reward equal to 6575.526890023941\n", "Lived with maximum time 4\n", "Earned a total of reward equal to 3211.0176961992283\n", "Lived with maximum time 2\n", "Earned a total of reward equal to 1075.9133274380465\n", "Lived with maximum time 51\n", "Earned a total of reward equal to 58436.087906076275\n", "Lived with maximum time 1\n", "Earned a total of reward equal to -10\n", "Lived with maximum time 7\n", "Earned a total of reward equal to 6490.193902990617\n", "Lived with maximum time 4\n", "Earned a total of reward equal to 3254.6406007128016\n", "Lived with maximum time 6\n", "Earned a total of reward equal to 5380.771770361067\n", "Successfully saved network.\n" ] } ], "source": [ "nb_frame = 100\n", "\n", "env = make(name_env=\"case14_redisp\")\n", "my_agent = DeepQAgent(env.action_space, mode=\"DDQN\")\n", "trainer = TrainAgent(agent=my_agent, env=env, reward_fun=RedispReward)\n", "trainer.train(nb_frame)\n", "trainer.agent.deep_q.save_network(\"saved_agent_\"+trainer.agent.mode+\".h5\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# To see whether the agent learnt something during training: lets plot the learning curve" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(0.0, 100.0)" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "\n", "plt.figure(figsize=(30,20))\n", "plt.plot(my_agent.deep_q.qvalue_evolution)\n", "plt.axhline(y=0, linewidth=3, color='red')\n", "plt.xlim(0, len(my_agent.deep_q.qvalue_evolution))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The learning curve shown above is really poor. It's because the agent has not been trained for a long time (and because it uses a very poor input data). If you train this agent for approximately 10-12 hours, using only the relative flow (`obs.rho` see the last section of this notebook for an example) you will get the following:\n", "\n", "![](img/trained_agent.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## III) Evaluating the Agent" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And now, time to test this trained agent.\n", "\n", "To do that, we have multiple choices.\n", "\n", "Either we recode the \"DeepQAgent\" class to load the stored weights (that have been saved during trainig) when it is initialized (not covered in this notebook), or we can also directly specified the \"instance\" of the Agent to use in the Grid2Op Runner.\n", "\n", "To do that, it's fairly simple. First, you need to specify that you won't use the \"*agentClass*\" argument, by setting it to ``None``, and secondly you simply provide the agent to use in the *agentInstance* argument.\n", "\n", "**NB** If you don't do that, the Runner will be created (the constructor will raise an exception). And if you choose to use the \"*agentClass*\" argument, your agent will be reloaded from scratch. So **if it doesn't load the weights** it will behave as a non trained agent, unlikely to perform well on the task." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### III.A) Evaluate the Agent" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have \"successfully\" trained our Agent, we will evaluating it. As opposed to the trainining, the evaluation is done classically using a standard Runner.\n", "\n", "Note that the Runner will use a \"scoring function\" that might be different from the \"reward function\" used during training. In our case, it's not. We use the `L2RPNReward` in both cases.\n", "\n", "In the code bellow, we commented on what can be different and what must be identical for training and evaluation of model." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "from grid2op.Runner import Runner\n", "from grid2op.Chronics import GridStateFromFileWithForecasts, Multifolder\n", "scoring_function = L2RPNReward\n", "\n", "dict_params = env.get_params_for_runner()\n", "dict_params[\"gridStateclass_kwargs\"][\"max_iter\"] = max_iter\n", "# make a runner from an intialized environment\n", "runner = Runner(**dict_params)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run the Agent and save the results. As opposed to the multiple times we exposed the \"runner.run\" call, we never really dive into the \"path_save\" argument. This path allows you to save lots of information about your Agent behaviour. Please All the informations present are shown on the documentation [here](file:///home/donnotben/Documents/Grid2Op/documentation/html/runner.html)." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The results for the trained agent are:\n", "\tFor chronics located at 0\n", "\t\t - cumulative reward: 121368.512117\n", "\t\t - number of time steps completed: 100 / 100\n" ] } ], "source": [ "import shutil\n", "path_save=\"trained_agent_log\"\n", "\n", "# delete the previous stored results\n", "if os.path.exists(path_save):\n", " shutil.rmtree(path_save)\n", "\n", "# run the episode\n", "res = runner.run(nb_episode=1, path_save=path_save)\n", "print(\"The results for the trained agent are:\")\n", "for _, chron_name, cum_reward, nb_time_step, max_ts in res:\n", " msg_tmp = \"\\tFor chronics located at {}\\n\".format(chron_name)\n", " msg_tmp += \"\\t\\t - cumulative reward: {:.6f}\\n\".format(cum_reward)\n", " msg_tmp += \"\\t\\t - number of time steps completed: {:.0f} / {:.0f}\".format(nb_time_step, max_ts)\n", " print(msg_tmp)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### III.B) Inspect the Agent " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Please refer to the official document for more information about the content of the directory where the data are saved. Note that the saving of the information is triggered by the \"path_save\" argument sent to the \"runner.run\" function.\n", "\n", "Some information that will be present in this repository are:\n", "If enabled, the :class:`Runner` will save the information in a structured way. For each episode there will be a folder\n", "with:\n", "\n", " - \"episode_meta.json\" that represents some meta information about:\n", "\n", " - \"backend_type\": the name of the `grid2op.Backend` class used\n", " - \"chronics_max_timestep\": the **maximum** number of timestep for the chronics used\n", " - \"chronics_path\": the path where the temporal data (chronics) are located\n", " - \"env_type\": the name of the `grid2op.Environment` class used.\n", " - \"grid_path\": the path where the powergrid has been loaded from\n", "\n", " - \"episode_times.json\": gives some information about the total time spend in multiple part of the runner, mainly the\n", " `grid2op.Agent` (and especially its method `grid2op.Agent.act`) and amount of time spent in the\n", " `grid2op.Environment`\n", "\n", " - \"_parameters.json\": is a representation as json of a the `grid2op.Parameters` used for this episode\n", " - \"rewards.npy\" is a numpy 1d array giving the rewards at each time step. We adopted the convention that the stored\n", " reward at index `i` is the one observed by the agent at time `i` and **NOT** the reward sent by the\n", " `grid2op.Environment` after the action has been implemented.\n", " - \"exec_times.npy\" is a numpy 1d array giving the execution time of each time step of the episode\n", " - \"actions.npy\" gives the actions that has been taken by the `grid2op.Agent`. At row `i` of \"actions.npy\" is a\n", " vectorized representation of the action performed by the agent at timestep `i` *ie.* **after** having observed\n", " the observation present at row `i` of \"observation.npy\" and the reward showed in row `i` of \"rewards.npy\".\n", " - \"disc_lines.npy\" gives which lines have been disconnected during the simulation of the cascading failure at each\n", " time step. The same convention as for \"rewards.npy\" has been adopted. This means that the powerlines are\n", " disconnected when the `grid2op.Agent` takes the `grid2op.Action` at time step `i`.\n", " - \"observations.npy\" is a numpy 2d array reprensenting the `grid2op.Observation` at the disposal of the\n", " `grid2op.Agent` when he took his action.\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can first look at the repository were the data are stored:" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0\t\t\tdict_env_modification_space.json\r\n", "dict_action_space.json\tdict_observation_space.json\r\n" ] } ], "source": [ "!ls trained_agent_log" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see, there is only one folder there. It's named \"1\" because, in the original data, this came from the folder named \"1\" (the original data are located at \"/home/donnotben/.local/lib/python3.6/site-packages/grid2op/data/test_multi_chronics/\")\n", "\n", "If there were multiple episode, each episode would have it's own folder, with a name as resemblant as possible to the origin name of the data. This is done to ease the studying of the results.\n", "\n", "Now let's see what is inside this folder:" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "actions.npy\t\t\t env_modifications.npy observations.npy\r\n", "agent_exec_times.npy\t\t episode_meta.json\t _parameters.json\r\n", "disc_lines_cascading_failure.npy episode_times.json\t rewards.npy\r\n" ] } ], "source": [ "!ls trained_agent_log/0" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can for example load the \"actions\" performed by the Agent, and have a look at them.\n", "\n", "To do that we will load the action array (represented as vector) and use the action_space to convert it back into valid action class." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "all_actions = np.load(os.path.join(\"trained_agent_log\", \"0\", \"actions.npy\"))\n", "li_actions = []\n", "for i in range(all_actions.shape[0]):\n", " try:\n", " tmp = runner.env.action_space.from_vect(all_actions[i,:])\n", " li_actions.append(tmp)\n", " except:\n", " break" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This allows us to have a deeper look at the action, and their effect. Note that here, we used action that can only **set** the line status, so looking at their effect is pretty straightforward.\n", "\n", "Also, note that as oppose to \"change\", if a powerline is already connected, trying to **set** it as connected has absolutely no impact." ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "line_disc = 0\n", "line_reco = 0\n", "for act in li_actions:\n", " dict_ = act.as_dict()\n", " if \"set_line_status\" in dict_:\n", " line_reco += dict_[\"set_line_status\"][\"nb_connected\"]\n", " line_disc += dict_[\"set_line_status\"][\"nb_disconnected\"]\n", "line_reco" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As wa can see for our event, the agent always try to reconnect a powerline. As all lines are alway reconnected, this Agent does basically nothing." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also do the same kind of post analysis for the observation, even though here, as the observations come from files, it's probably not particularly intersting." ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_observations = np.load(os.path.join(\"trained_agent_log\", \"0\", \"observations.npy\"))\n", "li_observations = []\n", "nb_real_disc = 0\n", "for i in range(all_observations.shape[0]):\n", " try:\n", " tmp = runner.env.observation_space.from_vect(all_observations[i,:])\n", " li_observations.append(tmp)\n", " nb_real_disc += (np.sum(tmp.line_status) - tmp.line_status.shape[0])\n", " except:\n", " break\n", "nb_real_disc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also look at the type of action the agent did:" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of different actions played: 1\n" ] } ], "source": [ "actions_count = {}\n", "for act in li_actions:\n", " act_as_vect = tuple(act.to_vect())\n", " if not act_as_vect in actions_count:\n", " actions_count[act_as_vect] = 0\n", " actions_count[act_as_vect] += 1\n", "print(\"Number of different actions played: {}\".format(len(actions_count)))" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "This action will:\n", "\t - NOT change anything to the injections\n", "\t - NOT perform any redispatching action\n", "\t - NOT force any line status\n", "\t - NOT switch any line status\n", "\t - NOT switch anything in the topology\n", "\t - NOT force any particular bus configuration\n" ] } ], "source": [ "print(runner.env.action_space.from_vect(np.array(list(actions_count.keys())[0])))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The agent only did one action (note that this number can really vary on the number of training step and the . This is not really good, the agent didn't learn anything." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## IV) Improve your Agent " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we saw, the agent we develop was not really interesting. To improve it, we could think about:\n", "\n", "- a better encoding of the observation. For now everything is fed to the neural network, without any normalization of any kind. This is a real problem for learning algorithm.\n", "- a better neural network architecture (as said, we didn't pay any attention to it in our model)\n", "- train it for a longer time\n", "- adapt the learning rate and all the meta parameters of the learning algorithm.\n", "- etc.\n", "\n", "In this notebook, we will focus on changing the observation representation, by only feeding the agent only some informations.\n", "\n", "To do so, the only modification we need to do is to modify the way the observation are converted. So the \"*convert_obs*\" method, and that is it. Nothing else need changing. Here for example we could think of only using the flow ratio (*ie* the current flows divided by the thermal limites) part of the observation (named rho) instead of feeding the whole observation." ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": [ "class DeepQAgent_Improved(DeepQAgent):\n", " def __init__(self, action_space, mode=\"DDQN\"):\n", " DeepQAgent.__init__(self, action_space, mode=mode)\n", " \n", " def convert_obs(self, observation):\n", " return observation.rho" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And we can reuse exactly the same code to train it :-)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Successfully constructed networks.\n", "Lived with maximum time 27\n", "Earned a total of reward equal to 30455.13538629611\n", "Lived with maximum time 5\n", "Earned a total of reward equal to 4352.546473986533\n", "Lived with maximum time 43\n", "Earned a total of reward equal to 48400.2926498057\n", "Lived with maximum time 14\n", "Earned a total of reward equal to 13357.726978636152\n", "Lived with maximum time 9\n", "Earned a total of reward equal to 8691.820546873894\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "my_agent = DeepQAgent_Improved(env.action_space, mode=\"DDQN\")\n", "trainer = TrainAgent(agent=my_agent, env=env)\n", "trainer.train(nb_frame)\n", "\n", "plt.figure(figsize=(30,20))\n", "plt.plot(my_agent.deep_q.qvalue_evolution)\n", "plt.axhline(y=0, linewidth=3, color='red')\n", "_ = plt.xlim(0, len(my_agent.deep_q.qvalue_evolution))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 2 }