{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Agent, RL and MultiEnvironment" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is recommended to have a look at the [0_basic_functionalities](0_basic_functionalities.ipynb), [1_Observation_Agents](1_Observation_Agents.ipynb) and [2_Action_GridManipulation](2_Action_GridManipulation.ipynb) and especially [3_TrainingAnAgent](3_TrainingAnAgent.ipynb) notebooks before getting into this one." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Objectives**\n", "\n", "In this notebook we will expose :\n", "* what is a \"MultiEnv\"\n", "* how can it be used with an agent\n", "* how can it be used to train a agent that uses different environments" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Impossible to automatically add a menu / table of content to this notebook.\n", "You can download \"jyquickhelper\" package with: \n", "\"pip install jyquickhelper\"\n" ] } ], "source": [ "res = None\n", "try:\n", " from jyquickhelper import add_notebook_menu\n", " res = add_notebook_menu()\n", "except ModuleNotFoundError:\n", " print(\"Impossible to automatically add a menu / table of content to this notebook.\\nYou can download \\\"jyquickhelper\\\" package with: \\n\\\"pip install jyquickhelper\\\"\")\n", "res" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import grid2op\n", "from grid2op.Reward import ConstantReward, FlatReward\n", "from tqdm.notebook import tqdm\n", "from grid2op.Runner import Runner\n", "import sys\n", "import os\n", "import numpy as np\n", "TRAINING_STEP = 100" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## I) Download more data for the default environment.\n", "\n", "A lot of data have been made available for the default \"case14_redisp\" environment. Including this data in the package is not convenient. We chose instead to release them and make them easily available with a utility. To download them in the default directory (\"~/data_grid2op/case14_redisp\") on linux based system you can do the following (uncomment the following command)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# !$sys.executable -m grid2op.download --name \"case14_realistic\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## I) Make a regular environment and agent" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we downloaded the dataset, it is time to make an environment that will use all the data avaiable. You can execute the following command line. If you see any error or warning consider re downloading the data, or adapting the key-word argument \"chronics_path\" to match the path where the data have been downloaded." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "env = grid2op.make(\"rte_case14_realistic\", test=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A lot of data have been made available for the default \"rte_case14_realistic\" environment. Including this data in the package is not convenient. \n", "\n", "We chose instead to release them and make them easily available with a utility. To download them in the default directory (\"~/data_grid2op/case14_redisp\") just pass the argument \"test=False\" (or don't pass anything else) as local=False is the default value. It will download approximately 300Mo of data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## II) Train a standard RL Agent" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Make sure you are using a computer with at least 4 cores if you want to notice some speed-ups." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "from grid2op.Environment import MultiEnvironment\n", "from grid2op.Agent import DoNothingAgent\n", "NUM_CORE = 8" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### IIIa) Using the standard open AI gym loop\n", "\n", "Here we demonstrate how to use the multi environment class. First let's create a multi environment." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# create a simple agent\n", "agent = DoNothingAgent(env.action_space)\n", "\n", "# create the multi environment class\n", "multi_envs = MultiEnvironment(env=env, nb_env=NUM_CORE)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A multienvironment is just like a regular environment but instead of dealing with one action, and one observation, is requires to be sent multiple actions, and returns a list of observations as well. \n", "\n", "It requires a grid2op environment to be initialized and creates some specific \"workers\", each a replication of the initial environment. None of the \"worker\" can be accessed directly. Supported methods are:\n", "- multi_env.reset\n", "- multi_env.step\n", "- multi_env.close\n", "\n", "That have similar behaviour to \"env.step\", \"env.close\" or \"env.reset\".\n", "\n", "\n", "It can be used the following manner." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([.res object at 0x7f88c6c65f10>,\n", " .res object at 0x7f88c6c654c0>,\n", " .res object at 0x7f88c6c652b0>,\n", " .res object at 0x7f88c6c65b80>,\n", " .res object at 0x7f88c6c65580>,\n", " .res object at 0x7f88c6c65b20>,\n", " .res object at 0x7f88c6c65a30>,\n", " .res object at 0x7f88c6c652e0>],\n", " dtype=object)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# initiliaze some variable with the proper dimension\n", "obss = multi_envs.reset()\n", "rews = [env.reward_range[0] for i in range(NUM_CORE)]\n", "dones = [False for i in range(NUM_CORE)]\n", "obss" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[False, False, False, False, False, False, False, False]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dones" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, obs is not a single obervation, but a list (numpy nd array to be precise) of 4 observations, each one being an observation of a given \"worker\" environment.\n", "\n", "Worker environments are always called in the same order. It means the first observation of this vector will always correspond to the first worker environment. \n", "\n", "\n", "Similarly to Observation, the \"step\" function of a multi_environment takes as input a list of multiple actions, each action will be implemented in its own environment. It returns a list of observations, a list of rewards, and boolean list of whether or not the worker environment suffer from a game over (in that case this worker environment is automatically restarted using the \"reset\" method.)\n", "\n", "Because orker environments are always called in the same order, the first action sent to the \"multi_env.step\" function will also be applied on this first environment.\n", "\n", "It is possible to use it as follow:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([.res object at 0x7f88c6848a90>,\n", " .res object at 0x7f88c6848c40>,\n", " .res object at 0x7f890ce82610>,\n", " .res object at 0x7f890ce82550>,\n", " .res object at 0x7f890ce825b0>,\n", " .res object at 0x7f890ce82850>,\n", " .res object at 0x7f890d6f0f40>,\n", " .res object at 0x7f890d6f0970>],\n", " dtype=object)" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# initialize the vector of actions that will be processed by each worker environment.\n", "acts = [None for _ in range(NUM_CORE)]\n", "for env_act_id in range(NUM_CORE):\n", " acts[env_act_id] = agent.act(obss[env_act_id], rews[env_act_id], dones[env_act_id])\n", " \n", "# feed them to the multi_env\n", "obss, rews, dones, infos = multi_envs.step(acts)\n", "\n", "# as explained, this is a vector of Observation (as many as NUM_CORE in this example)\n", "obss" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The multi environment loop is really close to the \"gym\" loop:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "# performs the appropriated steps\n", "for i in range(10):\n", " acts = [None for _ in range(NUM_CORE)]\n", " for env_act_id in range(NUM_CORE):\n", " acts[env_act_id] = agent.act(obss[env_act_id], rews[env_act_id], dones[env_act_id])\n", " obss, rews, dones, infos = multi_envs.step(acts)\n", "\n", " # DO SOMETHING WITH THE AGENT IF YOU WANT\n", " ## agent.train(obss, rews, dones)\n", " \n", "\n", "# close the environments created by the multi_env\n", "multi_envs.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On the above example, `TRAINING_STEP` steps are performed on `NUM_CORE` environments in parrallel. The agent has then acted `TRAINING_STEP * NUM_CORE` (=`10 * 4 = 40` by default) times on `NUM_CORE` different environments." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### III.b) Practical example" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We reuse the code of the Notebook [3_TrainingAnAgent](3_TrainingAnAgent.ipynb) to train a new agent, but this time using more than one process of the machine. To further emphasize the working of multi environments, we put on a different module ([ml_agent](ml_agent.py)) the code of some agents and focus here on the training part. \n", "\n", "Note that compare to the previous notebook, the code have been adapted to used \"batch\" of data when predicting movments. The input data is also restricted to:\n", "- the relative flow value\n", "- the powerline status\n", "- the topology vector\n", "\n", "All the other component of the observations are not used." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "from ml_agent import TrainingParam, ReplayBuffer, DeepQAgent\n", "from grid2op.Agent import AgentWithConverter\n", "from grid2op.Reward import RedispReward\n", "from grid2op.Converter import IdToAct\n", "import numpy as np\n", "import random\n", "import warnings\n", "import pdb\n", "with warnings.catch_warnings():\n", " warnings.filterwarnings(\"ignore\", category=FutureWarning)\n", " import tensorflow.keras\n", " import tensorflow.keras.backend as K\n", " from tensorflow.keras.models import load_model, Sequential, Model\n", " from tensorflow.keras.optimizers import Adam\n", " from tensorflow.keras.layers import Activation, Dropout, Flatten, Dense, subtract, add\n", " from tensorflow.keras.layers import Input, Lambda, Concatenate" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "class TrainAgentMultiEnv(object):\n", " def __init__(self, agent, nb_process, reward_fun=RedispReward, env=None, name=None):\n", " # compare to the version showed in the notebook 3, the process buffer has been moved in this class\n", " # and we add a multi_envs argument.\n", " self.nb_process = nb_process\n", " self.multi_envs = None\n", " self.process_buffer = [[] for _ in range(self.nb_process)]\n", " self.name = name\n", " self.agent = agent\n", " self.env = env\n", " self.training_param = None\n", " \n", " def close(self):\n", " self.multi_envs.close()\n", " \n", " def convert_process_buffer(self):\n", " \"\"\"Converts the list of NUM_FRAMES images in the process buffer\n", " into one training sample\"\"\"\n", " # here i simply concatenate the action in case of multiple action in the \"buffer\"\n", " if self.training_param.NUM_FRAMES != 1:\n", " raise RuntimeError(\"This has not been tested with self.training_param.NUM_FRAMES != 1 for now\")\n", " return np.array([np.concatenate(el) for el in self.process_buffer])\n", " \n", " def _build_valid_env(self, training_param):\n", " # this function has also be adapted\n", " create_new = False\n", " if self.multi_envs is None:\n", " create_new = True\n", " # first we need to initialize the multi environment\n", " self.multi_envs = MultiEnvironment(env=env, nb_env=self.nb_process)\n", " \n", " # then, as before, we reset it\n", " obss = self.multi_envs.reset()\n", " for worker_id in range(self.nb_process):\n", " self.process_buffer[worker_id].append(self.agent.convert_obs(obss[worker_id]))\n", " \n", " # used in case of \"num frames\" != 1 (so not tested)\n", " do_nothing = [self.env.action_space() for _ in range(self.nb_process)]\n", " for _ in range(training_param.NUM_FRAMES-1):\n", " # Initialize buffer with the first frames\n", " s1, r1, _, _ = self.multi_envs.step(do_nothing)\n", " for worker_id in range(self.nb_process):\n", " # difference compared to previous implementation: we loop through all the observations\n", " # and save them all\n", " self.process_buffer[worker_id].append(self.agent.convert_obs(s1[worker_id])) \n", " \n", " return create_new\n", " \n", " def train(self, num_frames, training_param=TrainingParam()):\n", " self.training_param = training_param\n", " \n", " # first we create an environment or make sure the given environment is valid\n", " close_env = self._build_valid_env(training_param)\n", " \n", " # same as in the original implemenation, except the process buffer is now in this class\n", " observation_num = 0\n", " curr_state = self.convert_process_buffer()\n", " \n", " # we initialize the NN exactly as before\n", " self.agent.init_deep_q(curr_state)\n", " \n", " # some parameters have been move to a class named \"training_param\" for convenience\n", " epsilon = training_param.INITIAL_EPSILON\n", " # now the number of alive frames and total reward depends on the \"underlying environment\". It is vector instead\n", " # of scalar\n", " alive_frame = np.zeros(self.nb_process, dtype=np.int)\n", " total_reward = np.zeros(self.nb_process, dtype=np.float)\n", "\n", " with tqdm(total=num_frames) as pbar:\n", " while observation_num < num_frames:\n", " if observation_num % 1000 == 999:\n", " print((\"Executing loop %d\" %observation_num))\n", " # for efficient reading of data: at early stage of training, it is advised to load\n", " # data by chunk: the model will do game over pretty easily (no need to load all the dataset)\n", " tmp = min(10000 * (num_frames // observation_num), 10000)\n", " self.multi_envs.set_chunk_size(int(max(100, tmp)))\n", "\n", " # Slowly decay the learning rate\n", " if epsilon > training_param.FINAL_EPSILON:\n", " epsilon -= (training_param.INITIAL_EPSILON-training_param.FINAL_EPSILON)/training_param.EPSILON_DECAY\n", "\n", " initial_state = self.convert_process_buffer()\n", " self.process_buffer = [[] for _ in range(self.nb_process)]\n", "\n", " # TODO vectorize that in the Agent directly\n", " # then we need to predict the next moves. Agents have been adapted to predict a batch of data\n", " pm_i, pq_v = self.agent.deep_q.predict_movement(curr_state, epsilon)\n", " # and build the convenient vectors (it was scalars before)\n", " predict_movement_int = []\n", " predict_q_value = []\n", " acts = []\n", " for p_id in range(self.nb_process):\n", " predict_movement_int.append(pm_i[p_id])\n", " predict_q_value.append(pq_v[p_id])\n", " # and then we convert it to a valid action\n", " acts.append(self.agent.convert_act(pm_i[p_id]))\n", "\n", " # same loop as in notebook 3\n", " reward, done = np.zeros(self.nb_process), np.full(self.nb_process, fill_value=False, dtype=np.bool)\n", " for i in range(training_param.NUM_FRAMES):\n", " temp_observation_obj, temp_reward, temp_done, _ = self.multi_envs.step(acts)\n", "\n", " # we need to handle vectors for \"done\"\n", " reward[~temp_done] += temp_reward[~temp_done]\n", " # and then \"de stack\" the observations coming from different environments\n", " for worker_id, obs in enumerate(temp_observation_obj):\n", " self.process_buffer[worker_id].append(self.agent.convert_obs(temp_observation_obj[worker_id])) \n", " done = done | temp_done\n", "\n", " # increase of 1 the number of frame alive for relevant \"underlying environments\"\n", " alive_frame[~temp_done] += 1\n", " # loop through the environment where a game over was done, and print the results\n", " for env_done_idx in np.where(temp_done)[0]:\n", " print(\"For env with id {}\".format(env_done_idx))\n", " print(\"\\tLived with maximum time \", alive_frame[env_done_idx])\n", " print(\"\\tEarned a total of reward equal to \", total_reward[env_done_idx])\n", "\n", " reward[temp_done] = 0.\n", " total_reward[temp_done] = 0.\n", " total_reward += reward\n", " alive_frame[temp_done] = 0\n", "\n", " # vectorized version of the previous code\n", " new_state = self.convert_process_buffer()\n", " # same as before, but looping through the \"underlying environment\"\n", " for sub_env_id in range(self.nb_process):\n", " self.agent.replay_buffer.add(initial_state[sub_env_id],\n", " predict_movement_int[sub_env_id],\n", " reward[sub_env_id],\n", " done[sub_env_id],\n", " new_state[sub_env_id])\n", "\n", " if self.agent.replay_buffer.size() > training_param.MIN_OBSERVATION:\n", " s_batch, a_batch, r_batch, d_batch, s2_batch = self.agent.replay_buffer.sample(training_param.MINIBATCH_SIZE)\n", " isfinite = self.agent.deep_q.train(s_batch, a_batch, r_batch, d_batch, s2_batch, observation_num)\n", " self.agent.deep_q.target_train()\n", "\n", " if not isfinite:\n", " # if the loss is not finite i stop the learning\n", " print(\"ERROR INFINITE LOSS\")\n", " break\n", "\n", "\n", " # Save the network every 10000 iterations\n", " if observation_num % 10000 == 9999 or observation_num == num_frames-1:\n", " print(\"Saving Network\")\n", " if self.name is None:\n", " self.agent.deep_q.save_network(\"saved_notebook6.h5\")\n", " else:\n", " self.agent.deep_q.save_network(\"saved_notebook6_{}\".format(self.name))\n", "\n", " observation_num += 1\n", " pbar.update(1)\n", " \n", " if close_env:\n", " print(\"closing env\")\n", " self.env.close()\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We redifine the class used to train the agent." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Successfully constructed networks.\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "a9ca69ec05a8402ea5a1d140a8502e46", "version_major": 2, "version_minor": 0 }, "text/plain": [ "HBox(children=(FloatProgress(value=0.0), HTML(value='')))" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "For env with id 4\n", "\tLived with maximum time 6\n", "\tEarned a total of reward equal to 1062.252197265625\n", "For env with id 3\n", "\tLived with maximum time 10\n", "\tEarned a total of reward equal to 2167.105224609375\n", "For env with id 6\n", "\tLived with maximum time 12\n", "\tEarned a total of reward equal to 1000.6795654296875\n", "For env with id 1\n", "\tLived with maximum time 14\n", "\tEarned a total of reward equal to 3311.049072265625\n", "For env with id 7\n", "\tLived with maximum time 15\n", "\tEarned a total of reward equal to -150.0\n", "For env with id 4\n", "\tLived with maximum time 10\n", "\tEarned a total of reward equal to 1053.3250732421875\n", "For env with id 2\n", "\tLived with maximum time 19\n", "\tEarned a total of reward equal to 2067.565185546875\n", "For env with id 4\n", "\tLived with maximum time 2\n", "\tEarned a total of reward equal to 1115.4794921875\n", "For env with id 6\n", "\tLived with maximum time 26\n", "\tEarned a total of reward equal to 952.8309326171875\n", "For env with id 4\n", "\tLived with maximum time 20\n", "\tEarned a total of reward equal to 2162.169921875\n", "For env with id 4\n", "\tLived with maximum time 2\n", "\tEarned a total of reward equal to -20.0\n", "For env with id 0\n", "\tLived with maximum time 47\n", "\tEarned a total of reward equal to 785.864501953125\n", "For env with id 7\n", "\tLived with maximum time 39\n", "\tEarned a total of reward equal to 3232.6942138671875\n", "For env with id 3\n", "\tLived with maximum time 48\n", "\tEarned a total of reward equal to 2020.0274658203125\n", "For env with id 1\n", "\tLived with maximum time 50\n", "\tEarned a total of reward equal to 4354.7083740234375\n", "For env with id 1\n", "\tLived with maximum time 7\n", "\tEarned a total of reward equal to 1052.818603515625\n", "For env with id 2\n", "\tLived with maximum time 55\n", "\tEarned a total of reward equal to 4237.5467529296875\n", "For env with id 0\n", "\tLived with maximum time 28\n", "\tEarned a total of reward equal to -280.0\n", "For env with id 5\n", "\tLived with maximum time 88\n", "\tEarned a total of reward equal to 33391.75183105469\n", "For env with id 1\n", "\tLived with maximum time 21\n", "\tEarned a total of reward equal to 960.6181640625\n", "For env with id 3\n", "\tLived with maximum time 37\n", "\tEarned a total of reward equal to -370.0\n", "For env with id 6\n", "\tLived with maximum time 57\n", "\tEarned a total of reward equal to 12161.2880859375\n", "Saving Network\n", "Successfully saved network.\n", "\n", "closing env\n" ] } ], "source": [ "agent_name = \"sac_1e5\"\n", "my_agent = DeepQAgent(env.action_space, mode=\"SAC\", training_param=TrainingParam())\n", "trainer = TrainAgentMultiEnv(agent=my_agent, env=env, nb_process=NUM_CORE, name=agent_name)\n", "# trainer = TrainAgent(agent=my_agent, env=env)\n", "trainer.train(TRAINING_STEP)\n", "trainer.close()" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "\n", "plt.figure(figsize=(30,20))\n", "plt.plot(my_agent.deep_q.qvalue_evolution)\n", "plt.axhline(y=0, linewidth=3, color='red')\n", "_ = plt.xlim(0, len(my_agent.deep_q.qvalue_evolution))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A loss trained for 100000 iterations on 8 cores, for a ddqn agent and default parameters can look like:\n", "![](img/qvalue.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### II c) Assess the performance of the trained agent" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First we evaluate the performance of a baseline, in this case the \"do nothing\" agent.\n", "\n", "**NB** The use of a Runner (see the first notebook) is particurlaly suited for that purpose. We are showing here how to quickly assess the performances." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "bf7d2f1a6cf84602bbad581f5f9dc5f1", "version_major": 2, "version_minor": 0 }, "text/plain": [ "HBox(children=(FloatProgress(value=0.0, description='episode', max=2.0, style=ProgressStyle(description_width=…" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "bf760e0fc772490c9662f2876fa58e3f", "version_major": 2, "version_minor": 0 }, "text/plain": [ "HBox(children=(FloatProgress(value=1.0, bar_style='info', description='episode', max=1.0, style=ProgressStyle(…" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "7c8611fb0c2d49318dbaedfdf5d36cbc", "version_major": 2, "version_minor": 0 }, "text/plain": [ "HBox(children=(FloatProgress(value=1.0, bar_style='info', description='episode', max=1.0, style=ProgressStyle(…" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "The results for the DoNothing agent are:\n", "\tFor chronics with id 000\n", "\t\t - cumulative reward: 122885.140625\n", "\t\t - number of time steps completed: 100 / 100\n", "\tFor chronics with id 001\n", "\t\t - cumulative reward: 122958.148438\n", "\t\t - number of time steps completed: 100 / 100\n" ] } ], "source": [ "NB_EPISODE = 2\n", "max_iter = 100\n", "# tun the do nothing for the whole episode\n", "dn_agent = grid2op.Agent.DoNothingAgent(env.action_space)\n", "runner = Runner(**env.get_params_for_runner(), agentInstance=dn_agent, agentClass=None)\n", "res = runner.run(nb_episode=NB_EPISODE, max_iter=max_iter, pbar=tqdm)\n", "print(\"The results for the DoNothing agent are:\")\n", "for _, chron_id, cum_reward, nb_time_step, max_ts in res:\n", " msg_tmp = \"\\tFor chronics with id {}\\n\".format(chron_id)\n", " msg_tmp += \"\\t\\t - cumulative reward: {:.6f}\\n\".format(cum_reward)\n", " msg_tmp += \"\\t\\t - number of time steps completed: {:.0f} / {:.0f}\".format(nb_time_step, max_ts)\n", " print(msg_tmp)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then we load the saved neural network, and we can now evaluate the fixed policy:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Successfully constructed networks.\n", "Succesfully loaded network.\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "50ad9742ff3f4863856407e5b16352c6", "version_major": 2, "version_minor": 0 }, "text/plain": [ "HBox(children=(FloatProgress(value=0.0, description='episode', max=2.0, style=ProgressStyle(description_width=…" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "f2913a5679b74791ac30890815d2671f", "version_major": 2, "version_minor": 0 }, "text/plain": [ "HBox(children=(FloatProgress(value=1.0, bar_style='info', description='episode', max=1.0, style=ProgressStyle(…" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "8fb9ca66d4724478b892edbd871a67c8", "version_major": 2, "version_minor": 0 }, "text/plain": [ "HBox(children=(FloatProgress(value=1.0, bar_style='info', description='episode', max=1.0, style=ProgressStyle(…" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "The results for the DoNothing agent are:\n", "\tFor chronics with id 000\n", "\t\t - cumulative reward: 122885.140625\n", "\t\t - number of time steps completed: 100 / 100\n", "\tFor chronics with id 001\n", "\t\t - cumulative reward: 122958.148438\n", "\t\t - number of time steps completed: 100 / 100\n" ] } ], "source": [ "obs = env.reset()\n", "trained_agent = DeepQAgent(env.action_space, mode=\"DDQN\", training_param=TrainingParam())\n", "trained_agent.init_deep_q(trained_agent.convert_obs(obs))\n", "trained_agent.load_network(\"saved_notebook6_{}\".format(agent_name))\n", "runner = Runner(**env.get_params_for_runner(),\n", " agentInstance=trained_agent, agentClass=None)\n", "res = runner.run(nb_episode=NB_EPISODE,\n", " max_iter=max_iter, pbar=tqdm)\n", "print(\"The results for the DoNothing agent are:\")\n", "for _, chron_id, cum_reward, nb_time_step, max_ts in res:\n", " msg_tmp = \"\\tFor chronics with id {}\\n\".format(chron_id)\n", " msg_tmp += \"\\t\\t - cumulative reward: {:.6f}\\n\".format(cum_reward)\n", " msg_tmp += \"\\t\\t - number of time steps completed: {:.0f} / {:.0f}\".format(nb_time_step, max_ts)\n", " print(msg_tmp)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A default DDQN agent trained on 8 cores on 1000000 steps (so 8e6 steps in total), 24h of training on a laptop achieved to perform 5637 steps, largely outperforming the \"do nothing\" agent (which did only 2180 steps on the same 2 environment)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.2" } }, "nbformat": 4, "nbformat_minor": 2 }