# Training Agent, action converters and l2rpn_baselines
Try me out interactively with: [![Binder](./img/badge_logo.svg)](https://mybinder.org/v2/gh/rte-france/Grid2Op/master)

It is recommended to have a look at the [00_SmallExample](00_SmallExample.ipynb), [02_Observation](02_Observation.ipynb) and [03_Action](03_Action.ipynb) notebooks before getting into this one.

**Objectives**

In this notebook we will expose :
* how to make grid2op compatible with *gym* RL framework (short introduction to *gym_compat* module)
* how to transform grid2op actions / observations with gym "spaces" (https://gym.openai.com/docs/#spaces)
* how to train a (naive) Agent using reinforcement learning.
* how to inspect (rapidly) the actions taken by the Agent.

**NB** In this tutorial, we will use the 

This notebook do not cover the use of existing RL frameworks. Please consult the [11_IntegrationWithExistingRLFrameworks](11_IntegrationWithExistingRLFrameworks.ipynb) for such information! 


**Don't hesitate to check the grid2op module grid2op.gym_compat for a closer integration between grid2op and openAI gym. This module is documented at https://grid2op.readthedocs.io/en/latest/gym.html** 




Execute the cell below by removing the # character if you use google colab !

Cell will look like:
```python
!pip install grid2op[optional] # for use with google colab (grid2op is not installed by default)
```


In [None]:
# !pip install grid2op[optional] # for use with google colab (grid2Op is not installed by default)

In [None]:
import os
import sys
LEARNING_ITERATION = 50
EVAL_EPISODE = 1
MAX_EVAL_STEPS = 10

In [None]:
res = None
try:
 from jyquickhelper import add_notebook_menu
 res = add_notebook_menu()
except ModuleNotFoundError:
 print("Impossible to automatically add a menu / table of content to this notebook.\nYou can download \"jyquickhelper\" package with: \n\"pip install jyquickhelper\"")
res

## 0) Good practice

### A. defining a training, validation and test sets

As in other machine learning tasks, we highly recommend, before even trying to train an agent, to split the the "episode data" (*eg* what are the loads / generations for each load / generator) into 3 datasets:
- "train" use to train the agent
- "val" use to validate the hyper parameters
- "test" at which you would look **only once** to report the agent performance in a scientific paper (for example)

Grid2op lets you do that with relative ease:

```python
import grid2op
env_name = "l2rpn_case14_sandbox" # or any other...
env = grid2op.make(env_name)

# extract 1% of the "chronics" to be used in the validation environment. The other 99% will
# be used for test
nm_env_train, nm_env_val, nm_env_test = env.train_val_split_random(pct_val=1., pct_test=1.)

# and now you can use the training set only to train your agent:
print(f"The name of the training environment is \\"{nm_env_train}\\"")
print(f"The name of the validation environment is \\"{nm_env_val}\\"")
print(f"The name of the test environment is \\"{nm_env_test}\\"")
```

And now, you can use the training environment to train your agent:

```python
import grid2op
env_name = "l2rpn_case14_sandbox"
env = grid2op.make(env_name+"_train")
```

Be carefull, on windows you might run into issues. Don't hesitate to have a look at the documentation of this funciton if this the case (see https://grid2op.readthedocs.io/en/latest/environment.html#grid2op.Environment.Environment.train_val_split and https://grid2op.readthedocs.io/en/latest/environment.html#grid2op.Environment.Environment.train_val_split_random)

### B. Not spending all of your time loading data...

In most grid2op environment, the "data" are loaded from the hard drive.

From experience, what happens (especially at the beginning of training) is that your agent survives a few steps (so taking a few milliseconds) before a game over. At this stage you will call `env.reset()` which will load the data of the next scenario.

This is the default behaviour and it is far from "optimal" (more time is spent loading data than performing actual useful computation). To that end, we encourage you:
- to use a "caching" mechanism, for example with `MultifolderWithCache` class
- to read the data by small "chunk" (`env.chronics_handler.set_chunk_size(...)`). 

More information is provided in https://grid2op.readthedocs.io/en/latest/environment.html#optimize-the-data-pipeline

### C. Use a fast simulator

Grid2op will use a "backend" to compute the powerflows and be able to return the next observation (after `env.step(...)`). These "backends" can be faster. For example, we strongly encourage you to use the "lightsim2grid" backend.

You can install it with `pip install lightsim2grid`

And use it with:
```python
import grid2op
from lightsim2grid import LightSimBackend
env_name = "l2rpn_case14_sandbox"
env = grid2op.make(env_name+"_train", backend=LightSimBackend(), ...)
```

## I) Action representation

### What's this ?

The Grid2op package has been built with an "object-oriented" perspective: almost everything is encapsulated in a dedicated `class`. This allows for more customization of the plateform.

The downside of this approach is that machine learning methods, especially in deep learning, often prefer to deal with vectors rather than with "complex" objects. Indeed, as we covered in the previous tutorials on the platform, we saw that building our own actions can be tedious and can sometime require important knowledge of the powergrid.

On the contrary, in most of the standard Reinforcement Learning environments, actions have a higher level representation. For example in pacman, there are 4 different types of actions: "turn left", "turn right", "go up" and "go down". This allows for easy sampling (if you need to achieve an uniform sampling, you simply need to randomly pick a number between 0 and 3 included) and an easy representation: each action can be represented as a different component of a vector of dimension 4 [because there are 4 actions]. 

On the other hand, this representation is not "human friendly". It is quite convenient in the case of pacman because the action space is rather small, making it possible to remember which action corresponds to which component, but in the case of the grid2op package, there are hundreds or even thousands of actions. We suppose that we do not really care about this here, as tutorials on Reinforcement Learning with discrete action space often assume that actions are labeled with integers (such as in pacman for example).

Converting grid2op actions into "machine readable" ones is the major difficulty as there is no unique ways to do so. In grid2op we offer some pre defined "functions" to do so:

- `BoxGymObsSpace` will convert the action space into a gym "Box". It is rather straightforward, especially for **continuous** type of actions (such as *redispatching*, *curtailment* or actions on *storage units*). Representing the discrete actions (on powerlines and on substation) is not an easy task with them. We would not recommend to use them if your focus is on topology. More information on https://grid2op.readthedocs.io/en/latest/gym.html#grid2op.gym_compat.BoxGymActSpace
- `MultiDiscreteActSpace` is similar to `BoxGymObsSpace` but mainly focused on the **discrete** actions (*lines status* and *substation reconfiguration*). Actions are represented with a gym "MultiDiscrete" space. It allows to perform any number of actions you want (which might be illegal) but comes with little restrictions. It handles continuous actions through "binning" (which is not ideal but doable). We recommend using this transformation if the algorithm you want to use is able to deal with "MultiDiscrete" gym action type. More information is given at https://grid2op.readthedocs.io/en/latest/gym.html#grid2op.gym_compat.MultiDiscreteActSpace
- `DiscreteActSpace` is similar to `MultiDiscreteActSpace` in the sense that it focuses on **discrete** actions. It comes with a main restriction though: you can only do one action. For example, you cannot "modify a substation" AND "disconnect a powerline" with the same action. More information is provided at https://grid2op.readthedocs.io/en/latest/gym.html#grid2op.gym_compat.DiscreteActSpace. We recommend to use it if you want to focus on **discrete** actions and the algorithm you want to use is not able to deal with `MultiDiscreteActSpace`.
- You can also fully customize the way you "represent" the action. More information is given in the notebook [11_IntegrationWithExistingRLFrameworks](11_IntegrationWithExistingRLFrameworks.ipynb)

In the next section we will show an agent working with `DiscreteActSpace`. The code showed can be easily adapted with the other type of actions.

### Create a gym compatible environment

The first step is to "convert" your environment into a gym environment, for easier manipulation.

In [None]:
from grid2op.gym_compat import GymEnv
import grid2op
from gym import Env
from gym.utils.env_checker import check_env
try:
 from lightsim2grid import LightSimBackend
 bk_cls = LightSimBackend
except ImportError as exc:
 print(f"Error: {exc} when importing faster LightSimBackend")
 from grid2op.Backend import PandaPowerBackend
 bk_cls = PandaPowerBackend
 
env_name = "l2rpn_case14_sandbox"
training_env = grid2op.make(env_name, test=True, backend=bk_cls()) # we put "test=True" in this notebook because...
# it's a notebook to explain things. Of course, do not put "test=True" if you really want
# to train an agent...
gym_env = GymEnv(training_env)

In [None]:
isinstance(gym_env, Env)

In [None]:
check_env(gym_env, warn=False)

As you can see, `gym_env` is really an environment from gym. It also meets the gym requirements and pass the checks performed by gym.

By default however, the action space and observation space are Dictionnaries, which is not convenient for most machine learning algorithm (they need to be transformed into vectors somehow).

In [None]:
gym_env.action_space

In [None]:
gym_env.observation_space

This is why it is often a good idea to customize the environment to have proper observations and actions. 

This is covered more in depth on the notebook [11_IntegrationWithExistingRLFrameworks](11_IntegrationWithExistingRLFrameworks.ipynb) (especially for the observation space). 

Here we use simple converters to focus on the training of the agent. You can set them with:

In [None]:
from grid2op.gym_compat import DiscreteActSpace
gym_env.action_space = DiscreteActSpace(training_env.action_space,
 attr_to_keep=["set_bus" , "set_line_status_simple"])
gym_env.action_space

In [None]:
from grid2op.gym_compat import BoxGymObsSpace
gym_env.observation_space = BoxGymObsSpace(training_env.observation_space,
 attr_to_keep=["rho"])
gym_env.observation_space

And now your agent will receive a "box" as observation:

In [None]:
obs = gym_env.reset()
obs

And will be able to interact with the environment with standard "int" (integers that are the normal way to represent "Discrete" gym space)

In [None]:
obs, reward, done, info = gym_env.step(0) # perform action labeled 0
obs

In [None]:
obs, reward, done, info = gym_env.step(53) # perform action labeled 53
obs

### Even better gym environment
The environment show here directly "maps" a grid2op environment to a gym environment: at each "grid2op step" the agent is asked to chose an action.


After a few itérations of the "l2rpn" competitions it appears that the vast majority of top performers all rely on "heuristics" to handle the problem showed in grid2op.

For example, most of the time the "agent" does not take any actions when all the flows are bellow a certain threshold (say 90%) and only do actions when at least one powerline sees a flow above this limit. 

In `l2rpn_baselines` we embeded directly the possibility to have a "gym environment" that does some of these heuristics. For example, the "gym environment" keeps performing "steps" (with the do nothing action) while all the flows are bellow a certain threshold. Otherwise it asks for an action at the agent. In this setting, the agent is directly trained with the heuristics used at inference time.

This is available in:

```python
from l2rpn_baselines.utils import GymEnvWithReco, GymEnvWithRecoWithDN
```


## II) Train an Agent

In this tutorial we will train an agent that uses only discrete actions thanks to the "stable baselines 3" RL package. 

More precisely, we will use the "PPO" algorithm. We do not cover the details of such algorithm here.

**NB** here we show a minimal complete code to get started. We recommend however to use the l2rpn_baselines `PPO_SB3` model (that integrates all this, which much better customization) for such purpose.

In [None]:
from stable_baselines3 import PPO
nn_model = PPO(env=gym_env,
 learning_rate=1e-3,
 policy="MlpPolicy",
 policy_kwargs={"net_arch": [100, 100, 100]},
 n_steps=2,
 batch_size=8,
 verbose=True,
 )

In [None]:
nn_model.learn(total_timesteps=LEARNING_ITERATION)

### Important notes

A few notes here:
- the meta parameters ("net_arch", "learning_rate", "n_steps", "gamma", "batch_size", ...) are not fine tuned and are unlikely to lead to good results
- the number of training steps is way too low (it should be much greater, probably in >= 1M)
- there is no "heuristics" embeded here. Agents performs best when they act only when the grid is "in danger" (otherwise it takes a while for an agent to learn to "do nothing")
- we use a "test" environment, this is why we specified "test=True" at the environment creation.
- observations are not scaled
- reward is not tuned for a particular goal: default reward is used and this might not be a good things to do for the particular agent we are trying to build.

All of these probably lead to a quite terrible agent...

### Other important notes

We do not recommend to train "real" model with such simple features. In particular, you might need to save (at different stage of training) your agent, log the progress using tensorboard (for example) etc.

This can be done with "callbacks" here that are not covered in this "getting started".


### Yet another important note
If you chose to use "l2rpn baselines" it will be as easy to learn an agent taking into account account all of the above if you us:

```python
from l2rpn_baselines.utils import GymEnvWithRecoWithDN
from l2rpn_baselines.PP0_SB3 import train
```

You can have a look at the "examples" section of the l2rpn baselines repository (https://github.com/rte-france/l2rpn-baselines/tree/master/examples)





## III) Evaluating the Agent

And now, it is time to test the agent that we trained on different scenarios (ideally) using another environment (to mimic a real process)

To do that, we have two main choices:
1) we use the trained model, convert the other environment to a gym env, run the environment and ask for the neural network what it should od
2) embed the trained neural network in a "grid2op agent" class and use the built grid2op runner transparently

Solution 1) is to be preferred for "quick and dirty" tests. In all other cases we strongly recommend to use solution 2 as it also allows you to not recode the "gym loop" to benefit from the saving functionality (as so to be able to inspect your agent in grid2viz for example) etc.

We will code a minimal agent to leverage solution 2. As always, this is much easier to use with l2rpn_baselines package...

### III A) Create the "grid2op agent"

In [None]:
from grid2op.Agent import BaseAgent

class MyPPOAgent(BaseAgent):
 def __init__(self,
 action_space, # grid2op action space
 used_trained_gym_env, # gym env used for training
 trained_ppo_nnet, # neural net after training
 ):
 super().__init__(action_space)
 self.nn_model = trained_ppo_nnet
 self.gym_env = used_trained_gym_env
 
 def act(self, observation, reward, done=False):
 gym_obs = self.gym_env.observation_space.to_gym(observation)
 gym_act, _ = self.nn_model.predict(gym_obs, deterministic=True)
 grid2op_act = self.gym_env.action_space.from_gym(gym_act)
 return grid2op_act
 
my_agent = MyPPOAgent(training_env.action_space, gym_env, nn_model)

**NB** Of course, if you use "l2rpn_baselines" you are not forced to recode it. Everything is properly set up for you to use directly the "trained l2rpn_baselines agent" in the grid2op runner !

### III.B) Perform the evaluation

Now we create the test environment (in this case it's the same as the training one, but we really do recommend to have different environment with different data...)

For a real experiment this could look like:

```python
training_env = grid2op.make(env_name+"_test", ...) 
```

In [None]:
testing_env = grid2op.make(env_name, test=True, backend=bk_cls()) 

In [None]:
from l2rpn_baselines.DuelQSimple import evaluate
import shutil
from tqdm.notebook import tqdm
from grid2op.Runner import Runner
from grid2op.Episode import EpisodeData

save_path = "saved_agent_PPO_{}".format(LEARNING_ITERATION)
path_save_results = "{}_results".format(save_path)
shutil.rmtree(path_save_results, ignore_errors=True)

test_runner = Runner(**testing_env.get_params_for_runner(),
 agentInstance=my_agent, agentClass=None)
res = test_runner.run(nb_episode=EVAL_EPISODE,
 max_iter=MAX_EVAL_STEPS,
 pbar=tqdm,
 path_save=f"./{path_save_results}")

### III.B) Inspect the Agent 

Please refer to the official documentation for more information about the contents of the directory where the data is saved. Note that saving the information is triggered by the "path_save" argument of the "runner.run" function.

The information contained in this output will be saved in a structured way and includes :
For each episode :
 - "episode_meta.json": json file that represents some meta information about:

 - "backend_type": the name of the `grid2op.Backend` class used
 - "chronics_max_timestep": the **maximum** number of timesteps for the chronics used
 - "chronics_path": the path where the temporal data (chronics) are located
 - "env_type": the name of the `grid2op.Environment` class used.
 - "grid_path": the path where the powergrid has been loaded from

 - "episode_times.json": json file that gives some information about the total time spent in multiple parts of the runner, mainly the
 `grid2op.Agent` (and especially its method `grid2op.Agent.act`) and the
 `grid2op.Environment`

 - "_parameters.json": json representation of the `grid2op.Parameters` used for this episode
 - "rewards.npy": numpy 1d-array giving the rewards at each time step. We adopted the convention that the stored
 reward at index `i` is the one observed by the agent at time `i` and **NOT** the reward sent by the
 `grid2op.Environment` after the action has been taken.
 - "exec_times.npy": numpy 1d-array giving the execution time for each time step in the episode
 - "actions.npy": numpy 2d-array giving the actions that have been taken by the `grid2op.Agent`. At row `i` of "actions.npy" is a
 vectorized representation of the action performed by the agent at timestep `i` *ie.* **after** having observed
 the observation present at row `i` of "observation.npy" and the reward showed in row `i` of "rewards.npy".
 - "disc_lines.npy": numpy 2d-array that tells which lines have been disconnected during the simulation of the cascading failure at each
 time step. The same convention has been adopted for "rewards.npy". This means that the powerlines are
 disconnected when the `grid2op.Agent` takes the `grid2op.Action` at time step `i`.
 - "observations.npy": numpy 2d-array representing the `grid2op.Observation` at the disposal of the
 `grid2op.Agent` when he took his action.

We can first look at the repository were the data is stored:

In [None]:
import os
os.listdir(path_save_results)

As we can see there are 2 folders, each corresponding to a chronics.
There are also additional json files.

Now let's see what is inside one of these folders:

In [None]:
EpisodeData.list_episode(path_save_results)

For example, we can load the actions chosen by the Agent, and have a look at them.

To do that, we will load the action array and use the `action_space` function to convert it back to `Action` objects.

In [None]:
all_episodes = EpisodeData.list_episode(path_save_results)
this_episode = EpisodeData.from_disk(*all_episodes[0])
li_actions = this_episode.actions

This allows us to have a deeper look at the actions, and their effects.

Now we will inspect the actions that has been taken by the agent :

In [None]:
line_disc = 0
line_reco = 0
for act in li_actions:
 dict_ = act.as_dict()
 if "set_line_status" in dict_:
 line_reco += dict_["set_line_status"]["nb_connected"]
 line_disc += dict_["set_line_status"]["nb_disconnected"]
print(f'Total reconnected lines : {line_reco}')
print(f'Total disconnected lines : {line_disc}')

As we can see, during this episode, our agent never tries to disconnect or reconnect a line.

We can also analyse the observations of the recorded episode :

In [None]:
li_observations = this_episode.observations
nb_real_disc = 0
for obs_ in li_observations:
 nb_real_disc += (obs_.line_status == False).sum()
print(f'Total number of disconnected powerlines cumulated over all the timesteps : {nb_real_disc}')

We can also look at the kind of actions that the agent chose:

In [None]:
actions_count = {}
for act in li_actions:
 act_as_vect = tuple(act.to_vect())
 if not act_as_vect in actions_count:
 actions_count[act_as_vect] = 0
 actions_count[act_as_vect] += 1
print("The agent did {} different valid actions:\n".format(len(actions_count)))

The actions chosen by the agent were :

In [None]:
for act in li_actions:
 print(act)

## IV) Improve your Agent 

As we stated above, the goal of this notebook was not to show a "state of the art" agent but rather to explain with a "minimal example" how you could get started into training an agent operating a powergrid.

To improve your agent, we strongly recommend to use the l2rpn_baselines repository, to fine tune the hyper parameters of your agents, to train it for longer on more diverse data, to use a different RL algorithm (maybe PPO is not the best one for this environment ?) etc. etc.

Using some pre training (for example training the policy in a supervised fashion using heuristic of human demonstration data) might be a good solution.

Another promising tool would be to use some "curiculum learning" where, at the beginning, the agent would interact with a simplified version of the environment (easier rules, no disconnection due to overflow, no cooldown after an action has been made, etc.) and after some time, when the agent is starting to perform well to increase this difficulty.

Yet another possibility would be to access some kind of "model based" reinforcement learning, such as alpha-* models (*eg* alpha go).