{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Monte-Carlo control "
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"try:\n",
" import google.colab\n",
" IN_COLAB = True\n",
"except:\n",
" IN_COLAB = False\n",
"\n",
"if IN_COLAB:\n",
" !pip install -U gymnasium pygame moviepy\n",
" !pip install gymnasium[box2d]"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"gym version: 0.26.3\n"
]
}
],
"source": [
"import numpy as np\n",
"rng = np.random.default_rng()\n",
"import matplotlib.pyplot as plt\n",
"import os\n",
"\n",
"import gymnasium as gym\n",
"print(\"gym version:\", gym.__version__)\n",
"\n",
"from moviepy.editor import ImageSequenceClip, ipython_display\n",
"\n",
"class GymRecorder(object):\n",
" \"\"\"\n",
" Simple wrapper over moviepy to generate a .gif with the frames of a gym environment.\n",
" \n",
" The environment must have the render_mode `rgb_array_list`.\n",
" \"\"\"\n",
" def __init__(self, env):\n",
" self.env = env\n",
" self._frames = []\n",
"\n",
" def record(self, frames):\n",
" \"To be called at the end of an episode.\"\n",
" for frame in frames:\n",
" self._frames.append(np.array(frame))\n",
"\n",
" def make_video(self, filename):\n",
" \"Generates the gif video.\"\n",
" directory = os.path.dirname(os.path.abspath(filename))\n",
" if not os.path.exists(directory):\n",
" os.mkdir(directory)\n",
" self.clip = ImageSequenceClip(list(self._frames), fps=self.env.metadata[\"render_fps\"])\n",
" self.clip.write_gif(filename, fps=self.env.metadata[\"render_fps\"], loop=0)\n",
" del self._frames\n",
" self._frames = []"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The taxi environment\n",
"\n",
"In this exercise, we are going to apply **on-policy Monte-Carlo control** on the Taxi environment available in gym:\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's create the environment in ansi mode, initialize it, and render the first state:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"374\n",
"+---------+\n",
"|R: | : :G|\n",
"| : | : : |\n",
"| : : : : |\n",
"| | : |\u001b[43m \u001b[0m: |\n",
"|\u001b[35mY\u001b[0m| : |\u001b[34;1mB\u001b[0m: |\n",
"+---------+\n",
"\n",
"\n"
]
}
],
"source": [
"env = gym.make(\"Taxi-v3\", render_mode='ansi')\n",
"state, info = env.reset()\n",
"print(state)\n",
"print(env.render())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The agent is the black square. It can move up, down, left or right if there is no wall (the pipes and dashes). Its goal is to pick clients at the blue location and drop them off at the purple location. These locations are fixed (R, G, B, Y), but which one is the pick-up location and which one is the drop-off destination changes between each episode.\n",
"\n",
"**Q:** Re-run the previous cell multiple times to observe the diversity of initial states.\n",
"\n",
"The following cell prints the action space of the environment: "
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Action Space: Discrete(6)\n",
"Number of actions: 6\n"
]
}
],
"source": [
"print(\"Action Space:\", env.action_space)\n",
"print(\"Number of actions:\", env.action_space.n)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are 6 discrete actions: south, north, east, west, pickup, dropoff.\n",
" \n",
"Let's now look at the observation space (state space):"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"State Space: Discrete(500)\n",
"Number of states: 500\n"
]
}
],
"source": [
"print(\"State Space:\", env.observation_space)\n",
"print(\"Number of states:\", env.observation_space.n)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are 500 discrete states. What are they?\n",
"\n",
"* The taxi can be anywhere in the 5x5 grid, giving 25 different locations.\n",
"* The passenger can be at any of the four locations R, G, B, Y or in the taxi: 5 values.\n",
"* The destination can be any of the four locations: 4 values.\n",
"\n",
"This gives indeed 25x5x4 = 500 different combinations.\n",
"\n",
"The internal representation of a state is a number between 0 and 499. You can use the `encode` and `decode` methods of the environment to relate it to the state variables."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"State: 224\n",
"State: [3, 1, 2, 0]\n"
]
}
],
"source": [
"state = env.encode(2, 1, 1, 0) # (taxi row, taxi column, passenger index, destination index)\n",
"print(\"State:\", state)\n",
"\n",
"state = env.decode(328) \n",
"print(\"State:\", list(state))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The reward function is simple:\n",
"\n",
"* r = 20 when delivering the client at the correct location.\n",
"* r = -10 when picking or dropping a client illegally (picking where there is no client, dropping a client somewhere else, etc)\n",
"* r = -1 for all other transitions in order to incent the agent to be as fast as possible.\n",
"\n",
"The actions pickup and dropoff are very dangerous: take them at the wrong time and your return will be very low. The navigation actions are less critical.\n",
"\n",
"Depending on the initial state, the taxi will need at least 10 steps to deliver the client, so the maximal return you can expect is around 10 (+20 for the success, -1 for all the steps). \n",
"\n",
"The task is episodic: if you have not delivered the client within 200 steps, the episode stops (no particular reward)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Random agent\n",
"\n",
"Let's now define a random agent that just samples the action space."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q:** Modify the random agent of last time, so that it accepts the `GymRecorder` that generates the .gif file.\n",
"\n",
"```python\n",
"def train(self, nb_episodes, recorder=None):\n",
"```\n",
"\n",
"The environment should be started in 'rgb_array_list' mode, not 'ansi'. The game looks different but has the same rules.\n",
"\n",
"```python\n",
"env = gym.make(\"Taxi-v3\", render_mode='rgb_array_list')\n",
"recorder = GymRecorder(env)\n",
"```\n",
"\n",
"As episodes in Taxi can be quite long, only the last episode should be recorded:\n",
"\n",
"```python\n",
"if recorder is not None and episode == nb_episodes -1:\n",
" recorder.record(self.env.render())\n",
"```\n",
"\n",
"Perform 10 episodes, plot the obtained returns and vidualize the last episode."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"class RandomAgent:\n",
" \"\"\"\n",
" Random agent exploring uniformly the environment.\n",
" \"\"\"\n",
" \n",
" def __init__(self, env):\n",
" self.env = env\n",
" \n",
" def act(self, state):\n",
" \"Returns a random action by sampling the action space.\"\n",
" return self.env.action_space.sample()\n",
" \n",
" def update(self, state, action, reward, next_state):\n",
" \"Updates the agent using the transition (s, a, r, s').\"\n",
" pass\n",
" \n",
" def train(self, nb_episodes, recorder=None):\n",
" \"Runs the agent on the environment for nb_episodes. Returns the list of obtained rewards.\"\n",
" # List of returns\n",
" returns = []\n",
"\n",
" for episode in range(nb_episodes):\n",
"\n",
" # Sample the initial state\n",
" state, info = self.env.reset()\n",
"\n",
" return_episode = 0.0\n",
" done = False\n",
" while not done:\n",
"\n",
" # Select an action randomly\n",
" action = self.env.action_space.sample()\n",
" \n",
" # Sample a single transition\n",
" next_state, reward, terminal, truncated, info = self.env.step(action)\n",
" \n",
" # Go in the next state\n",
" state = next_state\n",
"\n",
" # Update return\n",
" return_episode += reward\n",
"\n",
" # End of the episode\n",
" done = terminal or truncated\n",
"\n",
" # Record at the end of the episode\n",
" if recorder is not None and episode == nb_episodes -1:\n",
" recorder.record(self.env.render())\n",
" \n",
" # Append return\n",
" returns.append(return_episode)\n",
"\n",
" return returns"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"MoviePy - Building file videos/taxi.gif with imageio.\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" \r"
]
},
{
"data": {
"text/html": [
"