{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "ctpoSTL9oGq7" }, "source": [ "# Gym environments" ] }, { "cell_type": "markdown", "metadata": { "id": "PsWSoHfooGrJ" }, "source": [ "## Installing gym\n", "\n", "In this course, we will mostly address RL environments available in the **OpenAI Gym** framework:\n", "\n", "\n", "\n", "It provides a multitude of RL problems, from simple text-based problems with a few dozens of states (Gridworld, Taxi) to continuous control problems (Cartpole, Pendulum) to Atari games (Breakout, Space Invaders) to complex robotics simulators (Mujoco):\n", "\n", "\n", "\n", "However, `gym` is not maintained by OpenAI anymore since September 2022. We will use instead the `gymnasium` library maintained by the Farama foundation, which will keep on maintaining and improving the library.\n", "\n", "\n", "\n", "You can install gymnasium and its dependencies using:\n", "\n", "```bash\n", "pip install -U gymnasium pygame moviepy swig\n", "pip install gymnasium[classic_control]\n", "pip install gymnasium[box2d]\n", "```\n", "\n", "For this exercise and the following, we will focus on simple environments whose installation is straightforward: toy text, classic control and box2d. More complex environments based on Atari games or the Mujoco physics simulator are described in the last (optional) section of this notebook, as they require additional dependencies. \n", "\n", "On colab, `gym` cannot open graphical windows for visualizing the environments, as it is not possible in the browser. We will see a workaround allowing to produce videos. Running that cell in colab should allow you to run the simplest environments:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "id": "YxLOmLKIoGrN" }, "outputs": [], "source": [ "try:\n", " import google.colab\n", " IN_COLAB = True\n", "except:\n", " IN_COLAB = False\n", "\n", "if IN_COLAB:\n", " !pip install -U gymnasium pygame moviepy swig\n", " !pip install gymnasium[box2d]\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "gym version: 0.29.1\n" ] } ], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import os\n", "\n", "import gymnasium as gym\n", "print(\"gym version:\", gym.__version__)\n", "\n", "from moviepy.editor import ImageSequenceClip, ipython_display\n", "\n", "class GymRecorder(object):\n", " \"\"\"\n", " Simple wrapper over moviepy to generate a .gif with the frames of a gym environment.\n", " \n", " The environment must have the render_mode `rgb_array_list`.\n", " \"\"\"\n", " def __init__(self, env):\n", " self.env = env\n", " self._frames = []\n", "\n", " def record(self, frames):\n", " \"To be called at the end of an episode.\"\n", " for frame in frames:\n", " self._frames.append(np.array(frame))\n", "\n", " def make_video(self, filename):\n", " \"Generates the gif video.\"\n", " directory = os.path.dirname(os.path.abspath(filename))\n", " if not os.path.exists(directory):\n", " os.mkdir(directory)\n", " self.clip = ImageSequenceClip(list(self._frames), fps=self.env.metadata[\"render_fps\"])\n", " self.clip.write_gif(filename, fps=self.env.metadata[\"render_fps\"], loop=0)\n", " del self._frames\n", " self._frames = []" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Interacting with an environment\n", "\n", "A gym environment is created using:\n", "\n", "```python\n", "env = gym.make('CartPole-v1', render_mode=\"human\")\n", "```\n", "\n", "where 'CartPole-v1' should be replaced by the environment you want to interact with. The following cell lists the environments available to you (including the different versions)." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CartPole-v0\n", "CartPole-v1\n", "MountainCar-v0\n", "MountainCarContinuous-v0\n", "Pendulum-v1\n", "Acrobot-v1\n", "phys2d/CartPole-v0\n", "phys2d/CartPole-v1\n", "phys2d/Pendulum-v0\n", "LunarLander-v2\n", "LunarLanderContinuous-v2\n", "BipedalWalker-v3\n", "BipedalWalkerHardcore-v3\n", "CarRacing-v2\n", "Blackjack-v1\n", "FrozenLake-v1\n", "FrozenLake8x8-v1\n", "CliffWalking-v0\n", "Taxi-v3\n", "tabular/Blackjack-v0\n", "tabular/CliffWalking-v0\n", "Reacher-v2\n", "Reacher-v4\n", "Pusher-v2\n", "Pusher-v4\n", "InvertedPendulum-v2\n", "InvertedPendulum-v4\n", "InvertedDoublePendulum-v2\n", "InvertedDoublePendulum-v4\n", "HalfCheetah-v2\n", "HalfCheetah-v3\n", "HalfCheetah-v4\n", "Hopper-v2\n", "Hopper-v3\n", "Hopper-v4\n", "Swimmer-v2\n", "Swimmer-v3\n", "Swimmer-v4\n", "Walker2d-v2\n", "Walker2d-v3\n", "Walker2d-v4\n", "Ant-v2\n", "Ant-v3\n", "Ant-v4\n", "Humanoid-v2\n", "Humanoid-v3\n", "Humanoid-v4\n", "HumanoidStandup-v2\n", "HumanoidStandup-v4\n", "GymV26Environment-v0\n", "GymV21Environment-v0\n", "Adventure-v0\n", "AdventureDeterministic-v0\n", "AdventureNoFrameskip-v0\n", "Adventure-v4\n", "AdventureDeterministic-v4\n", "AdventureNoFrameskip-v4\n", "Adventure-ram-v0\n", "Adventure-ramDeterministic-v0\n", "Adventure-ramNoFrameskip-v0\n", "Adventure-ram-v4\n", "Adventure-ramDeterministic-v4\n", "Adventure-ramNoFrameskip-v4\n", "AirRaid-v0\n", "AirRaidDeterministic-v0\n", "AirRaidNoFrameskip-v0\n", "AirRaid-v4\n", "AirRaidDeterministic-v4\n", "AirRaidNoFrameskip-v4\n", "AirRaid-ram-v0\n", "AirRaid-ramDeterministic-v0\n", "AirRaid-ramNoFrameskip-v0\n", "AirRaid-ram-v4\n", "AirRaid-ramDeterministic-v4\n", "AirRaid-ramNoFrameskip-v4\n", "Alien-v0\n", "AlienDeterministic-v0\n", "AlienNoFrameskip-v0\n", "Alien-v4\n", "AlienDeterministic-v4\n", "AlienNoFrameskip-v4\n", "Alien-ram-v0\n", "Alien-ramDeterministic-v0\n", "Alien-ramNoFrameskip-v0\n", "Alien-ram-v4\n", "Alien-ramDeterministic-v4\n", "Alien-ramNoFrameskip-v4\n", "Amidar-v0\n", "AmidarDeterministic-v0\n", "AmidarNoFrameskip-v0\n", "Amidar-v4\n", "AmidarDeterministic-v4\n", "AmidarNoFrameskip-v4\n", "Amidar-ram-v0\n", "Amidar-ramDeterministic-v0\n", "Amidar-ramNoFrameskip-v0\n", "Amidar-ram-v4\n", "Amidar-ramDeterministic-v4\n", "Amidar-ramNoFrameskip-v4\n", "Assault-v0\n", "AssaultDeterministic-v0\n", "AssaultNoFrameskip-v0\n", "Assault-v4\n", "AssaultDeterministic-v4\n", "AssaultNoFrameskip-v4\n", "Assault-ram-v0\n", "Assault-ramDeterministic-v0\n", "Assault-ramNoFrameskip-v0\n", "Assault-ram-v4\n", "Assault-ramDeterministic-v4\n", "Assault-ramNoFrameskip-v4\n", "Asterix-v0\n", "AsterixDeterministic-v0\n", "AsterixNoFrameskip-v0\n", "Asterix-v4\n", "AsterixDeterministic-v4\n", "AsterixNoFrameskip-v4\n", "Asterix-ram-v0\n", "Asterix-ramDeterministic-v0\n", "Asterix-ramNoFrameskip-v0\n", "Asterix-ram-v4\n", "Asterix-ramDeterministic-v4\n", "Asterix-ramNoFrameskip-v4\n", "Asteroids-v0\n", "AsteroidsDeterministic-v0\n", "AsteroidsNoFrameskip-v0\n", "Asteroids-v4\n", "AsteroidsDeterministic-v4\n", "AsteroidsNoFrameskip-v4\n", "Asteroids-ram-v0\n", "Asteroids-ramDeterministic-v0\n", "Asteroids-ramNoFrameskip-v0\n", "Asteroids-ram-v4\n", "Asteroids-ramDeterministic-v4\n", "Asteroids-ramNoFrameskip-v4\n", "Atlantis-v0\n", "AtlantisDeterministic-v0\n", "AtlantisNoFrameskip-v0\n", "Atlantis-v4\n", "AtlantisDeterministic-v4\n", "AtlantisNoFrameskip-v4\n", "Atlantis-ram-v0\n", "Atlantis-ramDeterministic-v0\n", "Atlantis-ramNoFrameskip-v0\n", "Atlantis-ram-v4\n", "Atlantis-ramDeterministic-v4\n", "Atlantis-ramNoFrameskip-v4\n", "BankHeist-v0\n", "BankHeistDeterministic-v0\n", "BankHeistNoFrameskip-v0\n", "BankHeist-v4\n", "BankHeistDeterministic-v4\n", "BankHeistNoFrameskip-v4\n", "BankHeist-ram-v0\n", "BankHeist-ramDeterministic-v0\n", "BankHeist-ramNoFrameskip-v0\n", "BankHeist-ram-v4\n", "BankHeist-ramDeterministic-v4\n", "BankHeist-ramNoFrameskip-v4\n", "BattleZone-v0\n", "BattleZoneDeterministic-v0\n", "BattleZoneNoFrameskip-v0\n", "BattleZone-v4\n", "BattleZoneDeterministic-v4\n", "BattleZoneNoFrameskip-v4\n", "BattleZone-ram-v0\n", "BattleZone-ramDeterministic-v0\n", "BattleZone-ramNoFrameskip-v0\n", "BattleZone-ram-v4\n", "BattleZone-ramDeterministic-v4\n", "BattleZone-ramNoFrameskip-v4\n", "BeamRider-v0\n", "BeamRiderDeterministic-v0\n", "BeamRiderNoFrameskip-v0\n", "BeamRider-v4\n", "BeamRiderDeterministic-v4\n", "BeamRiderNoFrameskip-v4\n", "BeamRider-ram-v0\n", "BeamRider-ramDeterministic-v0\n", "BeamRider-ramNoFrameskip-v0\n", "BeamRider-ram-v4\n", "BeamRider-ramDeterministic-v4\n", "BeamRider-ramNoFrameskip-v4\n", "Berzerk-v0\n", "BerzerkDeterministic-v0\n", "BerzerkNoFrameskip-v0\n", "Berzerk-v4\n", "BerzerkDeterministic-v4\n", "BerzerkNoFrameskip-v4\n", "Berzerk-ram-v0\n", "Berzerk-ramDeterministic-v0\n", "Berzerk-ramNoFrameskip-v0\n", "Berzerk-ram-v4\n", "Berzerk-ramDeterministic-v4\n", "Berzerk-ramNoFrameskip-v4\n", "Bowling-v0\n", "BowlingDeterministic-v0\n", "BowlingNoFrameskip-v0\n", "Bowling-v4\n", "BowlingDeterministic-v4\n", "BowlingNoFrameskip-v4\n", "Bowling-ram-v0\n", "Bowling-ramDeterministic-v0\n", "Bowling-ramNoFrameskip-v0\n", "Bowling-ram-v4\n", "Bowling-ramDeterministic-v4\n", "Bowling-ramNoFrameskip-v4\n", "Boxing-v0\n", "BoxingDeterministic-v0\n", "BoxingNoFrameskip-v0\n", "Boxing-v4\n", "BoxingDeterministic-v4\n", "BoxingNoFrameskip-v4\n", "Boxing-ram-v0\n", "Boxing-ramDeterministic-v0\n", "Boxing-ramNoFrameskip-v0\n", "Boxing-ram-v4\n", "Boxing-ramDeterministic-v4\n", "Boxing-ramNoFrameskip-v4\n", "Breakout-v0\n", "BreakoutDeterministic-v0\n", "BreakoutNoFrameskip-v0\n", "Breakout-v4\n", "BreakoutDeterministic-v4\n", "BreakoutNoFrameskip-v4\n", "Breakout-ram-v0\n", "Breakout-ramDeterministic-v0\n", "Breakout-ramNoFrameskip-v0\n", "Breakout-ram-v4\n", "Breakout-ramDeterministic-v4\n", "Breakout-ramNoFrameskip-v4\n", "Carnival-v0\n", "CarnivalDeterministic-v0\n", "CarnivalNoFrameskip-v0\n", "Carnival-v4\n", "CarnivalDeterministic-v4\n", "CarnivalNoFrameskip-v4\n", "Carnival-ram-v0\n", "Carnival-ramDeterministic-v0\n", "Carnival-ramNoFrameskip-v0\n", "Carnival-ram-v4\n", "Carnival-ramDeterministic-v4\n", "Carnival-ramNoFrameskip-v4\n", "Centipede-v0\n", "CentipedeDeterministic-v0\n", "CentipedeNoFrameskip-v0\n", "Centipede-v4\n", "CentipedeDeterministic-v4\n", "CentipedeNoFrameskip-v4\n", "Centipede-ram-v0\n", "Centipede-ramDeterministic-v0\n", "Centipede-ramNoFrameskip-v0\n", "Centipede-ram-v4\n", "Centipede-ramDeterministic-v4\n", "Centipede-ramNoFrameskip-v4\n", "ChopperCommand-v0\n", "ChopperCommandDeterministic-v0\n", "ChopperCommandNoFrameskip-v0\n", "ChopperCommand-v4\n", "ChopperCommandDeterministic-v4\n", "ChopperCommandNoFrameskip-v4\n", "ChopperCommand-ram-v0\n", "ChopperCommand-ramDeterministic-v0\n", "ChopperCommand-ramNoFrameskip-v0\n", "ChopperCommand-ram-v4\n", "ChopperCommand-ramDeterministic-v4\n", "ChopperCommand-ramNoFrameskip-v4\n", "CrazyClimber-v0\n", "CrazyClimberDeterministic-v0\n", "CrazyClimberNoFrameskip-v0\n", "CrazyClimber-v4\n", "CrazyClimberDeterministic-v4\n", "CrazyClimberNoFrameskip-v4\n", "CrazyClimber-ram-v0\n", "CrazyClimber-ramDeterministic-v0\n", "CrazyClimber-ramNoFrameskip-v0\n", "CrazyClimber-ram-v4\n", "CrazyClimber-ramDeterministic-v4\n", "CrazyClimber-ramNoFrameskip-v4\n", "Defender-v0\n", "DefenderDeterministic-v0\n", "DefenderNoFrameskip-v0\n", "Defender-v4\n", "DefenderDeterministic-v4\n", "DefenderNoFrameskip-v4\n", "Defender-ram-v0\n", "Defender-ramDeterministic-v0\n", "Defender-ramNoFrameskip-v0\n", "Defender-ram-v4\n", "Defender-ramDeterministic-v4\n", "Defender-ramNoFrameskip-v4\n", "DemonAttack-v0\n", "DemonAttackDeterministic-v0\n", "DemonAttackNoFrameskip-v0\n", "DemonAttack-v4\n", "DemonAttackDeterministic-v4\n", "DemonAttackNoFrameskip-v4\n", "DemonAttack-ram-v0\n", "DemonAttack-ramDeterministic-v0\n", "DemonAttack-ramNoFrameskip-v0\n", "DemonAttack-ram-v4\n", "DemonAttack-ramDeterministic-v4\n", "DemonAttack-ramNoFrameskip-v4\n", "DoubleDunk-v0\n", "DoubleDunkDeterministic-v0\n", "DoubleDunkNoFrameskip-v0\n", "DoubleDunk-v4\n", "DoubleDunkDeterministic-v4\n", "DoubleDunkNoFrameskip-v4\n", "DoubleDunk-ram-v0\n", "DoubleDunk-ramDeterministic-v0\n", "DoubleDunk-ramNoFrameskip-v0\n", "DoubleDunk-ram-v4\n", "DoubleDunk-ramDeterministic-v4\n", "DoubleDunk-ramNoFrameskip-v4\n", "ElevatorAction-v0\n", "ElevatorActionDeterministic-v0\n", "ElevatorActionNoFrameskip-v0\n", "ElevatorAction-v4\n", "ElevatorActionDeterministic-v4\n", "ElevatorActionNoFrameskip-v4\n", "ElevatorAction-ram-v0\n", "ElevatorAction-ramDeterministic-v0\n", "ElevatorAction-ramNoFrameskip-v0\n", "ElevatorAction-ram-v4\n", "ElevatorAction-ramDeterministic-v4\n", "ElevatorAction-ramNoFrameskip-v4\n", "Enduro-v0\n", "EnduroDeterministic-v0\n", "EnduroNoFrameskip-v0\n", "Enduro-v4\n", "EnduroDeterministic-v4\n", "EnduroNoFrameskip-v4\n", "Enduro-ram-v0\n", "Enduro-ramDeterministic-v0\n", "Enduro-ramNoFrameskip-v0\n", "Enduro-ram-v4\n", "Enduro-ramDeterministic-v4\n", "Enduro-ramNoFrameskip-v4\n", "FishingDerby-v0\n", "FishingDerbyDeterministic-v0\n", "FishingDerbyNoFrameskip-v0\n", "FishingDerby-v4\n", "FishingDerbyDeterministic-v4\n", "FishingDerbyNoFrameskip-v4\n", "FishingDerby-ram-v0\n", "FishingDerby-ramDeterministic-v0\n", "FishingDerby-ramNoFrameskip-v0\n", "FishingDerby-ram-v4\n", "FishingDerby-ramDeterministic-v4\n", "FishingDerby-ramNoFrameskip-v4\n", "Freeway-v0\n", "FreewayDeterministic-v0\n", "FreewayNoFrameskip-v0\n", "Freeway-v4\n", "FreewayDeterministic-v4\n", "FreewayNoFrameskip-v4\n", "Freeway-ram-v0\n", "Freeway-ramDeterministic-v0\n", "Freeway-ramNoFrameskip-v0\n", "Freeway-ram-v4\n", "Freeway-ramDeterministic-v4\n", "Freeway-ramNoFrameskip-v4\n", "Frostbite-v0\n", "FrostbiteDeterministic-v0\n", "FrostbiteNoFrameskip-v0\n", "Frostbite-v4\n", "FrostbiteDeterministic-v4\n", "FrostbiteNoFrameskip-v4\n", "Frostbite-ram-v0\n", "Frostbite-ramDeterministic-v0\n", "Frostbite-ramNoFrameskip-v0\n", "Frostbite-ram-v4\n", "Frostbite-ramDeterministic-v4\n", "Frostbite-ramNoFrameskip-v4\n", "Gopher-v0\n", "GopherDeterministic-v0\n", "GopherNoFrameskip-v0\n", "Gopher-v4\n", "GopherDeterministic-v4\n", "GopherNoFrameskip-v4\n", "Gopher-ram-v0\n", "Gopher-ramDeterministic-v0\n", "Gopher-ramNoFrameskip-v0\n", "Gopher-ram-v4\n", "Gopher-ramDeterministic-v4\n", "Gopher-ramNoFrameskip-v4\n", "Gravitar-v0\n", "GravitarDeterministic-v0\n", "GravitarNoFrameskip-v0\n", "Gravitar-v4\n", "GravitarDeterministic-v4\n", "GravitarNoFrameskip-v4\n", "Gravitar-ram-v0\n", "Gravitar-ramDeterministic-v0\n", "Gravitar-ramNoFrameskip-v0\n", "Gravitar-ram-v4\n", "Gravitar-ramDeterministic-v4\n", "Gravitar-ramNoFrameskip-v4\n", "Hero-v0\n", "HeroDeterministic-v0\n", "HeroNoFrameskip-v0\n", "Hero-v4\n", "HeroDeterministic-v4\n", "HeroNoFrameskip-v4\n", "Hero-ram-v0\n", "Hero-ramDeterministic-v0\n", "Hero-ramNoFrameskip-v0\n", "Hero-ram-v4\n", "Hero-ramDeterministic-v4\n", "Hero-ramNoFrameskip-v4\n", "IceHockey-v0\n", "IceHockeyDeterministic-v0\n", "IceHockeyNoFrameskip-v0\n", "IceHockey-v4\n", "IceHockeyDeterministic-v4\n", "IceHockeyNoFrameskip-v4\n", "IceHockey-ram-v0\n", "IceHockey-ramDeterministic-v0\n", "IceHockey-ramNoFrameskip-v0\n", "IceHockey-ram-v4\n", "IceHockey-ramDeterministic-v4\n", "IceHockey-ramNoFrameskip-v4\n", "Jamesbond-v0\n", "JamesbondDeterministic-v0\n", "JamesbondNoFrameskip-v0\n", "Jamesbond-v4\n", "JamesbondDeterministic-v4\n", "JamesbondNoFrameskip-v4\n", "Jamesbond-ram-v0\n", "Jamesbond-ramDeterministic-v0\n", "Jamesbond-ramNoFrameskip-v0\n", "Jamesbond-ram-v4\n", "Jamesbond-ramDeterministic-v4\n", "Jamesbond-ramNoFrameskip-v4\n", "JourneyEscape-v0\n", "JourneyEscapeDeterministic-v0\n", "JourneyEscapeNoFrameskip-v0\n", "JourneyEscape-v4\n", "JourneyEscapeDeterministic-v4\n", "JourneyEscapeNoFrameskip-v4\n", "JourneyEscape-ram-v0\n", "JourneyEscape-ramDeterministic-v0\n", "JourneyEscape-ramNoFrameskip-v0\n", "JourneyEscape-ram-v4\n", "JourneyEscape-ramDeterministic-v4\n", "JourneyEscape-ramNoFrameskip-v4\n", "Kangaroo-v0\n", "KangarooDeterministic-v0\n", "KangarooNoFrameskip-v0\n", "Kangaroo-v4\n", "KangarooDeterministic-v4\n", "KangarooNoFrameskip-v4\n", "Kangaroo-ram-v0\n", "Kangaroo-ramDeterministic-v0\n", "Kangaroo-ramNoFrameskip-v0\n", "Kangaroo-ram-v4\n", "Kangaroo-ramDeterministic-v4\n", "Kangaroo-ramNoFrameskip-v4\n", "Krull-v0\n", "KrullDeterministic-v0\n", "KrullNoFrameskip-v0\n", "Krull-v4\n", "KrullDeterministic-v4\n", "KrullNoFrameskip-v4\n", "Krull-ram-v0\n", "Krull-ramDeterministic-v0\n", "Krull-ramNoFrameskip-v0\n", "Krull-ram-v4\n", "Krull-ramDeterministic-v4\n", "Krull-ramNoFrameskip-v4\n", "KungFuMaster-v0\n", "KungFuMasterDeterministic-v0\n", "KungFuMasterNoFrameskip-v0\n", "KungFuMaster-v4\n", "KungFuMasterDeterministic-v4\n", "KungFuMasterNoFrameskip-v4\n", "KungFuMaster-ram-v0\n", "KungFuMaster-ramDeterministic-v0\n", "KungFuMaster-ramNoFrameskip-v0\n", "KungFuMaster-ram-v4\n", "KungFuMaster-ramDeterministic-v4\n", "KungFuMaster-ramNoFrameskip-v4\n", "MontezumaRevenge-v0\n", "MontezumaRevengeDeterministic-v0\n", "MontezumaRevengeNoFrameskip-v0\n", "MontezumaRevenge-v4\n", "MontezumaRevengeDeterministic-v4\n", "MontezumaRevengeNoFrameskip-v4\n", "MontezumaRevenge-ram-v0\n", "MontezumaRevenge-ramDeterministic-v0\n", "MontezumaRevenge-ramNoFrameskip-v0\n", "MontezumaRevenge-ram-v4\n", "MontezumaRevenge-ramDeterministic-v4\n", "MontezumaRevenge-ramNoFrameskip-v4\n", "MsPacman-v0\n", "MsPacmanDeterministic-v0\n", "MsPacmanNoFrameskip-v0\n", "MsPacman-v4\n", "MsPacmanDeterministic-v4\n", "MsPacmanNoFrameskip-v4\n", "MsPacman-ram-v0\n", "MsPacman-ramDeterministic-v0\n", "MsPacman-ramNoFrameskip-v0\n", "MsPacman-ram-v4\n", "MsPacman-ramDeterministic-v4\n", "MsPacman-ramNoFrameskip-v4\n", "NameThisGame-v0\n", "NameThisGameDeterministic-v0\n", "NameThisGameNoFrameskip-v0\n", "NameThisGame-v4\n", "NameThisGameDeterministic-v4\n", "NameThisGameNoFrameskip-v4\n", "NameThisGame-ram-v0\n", "NameThisGame-ramDeterministic-v0\n", "NameThisGame-ramNoFrameskip-v0\n", "NameThisGame-ram-v4\n", "NameThisGame-ramDeterministic-v4\n", "NameThisGame-ramNoFrameskip-v4\n", "Phoenix-v0\n", "PhoenixDeterministic-v0\n", "PhoenixNoFrameskip-v0\n", "Phoenix-v4\n", "PhoenixDeterministic-v4\n", "PhoenixNoFrameskip-v4\n", "Phoenix-ram-v0\n", "Phoenix-ramDeterministic-v0\n", "Phoenix-ramNoFrameskip-v0\n", "Phoenix-ram-v4\n", "Phoenix-ramDeterministic-v4\n", "Phoenix-ramNoFrameskip-v4\n", "Pitfall-v0\n", "PitfallDeterministic-v0\n", "PitfallNoFrameskip-v0\n", "Pitfall-v4\n", "PitfallDeterministic-v4\n", "PitfallNoFrameskip-v4\n", "Pitfall-ram-v0\n", "Pitfall-ramDeterministic-v0\n", "Pitfall-ramNoFrameskip-v0\n", "Pitfall-ram-v4\n", "Pitfall-ramDeterministic-v4\n", "Pitfall-ramNoFrameskip-v4\n", "Pong-v0\n", "PongDeterministic-v0\n", "PongNoFrameskip-v0\n", "Pong-v4\n", "PongDeterministic-v4\n", "PongNoFrameskip-v4\n", "Pong-ram-v0\n", "Pong-ramDeterministic-v0\n", "Pong-ramNoFrameskip-v0\n", "Pong-ram-v4\n", "Pong-ramDeterministic-v4\n", "Pong-ramNoFrameskip-v4\n", "Pooyan-v0\n", "PooyanDeterministic-v0\n", "PooyanNoFrameskip-v0\n", "Pooyan-v4\n", "PooyanDeterministic-v4\n", "PooyanNoFrameskip-v4\n", "Pooyan-ram-v0\n", "Pooyan-ramDeterministic-v0\n", "Pooyan-ramNoFrameskip-v0\n", "Pooyan-ram-v4\n", "Pooyan-ramDeterministic-v4\n", "Pooyan-ramNoFrameskip-v4\n", "PrivateEye-v0\n", "PrivateEyeDeterministic-v0\n", "PrivateEyeNoFrameskip-v0\n", "PrivateEye-v4\n", "PrivateEyeDeterministic-v4\n", "PrivateEyeNoFrameskip-v4\n", "PrivateEye-ram-v0\n", "PrivateEye-ramDeterministic-v0\n", "PrivateEye-ramNoFrameskip-v0\n", "PrivateEye-ram-v4\n", "PrivateEye-ramDeterministic-v4\n", "PrivateEye-ramNoFrameskip-v4\n", "Qbert-v0\n", "QbertDeterministic-v0\n", "QbertNoFrameskip-v0\n", "Qbert-v4\n", "QbertDeterministic-v4\n", "QbertNoFrameskip-v4\n", "Qbert-ram-v0\n", "Qbert-ramDeterministic-v0\n", "Qbert-ramNoFrameskip-v0\n", "Qbert-ram-v4\n", "Qbert-ramDeterministic-v4\n", "Qbert-ramNoFrameskip-v4\n", "Riverraid-v0\n", "RiverraidDeterministic-v0\n", "RiverraidNoFrameskip-v0\n", "Riverraid-v4\n", "RiverraidDeterministic-v4\n", "RiverraidNoFrameskip-v4\n", "Riverraid-ram-v0\n", "Riverraid-ramDeterministic-v0\n", "Riverraid-ramNoFrameskip-v0\n", "Riverraid-ram-v4\n", "Riverraid-ramDeterministic-v4\n", "Riverraid-ramNoFrameskip-v4\n", "RoadRunner-v0\n", "RoadRunnerDeterministic-v0\n", "RoadRunnerNoFrameskip-v0\n", "RoadRunner-v4\n", "RoadRunnerDeterministic-v4\n", "RoadRunnerNoFrameskip-v4\n", "RoadRunner-ram-v0\n", "RoadRunner-ramDeterministic-v0\n", "RoadRunner-ramNoFrameskip-v0\n", "RoadRunner-ram-v4\n", "RoadRunner-ramDeterministic-v4\n", "RoadRunner-ramNoFrameskip-v4\n", "Robotank-v0\n", "RobotankDeterministic-v0\n", "RobotankNoFrameskip-v0\n", "Robotank-v4\n", "RobotankDeterministic-v4\n", "RobotankNoFrameskip-v4\n", "Robotank-ram-v0\n", "Robotank-ramDeterministic-v0\n", "Robotank-ramNoFrameskip-v0\n", "Robotank-ram-v4\n", "Robotank-ramDeterministic-v4\n", "Robotank-ramNoFrameskip-v4\n", "Seaquest-v0\n", "SeaquestDeterministic-v0\n", "SeaquestNoFrameskip-v0\n", "Seaquest-v4\n", "SeaquestDeterministic-v4\n", "SeaquestNoFrameskip-v4\n", "Seaquest-ram-v0\n", "Seaquest-ramDeterministic-v0\n", "Seaquest-ramNoFrameskip-v0\n", "Seaquest-ram-v4\n", "Seaquest-ramDeterministic-v4\n", "Seaquest-ramNoFrameskip-v4\n", "Skiing-v0\n", "SkiingDeterministic-v0\n", "SkiingNoFrameskip-v0\n", "Skiing-v4\n", "SkiingDeterministic-v4\n", "SkiingNoFrameskip-v4\n", "Skiing-ram-v0\n", "Skiing-ramDeterministic-v0\n", "Skiing-ramNoFrameskip-v0\n", "Skiing-ram-v4\n", "Skiing-ramDeterministic-v4\n", "Skiing-ramNoFrameskip-v4\n", "Solaris-v0\n", "SolarisDeterministic-v0\n", "SolarisNoFrameskip-v0\n", "Solaris-v4\n", "SolarisDeterministic-v4\n", "SolarisNoFrameskip-v4\n", "Solaris-ram-v0\n", "Solaris-ramDeterministic-v0\n", "Solaris-ramNoFrameskip-v0\n", "Solaris-ram-v4\n", "Solaris-ramDeterministic-v4\n", "Solaris-ramNoFrameskip-v4\n", "SpaceInvaders-v0\n", "SpaceInvadersDeterministic-v0\n", "SpaceInvadersNoFrameskip-v0\n", "SpaceInvaders-v4\n", "SpaceInvadersDeterministic-v4\n", "SpaceInvadersNoFrameskip-v4\n", "SpaceInvaders-ram-v0\n", "SpaceInvaders-ramDeterministic-v0\n", "SpaceInvaders-ramNoFrameskip-v0\n", "SpaceInvaders-ram-v4\n", "SpaceInvaders-ramDeterministic-v4\n", "SpaceInvaders-ramNoFrameskip-v4\n", "StarGunner-v0\n", "StarGunnerDeterministic-v0\n", "StarGunnerNoFrameskip-v0\n", "StarGunner-v4\n", "StarGunnerDeterministic-v4\n", "StarGunnerNoFrameskip-v4\n", "StarGunner-ram-v0\n", "StarGunner-ramDeterministic-v0\n", "StarGunner-ramNoFrameskip-v0\n", "StarGunner-ram-v4\n", "StarGunner-ramDeterministic-v4\n", "StarGunner-ramNoFrameskip-v4\n", "Tennis-v0\n", "TennisDeterministic-v0\n", "TennisNoFrameskip-v0\n", "Tennis-v4\n", "TennisDeterministic-v4\n", "TennisNoFrameskip-v4\n", "Tennis-ram-v0\n", "Tennis-ramDeterministic-v0\n", "Tennis-ramNoFrameskip-v0\n", "Tennis-ram-v4\n", "Tennis-ramDeterministic-v4\n", "Tennis-ramNoFrameskip-v4\n", "TimePilot-v0\n", "TimePilotDeterministic-v0\n", "TimePilotNoFrameskip-v0\n", "TimePilot-v4\n", "TimePilotDeterministic-v4\n", "TimePilotNoFrameskip-v4\n", "TimePilot-ram-v0\n", "TimePilot-ramDeterministic-v0\n", "TimePilot-ramNoFrameskip-v0\n", "TimePilot-ram-v4\n", "TimePilot-ramDeterministic-v4\n", "TimePilot-ramNoFrameskip-v4\n", "Tutankham-v0\n", "TutankhamDeterministic-v0\n", "TutankhamNoFrameskip-v0\n", "Tutankham-v4\n", "TutankhamDeterministic-v4\n", "TutankhamNoFrameskip-v4\n", "Tutankham-ram-v0\n", "Tutankham-ramDeterministic-v0\n", "Tutankham-ramNoFrameskip-v0\n", "Tutankham-ram-v4\n", "Tutankham-ramDeterministic-v4\n", "Tutankham-ramNoFrameskip-v4\n", "UpNDown-v0\n", "UpNDownDeterministic-v0\n", "UpNDownNoFrameskip-v0\n", "UpNDown-v4\n", "UpNDownDeterministic-v4\n", "UpNDownNoFrameskip-v4\n", "UpNDown-ram-v0\n", "UpNDown-ramDeterministic-v0\n", "UpNDown-ramNoFrameskip-v0\n", "UpNDown-ram-v4\n", "UpNDown-ramDeterministic-v4\n", "UpNDown-ramNoFrameskip-v4\n", "Venture-v0\n", "VentureDeterministic-v0\n", "VentureNoFrameskip-v0\n", "Venture-v4\n", "VentureDeterministic-v4\n", "VentureNoFrameskip-v4\n", "Venture-ram-v0\n", "Venture-ramDeterministic-v0\n", "Venture-ramNoFrameskip-v0\n", "Venture-ram-v4\n", "Venture-ramDeterministic-v4\n", "Venture-ramNoFrameskip-v4\n", "VideoPinball-v0\n", "VideoPinballDeterministic-v0\n", "VideoPinballNoFrameskip-v0\n", "VideoPinball-v4\n", "VideoPinballDeterministic-v4\n", "VideoPinballNoFrameskip-v4\n", "VideoPinball-ram-v0\n", "VideoPinball-ramDeterministic-v0\n", "VideoPinball-ramNoFrameskip-v0\n", "VideoPinball-ram-v4\n", "VideoPinball-ramDeterministic-v4\n", "VideoPinball-ramNoFrameskip-v4\n", "WizardOfWor-v0\n", "WizardOfWorDeterministic-v0\n", "WizardOfWorNoFrameskip-v0\n", "WizardOfWor-v4\n", "WizardOfWorDeterministic-v4\n", "WizardOfWorNoFrameskip-v4\n", "WizardOfWor-ram-v0\n", "WizardOfWor-ramDeterministic-v0\n", "WizardOfWor-ramNoFrameskip-v0\n", "WizardOfWor-ram-v4\n", "WizardOfWor-ramDeterministic-v4\n", "WizardOfWor-ramNoFrameskip-v4\n", "YarsRevenge-v0\n", "YarsRevengeDeterministic-v0\n", "YarsRevengeNoFrameskip-v0\n", "YarsRevenge-v4\n", "YarsRevengeDeterministic-v4\n", "YarsRevengeNoFrameskip-v4\n", "YarsRevenge-ram-v0\n", "YarsRevenge-ramDeterministic-v0\n", "YarsRevenge-ramNoFrameskip-v0\n", "YarsRevenge-ram-v4\n", "YarsRevenge-ramDeterministic-v4\n", "YarsRevenge-ramNoFrameskip-v4\n", "Zaxxon-v0\n", "ZaxxonDeterministic-v0\n", "ZaxxonNoFrameskip-v0\n", "Zaxxon-v4\n", "ZaxxonDeterministic-v4\n", "ZaxxonNoFrameskip-v4\n", "Zaxxon-ram-v0\n", "Zaxxon-ramDeterministic-v0\n", "Zaxxon-ramNoFrameskip-v0\n", "Zaxxon-ram-v4\n", "Zaxxon-ramDeterministic-v4\n", "Zaxxon-ramNoFrameskip-v4\n", "ALE/Adventure-v5\n", "ALE/Adventure-ram-v5\n", "ALE/AirRaid-v5\n", "ALE/AirRaid-ram-v5\n", "ALE/Alien-v5\n", "ALE/Alien-ram-v5\n", "ALE/Amidar-v5\n", "ALE/Amidar-ram-v5\n", "ALE/Assault-v5\n", "ALE/Assault-ram-v5\n", "ALE/Asterix-v5\n", "ALE/Asterix-ram-v5\n", "ALE/Asteroids-v5\n", "ALE/Asteroids-ram-v5\n", "ALE/Atlantis-v5\n", "ALE/Atlantis-ram-v5\n", "ALE/Atlantis2-v5\n", "ALE/Atlantis2-ram-v5\n", "ALE/Backgammon-v5\n", "ALE/Backgammon-ram-v5\n", "ALE/BankHeist-v5\n", "ALE/BankHeist-ram-v5\n", "ALE/BasicMath-v5\n", "ALE/BasicMath-ram-v5\n", "ALE/BattleZone-v5\n", "ALE/BattleZone-ram-v5\n", "ALE/BeamRider-v5\n", "ALE/BeamRider-ram-v5\n", "ALE/Berzerk-v5\n", "ALE/Berzerk-ram-v5\n", "ALE/Blackjack-v5\n", "ALE/Blackjack-ram-v5\n", "ALE/Bowling-v5\n", "ALE/Bowling-ram-v5\n", "ALE/Boxing-v5\n", "ALE/Boxing-ram-v5\n", "ALE/Breakout-v5\n", "ALE/Breakout-ram-v5\n", "ALE/Carnival-v5\n", "ALE/Carnival-ram-v5\n", "ALE/Casino-v5\n", "ALE/Casino-ram-v5\n", "ALE/Centipede-v5\n", "ALE/Centipede-ram-v5\n", "ALE/ChopperCommand-v5\n", "ALE/ChopperCommand-ram-v5\n", "ALE/CrazyClimber-v5\n", "ALE/CrazyClimber-ram-v5\n", "ALE/Crossbow-v5\n", "ALE/Crossbow-ram-v5\n", "ALE/Darkchambers-v5\n", "ALE/Darkchambers-ram-v5\n", "ALE/Defender-v5\n", "ALE/Defender-ram-v5\n", "ALE/DemonAttack-v5\n", "ALE/DemonAttack-ram-v5\n", "ALE/DonkeyKong-v5\n", "ALE/DonkeyKong-ram-v5\n", "ALE/DoubleDunk-v5\n", "ALE/DoubleDunk-ram-v5\n", "ALE/Earthworld-v5\n", "ALE/Earthworld-ram-v5\n", "ALE/ElevatorAction-v5\n", "ALE/ElevatorAction-ram-v5\n", "ALE/Enduro-v5\n", "ALE/Enduro-ram-v5\n", "ALE/Entombed-v5\n", "ALE/Entombed-ram-v5\n", "ALE/Et-v5\n", "ALE/Et-ram-v5\n", "ALE/FishingDerby-v5\n", "ALE/FishingDerby-ram-v5\n", "ALE/FlagCapture-v5\n", "ALE/FlagCapture-ram-v5\n", "ALE/Freeway-v5\n", "ALE/Freeway-ram-v5\n", "ALE/Frogger-v5\n", "ALE/Frogger-ram-v5\n", "ALE/Frostbite-v5\n", "ALE/Frostbite-ram-v5\n", "ALE/Galaxian-v5\n", "ALE/Galaxian-ram-v5\n", "ALE/Gopher-v5\n", "ALE/Gopher-ram-v5\n", "ALE/Gravitar-v5\n", "ALE/Gravitar-ram-v5\n", "ALE/Hangman-v5\n", "ALE/Hangman-ram-v5\n", "ALE/HauntedHouse-v5\n", "ALE/HauntedHouse-ram-v5\n", "ALE/Hero-v5\n", "ALE/Hero-ram-v5\n", "ALE/HumanCannonball-v5\n", "ALE/HumanCannonball-ram-v5\n", "ALE/IceHockey-v5\n", "ALE/IceHockey-ram-v5\n", "ALE/Jamesbond-v5\n", "ALE/Jamesbond-ram-v5\n", "ALE/JourneyEscape-v5\n", "ALE/JourneyEscape-ram-v5\n", "ALE/Kaboom-v5\n", "ALE/Kaboom-ram-v5\n", "ALE/Kangaroo-v5\n", "ALE/Kangaroo-ram-v5\n", "ALE/KeystoneKapers-v5\n", "ALE/KeystoneKapers-ram-v5\n", "ALE/KingKong-v5\n", "ALE/KingKong-ram-v5\n", "ALE/Klax-v5\n", "ALE/Klax-ram-v5\n", "ALE/Koolaid-v5\n", "ALE/Koolaid-ram-v5\n", "ALE/Krull-v5\n", "ALE/Krull-ram-v5\n", "ALE/KungFuMaster-v5\n", "ALE/KungFuMaster-ram-v5\n", "ALE/LaserGates-v5\n", "ALE/LaserGates-ram-v5\n", "ALE/LostLuggage-v5\n", "ALE/LostLuggage-ram-v5\n", "ALE/MarioBros-v5\n", "ALE/MarioBros-ram-v5\n", "ALE/MiniatureGolf-v5\n", "ALE/MiniatureGolf-ram-v5\n", "ALE/MontezumaRevenge-v5\n", "ALE/MontezumaRevenge-ram-v5\n", "ALE/MrDo-v5\n", "ALE/MrDo-ram-v5\n", "ALE/MsPacman-v5\n", "ALE/MsPacman-ram-v5\n", "ALE/NameThisGame-v5\n", "ALE/NameThisGame-ram-v5\n", "ALE/Othello-v5\n", "ALE/Othello-ram-v5\n", "ALE/Pacman-v5\n", "ALE/Pacman-ram-v5\n", "ALE/Phoenix-v5\n", "ALE/Phoenix-ram-v5\n", "ALE/Pitfall-v5\n", "ALE/Pitfall-ram-v5\n", "ALE/Pitfall2-v5\n", "ALE/Pitfall2-ram-v5\n", "ALE/Pong-v5\n", "ALE/Pong-ram-v5\n", "ALE/Pooyan-v5\n", "ALE/Pooyan-ram-v5\n", "ALE/PrivateEye-v5\n", "ALE/PrivateEye-ram-v5\n", "ALE/Qbert-v5\n", "ALE/Qbert-ram-v5\n", "ALE/Riverraid-v5\n", "ALE/Riverraid-ram-v5\n", "ALE/RoadRunner-v5\n", "ALE/RoadRunner-ram-v5\n", "ALE/Robotank-v5\n", "ALE/Robotank-ram-v5\n", "ALE/Seaquest-v5\n", "ALE/Seaquest-ram-v5\n", "ALE/SirLancelot-v5\n", "ALE/SirLancelot-ram-v5\n", "ALE/Skiing-v5\n", "ALE/Skiing-ram-v5\n", "ALE/Solaris-v5\n", "ALE/Solaris-ram-v5\n", "ALE/SpaceInvaders-v5\n", "ALE/SpaceInvaders-ram-v5\n", "ALE/SpaceWar-v5\n", "ALE/SpaceWar-ram-v5\n", "ALE/StarGunner-v5\n", "ALE/StarGunner-ram-v5\n", "ALE/Superman-v5\n", "ALE/Superman-ram-v5\n", "ALE/Surround-v5\n", "ALE/Surround-ram-v5\n", "ALE/Tennis-v5\n", "ALE/Tennis-ram-v5\n", "ALE/Tetris-v5\n", "ALE/Tetris-ram-v5\n", "ALE/TicTacToe3D-v5\n", "ALE/TicTacToe3D-ram-v5\n", "ALE/TimePilot-v5\n", "ALE/TimePilot-ram-v5\n", "ALE/Trondead-v5\n", "ALE/Trondead-ram-v5\n", "ALE/Turmoil-v5\n", "ALE/Turmoil-ram-v5\n", "ALE/Tutankham-v5\n", "ALE/Tutankham-ram-v5\n", "ALE/UpNDown-v5\n", "ALE/UpNDown-ram-v5\n", "ALE/Venture-v5\n", "ALE/Venture-ram-v5\n", "ALE/VideoCheckers-v5\n", "ALE/VideoCheckers-ram-v5\n", "ALE/VideoChess-v5\n", "ALE/VideoChess-ram-v5\n", "ALE/VideoCube-v5\n", "ALE/VideoCube-ram-v5\n", "ALE/VideoPinball-v5\n", "ALE/VideoPinball-ram-v5\n", "ALE/WizardOfWor-v5\n", "ALE/WizardOfWor-ram-v5\n", "ALE/WordZapper-v5\n", "ALE/WordZapper-ram-v5\n", "ALE/YarsRevenge-v5\n", "ALE/YarsRevenge-ram-v5\n", "ALE/Zaxxon-v5\n", "ALE/Zaxxon-ram-v5\n" ] } ], "source": [ "for env in gym.envs.registry.items():\n", " print(env[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `render_mode` argument defines how you will see the environment:\n", "\n", "* `None` (default): allows to train a DRL algorithm without wasting computational resources rendering it.\n", "* `rgb_array_list`: allows to get numpy arrays corresponding to each frame. Will be useful when generating videos.\n", "* `ansi`: string representation of each state. Only available for the \"Toy text\" environments.\n", "* `human`: graphical window displaying the environment live." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The main interest of gym(nasium) is that all problems have a common interface defined by the class `gym.Env`. There are only three methods that have to be used when interacting with an environment:\n", "\n", "* `state, info = env.reset()` restarts the environment and returns an initial state $s_0$.\n", "\n", "* `state, reward, terminal, truncated, info = env.step(action)` takes an action $a_t$ and returns:\n", " * the new state $s_{t+1}$, \n", " * the reward $r_{t+1}$, \n", " * two boolean flags indicating whether the current state is terminal (won/lost) or truncated (timeout),\n", " * a dictionary containing additional info for debugging (you can ignore it most of the time).\n", "\n", "* `env.render()` displays the current state of the MDP. When the render mode is set to `rgb_array_list` or `human`, it does not even have to called explicitly (since gym 0.25)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With this interface, we can interact with the environment in a standardized way:\n", "\n", "* We first create the environment.\n", "* For a fixed number of episodes:\n", " * We pick an initial state with `reset()`.\n", " * Until the episode is terminated:\n", " * We select an action using our RL algorithm or randomly.\n", " * We take that action (`step()`), observe the new state and the reward.\n", " * We go into the new state.\n", "\n", "The following cell shows how to interact with the CartPole environment using a random policy. Note that it will only work on your computer, not in colab." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "env = gym.make('CartPole-v1', render_mode=\"human\")\n", "\n", "for episode in range(10):\n", " state, info = env.reset()\n", " done = False\n", " while not done:\n", " # Select an action randomly\n", " action = env.action_space.sample()\n", " \n", " # Sample a single transition\n", " next_state, reward, terminal, truncated, info = env.step(action)\n", " \n", " # Go in the next state\n", " state = next_state\n", "\n", " # End of the episode\n", " done = terminal or truncated\n", "\n", "env.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On colab (or whenever you want to record videos of the episodes instead of watching them live), you need to create the environment with the rendering mode `rgb_array_list`. \n", "\n", "You then create a `GymRecorder` object (defined in the first cell of this notebook). \n", "\n", "```python\n", "recorder = GymRecorder(env)\n", "```\n", "\n", "At the end of each episode, you tell the recorder to record all frames generated during the episode. The frames returned by `env.render()` are (width, height, 3) numpy arrays which are accumulated by the environment during the episode and flushed when `env.reset()` is called.\n", "\n", "```python\n", "recorder.record(env.render())\n", "```\n", "\n", "You can then generate a gif at the end of the simulation with:\n", "\n", "```python\n", "recorder.make_video('videos/CartPole-v1.gif')\n", "```\n", "\n", "Finally, you can render the gif in the notebook by calling **at the very last line of the cell**:\n", "\n", "```python\n", "ipython_display('videos/CartPole-v1.gif')\n", "```" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MoviePy - Building file videos/CartPole-v1.gif with imageio.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " \r" ] }, { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "env = gym.make('CartPole-v1', render_mode=\"rgb_array_list\")\n", "recorder = GymRecorder(env)\n", "\n", "for episode in range(10):\n", " state, info = env.reset()\n", "\n", " done = False\n", " while not done:\n", " # Select an action randomly\n", " action = env.action_space.sample()\n", " \n", " # Sample a single transition\n", " next_state, reward, terminal, truncated, info = env.step(action)\n", " \n", " # Go in the next state\n", " state = next_state\n", "\n", " # End of the episode\n", " done = terminal or truncated\n", "\n", " # Record at the end of the episode \n", " recorder.record(env.render())\n", "\n", "recorder.make_video('videos/CartPole-v1.gif')\n", "ipython_display('videos/CartPole-v1.gif', autoplay=1, loop=0)" ] }, { "cell_type": "markdown", "metadata": { "id": "WENOr5atoGr1" }, "source": [ "Each environment defines its state space (`env.observation_space`) and action space (`env.action_space`). \n", "\n", "State and action spaces can either be :\n", "\n", "* discrete (`gym.spaces.Discrete(nb_states)`), with states being an integer between 0 and `nb_states` -1.\n", "\n", "* feature-based (`gym.spaces.Box(low=0, high=255, shape=(SCREEN_HEIGHT, SCREEN_WIDTH, 3))`) for pixel frames.\n", "\n", "* continuous. Example for two joints of a robotic arm limited between -180 and 180 degrees:\n", "\n", "```python\n", "gym.spaces.Box(-180.0, 180.0, (2, ))\n", "```\n", "\n", "You can sample a state or action randomly from these spaces:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[107.88474 76.81408]\n" ] } ], "source": [ "action_space = gym.spaces.Box(-180.0, 180.0, (2, ))\n", "action = action_space.sample()\n", "print(action)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Sampling the action space is particularly useful for exploration. We use it here to perform random (but valid) actions:\n", "\n", "```python\n", "action = env.action_space.sample()\n", "```\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q:** Create a method `random_interaction(env, number_episodes, recorder=None)` that takes as arguments:\n", "\n", "* The environment.\n", "* The number of episodes to be performed.\n", "* An optional `GymRecorder` object that may record the frames of the environment if it is not None (`if renderer is not None:`). Otherwise, do not nothing.\n", "\n", "The method should return the list of undiscounted returns ($\\gamma=1$, i.e. just the sum of rewards obtained during each episode) for all episodes." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "def random_interaction(env, number_episodes, recorder=None):\n", "\n", " returns = []\n", "\n", " # Sample episodes\n", " for episode in range(number_episodes):\n", "\n", " # Sample the initial state\n", " state, info = env.reset()\n", "\n", " return_episode = 0.0\n", " done = False\n", " while not done:\n", "\n", " # Select an action randomly\n", " action = env.action_space.sample()\n", " \n", " # Sample a single transition\n", " next_state, reward, terminal, truncated, info = env.step(action)\n", "\n", " # Update the return\n", " return_episode += reward\n", " \n", " # Go in the next state\n", " state = next_state\n", "\n", " # End of the episode\n", " done = terminal or truncated\n", "\n", " # Record at the end of the episode\n", " if recorder is not None:\n", " recorder.record(env.render())\n", "\n", " returns.append(return_episode)\n", "\n", " return returns\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q:** Use that method to visualize all the available simple environments for a few episodes:\n", "\n", "* CartPole-v1\n", "* MountainCar-v0\n", "* Pendulum-v1\n", "* Acrobot-v1\n", "* LunarLander-v2\n", "* BipedalWalker-v3\n", "* CarRacing-v2\n", "* Blackjack-v1\n", "* FrozenLake-v1\n", "* CliffWalking-v0\n", "* Taxi-v3\n", "\n", "If you do many episodes (CarRacing or Taxi have very long episodes with a random policy), plot the obtained returns to see how they vary. \n", "\n", "If you managed to install the mujoco and atari dependencies, feel free to visualize them too. " ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "MoviePy - Building file videos/CartPole-v1.gif with imageio.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " \r" ] }, { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "envname = 'CartPole-v1'\n", "env = gym.make(envname, render_mode=\"rgb_array_list\")\n", "recorder = GymRecorder(env)\n", "\n", "returns = random_interaction(env, 10, recorder)\n", "\n", "plt.figure(figsize=(10, 6))\n", "plt.plot(returns)\n", "plt.xlabel(\"Episodes\")\n", "plt.ylabel(\"Return\")\n", "plt.show()\n", "\n", "video = \"videos/\" + envname + \".gif\"\n", "recorder.make_video(video)\n", "ipython_display(video)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MoviePy - Building file videos/CarRacing-v2.gif with imageio.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " \r" ] }, { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "envname = 'CarRacing-v2'\n", "env = gym.make(envname, render_mode=\"rgb_array_list\")\n", "recorder = GymRecorder(env)\n", "\n", "returns = random_interaction(env, 1, recorder)\n", "\n", "video = \"videos/\" + envname + \".gif\"\n", "recorder.make_video(video)\n", "ipython_display(video)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MoviePy - Building file videos/Taxi-v3.gif with imageio.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " \r" ] }, { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "envname = 'Taxi-v3'\n", "env = gym.make(envname, render_mode=\"rgb_array_list\")\n", "recorder = GymRecorder(env)\n", "\n", "returns = random_interaction(env, 1, recorder)\n", "\n", "video = \"videos/\" + envname + \".gif\"\n", "recorder.make_video(video)\n", "ipython_display(video)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating your own environment\n", "\n", "### Random environment\n", "\n", "You can create your own environment using the gym interface:\n", "\n", "\n", "\n", "Here is an example of a dummy environment with discrete states and actions, where the transition probabilities and rewards are completely random:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "id": "B5u7Z8tjoGr3" }, "outputs": [], "source": [ "class RandomEnv(gym.Env):\n", " \"Random discrete environment that does nothing.\"\n", " \n", " metadata = {\"render_modes\": [\"ansi\"], \"render_fps\": 1}\n", "\n", " def __init__(self, nb_states, nb_actions, max_episode_steps=10, render_mode=\"ansi\"):\n", "\n", " self.nb_states = nb_states\n", " self.nb_actions = nb_actions\n", " self.max_episode_steps = max_episode_steps\n", " self.render_mode = render_mode\n", "\n", " # State space, can be discrete or continuous.\n", " self.observation_space = gym.spaces.Discrete(nb_states)\n", " \n", " # Action space, can be discrete or continuous.\n", " self.action_space = gym.spaces.Discrete(nb_actions) \n", "\n", " # Reset\n", " self.reset()\n", "\n", "\n", " def reset(self, seed=None, options=None):\n", "\n", " # Re-initialize time\n", " self.current_step = 0\n", " \n", " # Sample one state randomly \n", " self.state = self.observation_space.sample()\n", " \n", " return self.state, info\n", "\n", " def step(self, action):\n", "\n", " # Random transition to another state\n", " self.state = self.observation_space.sample() \n", " \n", " # Random reward\n", " reward = np.random.uniform(0, 1, 1)[0] \n", " \n", " # Terminate the episode after 10 steps\n", " terminal = False \n", " truncated = False\n", "\n", " self.current_step +=1\n", " if self.current_step % self.max_episode_steps == 0:\n", " truncated = True \n", "\n", " info = {} # No info\n", "\n", " return self.state, reward, terminal, truncated, info\n", "\n", "\n", " def render(self):\n", " if self.render_mode == \"ansi\":\n", " description = \"Step \" + str(self.current_step) + \": state \" + str(self.state)\n", " return description\n", " return None\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The different methods should be quite self-explanatory.\n", "\n", "`metadata` defines which render modes are available for this environment (here only the text mode \"ansi\").\n", "\n", "The constructor accepts the size of the state and action spaces as arguments, the duration of the episode and the render mode. \n", "\n", "`reset()` samples an initial state randomly.\n", "\n", "`step()` ignores the action, samples a new state and a reward, and truncates an episode after `max_episode_steps`.\n", "\n", "`render()` returns a string with the current state.\n", "\n", "**Q:** Interact with the random environment for a couple of episodes.\n", "\n", "As the mode is `ansi` (text-based), you will need to print the string returned by `render()` after each step:\n", "\n", "```python\n", "while not done:\n", "\n", " action = env.action_space.sample()\n", " \n", " next_state, reward, terminal, truncated, info = env.step(action)\n", "\n", " print(env.render())\n", "```" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Episode 0\n", "Step 0: state 9\n", "Step 1: state 8\n", "Step 2: state 3\n", "Step 3: state 0\n", "Step 4: state 5\n", "Step 5: state 4\n", "Step 6: state 3\n", "Step 7: state 0\n", "Step 8: state 6\n", "Step 9: state 8\n", "Step 10: state 4\n", "Return of the episode: 5.007583939302424\n", "----------\n", "Episode 1\n", "Step 0: state 6\n", "Step 1: state 4\n", "Step 2: state 0\n", "Step 3: state 4\n", "Step 4: state 4\n", "Step 5: state 5\n", "Step 6: state 2\n", "Step 7: state 6\n", "Step 8: state 2\n", "Step 9: state 2\n", "Step 10: state 2\n", "Return of the episode: 3.3300428387017016\n", "----------\n" ] } ], "source": [ "# Create the environment\n", "env = RandomEnv(nb_states=10, nb_actions=4)\n", "\n", "# Sample episodes\n", "for episode in range(2):\n", "\n", " print(\"Episode\", episode)\n", "\n", " # Sample the initial state\n", " state, info = env.reset()\n", "\n", " # Render the initial state\n", " print(env.render())\n", "\n", " # Episode\n", " return_episode = 0.0\n", " done = False\n", " while not done:\n", " # Select an action randomly\n", " action = env.action_space.sample()\n", " \n", " # Sample a single transition\n", " next_state, reward, terminal, truncated, info = env.step(action)\n", " \n", " # Go in the next state\n", " state = next_state\n", "\n", " # Update return\n", " return_episode += reward\n", "\n", " # Render the current state\n", " print(env.render())\n", "\n", " # End of the episode\n", " done = terminal or truncated\n", " \n", " print(\"Return of the episode:\", return_episode)\n", " print('-'*10)\n" ] }, { "cell_type": "markdown", "metadata": { "id": "m7JjS7uwoGse" }, "source": [ "### Recycling robot\n", "\n", "**Q:** Create a `RecyclingRobot` gym-like environment using last week's exercise.\n", "\n", "The parameters `alpha`, `beta`, `r_wait` and `r_search` should be passed to the constructor of the environment and saved as attributes.\n", "\n", "The state space is discrete, with two states `high` and `low` which will have indices 0 and 1. The three discrete actions `search`, `wait` and `recharge` have indices 0, 1, and 2.\n", "\n", "The initial state of the MDP (`reset()`) should be the high state.\n", "\n", "The `step()` should generate transitions according to the dynamics of the MDP. Depending on the current state and the chosen action, make a transition to another state. For the actions `search` and `wait`, sample the reward from the normal distribution with mean `r_search` (resp. `r_wait`) and variance 0.5. \n", "\n", "If the random agent selects `recharge` in `high`, do nothing (next state is high, reward is 0).\n", "\n", "Rendering is just printing the current state. There is nothing to close, so you do not even need to redefine the function.\n", "\n", "Although the recycling robot is a continuing task, limit the number of steps per episode to 10, as in the the previous random environment.\n", "\n", "Interact randomly with the MDP for several episodes and observe the returns. " ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "id": "HRhkICMBoGsf" }, "outputs": [], "source": [ "class RecyclingRobot(gym.Env):\n", " \"Recycling robot environment.\"\n", "\n", " metadata = {\"render_modes\": [\"ansi\"], \"render_fps\": 1}\n", "\n", " def __init__(self, alpha, beta, r_search, r_wait, max_episode_steps=10, render_mode=\"ansi\"):\n", " \n", " # Store parameters\n", " self.alpha = alpha\n", " self.beta = beta\n", " self.r_search = r_search\n", " self.r_wait = r_wait\n", " self.max_episode_steps = max_episode_steps\n", " self.render_mode = render_mode\n", " \n", " # State space, can be discrete or continuous.\n", " self.observation_space = gym.spaces.Discrete(2) \n", " self.states = ['high', 'low']\n", " \n", " # Action space, can be discrete or continuous.\n", " self.action_space = gym.spaces.Discrete(3) \n", " self.actions = ['search', 'wait', 'recharge'] \n", "\n", " # Reset\n", " self.reset()\n", " \n", " def reset(self, seed=None, options=None):\n", " \n", " # Re-initialize time\n", " self.current_step = 0\n", "\n", " # Start in the high state\n", " self.state = 0\n", " \n", " return self.state, {}\n", " \n", " def step(self, action):\n", " \n", " if self.state == 0: # high\n", " if action == 0: # search\n", " p = np.random.rand()\n", " if p < self.alpha:\n", " self.state = 0 # high\n", " else:\n", " self.state = 1 # low\n", " self.reward = float(np.random.normal(self.r_search, 0.5, 1))\n", " elif action == 1: # wait\n", " self.state = 0 # high\n", " self.reward = float(np.random.normal(self.r_wait, 0.5, 1))\n", " elif action == 2: # recharge\n", " self.state = 0 # high\n", " self.reward = 0.0\n", " \n", " elif self.state == 1: # low\n", " if action == 0: # search\n", " p = np.random.rand()\n", " if p < self.beta:\n", " self.state = 1 # low\n", " self.reward = float(np.random.normal(self.r_search, 0.5, 1))\n", " else:\n", " self.state = 0 # high\n", " self.reward = -3.0\n", " elif action == 1: # wait\n", " self.state = 1 # low\n", " self.reward = float(np.random.normal(self.r_wait, 0.5, 1))\n", " elif action == 2: # recharge\n", " self.state = 0 # high\n", " self.reward = 0.0\n", " \n", " terminal = False\n", " truncated = False\n", " self.current_step +=1\n", " if self.current_step % self.max_episode_steps == 0:\n", " truncated = True \n", "\n", " info = {} # No info\n", "\n", " return self.state, self.reward, terminal, truncated, info\n", "\n", " def render(self):\n", " \n", " if self.render_mode == \"ansi\":\n", " description = \"Step \" + str(self.current_step) + \": state \" + self.states[self.state]\n", " return description\n", " \n", " return None" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "id": "KmYJr3nVoGsl" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Episode: 0\n", "Step 0: state high\n", "high + search -> low : 5.824371578099047\n", "Step 1: state low\n", "low + wait -> low : 2.9450113775797497\n", "Step 2: state low\n", "low + recharge -> high : 0.0\n", "Step 3: state high\n", "high + wait -> high : 1.19425009588391\n", "Step 4: state high\n", "high + wait -> high : 1.1899989446304127\n", "Step 5: state high\n", "high + wait -> high : 1.9428395745820943\n", "Step 6: state high\n", "high + recharge -> high : 0.0\n", "Step 7: state high\n", "high + wait -> high : 2.3322524158366855\n", "Step 8: state high\n", "high + search -> low : 5.892699540496995\n", "Step 9: state low\n", "low + wait -> low : 1.4083503027532602\n", "Step 10: state low\n", "Return of the episode: 22.729773829862154\n", "----------\n", "Episode: 1\n", "Step 0: state high\n", "high + search -> low : 5.437227719900913\n", "Step 1: state low\n", "low + wait -> low : 1.845777038847942\n", "Step 2: state low\n", "low + wait -> low : 2.489897166433866\n", "Step 3: state low\n", "low + wait -> low : 2.4018187411534972\n", "Step 4: state low\n", "low + recharge -> high : 0.0\n", "Step 5: state high\n", "high + wait -> high : 2.402843421805779\n", "Step 6: state high\n", "high + wait -> high : 1.8639466909583808\n", "Step 7: state high\n", "high + recharge -> high : 0.0\n", "Step 8: state high\n", "high + recharge -> high : 0.0\n", "Step 9: state high\n", "high + search -> low : 5.871762172601576\n", "Step 10: state low\n", "Return of the episode: 22.313272951701954\n", "----------\n", "Episode: 2\n", "Step 0: state high\n", "high + search -> low : 5.653305608734461\n", "Step 1: state low\n", "low + wait -> low : 1.260637353066509\n", "Step 2: state low\n", "low + recharge -> high : 0.0\n", "Step 3: state high\n", "high + recharge -> high : 0.0\n", "Step 4: state high\n", "high + recharge -> high : 0.0\n", "Step 5: state high\n", "high + search -> low : 6.179745460703129\n", "Step 6: state low\n", "low + search -> high : -3.0\n", "Step 7: state high\n", "high + wait -> high : 1.7318126590102751\n", "Step 8: state high\n", "high + search -> low : 6.189826308355821\n", "Step 9: state low\n", "low + recharge -> high : 0.0\n", "Step 10: state high\n", "Return of the episode: 18.015327389870194\n", "----------\n", "Episode: 3\n", "Step 0: state high\n", "high + wait -> high : 0.3116973832342118\n", "Step 1: state high\n", "high + wait -> high : 1.5890528296212245\n", "Step 2: state high\n", "high + recharge -> high : 0.0\n", "Step 3: state high\n", "high + search -> low : 6.1770715710684865\n", "Step 4: state low\n", "low + wait -> low : 1.9950025916418077\n", "Step 5: state low\n", "low + wait -> low : 2.288732539726356\n", "Step 6: state low\n", "low + search -> high : -3.0\n", "Step 7: state high\n", "high + recharge -> high : 0.0\n", "Step 8: state high\n", "high + search -> low : 6.114749510159829\n", "Step 9: state low\n", "low + recharge -> high : 0.0\n", "Step 10: state high\n", "Return of the episode: 15.476306425451915\n", "----------\n", "Episode: 4\n", "Step 0: state high\n", "high + search -> high : 5.677653252270961\n", "Step 1: state high\n", "high + search -> low : 6.270961762981069\n", "Step 2: state low\n", "low + search -> high : -3.0\n", "Step 3: state high\n", "high + recharge -> high : 0.0\n", "Step 4: state high\n", "high + recharge -> high : 0.0\n", "Step 5: state high\n", "high + recharge -> high : 0.0\n", "Step 6: state high\n", "high + wait -> high : 2.167628510196825\n", "Step 7: state high\n", "high + recharge -> high : 0.0\n", "Step 8: state high\n", "high + recharge -> high : 0.0\n", "Step 9: state high\n", "high + search -> high : 6.00958550093353\n", "Step 10: state high\n", "Return of the episode: 17.125829026382384\n", "----------\n", "Episode: 5\n", "Step 0: state high\n", "high + recharge -> high : 0.0\n", "Step 1: state high\n", "high + wait -> high : 1.0072443461744696\n", "Step 2: state high\n", "high + wait -> high : 1.1770716510613748\n", "Step 3: state high\n", "high + recharge -> high : 0.0\n", "Step 4: state high\n", "high + search -> high : 6.498953899017299\n", "Step 5: state high\n", "high + search -> high : 6.084695387366945\n", "Step 6: state high\n", "high + wait -> high : 2.523957175613759\n", "Step 7: state high\n", "high + recharge -> high : 0.0\n", "Step 8: state high\n", "high + wait -> high : 2.2205143495052986\n", "Step 9: state high\n", "high + wait -> high : 1.565489159288247\n", "Step 10: state high\n", "Return of the episode: 21.077925968027394\n", "----------\n", "Episode: 6\n", "Step 0: state high\n", "high + wait -> high : 2.909574962101888\n", "Step 1: state high\n", "high + recharge -> high : 0.0\n", "Step 2: state high\n", "high + wait -> high : 1.5723174882068984\n", "Step 3: state high\n", "high + recharge -> high : 0.0\n", "Step 4: state high\n", "high + wait -> high : 1.4266076247041248\n", "Step 5: state high\n", "high + wait -> high : 2.5804443044435845\n", "Step 6: state high\n", "high + recharge -> high : 0.0\n", "Step 7: state high\n", "high + recharge -> high : 0.0\n", "Step 8: state high\n", "high + wait -> high : 0.9842375744951053\n", "Step 9: state high\n", "high + recharge -> high : 0.0\n", "Step 10: state high\n", "Return of the episode: 9.4731819539516\n", "----------\n", "Episode: 7\n", "Step 0: state high\n", "high + wait -> high : 1.3162358550374915\n", "Step 1: state high\n", "high + search -> high : 5.6548000306144\n", "Step 2: state high\n", "high + wait -> high : 2.263322213479556\n", "Step 3: state high\n", "high + wait -> high : 1.3881496262912796\n", "Step 4: state high\n", "high + search -> high : 6.636954022870793\n", "Step 5: state high\n", "high + search -> high : 6.610354453744501\n", "Step 6: state high\n", "high + search -> high : 4.750619854167663\n", "Step 7: state high\n", "high + recharge -> high : 0.0\n", "Step 8: state high\n", "high + search -> high : 5.5069516395763545\n", "Step 9: state high\n", "high + search -> high : 5.805846615859272\n", "Step 10: state high\n", "Return of the episode: 39.93323431164131\n", "----------\n", "Episode: 8\n", "Step 0: state high\n", "high + search -> high : 5.695885508830509\n", "Step 1: state high\n", "high + recharge -> high : 0.0\n", "Step 2: state high\n", "high + wait -> high : 2.3396804906667272\n", "Step 3: state high\n", "high + recharge -> high : 0.0\n", "Step 4: state high\n", "high + wait -> high : 2.0632140879809655\n", "Step 5: state high\n", "high + wait -> high : 2.207471429939306\n", "Step 6: state high\n", "high + wait -> high : 1.1333787694607529\n", "Step 7: state high\n", "high + recharge -> high : 0.0\n", "Step 8: state high\n", "high + search -> low : 6.595384487237137\n", "Step 9: state low\n", "low + wait -> low : 1.1887366998910425\n", "Step 10: state low\n", "Return of the episode: 21.223751474006438\n", "----------\n", "Episode: 9\n", "Step 0: state high\n", "high + search -> low : 6.317901393973506\n", "Step 1: state low\n", "low + search -> high : -3.0\n", "Step 2: state high\n", "high + recharge -> high : 0.0\n", "Step 3: state high\n", "high + search -> low : 6.372994582022667\n", "Step 4: state low\n", "low + wait -> low : 2.0416319726753738\n", "Step 5: state low\n", "low + wait -> low : 2.6438194473841232\n", "Step 6: state low\n", "low + search -> high : -3.0\n", "Step 7: state high\n", "high + search -> high : 5.970627774061589\n", "Step 8: state high\n", "high + recharge -> high : 0.0\n", "Step 9: state high\n", "high + recharge -> high : 0.0\n", "Step 10: state high\n", "Return of the episode: 17.34697517011726\n", "----------\n" ] } ], "source": [ "# Create the environment\n", "env = RecyclingRobot(alpha=0.3, beta=0.2, r_search=6, r_wait=2)\n", "\n", "# Sample episodes\n", "for episode in range(10):\n", "\n", " print(\"Episode:\", episode)\n", "\n", " # Sample the initial state\n", " state, info = env.reset()\n", " print(env.render())\n", "\n", " return_episode = 0.0\n", " done = False\n", " while not done:\n", "\n", " # Select an action randomly\n", " action = env.action_space.sample()\n", " \n", " # Sample a single transition\n", " next_state, reward, terminal, truncated, info = env.step(action)\n", " \n", " print(env.states[state], \"+\", env.actions[action], \"->\", env.states[next_state], \":\", reward)\n", " \n", " # Go in the next state\n", " state = next_state\n", "\n", " # Update return\n", " return_episode += reward\n", "\n", " # Render the current state\n", " print(env.render())\n", "\n", " # End of the episode\n", " done = terminal or truncated\n", " \n", " print(\"Return of the episode:\", return_episode)\n", " print('-'*10)" ] }, { "cell_type": "markdown", "metadata": { "id": "udFjupHmoGso" }, "source": [ "### Random agent\n", "\n", "To be complete, let's implement the random agent as a class. The class should look like:\n", "\n", "```python\n", "class RandomAgent:\n", " \"\"\"\n", " Random agent exploring uniformly the environment.\n", " \"\"\"\n", " \n", " def __init__(self, env):\n", " self.env = env\n", " \n", " def act(self, state):\n", " \"Returns a random action by sampling the action space.\"\n", " action = # TODO\n", " return action\n", " \n", " def update(self, state, action, reward, next_state):\n", " \"Updates the agent using the transition (s, a, r, s').\"\n", " pass\n", " \n", " def train(self, nb_episodes, render=False):\n", " \"Runs the agent on the environment for nb_episodes. Returns the list of obtained returns.\"\n", " \n", " # List of returns\n", " returns = []\n", "\n", " # TODO\n", " \n", " return returns\n", "```\n", "\n", "The environment is passed to the constructor. `act(state)` should sample a random action. `update(state, action, reward, next_state)` does nothing for the random agent (`pass` is a Python command doing nothing), but we will implement it in the next exercises. \n", "\n", "`train(nb_episodes, render)` implements the interaction loop between the agent and the environment for a fixed number of episodes. It should return the list of obtained returns. `render` defines whether you print the state at each step or not.\n", "\n", "**Q:** Implement the random agent and have it interact with the environment for a fixed number of episodes." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "id": "or3R2eIpoGsp" }, "outputs": [], "source": [ "class RandomAgent:\n", " \"\"\"\n", " Random agent exploring uniformly the environment.\n", " \"\"\"\n", " \n", " def __init__(self, env):\n", " self.env = env\n", " \n", " def act(self, state):\n", " \"Returns a random action by sampling the action space.\"\n", " return self.env.action_space.sample()\n", " \n", " def update(self, state, action, reward, next_state):\n", " \"Updates the agent using the transition (s, a, r, s').\"\n", " pass\n", " \n", " def train(self, nb_episodes, render=False):\n", " \"Runs the agent on the environment for nb_episodes. Returns the list of obtained rewards.\"\n", " # List of returns\n", " returns = []\n", "\n", " for episode in range(nb_episodes):\n", " if render:\n", " print(\"Episode:\", episode)\n", "\n", " # Sample the initial state\n", " state, info = self.env.reset()\n", " if render:\n", " print(self.env.render())\n", "\n", " return_episode = 0.0\n", " done = False\n", " while not done:\n", "\n", " # Select an action randomly\n", " action = self.act(state)\n", " \n", " # Sample a single transition\n", " next_state, reward, terminal, truncated, info = self.env.step(action)\n", " \n", " # Go in the next state\n", " state = next_state\n", "\n", " # Update return\n", " return_episode += reward\n", "\n", " # Render the current state\n", " if render:\n", " print(env.render())\n", "\n", " # End of the episode\n", " done = terminal or truncated\n", " \n", " returns.append(return_episode)\n", "\n", " return returns" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "id": "5eN9ChPaoGst" }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Create the environment\n", "env = RecyclingRobot(alpha=0.3, beta=0.2, r_search=6, r_wait=2)\n", "\n", "# Creating the random agent\n", "agent = RandomAgent(env)\n", "\n", "# Train the agent for 10 episodes\n", "returns = agent.train(10)\n", "\n", "\n", "# Plot the rewards\n", "plt.figure(figsize=(10, 6))\n", "plt.plot(returns)\n", "plt.xlabel(\"Steps\")\n", "plt.ylabel(\"Return\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "id": "HM17OH2joGsw" }, "source": [ "That's it! We now \"only\" need to define classes for all the sampling-based RL algorithms (MC, TD, deep RL) and we can interact with any environment with a single line!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Mujoco and Atari environments\n", "\n", "Note: both mujoco and atari environments will not work on colab. \n", "\n", "You may have to install non-Python packages on your computer, such as openGL. A lot of debugging in sight...\n", "\n", "The environments should work under Linux and MacOS, but I am not sure about windows. \n", "\n", "### Mujoco\n", "\n", "To install the mujoco environments of gymnasium, this should work:\n", "\n", "```bash\n", "pip install mujoco\n", "pip install gymnasium[mujoco]\n", "```\n", "\n", "Interaction should work as usual. See all environments here: " ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MoviePy - Building file videos/Walker2d-v4.gif with imageio.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " \r" ] }, { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "envname = 'Walker2d-v4'\n", "env = gym.make(envname, render_mode=\"rgb_array_list\")\n", "recorder = GymRecorder(env)\n", "\n", "returns = random_interaction(env, 10, recorder)\n", "\n", "video = \"videos/\" + envname + \".gif\"\n", "recorder.make_video(video)\n", "ipython_display(video)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Atari\n", "\n", "The atari games are available as binary ROM files, which have to be downloaded separately. The AutoROM package can do that for you: \n", "\n", "```bash\n", "pip install autorom\n", "AutoROM --accept-license\n", "```\n", "\n", "You can then install the atari submodules of gym (in particular ale_py):\n", "\n", "```bash\n", "pip install gymnasium[atari]\n", "```\n", "\n", "Check out the list of Atari games here: " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "env = gym.make('ALE/Breakout-v5', render_mode='human')\n", "\n", "returns = random_interaction(env, 1, None)\n", "\n", "env.close()" ] } ], "metadata": { "colab": { "name": "7-Gym-solution.ipynb", "provenance": [] }, "kernelspec": { "display_name": "ANNarchy", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.6" } }, "nbformat": 4, "nbformat_minor": 4 }