{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "S0lz4fO49hX-" }, "source": [ "# HW1 Solutions: Introduction to RL \n", "\n", "\n", "This notebook is designed for is designed to provide hands-on experience with RL modeling, algorithm implementation, and performance evaluation. Students will explore RL concepts through predefined environments and custom-designed settings.\n", "\n", "Follow the instructions in each section to complete the homework." ] }, { "cell_type": "markdown", "metadata": { "id": "_MVMgXuu9hX_" }, "source": [ "## Setup Instructions\n", "Seting up RL dependecies for first time may be challenging. Some torch or gymnasium (Sklearn lib in SL world!) environments need additional set up on your system. If you encountered error and failure after hours of search and try feel free to be in contact with TA's. Run the following commands to install dependencies before starting the notebook:\n", "\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "id": "PwJlcM67zK6e" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "E: Could not open lock file /var/lib/dpkg/lock-frontend - open (13: Permission denied)\n", "E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), are you root?\n" ] } ], "source": [ "!apt-get install x11-utils > /dev/null 2>&1\n", "!pip install pyglet > /dev/null 2>&1\n", "!apt-get install -y xvfb python-opengl > /dev/null 2>&1\n", "!apt-get install xvfb" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Vuzy-9Ky-adf" }, "outputs": [], "source": [ "!pip install pyvirtualdisplay > /dev/null 2>&1\n", "!pip install swig\n", "!pip install stable-baselines3 \"gymnasium[all]\" pygame matplotlib numpy pandas" ] }, { "cell_type": "markdown", "metadata": { "id": "L-wa1iuOt4am" }, "source": [ "But for saving game as **video** he defined a function (it's okay if you don't understand just try to run the code and see the output, then try to modify envs!):" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "id": "bhc87xchyeHl" }, "outputs": [], "source": [ "import logging\n", "from gymnasium.wrappers import RecordEpisodeStatistics, RecordVideo\n", "\n", "training_period = 250 # record the agent's episode every 250\n", "num_training_episodes = 1000 # total number of training episodes\n", "\n", "env = gym.make(\"MountainCar-v0\", render_mode=\"rgb_array\")\n", "env = RecordVideo(env, video_folder=\"MountainCar-v0-agent\", name_prefix=\"training\",\n", " episode_trigger=lambda x: x % training_period == 0)\n", "env = RecordEpisodeStatistics(env)\n", "\n", "for episode_num in range(num_training_episodes):\n", " obs, info = env.reset()\n", "\n", " episode_over = False\n", " while not episode_over:\n", " action = env.action_space.sample() # replace with actual agent\n", " obs, reward, terminated, truncated, info = env.step(action)\n", "\n", " episode_over = terminated or truncated\n", "\n", " logging.info(f\"episode-{episode_num}\", info[\"episode\"])\n", "env.close()" ] }, { "cell_type": "markdown", "metadata": { "id": "1ZKGl8Ql35k_" }, "source": [ "The videos are in MountainCar-v0-agent folder of your colab folder." ] }, { "cell_type": "markdown", "metadata": { "id": "n5URzrgeBjA2" }, "source": [ "**Loading saved model**\n", "\n", "After training model using PPO and saving it, Hamid started to load the model with the name he saved in cell above:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "qHRHFr9-6cgt" }, "outputs": [], "source": [ "model = PPO.load(\"ppo_MountainCar\")\n", "\n", "obs = vec_env.reset()\n", "while True:\n", " action, _states = model.predict(obs)\n", " obs, rewards, dones, info = vec_env.step(action)\n", " print(obs, rewards, dones, info)\n", " if dones[0]:\n", " break" ] }, { "cell_type": "markdown", "metadata": { "id": "n-Jjj_0Q9hX_" }, "source": [ "# **Task 1: Solving Predefined Environments (45 points)**\n", "1.1. Choose two environments from the list which are implemented by other developers and communities and train RL agents using stable-baselines3. Don't forget to check workshop notebook.\n", "\n", "**Environments:**\n", "- [CartPole](https://gymnasium.farama.org/environments/classic_control/cart_pole/)\n", "- [FrozenLake](https://gymnasium.farama.org/environments/toy_text/frozen_lake/)\n", "- [Taxi](https://gymnasium.farama.org/environments/toy_text/taxi/)\n", "- Flappy Bird (Custom env which you can google it)" ] }, { "cell_type": "markdown", "metadata": { "id": "ryQkECp8hr9v" }, "source": [ "📊 1.2. Algorithm Comparison:\n", "\n", "\n", " Compare RL algorithms and results (at least two algorithms e.g., PPO, DQN) based on:\n", "- Total reward over time\n", "- Hyperparameters (check the docs)\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "%load_ext tensorboard" ] }, { "cell_type": "markdown", "metadata": { "id": "Kis3E_faEs6H" }, "source": [ "## 1.1 CartPole Solution: \n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "id": "ZHfbghaAEcp5", "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Using cpu device\n", "Wrapping the env with a `Monitor` wrapper\n", "Wrapping the env in a DummyVecEnv.\n", "Logging to ./a2c_CartPole_tensorboard/A2C_3\n", "------------------------------------\n", "| rollout/ | |\n", "| ep_len_mean | 41.4 |\n", "| ep_rew_mean | 41.4 |\n", "| time/ | |\n", "| fps | 657 |\n", "| iterations | 100 |\n", "| time_elapsed | 0 |\n", "| total_timesteps | 500 |\n", "| train/ | |\n", "| entropy_loss | -0.583 |\n", "| explained_variance | 0.248 |\n", "| learning_rate | 0.0007 |\n", "| n_updates | 99 |\n", "| policy_loss | 1.32 |\n", "| value_loss | 4.74 |\n", "------------------------------------\n", "------------------------------------\n", "| rollout/ | |\n", "| ep_len_mean | 37.3 |\n", "| ep_rew_mean | 37.3 |\n", "| time/ | |\n", "| fps | 708 |\n", "| iterations | 200 |\n", "| time_elapsed | 1 |\n", "| total_timesteps | 1000 |\n", "| train/ | |\n", "| entropy_loss | -0.614 |\n", "| explained_variance | -0.921 |\n", "| learning_rate | 0.0007 |\n", "| n_updates | 199 |\n", "| policy_loss | 0.991 |\n", "| value_loss | 10.1 |\n", "------------------------------------\n", "------------------------------------\n", "| rollout/ | |\n", "| ep_len_mean | 36.3 |\n", "| ep_rew_mean | 36.3 |\n", "| time/ | |\n", "| fps | 718 |\n", "| iterations | 300 |\n", "| time_elapsed | 2 |\n", "| total_timesteps | 1500 |\n", "| train/ | |\n", "| entropy_loss | -0.595 |\n", "| explained_variance | 0.218 |\n", "| learning_rate | 0.0007 |\n", "| n_updates | 299 |\n", "| policy_loss | 1.39 |\n", "| value_loss | 6.2 |\n", "------------------------------------\n", "------------------------------------\n", "| rollout/ | |\n", "| ep_len_mean | 37.9 |\n", "| ep_rew_mean | 37.9 |\n", "| time/ | |\n", "| fps | 720 |\n", "| iterations | 400 |\n", "| time_elapsed | 2 |\n", "| total_timesteps | 2000 |\n", "| train/ | |\n", "| entropy_loss | -0.617 |\n", "| explained_variance | -0.0144 |\n", "| learning_rate | 0.0007 |\n", "| n_updates | 399 |\n", "| policy_loss | 1.11 |\n", "| value_loss | 6.31 |\n", "------------------------------------\n", "------------------------------------\n", "| rollout/ | |\n", "| ep_len_mean | 39 |\n", "| ep_rew_mean | 39 |\n", "| time/ | |\n", "| fps | 816 |\n", "| iterations | 500 |\n", "| time_elapsed | 3 |\n", "| total_timesteps | 2500 |\n", "| train/ | |\n", "| entropy_loss | -0.467 |\n", "| explained_variance | 0.0353 |\n", "| learning_rate | 0.0007 |\n", "| n_updates | 499 |\n", "| policy_loss | 0.594 |\n", "| value_loss | 5.39 |\n", "------------------------------------\n", "------------------------------------\n", "| rollout/ | |\n", "| ep_len_mean | 41.6 |\n", "| ep_rew_mean | 41.6 |\n", "| time/ | |\n", "| fps | 891 |\n", "| iterations | 600 |\n", "| time_elapsed | 3 |\n", "| total_timesteps | 3000 |\n", "| train/ | |\n", "| entropy_loss | -0.557 |\n", "| explained_variance | 0.0422 |\n", "| learning_rate | 0.0007 |\n", "| n_updates | 599 |\n", "| policy_loss | 1.25 |\n", "| value_loss | 4.68 |\n", "------------------------------------\n", "------------------------------------\n", "| rollout/ | |\n", "| ep_len_mean | 46.1 |\n", "| ep_rew_mean | 46.1 |\n", "| time/ | |\n", "| fps | 957 |\n", "| iterations | 700 |\n", "| time_elapsed | 3 |\n", "| total_timesteps | 3500 |\n", "| train/ | |\n", "| entropy_loss | -0.538 |\n", "| explained_variance | 0.00111 |\n", "| learning_rate | 0.0007 |\n", "| n_updates | 699 |\n", "| policy_loss | 1.32 |\n", "| value_loss | 4.25 |\n", "------------------------------------\n" ] }, { "ename": "KeyboardInterrupt", "evalue": "", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mKeyboardInterrupt\u001b[0m Traceback (most recent call last)", "Cell \u001b[0;32mIn[8], line 11\u001b[0m\n\u001b[1;32m 9\u001b[0m \u001b[38;5;66;03m# Train env 1\u001b[39;00m\n\u001b[1;32m 10\u001b[0m model \u001b[38;5;241m=\u001b[39m A2C(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mMlpPolicy\u001b[39m\u001b[38;5;124m\"\u001b[39m, env, verbose\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m1\u001b[39m, tensorboard_log\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m./a2c_CartPole_tensorboard/\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[0;32m---> 11\u001b[0m \u001b[43mmodel\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mlearn\u001b[49m\u001b[43m(\u001b[49m\u001b[43mtotal_timesteps\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;241;43m1000_000\u001b[39;49m\u001b[43m)\u001b[49m\n\u001b[1;32m 12\u001b[0m model\u001b[38;5;241m.\u001b[39msave(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124ma2c_cartpole\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n", "File \u001b[0;32m~/Self_learn/RL_Course/venv/lib/python3.10/site-packages/stable_baselines3/a2c/a2c.py:201\u001b[0m, in \u001b[0;36mA2C.learn\u001b[0;34m(self, total_timesteps, callback, log_interval, tb_log_name, reset_num_timesteps, progress_bar)\u001b[0m\n\u001b[1;32m 192\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21mlearn\u001b[39m(\n\u001b[1;32m 193\u001b[0m \u001b[38;5;28mself\u001b[39m: SelfA2C,\n\u001b[1;32m 194\u001b[0m total_timesteps: \u001b[38;5;28mint\u001b[39m,\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 199\u001b[0m progress_bar: \u001b[38;5;28mbool\u001b[39m \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mFalse\u001b[39;00m,\n\u001b[1;32m 200\u001b[0m ) \u001b[38;5;241m-\u001b[39m\u001b[38;5;241m>\u001b[39m SelfA2C:\n\u001b[0;32m--> 201\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43msuper\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mlearn\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 202\u001b[0m \u001b[43m \u001b[49m\u001b[43mtotal_timesteps\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mtotal_timesteps\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 203\u001b[0m \u001b[43m \u001b[49m\u001b[43mcallback\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mcallback\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 204\u001b[0m \u001b[43m \u001b[49m\u001b[43mlog_interval\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mlog_interval\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 205\u001b[0m \u001b[43m \u001b[49m\u001b[43mtb_log_name\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mtb_log_name\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 206\u001b[0m \u001b[43m \u001b[49m\u001b[43mreset_num_timesteps\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mreset_num_timesteps\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 207\u001b[0m \u001b[43m \u001b[49m\u001b[43mprogress_bar\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mprogress_bar\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 208\u001b[0m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n", "File \u001b[0;32m~/Self_learn/RL_Course/venv/lib/python3.10/site-packages/stable_baselines3/common/on_policy_algorithm.py:336\u001b[0m, in \u001b[0;36mOnPolicyAlgorithm.learn\u001b[0;34m(self, total_timesteps, callback, log_interval, tb_log_name, reset_num_timesteps, progress_bar)\u001b[0m\n\u001b[1;32m 333\u001b[0m \u001b[38;5;28;01massert\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mep_info_buffer \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[1;32m 334\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_dump_logs(iteration)\n\u001b[0;32m--> 336\u001b[0m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mtrain\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 338\u001b[0m callback\u001b[38;5;241m.\u001b[39mon_training_end()\n\u001b[1;32m 340\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\n", "File \u001b[0;32m~/Self_learn/RL_Course/venv/lib/python3.10/site-packages/stable_baselines3/a2c/a2c.py:150\u001b[0m, in \u001b[0;36mA2C.train\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 146\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39maction_space, spaces\u001b[38;5;241m.\u001b[39mDiscrete):\n\u001b[1;32m 147\u001b[0m \u001b[38;5;66;03m# Convert discrete action from float to long\u001b[39;00m\n\u001b[1;32m 148\u001b[0m actions \u001b[38;5;241m=\u001b[39m actions\u001b[38;5;241m.\u001b[39mlong()\u001b[38;5;241m.\u001b[39mflatten()\n\u001b[0;32m--> 150\u001b[0m values, log_prob, entropy \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mpolicy\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mevaluate_actions\u001b[49m\u001b[43m(\u001b[49m\u001b[43mrollout_data\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mobservations\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mactions\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 151\u001b[0m values \u001b[38;5;241m=\u001b[39m values\u001b[38;5;241m.\u001b[39mflatten()\n\u001b[1;32m 153\u001b[0m \u001b[38;5;66;03m# Normalize advantage (not present in the original implementation)\u001b[39;00m\n", "File \u001b[0;32m~/Self_learn/RL_Course/venv/lib/python3.10/site-packages/stable_baselines3/common/policies.py:732\u001b[0m, in \u001b[0;36mActorCriticPolicy.evaluate_actions\u001b[0;34m(self, obs, actions)\u001b[0m\n\u001b[1;32m 730\u001b[0m features \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mextract_features(obs)\n\u001b[1;32m 731\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mshare_features_extractor:\n\u001b[0;32m--> 732\u001b[0m latent_pi, latent_vf \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mmlp_extractor\u001b[49m\u001b[43m(\u001b[49m\u001b[43mfeatures\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 733\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m 734\u001b[0m pi_features, vf_features \u001b[38;5;241m=\u001b[39m features\n", "File \u001b[0;32m~/Self_learn/RL_Course/venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1739\u001b[0m, in \u001b[0;36mModule._wrapped_call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m 1737\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_compiled_call_impl(\u001b[38;5;241m*\u001b[39margs, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs) \u001b[38;5;66;03m# type: ignore[misc]\u001b[39;00m\n\u001b[1;32m 1738\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m-> 1739\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_call_impl\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n", "File \u001b[0;32m~/Self_learn/RL_Course/venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1750\u001b[0m, in \u001b[0;36mModule._call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m 1745\u001b[0m \u001b[38;5;66;03m# If we don't have any hooks, we want to skip the rest of the logic in\u001b[39;00m\n\u001b[1;32m 1746\u001b[0m \u001b[38;5;66;03m# this function, and just call forward.\u001b[39;00m\n\u001b[1;32m 1747\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m (\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_backward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_forward_pre_hooks\n\u001b[1;32m 1748\u001b[0m \u001b[38;5;129;01mor\u001b[39;00m _global_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_backward_hooks\n\u001b[1;32m 1749\u001b[0m \u001b[38;5;129;01mor\u001b[39;00m _global_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_forward_pre_hooks):\n\u001b[0;32m-> 1750\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mforward_call\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 1752\u001b[0m result \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[1;32m 1753\u001b[0m called_always_called_hooks \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mset\u001b[39m()\n", "File \u001b[0;32m~/Self_learn/RL_Course/venv/lib/python3.10/site-packages/stable_baselines3/common/torch_layers.py:257\u001b[0m, in \u001b[0;36mMlpExtractor.forward\u001b[0;34m(self, features)\u001b[0m\n\u001b[1;32m 252\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21mforward\u001b[39m(\u001b[38;5;28mself\u001b[39m, features: th\u001b[38;5;241m.\u001b[39mTensor) \u001b[38;5;241m-\u001b[39m\u001b[38;5;241m>\u001b[39m \u001b[38;5;28mtuple\u001b[39m[th\u001b[38;5;241m.\u001b[39mTensor, th\u001b[38;5;241m.\u001b[39mTensor]:\n\u001b[1;32m 253\u001b[0m \u001b[38;5;250m \u001b[39m\u001b[38;5;124;03m\"\"\"\u001b[39;00m\n\u001b[1;32m 254\u001b[0m \u001b[38;5;124;03m :return: latent_policy, latent_value of the specified network.\u001b[39;00m\n\u001b[1;32m 255\u001b[0m \u001b[38;5;124;03m If all layers are shared, then ``latent_policy == latent_value``\u001b[39;00m\n\u001b[1;32m 256\u001b[0m \u001b[38;5;124;03m \"\"\"\u001b[39;00m\n\u001b[0;32m--> 257\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mforward_actor\u001b[49m\u001b[43m(\u001b[49m\u001b[43mfeatures\u001b[49m\u001b[43m)\u001b[49m, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mforward_critic(features)\n", "File \u001b[0;32m~/Self_learn/RL_Course/venv/lib/python3.10/site-packages/stable_baselines3/common/torch_layers.py:260\u001b[0m, in \u001b[0;36mMlpExtractor.forward_actor\u001b[0;34m(self, features)\u001b[0m\n\u001b[1;32m 259\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21mforward_actor\u001b[39m(\u001b[38;5;28mself\u001b[39m, features: th\u001b[38;5;241m.\u001b[39mTensor) \u001b[38;5;241m-\u001b[39m\u001b[38;5;241m>\u001b[39m th\u001b[38;5;241m.\u001b[39mTensor:\n\u001b[0;32m--> 260\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mpolicy_net\u001b[49m\u001b[43m(\u001b[49m\u001b[43mfeatures\u001b[49m\u001b[43m)\u001b[49m\n", "File \u001b[0;32m~/Self_learn/RL_Course/venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1739\u001b[0m, in \u001b[0;36mModule._wrapped_call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m 1737\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_compiled_call_impl(\u001b[38;5;241m*\u001b[39margs, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs) \u001b[38;5;66;03m# type: ignore[misc]\u001b[39;00m\n\u001b[1;32m 1738\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m-> 1739\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_call_impl\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n", "File \u001b[0;32m~/Self_learn/RL_Course/venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1750\u001b[0m, in \u001b[0;36mModule._call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m 1745\u001b[0m \u001b[38;5;66;03m# If we don't have any hooks, we want to skip the rest of the logic in\u001b[39;00m\n\u001b[1;32m 1746\u001b[0m \u001b[38;5;66;03m# this function, and just call forward.\u001b[39;00m\n\u001b[1;32m 1747\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m (\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_backward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_forward_pre_hooks\n\u001b[1;32m 1748\u001b[0m \u001b[38;5;129;01mor\u001b[39;00m _global_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_backward_hooks\n\u001b[1;32m 1749\u001b[0m \u001b[38;5;129;01mor\u001b[39;00m _global_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_forward_pre_hooks):\n\u001b[0;32m-> 1750\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mforward_call\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 1752\u001b[0m result \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[1;32m 1753\u001b[0m called_always_called_hooks \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mset\u001b[39m()\n", "File \u001b[0;32m~/Self_learn/RL_Course/venv/lib/python3.10/site-packages/torch/nn/modules/container.py:250\u001b[0m, in \u001b[0;36mSequential.forward\u001b[0;34m(self, input)\u001b[0m\n\u001b[1;32m 248\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21mforward\u001b[39m(\u001b[38;5;28mself\u001b[39m, \u001b[38;5;28minput\u001b[39m):\n\u001b[1;32m 249\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m module \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mself\u001b[39m:\n\u001b[0;32m--> 250\u001b[0m \u001b[38;5;28minput\u001b[39m \u001b[38;5;241m=\u001b[39m \u001b[43mmodule\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43minput\u001b[39;49m\u001b[43m)\u001b[49m\n\u001b[1;32m 251\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28minput\u001b[39m\n", "File \u001b[0;32m~/Self_learn/RL_Course/venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1739\u001b[0m, in \u001b[0;36mModule._wrapped_call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m 1737\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_compiled_call_impl(\u001b[38;5;241m*\u001b[39margs, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs) \u001b[38;5;66;03m# type: ignore[misc]\u001b[39;00m\n\u001b[1;32m 1738\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m-> 1739\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_call_impl\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n", "File \u001b[0;32m~/Self_learn/RL_Course/venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1750\u001b[0m, in \u001b[0;36mModule._call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m 1745\u001b[0m \u001b[38;5;66;03m# If we don't have any hooks, we want to skip the rest of the logic in\u001b[39;00m\n\u001b[1;32m 1746\u001b[0m \u001b[38;5;66;03m# this function, and just call forward.\u001b[39;00m\n\u001b[1;32m 1747\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m (\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_backward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_forward_pre_hooks\n\u001b[1;32m 1748\u001b[0m \u001b[38;5;129;01mor\u001b[39;00m _global_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_backward_hooks\n\u001b[1;32m 1749\u001b[0m \u001b[38;5;129;01mor\u001b[39;00m _global_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_forward_pre_hooks):\n\u001b[0;32m-> 1750\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mforward_call\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 1752\u001b[0m result \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[1;32m 1753\u001b[0m called_always_called_hooks \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mset\u001b[39m()\n", "File \u001b[0;32m~/Self_learn/RL_Course/venv/lib/python3.10/site-packages/torch/nn/modules/linear.py:125\u001b[0m, in \u001b[0;36mLinear.forward\u001b[0;34m(self, input)\u001b[0m\n\u001b[1;32m 124\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21mforward\u001b[39m(\u001b[38;5;28mself\u001b[39m, \u001b[38;5;28minput\u001b[39m: Tensor) \u001b[38;5;241m-\u001b[39m\u001b[38;5;241m>\u001b[39m Tensor:\n\u001b[0;32m--> 125\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mF\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mlinear\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43minput\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mweight\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mbias\u001b[49m\u001b[43m)\u001b[49m\n", "\u001b[0;31mKeyboardInterrupt\u001b[0m: " ] } ], "source": [ "# Import CartPole env \n", "from stable_baselines3 import A2C , PPO , DQN\n", "import gymnasium as gym\n", "from gymnasium import spaces\n", "import numpy as np\n", "\n", "env = gym.make(\"CartPole-v1\")\n", "# Train env With A2C\n", "model = A2C(\"MlpPolicy\", env, verbose=1, tensorboard_log=\"./a2c_CartPole_tensorboard/\")\n", "model.learn(total_timesteps=500_000)\n", "model.save(\"a2c_cartpole\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Train env With PPO\n", "model = PPO(\"MlpPolicy\", env, verbose=1, tensorboard_log=\"./PPO_CartPole_tensorboard/\")\n", "model.learn(total_timesteps=500_000)\n", "model.save(\"ppo_cartpole\")" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "TensorFlow installation not found - running with reduced feature set.\n", "\n", "NOTE: Using experimental fast data loading logic. To disable, pass\n", " \"--load_fast=false\" and report issues on GitHub. More details:\n", " https://github.com/tensorflow/tensorboard/issues/4784\n", "\n", "Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all\n", "TensorBoard 2.18.0 at http://localhost:6007/ (Press CTRL+C to quit)\n", "^C\n" ] } ], "source": [ "# plot env1 total rewards on sepearate lines for each algorithm\n", "!tensorboard --logdir ./a2c_CartPole_tensorboard" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Modify hyperparameters for cartpole" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "# modify hyperparameters and plot reward curves\n", "learning_rates = [0.0001, 0.01] \n", "gammas = [0.95, 0.99]\n", "\n", "for lr in learning_rates:\n", " for gamma in gammas:\n", " model = PPO(\"MlpPolicy\",\n", " env, verbose=0,\n", " learning_rate=lr, \n", " gamma=gamma ,\n", " tensorboard_log=f'./PPO_CartPole_tensorboard/ppo_lr{lr}_gamma{gamma}')\n", " model.learn(total_timesteps=200_000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## wrapper for reward function of Cartpole" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from gymnasium import RewardWrapper\n", "class DoubledReward(RewardWrapper):\n", " def reward(self, reward):\n", " return reward * 2\n", "\n", "wrapped_env = DoubledReward(env)\n", "\n", "model = PPO(\"MlpPolicy\", wrapped_env, verbose=0, tensorboard_log=\"./PPO_CartPole_tensorboard/reward_wrapped_env_ppo\")\n", "model.learn(total_timesteps=500_000)" ] }, { "cell_type": "markdown", "metadata": { "id": "KQ5mY77EE9o3" }, "source": [ "## 1.2 FrozenLake implementation." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from stable_baselines3 import A2C , PPO , DQN\n", "import gymnasium as gym\n", "from gymnasium import spaces\n", "import numpy as np\n", "\n", "\n", "env = gym.make(\"FrozenLake-v1\" , is_slippery=True)\n", "# Train env With DQN\n", "model = DQN(\"MlpPolicy\" ,env , verbose=0, tensorboard_log=\"./DQN_FrozenLake_tensorboard/\")\n", "model.learn(total_timesteps=500_000)\n", "model.save(\"DQN_FrozenLake\")" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# Train env With A2C\n", "model = A2C(\"MlpPolicy\" ,env , verbose=0, tensorboard_log=\"./A2C_FrozenLake_tensorboard/\")\n", "model.learn(total_timesteps=500_000)\n", "model.save(\"A2C_FrozenLake\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Modify DQN hyperparameters for FrozenLake" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# modify hyperparameters and plot reward curves\n", "learning_rates = [0.0001, 0.01] \n", "gammas = [0.95, 0.99]\n", "\n", "for lr in learning_rates:\n", " for gamma in gammas:\n", " model = DQN(\"MlpPolicy\",\n", " env, verbose=0,\n", " learning_rate=lr, \n", " gamma=gamma ,\n", " tensorboard_log=f'./DQN_FrozenLake_tensorboard/DQN_lr{lr}_gamma{gamma}')\n", " model.learn(total_timesteps=300_000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## wrapper for reward function of FrozenLake" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from gymnasium import RewardWrapper\n", "class ModifiedFrozenLakeReward(RewardWrapper):\n", " def reward(self, reward):\n", " if reward == 0: \n", " return -0.01\n", " elif reward == 1: \n", " return 2\n", " else:\n", " return reward\n", "\n", "wrapped_env = ModifiedFrozenLakeReward(env)\n", "\n", "model = DQN(\"MlpPolicy\", wrapped_env, verbose=0, tensorboard_log=\"./DQN_FrozenLake_tensorboard/reward_wrapped_env_DQN\")\n", "model.learn(total_timesteps=500_000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1.3 Taxi Env implementation." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "from stable_baselines3 import A2C , PPO , DQN\n", "import gymnasium as gym\n", "from gymnasium import spaces\n", "import numpy as np\n", "\n", "env = gym.make(\"Taxi-v3\")\n", "# Train env With DQN\n", "model = DQN(\"MlpPolicy\" ,env , verbose=0, tensorboard_log=\"./DQN_Taxi_tensorboard/\")\n", "model.learn(total_timesteps=500_000)\n", "model.save(\"DQN_Taxi\")" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "# Train env With PPO\n", "model = PPO(\"MlpPolicy\", env, verbose=0, tensorboard_log=\"./PPO_Taxi_tensorboard/\")\n", "model.learn(total_timesteps=500_000)\n", "model.save(\"ppo_Taxi\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "seems we should solve taxi using better algorithms. we use q learning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Q-Learning for Taxi " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import random\n", "import gymnasium as gym\n", "from torch.utils.tensorboard import SummaryWriter\n", "\n", "\n", "\n", "def qlearning_train(taxi_env , learning_rate , discount_factor ):\n", " \n", " # we use the tensorboard writer\n", " writer = SummaryWriter(f\"Qlearning_Taxi/taxi_experiment_lr_{learning_rate}_{discount_factor}\")\n", " \n", " # q table dimensions based on state and action space\n", " num_states = taxi_env.observation_space.n\n", " num_actions = taxi_env.action_space.n\n", " q_values = np.zeros((num_states, num_actions))\n", " \n", " # hyperparameters for the problem\n", " alpha = learning_rate # learning rate\n", " gamma = discount_factor # discount factor\n", " eps = 1.0 # initial exploration probability\n", " decay = 0.005 # epsilon decay rate per episode\n", " episodes = 5000\n", " max_steps_per_ep = 99\n", " \n", " \n", " episode_rewards = []\n", " print(\"training...\")\n", " \n", " for ep in range(episodes):\n", " # reseting the environment\n", " state, _ = taxi_env.reset()\n", " total_reward = 0\n", " \n", " for step in range(max_steps_per_ep):\n", " \n", " if random.random() < eps:\n", " chosen_action = taxi_env.action_space.sample()\n", " else:\n", " chosen_action = np.argmax(q_values[state, :])\n", " \n", " next_state, reward, done, _, _ = taxi_env.step(chosen_action)\n", " \n", " q_values[state, chosen_action] += alpha * (\n", " reward + gamma * np.max(q_values[next_state, :]) - q_values[state, chosen_action]\n", " )\n", " \n", " state = next_state\n", " total_reward += reward\n", " \n", " if done:\n", " break\n", " \n", " # decay epsilon over episodes\n", " eps = 1 / (1 + decay * ep)\n", " episode_rewards.append(total_reward)\n", " \n", " avg_reward = np.mean(episode_rewards[-100:]) if ep >= 100 else np.mean(episode_rewards)\n", " writer.add_scalar(\"Mean_Episode_Reward\", avg_reward, ep)\n", " \n", " print(f\"Training completed over {episodes} episodes.\")\n", " writer.close()\n", " taxi_env.close()\n", " return q_values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "we test with different hyperparameters" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "training...\n", "Training completed over 5000 episodes.\n", "training...\n", "Training completed over 5000 episodes.\n", "training...\n", "Training completed over 5000 episodes.\n", "training...\n", "Training completed over 5000 episodes.\n" ] } ], "source": [ "# create Taxi environment\n", "taxi_env = gym.make(\"Taxi-v3\", render_mode=\"ansi\")\n", "\n", "learning_rates = [0.1, 0.9] \n", "gammas = [0.95, 0.99]\n", "\n", "for lr in learning_rates:\n", " for gamma in gammas:\n", " q_values = qlearning_train(taxi_env = taxi_env , learning_rate = lr , discount_factor = gamma )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "we can render the result here:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Run 1\n", "\n", "+---------+\n", "|R: | : :\u001b[35mG\u001b[0m|\n", "|\u001b[43m \u001b[0m: | : : |\n", "| : : : : |\n", "| | : | : |\n", "|\u001b[34;1mY\u001b[0m| : |B: |\n", "+---------+\n", "\n", "\n", "+---------+\n", "|R: | : :\u001b[35mG\u001b[0m|\n", "| : | : : |\n", "|\u001b[43m \u001b[0m: : : : |\n", "| | : | : |\n", "|\u001b[34;1mY\u001b[0m| : |B: |\n", "+---------+\n", " (South)\n", "\n", "+---------+\n", "|R: | : :\u001b[35mG\u001b[0m|\n", "| : | : : |\n", "| : : : : |\n", "|\u001b[43m \u001b[0m| : | : |\n", "|\u001b[34;1mY\u001b[0m| : |B: |\n", "+---------+\n", " (South)\n", "\n", "+---------+\n", "|R: | : :\u001b[35mG\u001b[0m|\n", "| : | : : |\n", "| : : : : |\n", "| | : | : |\n", "|\u001b[34;1m\u001b[43mY\u001b[0m\u001b[0m| : |B: |\n", "+---------+\n", " (South)\n", "\n", "+---------+\n", "|R: | : :\u001b[35mG\u001b[0m|\n", "| : | : : |\n", "| : : : : |\n", "| | : | : |\n", "|\u001b[42mY\u001b[0m| : |B: |\n", "+---------+\n", " (Pickup)\n", "\n", "+---------+\n", "|R: | : :\u001b[35mG\u001b[0m|\n", "| : | : : |\n", "| : : : : |\n", "|\u001b[42m_\u001b[0m| : | : |\n", "|Y| : |B: |\n", "+---------+\n", " (North)\n", "\n", "+---------+\n", "|R: | : :\u001b[35mG\u001b[0m|\n", "| : | : : |\n", "|\u001b[42m_\u001b[0m: : : : |\n", "| | : | : |\n", "|Y| : |B: |\n", "+---------+\n", " (North)\n", "\n", "+---------+\n", "|R: | : :\u001b[35mG\u001b[0m|\n", "| : | : : |\n", "| :\u001b[42m_\u001b[0m: : : |\n", "| | : | : |\n", "|Y| : |B: |\n", "+---------+\n", " (East)\n", "\n", "+---------+\n", "|R: | : :\u001b[35mG\u001b[0m|\n", "| : | : : |\n", "| : :\u001b[42m_\u001b[0m: : |\n", "| | : | : |\n", "|Y| : |B: |\n", "+---------+\n", " (East)\n", "\n", "+---------+\n", "|R: | : :\u001b[35mG\u001b[0m|\n", "| : |\u001b[42m_\u001b[0m: : |\n", "| : : : : |\n", "| | : | : |\n", "|Y| : |B: |\n", "+---------+\n", " (North)\n", "\n", "+---------+\n", "|R: |\u001b[42m_\u001b[0m: :\u001b[35mG\u001b[0m|\n", "| : | : : |\n", "| : : : : |\n", "| | : | : |\n", "|Y| : |B: |\n", "+---------+\n", " (North)\n", "\n", "+---------+\n", "|R: | :\u001b[42m_\u001b[0m:\u001b[35mG\u001b[0m|\n", "| : | : : |\n", "| : : : : |\n", "| | : | : |\n", "|Y| : |B: |\n", "+---------+\n", " (East)\n", "\n", "+---------+\n", "|R: | : :\u001b[35m\u001b[42mG\u001b[0m\u001b[0m|\n", "| : | : : |\n", "| : : : : |\n", "| | : | : |\n", "|Y| : |B: |\n", "+---------+\n", " (East)\n", "\n", "Run completed in 13 steps.\n", "\n", "Run 2\n", "\n", "+---------+\n", "|\u001b[34;1mR\u001b[0m: | : :G|\n", "| : | : : |\n", "| :\u001b[43m \u001b[0m: : : |\n", "| | : | : |\n", "|\u001b[35mY\u001b[0m| : |B: |\n", "+---------+\n", "\n", "\n", "+---------+\n", "|\u001b[34;1mR\u001b[0m: | : :G|\n", "| : | : : |\n", "|\u001b[43m \u001b[0m: : : : |\n", "| | : | : |\n", "|\u001b[35mY\u001b[0m| : |B: |\n", "+---------+\n", " (West)\n", "\n", "+---------+\n", "|\u001b[34;1mR\u001b[0m: | : :G|\n", "|\u001b[43m \u001b[0m: | : : |\n", "| : : : : |\n", "| | : | : |\n", "|\u001b[35mY\u001b[0m| : |B: |\n", "+---------+\n", " (North)\n", "\n", "+---------+\n", "|\u001b[34;1m\u001b[43mR\u001b[0m\u001b[0m: | : :G|\n", "| : | : : |\n", "| : : : : |\n", "| | : | : |\n", "|\u001b[35mY\u001b[0m| : |B: |\n", "+---------+\n", " (North)\n", "\n", "+---------+\n", "|\u001b[42mR\u001b[0m: | : :G|\n", "| : | : : |\n", "| : : : : |\n", "| | : | : |\n", "|\u001b[35mY\u001b[0m| : |B: |\n", "+---------+\n", " (Pickup)\n", "\n", "+---------+\n", "|R: | : :G|\n", "|\u001b[42m_\u001b[0m: | : : |\n", "| : : : : |\n", "| | : | : |\n", "|\u001b[35mY\u001b[0m| : |B: |\n", "+---------+\n", " (South)\n", "\n", "+---------+\n", "|R: | : :G|\n", "| : | : : |\n", "|\u001b[42m_\u001b[0m: : : : |\n", "| | : | : |\n", "|\u001b[35mY\u001b[0m| : |B: |\n", "+---------+\n", " (South)\n", "\n", "+---------+\n", "|R: | : :G|\n", "| : | : : |\n", "| : : : : |\n", "|\u001b[42m_\u001b[0m| : | : |\n", "|\u001b[35mY\u001b[0m| : |B: |\n", "+---------+\n", " (South)\n", "\n", "+---------+\n", "|R: | : :G|\n", "| : | : : |\n", "| : : : : |\n", "| | : | : |\n", "|\u001b[35m\u001b[42mY\u001b[0m\u001b[0m| : |B: |\n", "+---------+\n", " (South)\n", "\n", "Run completed in 9 steps.\n", "\n" ] } ], "source": [ "from time import sleep\n", "\n", "def simulate_agent(environment, q_matrix, num_runs=5, max_steps_per_run= 100):\n", " for run in range(num_runs):\n", " current_state, _ = environment.reset()\n", " finished = False\n", " print(f\"Run {run + 1}\\n\")\n", " sleep(1)\n", "\n", " for step in range(max_steps_per_run): \n", " print(environment.render())\n", " sleep(0.5) \n", " \n", " chosen_action = np.argmax(q_matrix[current_state, :])\n", " next_state, reward, finished, _, _ = environment.step(chosen_action)\n", " current_state = next_state\n", "\n", " if finished:\n", " print(f\"Run completed in {step + 1} steps.\\n\")\n", " sleep(2)\n", " break\n", "\n", "taxi_env = gym.make(\"Taxi-v3\", render_mode=\"ansi\")\n", "simulate_agent(taxi_env, q_values, num_runs=2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Wrapping Taxi reward" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "training...\n", "Training completed over 5000 episodes.\n" ] } ], "source": [ "from gymnasium import RewardWrapper\n", "class CustomTaxiReward(RewardWrapper):\n", " def reward(self, reward):\n", " if reward == -10: \n", " return -40 \n", " else:\n", " return reward\n", " \n", "taxi_env.reset()\n", "wrapped_taxi_env = CustomTaxiReward(taxi_env)\n", "\n", "qlearning_train(taxi_env = wrapped_taxi_env , learning_rate = 0.9 , discount_factor = .99 )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1.4 Implementing Flappy Bird" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this section we used the third party repo for flappy bird game from this [link](https://github.com/markub3327/flappy-bird-gymnasium).\n", "\n", "First we install it:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install flappy-bird-gymnasium" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then we check render in Human mode" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "import flappy_bird_gymnasium\n", "import gymnasium\n", "env = gymnasium.make(\"FlappyBird-v0\", render_mode=\"human\", use_lidar=True)\n", "\n", "obs, _ = env.reset()\n", "while True: \n", " action = env.action_space.sample()\n", "\n", " # Processing:\n", " obs, reward, terminated, _, info = env.step(action)\n", " \n", " # Checking if the player is still alive\n", " if terminated:\n", " break\n", "\n", "env.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we check the action and observation space" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Action Space Type: \n", "Action Space: Discrete(2)\n", "\n", "Observation Space Type: \n", "Observation Space: Box(0.0, 1.0, (180,), float64)\n", "\n", "Example Step Results:\n", "Action Taken: 1\n", "New State: [1. 1. 1. 1. 1. 1.\n", " 1. 1. 1. 1. 1. 1.\n", " 1. 1. 1. 1. 1. 1.\n", " 1. 1. 1. 1. 1. 1.\n", " 1. 1. 1. 1. 1. 1.\n", " 1. 1. 1. 1. 1. 1.\n", " 1. 1. 1. 1. 1. 1.\n", " 1. 1. 1. 1. 1. 1.\n", " 1. 1. 1. 1. 1. 1.\n", " 1. 1. 1. 1. 1. 1.\n", " 1. 1. 1. 1. 1. 1.\n", " 1. 1. 1. 1. 1. 1.\n", " 1. 1. 1. 1. 1. 1.\n", " 1. 1. 1. 1. 1. 1.\n", " 1. 1. 1. 1. 1. 1.\n", " 1. 1. 1. 1. 1. 1.\n", " 1. 1. 1. 1. 1. 1.\n", " 0.99428291 0.9916614 0.98996685 0.98850642 0.98692514 0.98601476\n", " 0.9851696 0.98481283 0.98469388 0.98474675 0.98502427 0.98576393\n", " 0.98659539 0.98807182 0.98945397 0.99107061 0.99358888 0.99651883\n", " 0.99985682 1. 1. 1. 1. 1.\n", " 1. 1. 1. 1. 1. 1.\n", " 1. 1. 1. 1. 1. 1.\n", " 1. 1. 1. 1. 1. 1.\n", " 1. 1. 1. 1. 1. 1.\n", " 1. 1. 1. 1. 1. 1.\n", " 1. 1. 1. 1. 1. 1.\n", " 1. 1. 0.98991426 0.97514403 0.96365112 0.95245687\n", " 0.94157192 0.93100712 0.92584822 0.91578435 0.90606844 0.8967118\n", " 0.88772578 0.88123633 0.87496653 0.86892114 0.86310487 0.85752239]\n", "Reward Received: 0.1\n" ] } ], "source": [ "env = gymnasium.make(\"FlappyBird-v0\", render_mode=\"rgb_array\", use_lidar=True)\n", "# Display action space\n", "print(f\"Action Space Type: {type(env.action_space)}\")\n", "print(f\"Action Space: {env.action_space}\")\n", "\n", "# Display observation space\n", "print(f\"\\nObservation Space Type: {type(env.observation_space)}\")\n", "print(f\"Observation Space: {env.observation_space}\")\n", "\n", "# Take a random action\n", "action = env.action_space.sample()\n", "env.reset()\n", "next_state, reward, done, truncated, info = env.step(action)\n", "\n", "# Display step results\n", "print(\"\\nExample Step Results:\")\n", "print(f\"Action Taken: {action}\")\n", "print(f\"New State: {next_state}\")\n", "print(f\"Reward Received: {reward}\")" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [], "source": [ "# Train env With A2C\n", "model = A2C(\"MlpPolicy\" ,env , verbose=0, tensorboard_log=\"./A2C_FlappyBird_tensorboard/\")\n", "model.learn(total_timesteps=5_000_000)\n", "model.save(\"A2C_FlappyBird\")" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [], "source": [ "# Train env With PPO\n", "model = PPO(\"MlpPolicy\" ,env , verbose=0, tensorboard_log=\"./PPO_mlp_FlappyBird_tensorboard/\")\n", "model.learn(total_timesteps=5_000_000)\n", "model.save(\"PPO_FlappyBird\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Wrapping FlappyBird reward" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from gymnasium import RewardWrapper\n", "class ModifiedFlappyBirdReward(RewardWrapper):\n", " def reward(self, reward):\n", " if reward == - 0.5: #- touch the top of the screen \n", " return -2 # we penalty touching the screen more\n", " elif reward == 1: \n", " return 2\n", " else:\n", " return reward\n", "\n", "wrapped_env = ModifiedFlappyBirdReward(env)\n", "\n", "model = DQN(\"MlpPolicy\", wrapped_env, verbose=0, tensorboard_log=\"./DQN_FrozenLake_tensorboard/reward_wrapped_env_DQN\")\n", "model.learn(total_timesteps=500_000)" ] }, { "cell_type": "markdown", "metadata": { "id": "Py94B6gw9hYA" }, "source": [ "# Task 2 Solution: Creating Custom Environment (45 points)\n", "In this question, you were required to model **a custom 4*4 gridworld problem** as Markov Decision Processes (MDPs). \n", "\n", "![gridworld_4x4.png]()\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- **State Space ($ S $)**: The state space is all 4*4 = 16 cells in the grid. So the states are tuples like (row, column):\n", "\n", " - Starts at (0,0) for the top-left “S.”\n", " - Ends at (3,3) for the bottom-right “G.”\n", " - Holes are at (1,1) and (2,2).\n", " - the other 10 cells are just regular places\n", "\n", " So, S = {(0,0), (0,1), ..., (3,2), (3,3)}. That’s 16 possible states where the agent can be at any time.\n", "\n", "- **Action Space ($ A $)**: \n", "The agent can move in four directions: up, down, left, or right. We’ll keep it simple and define:\n", "\n", " A = {0: up, 1: down, 2: left, 3: right}. Four actions. In code, we’ll use numbers 0 to 3 to keep it easy for gymnasium.\n", "\n", "\n", "- **Reward Function ($ R $)**:\n", "\n", " - Reaching the goal (G) at (3,3): +10 reward.\n", " - Falling into a hole at (1,1) or (2,2): -1 reward. \n", " - Moving anywhere else (white cells or hitting a wall): 0 reward.\n", " \n", "\n", "\n", "- **Transition Probability ($ P $)**: We assume this is deterministic environment.\n" ] }, { "cell_type": "code", "execution_count": 209, "metadata": { "id": "hgkvovqQiB8c" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "A . . .\n", ". H . .\n", ". . H .\n", ". . . G\n", "\n", "Starting state: (0, {})\n", "After moving right: 1 Reward: 0 Done: False\n" ] } ], "source": [ "import gymnasium as gym\n", "from gymnasium import spaces\n", "import numpy as np\n", "\n", "class GridWorldEnv(gym.Env):\n", " def __init__(self): \n", " self.grid_size = 4\n", " self.start = (0, 0) \n", " self.goal = (3, 3) \n", " self.holes = [(1, 1), (2, 2)] \n", " self.state = self.start\n", " \n", " \n", " self.action_space = spaces.Discrete(4) # 0: up, 1: down, 2: left, 3: right\n", " self.observation_space = spaces.Discrete(16) # from 0 to 15 for the problem states\n", "\n", " def get_observation(self):\n", " \"\"\"we convert internal state (i, j) to integer observation using state = 4*i + j.\"\"\"\n", " i, j = self.state\n", " return 4 * i + j\n", "\n", " def step(self, action): \n", " row, col = self.state\n", " \n", " \n", " if action == 0: # up\n", " next_row = max(row - 1, 0)\n", " next_col = col\n", " elif action == 1: # down\n", " next_row = min(row + 1, 3)\n", " next_col = col\n", " elif action == 2: # left\n", " next_row = row\n", " next_col = max(col - 1, 0)\n", " elif action == 3: # right\n", " next_row = row\n", " next_col = min(col + 1, 3)\n", "\n", " next_state = (next_row, next_col)\n", " \n", " # rewards\n", " if next_state in self.holes:\n", " reward = -1\n", " done = True\n", " elif next_state == self.goal:\n", " reward = 10\n", " done = True\n", " else:\n", " reward = 0\n", " done = False\n", " \n", " # Update the internal state\n", " self.state = next_state\n", " # Return observation as integer, along with reward and status\n", " return self.get_observation(), reward, done, False, {}\n", "\n", " def render(self):\n", " \"\"\"Render the grid world with the agent's position.\"\"\"\n", " # Initialize a 4x4 grid with empty cells\n", " grid = [['.' for _ in range(4)] for _ in range(4)]\n", " \n", " grid[0][0] = 'S' # Start\n", " grid[3][3] = 'G' # Goal\n", " for hole in self.holes: # Holes\n", " grid[hole[0]][hole[1]] = 'H'\n", " \n", " agent_row, agent_col = self.state\n", " if grid[agent_row][agent_col] == 'S':\n", " grid[agent_row][agent_col] = 'A' # Agent on start\n", " elif grid[agent_row][agent_col] == 'G':\n", " grid[agent_row][agent_col] = 'A' # Agent on goal\n", " elif grid[agent_row][agent_col] == 'H':\n", " grid[agent_row][agent_col] = 'A(H)' # Agent on hole\n", " else:\n", " grid[agent_row][agent_col] = 'A' # Agent on empty cell\n", " \n", " # Print the grid\n", " for row in grid:\n", " print(' '.join(row))\n", " print() # empty line \n", "\n", " def reset(self):\n", " \"\"\"Reset the environment to the start state and return initial observation.\"\"\"\n", " self.state = self.start\n", " return self.get_observation(), {}\n", "\n", "# quick test\n", "env = GridWorldEnv()\n", "env.render()\n", "print(\"Starting state:\", env.reset())\n", "action = 3 # move right action for test\n", "next_state, reward, done, truncated , _ = env.step(action)\n", "print(\"After moving right:\", next_state, \"Reward:\", reward, \"Done:\", done)" ] }, { "cell_type": "code", "execution_count": 210, "metadata": {}, "outputs": [], "source": [ "def qlearning_train_gridworld(env , learning_rate , discount_factor ):\n", " \n", " # we use the tensorboard writer\n", " writer = SummaryWriter(f\"Qlearning_GridWorld/experiment_lr_{learning_rate}_{discount_factor}\")\n", " \n", " # q table dimensions based on state and action space\n", " num_states = env.observation_space.n\n", " num_actions = env.action_space.n\n", " q_values = np.zeros((num_states, num_actions))\n", " \n", " # hyperparameters for the problem\n", " alpha = learning_rate # learning rate\n", " gamma = discount_factor # discount factor\n", " eps = 1.0 # initial exploration probability\n", " decay = 0.005 # epsilon decay rate per episode\n", " episodes = 5000\n", " max_steps_per_ep = 99\n", " \n", " \n", " episode_rewards = []\n", " print(\"training...\")\n", " \n", " for ep in range(episodes):\n", " # reseting the environment\n", " state, _ = env.reset()\n", " total_reward = 0\n", " \n", " for step in range(max_steps_per_ep):\n", " \n", " if random.random() < eps:\n", " chosen_action = env.action_space.sample()\n", " else:\n", " chosen_action = np.argmax(q_values[state, :])\n", " \n", " next_state, reward, done, _, _ = env.step(chosen_action)\n", " \n", " q_values[state, chosen_action] += alpha * (\n", " reward + gamma * np.max(q_values[next_state, :]) - q_values[state, chosen_action]\n", " )\n", " \n", " state = next_state\n", " total_reward += reward\n", " \n", " if done:\n", " break\n", " \n", " # decay epsilon over episodes\n", " eps = 1 / (1 + decay * ep)\n", " episode_rewards.append(total_reward)\n", " \n", " avg_reward = np.mean(episode_rewards[-100:]) if ep >= 100 else np.mean(episode_rewards)\n", " writer.add_scalar(\"Mean_Episode_Reward\", avg_reward, ep)\n", " \n", " print(f\"Training completed over {episodes} episodes.\")\n", " writer.close()\n", " env.close()\n", "\n", " return q_values" ] }, { "cell_type": "code", "execution_count": 216, "metadata": {}, "outputs": [], "source": [ "def simulate_gridworld_agent(environment, q_matrix, max_steps_per_run= 100):\n", " print(\"environment\" , environment)\n", " \n", " current_state, _ = environment.reset()\n", " print(current_state)\n", " finished = False \n", " sleep(1)\n", "\n", " for step in range(max_steps_per_run): \n", " environment.render()\n", " sleep(0.5) \n", " \n", " chosen_action = np.argmax(q_matrix[current_state, :])\n", " next_state, reward, finished, _, _ = environment.step(chosen_action)\n", " current_state = next_state\n", "\n", " if finished:\n", " environment.render()\n", " print(f\"Run completed in {step + 1} steps.\\n\")\n", " sleep(2)\n", " break" ] }, { "cell_type": "code", "execution_count": 212, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "training...\n", "Training completed over 5000 episodes.\n" ] } ], "source": [ "env = GridWorldEnv()\n", "q_values = qlearning_train_gridworld(env = env , learning_rate = 0.9 , discount_factor = .99 )" ] }, { "cell_type": "code", "execution_count": 214, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 9.41480149, 9.32065348, 9.41480149, 9.5099005 ],\n", " [ 9.5099005 , -1. , 9.41480149, 9.6059601 ],\n", " [ 9.6059601 , 9.70299 , 9.5099005 , 9.70299 ],\n", " [ 9.70299 , 9.801 , 9.6059601 , 9.70299 ],\n", " [ 9.41480149, 9.22737868, 9.32065347, -1. ],\n", " [ 0. , 0. , 0. , 0. ],\n", " [ 9.6059601 , -1. , -1. , 9.801 ],\n", " [ 9.70299 , 9.9 , 9.70299 , 9.801 ],\n", " [ 9.32065234, 0. , 7.45897358, 0. ],\n", " [-0.99 , 0. , 0. , -0.999 ],\n", " [ 0. , 0. , 0. , 0. ],\n", " [ 9.801 , 10. , -1. , 9.9 ],\n", " [ 0. , 0. , 0. , 0. ],\n", " [ 0. , 0. , 0. , 0. ],\n", " [-0.999 , 0. , 0. , 0. ],\n", " [ 0. , 0. , 0. , 0. ]])" ] }, "execution_count": 214, "metadata": {}, "output_type": "execute_result" } ], "source": [ "q_values" ] }, { "cell_type": "code", "execution_count": 217, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "environment \n", "0\n", "A . . .\n", ". H . .\n", ". . H .\n", ". . . G\n", "\n", "S A . .\n", ". H . .\n", ". . H .\n", ". . . G\n", "\n", "S . A .\n", ". H . .\n", ". . H .\n", ". . . G\n", "\n", "S . . .\n", ". H A .\n", ". . H .\n", ". . . G\n", "\n", "S . . .\n", ". H . A\n", ". . H .\n", ". . . G\n", "\n", "S . . .\n", ". H . .\n", ". . H A\n", ". . . G\n", "\n", "S . . .\n", ". H . .\n", ". . H .\n", ". . . A\n", "\n", "Run completed in 6 steps.\n", "\n" ] } ], "source": [ "simulate_gridworld_agent(environment = env, q_matrix = q_values , max_steps_per_run= 100)" ] }, { "cell_type": "markdown", "metadata": { "id": "fAzGwDFc56Bv" }, "source": [ "## Modify Q-Learning hyperparameters for gridworld" ] }, { "cell_type": "code", "execution_count": 218, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "training...\n", "Training completed over 5000 episodes.\n", "training...\n", "Training completed over 5000 episodes.\n", "training...\n", "Training completed over 5000 episodes.\n", "training...\n", "Training completed over 5000 episodes.\n" ] } ], "source": [ "learning_rates = [0.1, 0.9] \n", "gammas = [0.95, 0.99]\n", "\n", "for lr in learning_rates:\n", " for gamma in gammas:\n", " qlearning_train_gridworld(env = env , learning_rate = lr , discount_factor = gamma )" ] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" } }, "nbformat": 4, "nbformat_minor": 4 }