# Q-learning 

In [None]:
try:
 import google.colab
 IN_COLAB = True
except:
 IN_COLAB = False

if IN_COLAB:
 !pip install -U gymnasium pygame moviepy
 !pip install gymnasium[box2d]

In [None]:
import numpy as np
rng = np.random.default_rng()
import matplotlib.pyplot as plt
import os

import gymnasium as gym
print("gym version:", gym.__version__)

from moviepy.editor import ImageSequenceClip, ipython_display

class GymRecorder(object):
 """
 Simple wrapper over moviepy to generate a .gif with the frames of a gym environment.
 
 The environment must have the render_mode `rgb_array_list`.
 """
 def __init__(self, env):
 self.env = env
 self._frames = []

 def record(self, frames):
 "To be called at the end of an episode."
 for frame in frames:
 self._frames.append(np.array(frame))

 def make_video(self, filename):
 "Generates the gif video."
 directory = os.path.dirname(os.path.abspath(filename))
 if not os.path.exists(directory):
 os.mkdir(directory)
 self.clip = ImageSequenceClip(list(self._frames), fps=self.env.metadata["render_fps"])
 self.clip.write_gif(filename, fps=self.env.metadata["render_fps"], loop=1)
 del self._frames
 self._frames = []

def running_average(x, N):
 kernel = np.ones(N) / N
 return np.convolve(x, kernel, mode='same')

In this short exercise, we are going to apply **Q-learning** on the Taxi environment used last time for MC control.

As a reminder, Q-learning updates the Q-value of a state-action pair **after each transition**, using the update rule:

$$\Delta Q(s_t, a_t) = \alpha \, (r_{t+1} + \gamma \, \max_{a'} \, Q(s_{t+1}, a') - Q(s_t, a_t))$$

**Q:** Update the class you designed for online MC in the last exercise so that it implements Q-learning. 

The main difference is that the `update()` method has to be called after each step of the episode, not at the end. It simplifies a lot the code too (no need to iterate backwards on the episode).

You can use the following parameters at the beginning, but feel free to change them:

* Discount factor $\gamma = 0.9$. 
* Learning rate $\alpha = 0.1$.
* Epsilon-greedy action selection, with an initial exploration parameter of 1.0 and an exponential decay of $10^{-5}$ after each update (i.e. every step!).
* A total number of episodes of 20000.

Keep the general structure of the class: `train()` for the main loop, `test()` to run one episode without exploration, etc. 

Plot the training and test performance in the end and render the learned deterministic policy for one episode.

*Note:* if $s_{t+1}$ is terminal (`done` is true after the transition), the target should not be $r_{t+1} + \gamma \, \max_{a'} \, Q(s_{t+1}, a')$, but simply $r_{t+1}$ as there is no next action.

**Q:** Compare the performance of Q-learning to online MC. Experiment with parameters (gamma, epsilon, alpha, etc.).