{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "

CSCI - UA 9472 - Artificial Intelligence

\n", "\n", "

Assignment 4: Reinforcement Learning

\n", "\n", "
Given date: December 7\n", "
\n", "
Due date: December 20/21 \n", "
\n", "
Total: 20 pts \n", "
\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
In this fourth assignment, we will continue our quest for the optimal agent and replace our heavy logical reasoning model with a faster implementation based on Q-learning
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Question 1. A simple moving agent (10pts)\n", "\n", "We consider a simple environment as shown below. Our simple agent has the possibility to move West, East, North and South (except on the borders of the environment). Moreover it has two additional actions:\n", " \n", " - It can hit with its sword (which is essentially useful when facing the skeleton)\n", " \n", " - It can open the door and escape the room (useful when being in the upper rightmost cell)\n", " \n", "As you can imagine, hitting while in an empty cell is a loss of energy and should be penalized. We will thus associate useless behaviors (like hitting when alone and opening non existing doors) with a negative reward of -10. Hitting the skeleton and opening the door _when the skeleton has been killed_ should be associated with higher rewards of +30. Escaping without the kill will be associated with a negative reward of -10 (The agent should not exit before getting rid of all the evil in the room)\n", "\n", "There are a total of 48 cells. Each cell can contain the skeleton, be empty or contain the door (exclusively) which makes a total of $48\\times3 = 144$ states for your environment. Moreover, we want to keep track of whether or not the skeleton has been killed so we will take into account an additional variable encoding whether the kill has been completed. This thus leads to a total of $48\\times 3\\times2 = 288$.\n", "\n", "Any move will be associated with a small penalty of -1 point " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Start by coding a simple TD Q-learning agent which stores the Q-table entirely as a $288\\times 6$ numpy array. Use an exploration function to avoid being stuck in a limited region of the environment. Recall that such a function can be defined as \n", "\n", "\\begin{align}\n", "f(Q_\\theta[s,a], N[s,a]) = \\left\\{\\begin{array}{ll}\n", "R^+& \\text{if $N[s,a]