{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Bandits" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this part, we will investigate the properties of the action selection schemes seen in the lecture and compare their properties:\n", "\n", "1. greedy action selection\n", "2. $\\epsilon$-greedy action selection\n", "3. softmax action selection\n", "\n", "Let's re-use the definitions of the last exercise:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "rng = np.random.default_rng()" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "class Bandit:\n", " \"\"\"\n", " n-armed bandit.\n", " \"\"\"\n", " def __init__(self, nb_actions, mean=0.0, std_Q=1.0, std_r=1.0):\n", " \"\"\"\n", " :param nb_actions: number of arms.\n", " :param mean: mean of the normal distribution for $Q^*$.\n", " :param std_Q: standard deviation of the normal distribution for $Q^*$.\n", " :param std_r: standard deviation of the normal distribution for the sampled rewards.\n", " \"\"\"\n", " # Store parameters\n", " self.nb_actions = nb_actions\n", " self.mean = mean\n", " self.std_Q = std_Q\n", " self.std_r = std_r\n", " \n", " # Initialize the true Q-values\n", " self.Q_star = rng.normal(self.mean, self.std_Q, self.nb_actions)\n", " \n", " # Optimal action\n", " self.a_star = self.Q_star.argmax()\n", " \n", " def step(self, action):\n", " \"\"\"\n", " Sampled a single reward from the bandit.\n", " \n", " :param action: the selected action.\n", " :return: a reward.\n", " \"\"\"\n", " return float(rng.normal(self.Q_star[action], self.std_r, 1))" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAABJwAAAFzCAYAAAB7BBMsAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAAsTAAALEwEAmpwYAAAuYUlEQVR4nO3de7BmZ10n+u+PDhlRGAPSQEjSdgpbnQwzBOwKaEZFbpOLY4OIJ+EImUhNE028jFpOq3WOlzp1KuMFRjQmBoiEIxqpg5Ee0hpChovOGEgHk5AQIl0xkiZ9khYkXCJgk9/5Y6/G7c7u7t3da7/vu/v9fKre2ms961nr/T3VnX6yv+tW3R0AAAAAGMtjpl0AAAAAAMcWgRMAAAAAoxI4AQAAADAqgRMAAAAAoxI4AQAAADAqgRMAAAAAozpu2gVMwpOf/OTeuHHjtMsAmEm33HLL33X3+mnXMU3mCYDlmSMWmCcAlneweWIuAqeNGzdm586d0y4DYCZV1d9Ou4ZpM08ALM8cscA8AbC8g80TbqkDAAAAYFQCJwAAAABGJXACAAAAYFQCJwAAAABGJXACAAAAYFQCJwAAAABGJXACAAAAYFQCJwAAAABGJXACAAAAYFQCJwAAAABGJXACAAAAYFQCJwAAAABGddy0C2A2bdx23bRLGMW9l5477RIAjknmCQAADkbgBAAAACNwQgb+iVvqAAAAABiVwAkAAACAUQmcAAAAABiVwAkAAACAUQmcAAAAABiVwAkAAACAUQmcAAAAABiVwAkAAACAUQmcAAAAABiVwAkAAACAUQmcAACAmVRVZ1XV3VW1q6q2LbO9quoNw/bbq+o5S7avq6q/qqp3LWp7UlXdUFUfH34+cRJjAZg3AicAAGDmVNW6JJclOTvJaUnOr6rTlnQ7O8mm4bM1yeVLtv9EkruWtG1LcmN3b0py47AOwMgETgAAwCw6I8mu7r6nu7+c5JokW5b02ZLkrb3gpiQnVNWJSVJVJyc5N8mbltnn6mH56iQvXaX6AeaawAkAAJhFJyW5b9H67qFtpX3+W5KfTfLIkn2e2t17kmT4+ZTlvryqtlbVzqrauXfv3iMaAMA8EzgBAACzqJZp65X0qarvTfJgd99ypF/e3Vd29+bu3rx+/fojPQzA3BI4AQAAs2h3klMWrZ+c5P4V9jkzyfdV1b1ZuBXvBVX1+0OfBxbddndikgfHLx0AgRMAADCLbk6yqapOrarjk5yXZPuSPtuTvHp4W93zkjzU3Xu6++e6++Tu3jjs9z+6+4cW7XPBsHxBkneu+kgA5tBx0y4AAABgqe7eV1WXJLk+ybokV3X3nVV10bD9iiQ7kpyTZFeSh5NcuIJDX5rk7VX1miSfSPKK1agfYN4JnAAAgJnU3TuyECotbrti0XInufgQx3hfkvctWv9UkheOWScAjyZwgiU2brtu2iUctXsvPXfaJQAAADDHPMMJAAAAgFEJnAAAAAAYlcAJAAAAgFEJnAAAAAAYlcAJAAAAgFEJnAAAAAAY1XHTLmCpqjoryW8mWZfkTd196ZLtz0/yziR/MzT9cXf/yiRrBADm18Zt1027hKN276XnTrsEAOAYN1OBU1WtS3JZkhcn2Z3k5qra3t0fXdL1z7v7eydeIAAAAACHNGu31J2RZFd339PdX05yTZItU64JgBlSVWdV1d1Vtauqti2z/flV9VBV3Tp8/s9p1AkAAPNspq5wSnJSkvsWre9O8txl+n17Vd2W5P4kP9Pdd06iOACmy5WwAACwNszaFU61TFsvWf9wkm/s7mcl+a0kf7Lsgaq2VtXOqtq5d+/ecasEYFpcCQsAAGvArAVOu5Ocsmj95CxcxfRV3f3Z7v78sLwjyWOr6slLD9TdV3b35u7evH79+tWsGYDJWe5K2JOW6fftVXVbVf1pVf3ryZQGAADsN2uB081JNlXVqVV1fJLzkmxf3KGqnlZVNSyfkYUxfGrilQIwDa6EBQCANWCmAqfu3pfkkiTXJ7krydu7+86quqiqLhq6/UCSO4ZnOL0hyXndvfSXDQCOTa6EBQCANWDWHhq+/5eDHUvarli0/NtJfnvSdQEwE756JWyST2bhSthXLu5QVU9L8kB3tythAQBgOmYucAKAA+nufVW1/0rYdUmu2n8l7LD9iixcCfsjVbUvyT/ElbAAADBxAicA1hRXwgIAwOybqWc4AQAAALD2CZwAAAAAGJXACQAAAIBRCZwAAAAAGJXACQAAAIBRCZwAAAAAGJXACQAAAIBRCZwAAAAAGJXACQAAAIBRCZwAAAAAGJXACQAAmElVdVZV3V1Vu6pq2zLbq6reMGy/vaqeM7R/TVV9qKpuq6o7q+qXF+3zS1X1yaq6dficM8kxAcyL46ZdAAAAwFJVtS7JZUlenGR3kpurant3f3RRt7OTbBo+z01y+fDzS0le0N2fr6rHJvmLqvrT7r5p2O/13f3rkxoLwDxyhRMAADCLzkiyq7vv6e4vJ7kmyZYlfbYkeWsvuCnJCVV14rD++aHPY4dPT6xyAFzhBMy3jduum3YJo7j30nOnXQIAjO2kJPctWt+dhauXDtXnpCR7hiukbknyTUku6+4PLup3SVW9OsnOJD/d3X+/9MuramuSrUmyYcOGoxwKwPxxhRMAADCLapm2pVcpHbBPd3+lu09PcnKSM6rqmcP2y5M8I8npSfYk+Y3lvry7r+zuzd29ef369YdfPcCcEzgBAACzaHeSUxatn5zk/sPt092fSfK+JGcN6w8MYdQjSd6YhVv3ABiZwAkAAJhFNyfZVFWnVtXxSc5Lsn1Jn+1JXj28re55SR7q7j1Vtb6qTkiSqnpckhcl+diwfuKi/V+W5I5VHgfAXPIMJwAAYOZ0976quiTJ9UnWJbmqu++sqouG7Vck2ZHknCS7kjyc5MJh9xOTXD08x+kxSd7e3e8atv1qVZ2ehVvv7k3y2smMCGC+CJwAAOAAvFxiurp7RxZCpcVtVyxa7iQXL7Pf7UmefYBjvmrkMgFYhlvqAAAAABiVwAkAAACAUQmcAAAAABiVwAkAAACAUQmcAAAAABiVwAkAAACAUQmcAAAAABiVwAkAAACAUQmcAAAAABiVwAkAAACAUQmcAAAAABiVwAkAAACAUQmcAAAAABiVwAkAAACAUQmcAAAAABiVwAkAAACAUQmcAAAAABiVwAkAAACAUQmcAAAAABiVwAkAAACAUQmcAAAAABiVwAkAAACAUQmcAAAAABjVzAVOVXVWVd1dVbuqatsy26uq3jBsv72qnjONOgEAAABY3kwFTlW1LsllSc5OclqS86vqtCXdzk6yafhsTXL5RIsEYKqcmAAAgNk3U4FTkjOS7Orue7r7y0muSbJlSZ8tSd7aC25KckJVnTjpQgGYPCcmAABgbThu2gUscVKS+xat707y3BX0OSnJnsWdqmprFn7RyIYNG464oI3brjvifWfJvZeeu6r9jyXzOnZ/11kjvnpiIkmqav+JiY8u6vPVExNJbqqqE6rqxO7e8+jDAQAAq2HWAqdapq2PoE+6+8okVybJ5s2bH7UdgDXJiYlVIqxduXkd+7Hwd/1I/uzm9c8bAI7WrN1StzvJKYvWT05y/xH0AeDYNOqJie7e3N2b169fP0pxAIzrSJ/bV1VfU1UfqqrbqurOqvrlRfs8qapuqKqPDz+fOMkxAcyLWQucbk6yqapOrarjk5yXZPuSPtuTvHqYXJ6X5CG3SQDMDScmAObEUT6370tJXtDdz0pyepKzht8dkmRbkhu7e1OSG4d1AEY2U4FTd+9LckmS65PcleTt3X1nVV1UVRcN3XYkuSfJriRvTPKjUykWgGlwYgJgfhzxC4WG9c8PfR47fHrRPlcPy1cneelqDgJgXs3aM5zS3TuyECotbrti0XInuXjSdQEwfd29r6r2n5hYl+Sq/Scmhu1XZGEOOScLJyYeTnLhtOoF4Kgc1XP7hiukbknyTUku6+4PDn2euv9ERHfvqaqnLPflYz3rD2BezVzgBAAH48QEwNw4quf2dfdXkpxeVSckubaqntndd6z0y72ECODozNQtdQAAAINRntvX3Z9J8r4kZw1ND1TViUky/HxwtIoB+CqBEwAAMIuO+Ll9VbV+uLIpVfW4JC9K8rFF+1wwLF+Q5J2rPA6AueSWOgAAYOYc5XP7Tkxy9fAcp8dk4WVE7xq2XZrk7VX1miSfSPKKSY0JYJ4InAAAgJl0pM/t6+7bkzz7AMf8VJIXjlspAEu5pQ4AAACAUQmcAAAAABiVwAkAAACAUQmcAAAAABiVwAkAAACAUQmcAAAAABiVwAkAAACAUQmcAAAAABiVwAkAAACAUQmcAAAAABiVwAkAAACAUQmcAAAAABiVwAkAAACAUQmcAAAAABiVwAkAAACAUQmcAAAAABjVcdMuAAAAAGAt2rjtummXcNTuvfTcVTmuK5wAAAAAGJXACQAAAIBRCZwAAAAAGJXACQAAWBVVta6q3jPtOgCYPIETAACwKrr7K0kerqqvn3YtAEyWt9QBAACr6YtJPlJVNyT5wv7G7v7x6ZUEwGoTOAFJVu9VmADA3Ltu+AAwRwROAADAqunuq6vq+CTfPDTd3d3/OM2aAFh9AicAAA7JlbAcqap6fpKrk9ybpJKcUlUXdPcHplgWAKtM4AQAAKym30jyku6+O0mq6puT/GGSb5tqVQCsKm+pAwAAVtNj94dNSdLdf53ksVOsB4AJEDgBAACr6ZaqenNVPX/4vDHJLSvZsarOqqq7q2pXVW1bZntV1RuG7bdX1XOG9lOq6r1VdVdV3VlVP7Fon1+qqk9W1a3D55zRRgrAV7mlDgAAWE0XJbk4yY9n4RlOH0jyO4faqarWJbksyYuT7E5yc1Vt7+6PLup2dpJNw+e5SS4ffu5L8tPd/eGqekIWQq8bFu37+u7+9VFGB8CyBE4AAMCqqKrHJLmlu5+Z5HWHufsZSXZ19z3Dsa5JsiXJ4sBpS5K3dncnuamqTqiqE7t7T5I9SdLdn6uqu5KctGRfAFaRW+oAAIBV0d2PJLmtqjYcwe4nJblv0fruoe2w+lTVxiTPTvLBRc2XDLfgXVVVTzyC2gA4BIETAACwmk5McmdV3VhV2/d/VrBfLdPWh9Onqh6f5B1JfrK7Pzs0X57kGUlOz8JVUL+x7JdXba2qnVW1c+/evSsoF4DF3FIHAACspl8+wv12Jzll0frJSe5faZ+qemwWwqa3dfcf7+/Q3Q/sXx4eYP6u5b68u69McmWSbN68eWnQBcAhCJwAAIBVMTzD6bLhGU6H6+Ykm6rq1CSfTHJeklcu6bM9C7fHXZOFh4U/1N17qqqSvDnJXd39z54dtegZT0nysiR3HEFtAByCwAkAAFgV3f1IVd1WVRu6+xOHue++qrokyfVJ1iW5qrvvrKqLhu1XJNmR5Jwku5I8nOTCYfczk7wqyUeq6tah7ee7e0eSX62q07Nw6929SV57FEME4AAETgAAwGra/wynDyX5wv7G7v6+Q+04BEQ7lrRdsWi5k1y8zH5/keWf75TuftWKKwfgiAmcAACA1XSkz3ACYA2bmcCpqp6U5I+SbMzCpa0/2N1/v0y/e5N8LslXkuzr7s2TqxKAaTFPAKxN3f3+qvrGJJu6+z1V9bVZuEWOY9jGbddNu4RR3HvpudMuAdasx0y7gEW2JbmxuzcluXFYP5Dv6e7T/RIBMFfMEwBrUFX9pyT/b5LfHZpOSvInUysIgIk4osCpqr6uqsY+K7ElydXD8tVJXjry8QFY28wTAGvTxVl4iPdnk6S7P57kKVOtCIBVt6LAqaoeU1WvrKrrqurBJB9Lsqeq7qyqX6uqTSPU8tT9rycdfh5oEuok766qW6pq60Fq3lpVO6tq5969e0coD4ApG3WeAGBivtTdX96/UlXHZeHfagCOYSt9htN7k7wnyc8luaO7H0m++jyN70lyaVVd292/f7CDVNV7kjxtmU2/sPKSc2Z3319VT0lyQ1V9rLs/sLRTd1+Z5Mok2bx5swkNYA2Y5DwxhFFbk2TDhg1HVC8AK/L+qvr5JI+rqhcn+dEk/33KNQGwylYaOL2ou/9xaWN3fzrJO5K8o6oee6iDdPeLDrStqh6oqhO7e09VnZjkwQMc4/7h54NVdW2SM5I86hcJAKanqr4uyRe7+yuHs98k5wknJgAmZluS1yT5SJLXJtmR5E1TrQiAVbeiW+r2h01V9Q1V9SNVdWFVnVFVj1va5yhsT3LBsHxBkncu7TA8O+oJ+5eTvCTJHUf5vQAcpQndem2eAFiDuvuR7n5jd7+iu39gWBb0AxzjDveh4dcmWZ/k/07ya0keqqqPjVTLpUleXFUfT/LiYT1V9fSq2jH0eWqSv6iq25J8KMl13f1nI30/AEfuvUmekYVbr5/W3ad091OSfGeSm7Jw6/UPHeV3mCcAAGCNWOktdfs9obt/paq+v7u/u6penuSbxiikuz+V5IXLtN+f5Jxh+Z4kzxrj+wAY1Si3Xh+MeQIAANaOw73C6YvDzy9V1eO6+x0Z/icfgPk1oVuvAVjDhludAZgThxs4/frwZro/SnJVVf1YkpPGLwuANWo1b70GYA2qqu+oqo8muWtYf1ZV/c6UywJglR1W4NTd7+juT3f367LwdolTkmxZlcoAWIue0N2/kuSB7v7uJOcn+b0p1wTAdL0+yb9P8qkk6e7bknzXVCsCYNWt6BlOVVVL3yTR3f/PofoAMHcedet1Vb0/yX+dZlEATFd331dVi5u+Mq1aAJiMlV7h9N6q+rGq2rC4saqOr6oXVNXV+adXVQMwv9x6DcBS91XVdyTp4feHn8lwex0Ax66VvqXurCQ/nOQPq+rUJJ9J8rgsBFbvTvL67r51NQoEYO0YXiaRJK+rqlcl+Tdx6zXAvLsoyW9m4QTE7iz8/vCjU60IgFW3osCpu7+Y5HeS/M7wWusnJ/mH7v7MKtYGwBrh1msADuJbuvt/X9xQVWcm+Z9TqgeACTjct9Slu/+xu/cImwBYxK3XABzIb62wDYBjyCGvcKqqr+vuL1TV47v785MoCoA1x63XAPwzVfXtSb4jyfqq+qlFm/5lknXTqQqASVnJLXVPrKoLk+xK8merXA8Aa5BbrwFYxvFJHp+F3zmesKj9s0l+YCoVATAxKwmcXpjkP2bhbUNP6e4HV7ckANay7v7HJHuGs9mvS5Kq+pbuvnu6lQEwSd39/iTvr6q3dPffTrseACZrJYHTh7Jwm8QpwiYADqWqTkjy+iTfWlVfTHJ7ktckuXCadQEwNW+pqke9NKK7XzCNYgCYjEMGTt1917B4+yrXAsAaVlVPz8JVsV+b5C1ZeIbT3iT/NskfT68yAKbsZxYtf02SlyfZN6VaAJiQlVzhBAAHVVUvSXJ1kvcl+VKSi7IQPF3Y3b83xdIAmLLuvmVJ0/+sqvdPpRgAJmbFgdOSM9cfG+7JBoAk+b+SfGd379rfMLyd6Mqqek2Sh7v7jqlVB8DUVNWTFq0+Jsm3JXnalMoBYEJWFDgtd+a6qvafuf5fq1ceAGvE8YvDpiTp7r+sqpcneVcW5o5/M5XKAJi2W5J0ksrCrXR/k4Vn+wFwDFvpFU4HOnP9xuHM9RecuQaYa1+sqvXdvXdxY3f/dVV9JQtXyAIwh7r71GnXAMDkrTRwOtCZ6++PM9cAJL+W5E+q6hXdff/+xqp6cpIvecspwPwZflc4oO6eixdKbNx23bRLGMW9l5477RKANWalgZMz1wAcUHe/o6r+RZK/rKpbktyW5PgkP5iFq2QBmD//4SDbOit4g2lVnZXkN5OsS/Km7r50yfYatp+T5OEk/7G7P1xVpyR5axaeFfVIkiu7+zeHfZ6U5I+SbExyb5If7O6/P6yRAXBIKw2cnLkG4KC6+w+q6k+SnJfkmUk+m+SV3X3zVAsDYCq6+8Kj2b+q1iW5LMmLk+xOcnNVbe/ujy7qdnaSTcPnuUkuH37uS/LTQ/j0hCS3VNUNw77bktzY3ZdW1bZh/b8cTa0APNqKAidnrgFYie5+OMlV064DgNlRVV+f5BeTfNfQ9P4kv9LdDx1i1zOS7Orue4bjXJNkS5LFgdOWJG/t7k5yU1WdUFUndveeJHuSpLs/V1V3JTlp2HdLkucP++9/MZLACWBkj1lpx+7+gyT/KgvPbPr6JP+YhTPXV69SbQAAwNp3VZLPZeFk9Q9m4QrY31vBficluW/R+u6h7bD6VNXGJM9O8sGh6alDIJXh51NWMggADs9Kb6lL4sw1AABw2J7R3S9ftP7LVXXrCvarZdr6cPpU1eOTvCPJT3b3Z1fwnf904KqtSbYmyYYNGw5nVwByGFc4AQAAHIF/qKp/t3+lqs5M8g8r2G93klMWrZ+c5P6V9qmqx2YhbHrbkjfiPVBVJw59Tkyy7PNou/vK7t7c3ZvXr1+/gnIBWEzgBAAArKYfSXJZVd1bVX+b5LeTXLSC/W5OsqmqTq2q47PwUortS/psT/LqWvC8JA91957h7XVvTnJXd79umX0uGJYvSPLOIxsWAAdzWLfUAQAAHI7uvjXJs6rqXw7rK7q1rbv3VdUlSa5Psi7JVd19Z1VdNGy/IsmOJOck2ZXk4ST734x3ZpJXJfnIotv3fr67dyS5NMnbq+o1ST6R5BVHPUgAHkXgBAAArJqq+oksPCT8c0neWFXPSbKtu999qH2HgGjHkrYrFi13kouX2e8vsvzzndLdn0rywsMZAwCHzy11AADAavrh4aqml2ThjXAXZuEqIwCOYQInAABgNe2/0uicJL/X3bflAFcfAXDsEDgBAACr6ZaqencWAqfrq+oJSR6Zck0ArDLPcAIAAFbTa5KcnuSe7n64qr4h//RwbwCOUQInAABg1XT3I1W1MckPVVUn+YvuvnbKZQGwytxSBwAArJqq+p0kFyX5SJI7kry2qi6bblUArDZXOAEAAKvpu5M8s7s7Sarq6iyETwAcw1zhBAAArKa7k2xYtH5KktunVAsAE+IKJwAAYHRV9d+TdJKvT3JXVX1oWH9ukv81zdoAWH0CJwAAYDX8+kG29cSqAGAqBE4AAMDouvv9y7VX1ZlJXpnkA5OtCIBJEjgBAACrqqpOz0LI9INJ/ibJO6ZaEACrTuAEAACMrqq+Ocl5Sc5P8qkkf5Skuvt7ploYABMhcAIAAFbDx5L8eZL/0N27kqSq/vN0SwJgUh4z7QIAAIBj0suT/H9J3ltVb6yqFyapKdcEwIQInAAAgNF197Xd/b8l+dYk70vyn5M8taour6qXTLU4AFbdzAROVfWKqrqzqh6pqs0H6XdWVd1dVbuqatskawRgeswTAGtTd3+hu9/W3d+b5OQktybx7zPAMW5mAqckdyT5/hzk9ahVtS7JZUnOTnJakvOr6rTJlAfAlJknANa47v50d/9ud79g2rUAsLpm5qHh3X1XklQd9LbuM5Ls6u57hr7XJNmS5KOrXiAAU2WeAACAtWOWrnBaiZOS3LdofffQBgDJYcwTVbW1qnZW1c69e/dOpDgAAJgXE73Cqarek+Rpy2z6he5+50oOsUxbH+C7tibZmiQbNmxYcY0ATM8k54nuvjLJlUmyefPmZfsAAABHZqKBU3e/6CgPsTvJKYvWT05y/wG+yy8SAGvMJOcJAABg9ay1W+puTrKpqk6tquOTnJdk+5RrAmB2mCcAAGAGzEzgVFUvq6rdSb49yXVVdf3Q/vSq2pEk3b0vySVJrk9yV5K3d/ed06oZgMkxTwAAwNoxS2+puzbJtcu035/knEXrO5LsmGBpAMwA8wQAAKwdM3OFEwAAAADHBoETAAAAAKMSOAEAAAAwKoETAAAwk6rqrKq6u6p2VdW2ZbZXVb1h2H57VT1n0barqurBqrpjyT6/VFWfrKpbh885S48LwNETOAEAADOnqtYluSzJ2UlOS3J+VZ22pNvZSTYNn61JLl+07S1JzjrA4V/f3acPHy+aAFgFAicAAGAWnZFkV3ff091fTnJNki1L+mxJ8tZecFOSE6rqxCTp7g8k+fREKwbgqwROAADALDopyX2L1ncPbYfbZzmXDLfgXVVVTzy6MgFYjsAJAACYRbVMWx9Bn6UuT/KMJKcn2ZPkN5b98qqtVbWzqnbu3bv3EIcEYCmBEwAAMIt2Jzll0frJSe4/gj7/THc/0N1f6e5HkrwxC7fuLdfvyu7e3N2b169ff9jFA8w7gRMAADCLbk6yqapOrarjk5yXZPuSPtuTvHp4W93zkjzU3XsOdtD9z3gavCzJHQfqC8CRO27aBQAAACzV3fuq6pIk1ydZl+Sq7r6zqi4atl+RZEeSc5LsSvJwkgv3719Vf5jk+UmeXFW7k/xid785ya9W1elZuPXu3iSvndSYAOaJwAkAAJhJ3b0jC6HS4rYrFi13kosPsO/5B2h/1Zg1ArA8t9QBAAAAMCqBEwAAAACjEjgBAAAAMCqBEwAAAACjEjgBAAAAMCqBEwAAAACjEjgBAAAAMCqBEwAAAACjEjgBAAAAMCqBEwAAAACjEjgBAAAAMCqBEwAAAACjEjgBAAAAMCqBEwAAAACjEjgBAAAAMCqBEwAAAACjEjgBAAAAMCqBEwAAAACjEjgBAAAAMCqBEwAAAACjEjgBAAAAMCqBEwAAAACjEjgBAAAAMCqBEwAAAACjEjgBAAAAMCqBEwAAAACjEjgBAAAAMCqBEwAAAACjEjgBAAAzqarOqqq7q2pXVW1bZntV1RuG7bdX1XMWbbuqqh6sqjuW7POkqrqhqj4+/HziJMYCMG8ETgAAwMypqnVJLktydpLTkpxfVact6XZ2kk3DZ2uSyxdte0uSs5Y59LYkN3b3piQ3DusAjEzgBMCaUFWvqKo7q+qRqtp8kH73VtVHqurWqto5yRoBGNUZSXZ19z3d/eUk1yTZsqTPliRv7QU3JTmhqk5Mku7+QJJPL3PcLUmuHpavTvLS1SgeYN4JnABYK+5I8v1JPrCCvt/T3ad39wGDKQBm3klJ7lu0vntoO9w+Sz21u/ckyfDzKUdZJwDLmJnAyZlrAA6mu+/q7runXQcAE1PLtPUR9DmyL6/aWlU7q2rn3r17xzgkwFyZmcApzlwDMI5O8u6quqWqtk67GACO2O4kpyxaPznJ/UfQZ6kH9t92N/x8cLlO3X1ld2/u7s3r168/rMIBmKHAyZlrAKrqPVV1xzKfpc/sOJgzu/s5WXiQ7MVV9V0H+C5nrgFm281JNlXVqVV1fJLzkmxf0md7klcPb6t7XpKH9t8udxDbk1wwLF+Q5J1jFg3AguOmXcAR2H/mupP8bndfuVyn4az21iTZsGHDBMsD4Eh194tGOMb9w88Hq+raLDx09lFXzw7zx5VJsnnz5lFuvwBgPN29r6ouSXJ9knVJruruO6vqomH7FUl2JDknya4kDye5cP/+VfWHSZ6f5MlVtTvJL3b3m5NcmuTtVfWaJJ9I8orJjQpgfkw0cKqq9yR52jKbfqG7V3pm4czuvr+qnpLkhqr62PAGin/GLxIA86eqvi7JY7r7c8PyS5L8ypTLAuAIdfeOLIRKi9uuWLTcSS4+wL7nH6D9U0leOGKZACxjooHTJM9cA3BsqaqXJfmtJOuTXFdVt3b3v6+qpyd5U3efk+SpSa6tqmRhjvuD7v6zqRUNAABzak3dUufMNcD86u5rk1y7TPv9WbidIt19T5JnTbg0AABgiZkJnJy5BgAAgLVp47brpl3CUbv30nOnXcIxZWYCJ2euAQAAAI4Nj5l2AQAAAAAcWwROAAAAAIxK4AQAAADAqAROAAAAAIxK4AQAAADAqAROAAAAAIxK4AQAAADAqAROAAAAAIxK4AQAAADAqAROAAAAAIxK4AQAAADAqAROAAAAAIxK4AQAAADAqAROAAAAAIxK4AQAAADAqAROAAAAAIxK4AQAAADAqAROAAAAAIxK4AQAAADAqAROAAAAAIxK4AQAAADAqAROAAAAAIxK4AQAAADAqAROAAAAAIxK4AQAAMykqjqrqu6uql1VtW2Z7VVVbxi2315VzznUvlX1S1X1yaq6dficM6nxAMwTgRMAADBzqmpdksuSnJ3ktCTnV9VpS7qdnWTT8Nma5PIV7vv67j59+OxY3ZEAzCeBEwAAMIvOSLKru+/p7i8nuSbJliV9tiR5ay+4KckJVXXiCvcFYBUJnAAAgFl0UpL7Fq3vHtpW0udQ+14y3IJ3VVU9cbySAdhP4AQAAMyiWqatV9jnYPtenuQZSU5PsifJbyz75VVbq2pnVe3cu3fvigoG4J8InAAAgFm0O8kpi9ZPTnL/CvsccN/ufqC7v9LdjyR5YxZuv3uU7r6yuzd39+b169cf1UAA5pHACQAAmEU3J9lUVadW1fFJzkuyfUmf7UlePbyt7nlJHuruPQfbd3jG034vS3LHag8EYB4dN+0CAAAAlurufVV1SZLrk6xLclV331lVFw3br0iyI8k5SXYleTjJhQfbdzj0r1bV6Vm4xe7eJK+d2KAA5ojACQAAmEndvSMLodLitisWLXeSi1e679D+qpHLBGAZbqkDAAAAYFQCJwAAAABGJXACAAAAYFQCJwAAAABGJXACAAAAYFQCJwAAAABGJXACAAAAYFQCJwAAAABGJXACYE2oql+rqo9V1e1VdW1VnXCAfmdV1d1Vtauqtk24TAAAIAInANaOG5I8s7v/bZK/TvJzSztU1boklyU5O8lpSc6vqtMmWiUAADA7gZMz1wAcTHe/u7v3Das3JTl5mW5nJNnV3fd095eTXJNky6RqBAAAFsxM4BRnrgFYuR9O8qfLtJ+U5L5F67uHtkepqq1VtbOqdu7du3cVSgQAgPk1M4GTM9cAVNV7quqOZT5bFvX5hST7krxtuUMs09bLfVd3X9ndm7t78/r168cZAAAAkCQ5btoFHMAPJ/mjZdqXO3P93OUOUFVbk2xNkg0bNoxdHwCroLtfdLDtVXVBku9N8sLuXi5I2p3klEXrJye5f7wKAQCAlZjoFU7OXANwpKrqrCT/Jcn3dffDB+h2c5JNVXVqVR2f5Lwk2ydVIwAAsGCiVzg5cw3AUfjtJP8iyQ1VlSQ3dfdFVfX0JG/q7nO6e19VXZLk+iTrklzV3XdOr2QAAJhPM3NL3aIz19+9kjPXST6ZhTPXr5xQiQBMUXd/0wHa709yzqL1HUl2TKouAADg0WYmcMqMnrm+99JzV/PwAKxx5gkAAHi0mQmcnLkGAAAAODZM9KHhAAAAABz7BE4AAAAAjErgBAAAAMCoBE4AAAAAjErgBAAAAMCoBE4AAAAAjErgBAAAAMCoBE4AAAAAjErgBAAAAMCoBE4AAMBMqqqzquruqtpVVduW2V5V9YZh++1V9ZxD7VtVT6qqG6rq48PPJ05qPADzROAEAADMnKpal+SyJGcnOS3J+VV12pJuZyfZNHy2Jrl8BftuS3Jjd29KcuOwDsDIBE4AAMAsOiPJru6+p7u/nOSaJFuW9NmS5K294KYkJ1TViYfYd0uSq4flq5O8dJXHATCXBE4AAMAsOinJfYvWdw9tK+lzsH2f2t17kmT4+ZQRawZgcNy0C5iEW2655e+q6m+nXcdBPDnJ3027iCkw7vkzr2Of9XF/47QLmDbzxMwy7vkzr2Of5XFPc46oZdp6hX1Wsu/Bv7xqaxZu00uSz1fV3Yez/4St+t+h+q+refQjZtyrxLhnzqqO/SjHfcB5Yi4Cp+5eP+0aDqaqdnb35mnXMWnGPX/mdezzOu61xDwxm4x7/szr2Od13CuwO8kpi9ZPTnL/Cvscf5B9H6iqE7t7z3D73YPLfXl3X5nkyiMvf3Lm9e+Qcc+XeR13snbH7pY6AABgFt2cZFNVnVpVxyc5L8n2JX22J3n18La65yV5aLhN7mD7bk9ywbB8QZJ3rvZAAObRXFzhBAAArC3dva+qLklyfZJ1Sa7q7jur6qJh+xVJdiQ5J8muJA8nufBg+w6HvjTJ26vqNUk+keQVExwWwNwQOM2GNXGp7iow7vkzr2Of13Eznnn9O2Tc82dexz6v4z6k7t6RhVBpcdsVi5Y7ycUr3Xdo/1SSF45b6dTN698h454v8zruZI2OvRb+jQYAAACAcXiGEwAAAACjEjhNWVWdVVV3V9Wuqto27XomoaquqqoHq+qOadcySVV1SlW9t6ruqqo7q+onpl3TJFTV11TVh6rqtmHcvzztmiapqtZV1V9V1bumXQtrzzzOEYl5wjxhnoCVMk+YJ6Zd0ySYJ9buPCFwmqKqWpfksiRnJzktyflVddp0q5qItyQ5a9pFTMG+JD/d3f8qyfOSXDwnf95fSvKC7n5WktOTnDW8RWZe/ESSu6ZdBGvPHM8RiXnCPGGegEMyT5gnYp6YF2t2nhA4TdcZSXZ19z3d/eUk1yTZMuWaVl13fyDJp6ddx6R1957u/vCw/Lks/KNx0nSrWn294PPD6mOHz1w8PK6qTk5ybpI3TbsW1qS5nCMS88SwbJ6YA+YJjpJ5Ys6YJ5KYJ9YUgdN0nZTkvkXruzMH/2CQVNXGJM9O8sEplzIRw2WgtyZ5MMkN3T0X407y35L8bJJHplwHa5M5Yo6ZJ8wTsALmiTlmnjBPrAUCp+mqZdrmIqmdZ1X1+CTvSPKT3f3ZadczCd39le4+PcnJSc6oqmdOuaRVV1Xfm+TB7r5l2rWwZpkj5pR5wjwBK2SemFPmCfPEWiFwmq7dSU5ZtH5ykvunVAsTUFWPzcLk8Lbu/uNp1zNp3f2ZJO/LfNxzf2aS76uqe7NwifsLqur3p1sSa4w5Yg6ZJ8wT0y2JNcY8MYfME+aJ6ZZ0eARO03Vzkk1VdWpVHZ/kvCTbp1wTq6SqKsmbk9zV3a+bdj2TUlXrq+qEYflxSV6U5GNTLWoCuvvnuvvk7t6Yhf+2/0d3/9CUy2JtMUfMGfOEecI8wWEyT8wZ84R5Yq3NEwKnKerufUkuSXJ9Fh749vbuvnO6Va2+qvrDJH+Z5FuqandVvWbaNU3ImUlelYVk+tbhc860i5qAE5O8t6puz8L/GN3Q3WvulZ4wafM6RyTmiZgnzBOwAuYJ84R5gllX3W7zBQAAAGA8rnACAAAAYFQCJwAAAABGJXACAAAAYFQCJwAAAABGJXACAAAAYFQCJzhMVfWyquqq+tZD9PvJqvraRes7quqEVS8QgKkyTwBwMOYJ5kV197RrgDWlqt6e5MQkN3b3Lx2k371JNnf3302oNABmgHkCgIMxTzAvXOEEh6GqHp/kzCSvSXLe0Lauqn69qj5SVbdX1Y9V1Y8neXqS91bVe4d+91bVk4fln6qqO4bPTw5tG6vqrqp6Y1XdWVXvrqrHDdt+vKo+Ohz/msmPHICVME8AcDDmCebJcdMuANaYlyb5s+7+66r6dFU9J8lzk5ya5Nndva+qntTdn66qn0ryPUvPSFTVtyW5cNivknywqt6f5O+TbEpyfnf/p+HMx8uT/H6SbUlO7e4vuYwWYKa9NOYJAA7spTFPMCdc4QSH5/wk+88IXDOsvyjJFd29L0m6+9OHOMa/S3Jtd3+huz+f5I+TfOew7W+6+9Zh+ZYkG4fl25O8rap+KMm+EcYBwOowTwBwMOYJ5oYrnGCFquobkrwgyTOrqpOsS9JZ+If8cB6GVgfZ9qVFy19J8rhh+dwk35Xk+5L8H1X1r/dPSADMBvMEAAdjnmDeuMIJVu4Hkry1u7+xuzd29ylJ/ibJh5NcVFXHJUlVPWno/7kkT1jmOB9I8tKq+tqq+rokL0vy5wf60qp6TJJTuvu9SX42yQlJHj/SmAAYj3kCgIMxTzBXBE6wcucnuXZJ2zuy8DC/TyS5vapuS/LKYduVSf50/0P+9uvuDyd5S5IPJflgkjd1918d5HvXJfn9qvpIkr9K8vru/szRDQWAVWCeAOBgzBPMleo+nCv3AAAAAODgXOEEAAAAwKgETgAAAACMSuAEAAAAwKgETgAAAACMSuAEAAAAwKgETgAAAACMSuAEAAAAwKgETgAAAACM6v8HSnsmsdGJ6HMAAAAASUVORK5CYII=", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "nb_actions = 5\n", "bandit = Bandit(nb_actions)\n", "\n", "all_rewards = []\n", "for t in range(1000):\n", " rewards = []\n", " for a in range(nb_actions):\n", " rewards.append(bandit.step(a))\n", " all_rewards.append(rewards)\n", " \n", "mean_reward = np.mean(all_rewards, axis=0)\n", "\n", "plt.figure(figsize=(20, 6))\n", "plt.subplot(131)\n", "plt.bar(range(nb_actions), bandit.Q_star)\n", "plt.xlabel(\"Actions\")\n", "plt.ylabel(\"$Q^*(a)$\")\n", "plt.subplot(132)\n", "plt.bar(range(nb_actions), mean_reward)\n", "plt.xlabel(\"Actions\")\n", "plt.ylabel(\"$Q_t(a)$\")\n", "plt.subplot(133)\n", "plt.bar(range(nb_actions), np.abs(bandit.Q_star - mean_reward))\n", "plt.xlabel(\"Actions\")\n", "plt.ylabel(\"Absolute error\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Greedy action selection\n", "\n", "In **greedy action selection**, we systematically chose the action with the highest estimated Q-value at each play (or randomly when there are ties):\n", "\n", "$$a_t = \\text{argmax}_a Q_t(a)$$\n", "\n", "We maintain estimates $Q_t$ of the action values (initialized to 0) using the online formula:\n", "\n", "$$Q_{t+1}(a_t) = Q_t(a_t) + \\alpha \\, (r_{t} - Q_t(a_t))$$\n", "\n", "when receiving the sampled reward $r_t$ after taking the action $a_t$. The learning rate $\\alpha$ can be set to 0.1 at first.\n", "\n", "The algorithm simply alternates between these two steps for 1000 plays (or steps): take an action, update its Q-value. \n", "\n", "**Q:** Implement the greedy algorithm on the 5-armed bandit.\n", "\n", "Your algorithm will look like this:\n", "\n", "* Create a 5-armed bandit (mean of zero, variance of 1).\n", "* Initialize the estimated Q-values to 0 with an array of the same size as the bandit.\n", "* **for** 1000 plays:\n", " * Select the greedy action $a_t^*$ using the current estimates.\n", " * Sample a reward from $\\mathcal{N}(Q^*(a_t^*), 1)$.\n", " * Update the estimated Q-value of the action taken.\n", " \n", "Additionally, you will store the received rewards at each step in an initially empty list or a numpy array of the correct size and plot it in the end. You will also plot the true Q-values and the estimated Q-values at the end of the 1000 plays. \n", "\n", "*Tip:* to implement the argmax, do not rely on `np.argmax()`. If there are ties in the array, for example at the beginning:\n", "\n", "```python\n", "x = np.array([0, 0, 0, 0, 0])\n", "```\n", "\n", "`x.argmax()` will return you the **first occurrence** of the maximum 0.0 of the array. In this case it will be the index 0, so you will always select the action 0 first. \n", "\n", "It is much more efficient to retrieve the indices of **all** maxima and randomly select one of them:\n", "\n", "```python\n", "a = rng.choice(np.where(x == x.max())[0])\n", "```\n", "\n", "`np.where(x == x.max())` returns a list of indices where `x` is maximum. `rng.choice()` randomly selects one of them." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q:** Re-run your algorithm multiple times with different values of $Q^*$ (simply recreate the `Bandit`) and observe:\n", "\n", "1. How much reward you get.\n", "2. How your estimated Q-values in the end differ from the true Q-values.\n", "3. Whether greedy action action selection finds the optimal action or not." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before going further, let's turn the agent into a class for better reusability. \n", "\n", "**Q:** Create a `GreedyAgent` class taking the bandit as an argument as well as the learning rate `alpha=0.1`:\n", "\n", "```python\n", "bandit = Bandit(nb_actions)\n", "\n", "agent = GreedyAgent(bandit, alpha=0.1)\n", "```\n", "\n", "The constructor should initialize the array of estimated Q-values `Q_t` and store it as an attribute.\n", "\n", "Define a method `act(self)` that returns the index of the greedy action based on the current estimates, as well as a method `update(self, action, reward)` that allows to update the estimated Q-value of the action given the obtained reward. Define also a `train(self, nb_steps)` method that implements the complete training process for `nb_steps=1000` plays and returns the list of obtained rewards.\n", "\n", "```python\n", "class GreedyAgent:\n", " def __init__(self, bandit, alpha):\n", " # TODO\n", " \n", " def act(self): \n", " action = # TODO\n", " return action\n", " \n", " def update(self, action, reward):\n", " # TODO\n", " \n", " def train(self, nb_steps):\n", " # TODO\n", "```\n", "\n", "Re-run the experiment using this Greedy agent." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q:** Modify the `train()` method so that it also returns a list of binary values (0 and 1) indicating for each play whether the agent chose the optimal action. Plot this list and observe the lack of exploration.\n", "\n", "*Hint:* the index of the optimal action is already stored in the bandit: `bandit.a_star`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The evolution of the received rewards and optimal actions does not give a clear indication of the successful learning, as it is strongly dependent on the true Q-values. To truly estimate the performance of the algorithm, we have to average these results over many runs, e.g. 200.\n", "\n", "**Q:** Run the learning procedure 200 times (new bandit and agent every time) and average the results. Give a unique name to these arrays (e.g. `rewards_greedy` and `optimal_greedy`) as we will do comparisons later. Compare the results with the lecture, where a 10-armed bandit was used." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## $\\epsilon$-greedy action selection\n", "\n", "The main drawback of greedy action selection is that it does not explore: as soon as it finds an action better than the others (with a sufficiently positive true Q-value, i.e. where the sampled rewards are mostly positive), it will keep selecting that action and avoid exploring the other options. \n", "\n", "The estimated Q-value of the selected action will end up being quite correct, but those of the other actions will stay at 0.\n", "\n", "In $\\epsilon$-greedy action selection, the greedy action $a_t^*$ (with the highest estimated Q-value) will be selected with a probability $1-\\epsilon$, the others with a probability of $\\epsilon$ altogether. \n", "\n", "$$\n", " \\pi(a) = \\begin{cases} 1 - \\epsilon \\; \\text{if} \\; a = a_t^* \\\\ \\frac{\\epsilon}{|\\mathcal{A}| - 1} \\; \\text{otherwise.} \\end{cases}\n", "$$\n", "\n", "If you have $|\\mathcal{A}| = 5$ actions, the four non-greedy actions will be selected with a probability of $\\frac{\\epsilon}{4}$.\n", "\n", "**Q:** Create a `EpsilonGreedyAgent` (possibly inheriting from `GreedyAgent` to reuse code) to implement $\\epsilon$-greedy action selection (with $\\epsilon=0.1$ at first). Do not overwrite the arrays previously calculated (mean reward and optimal actions), as you will want to compare the two methods in a single plot.\n", "\n", "To implement $\\epsilon-$greedy, you need to:\n", "\n", "1. Select the greedy action $a = a^*_t$.\n", "2. Draw a random number between 0 and 1 (`rng.random()`).\n", "3. If this number is smaller than $\\epsilon$, you need to select another action randomly in the remaining ones (`rng.choice()`).\n", "4. Otherwise, keep the greedy action." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q:** Compare the properties of greedy and $\\epsilon$-greedy (speed, optimality, etc). Vary the value of the parameter $\\epsilon$ (0.0001 until 0.5) and conclude." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Softmax action selection\n", "\n", "To avoid exploring actions which are clearly not optimal, another useful algorithm is **softmax action selection**. In this scheme, the estimated Q-values are ransformed into a probability distribution using the softmax opertion:\n", "\n", "$$\n", " \\pi(a) = \\frac{\\exp \\frac{Q_t(a)}{\\tau}}{ \\sum_b \\exp \\frac{Q_t(b)}{\\tau}}\n", "$$ \n", "\n", "For each action, the term $\\exp \\frac{Q_t(a)}{\\tau}$ is proportional to $Q_t(a)$ but made positive. These terms are then normalized by the denominator in order to obtain a sum of 1, i.e. they are the parameters of a discrete probability distribution. The temperature $\\tau$ controls the level of exploration just as $\\epsilon$ for $\\epsilon$-greedy.\n", "\n", "In practice, $\\exp \\frac{Q_t(a)}{\\tau}$ can be very huge if the Q-values are high or the temperature is small, creating numerical instability (NaN). It is much more stable to substract the maximal Q-value from all Q-values before applying the softmax:\n", "\n", "$$\n", " \\pi(a) = \\frac{\\exp \\displaystyle\\frac{Q_t(a) - \\max_a Q_t(a)}{\\tau}}{ \\sum_b \\exp \\displaystyle\\frac{Q_t(b) - \\max_b Q_t(b)}{\\tau}}\n", "$$ \n", "\n", "This way, $Q_t(a) - \\max_a Q_t(a)$ is always negative, so its exponential is between 0 and 1.\n", "\n", "**Q:** Implement the softmax action selection (with $\\tau=0.5$ at first) and compare its performance to greedy and $\\epsilon$-greedy. Vary the temperature $\\tau$ and find the best possible value. Conclude.\n", "\n", "*Hint:* To select actions with different probabilities, check the doc of `rng.choice()`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exploration scheduling\n", "\n", "The problem with this version of softmax (with a constant temperature) is that even after it has found the optimal action, it will still explore the other ones (although more rarely than at the beginning). The solution is to **schedule** the exploration parameter so that it explores a lot at the beginning (high temperature) and gradually switches to more exploitation (low temperature).\n", "\n", "Many schemes are possible for that, the simplest one (**exponential decay**) being to multiply the value of $\\tau$ by a number very close to 1 after **each** play:\n", "\n", "$$\\tau = \\tau \\times (1 - \\tau_\\text{decay})$$\n", "\n", "**Q:** Implement in a class `SoftmaxScheduledAgent` temperature scheduling for the softmax algorithm ($\\epsilon$-greedy would be similar) with $\\tau=1$ initially and $\\tau_\\text{decay} = 0.01$ (feel free to change these values). Plot the evolution of `tau` and of the standard deviation of the choices of the optimal action. Conclude." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q:** Experiment with different schedules (initial values, decay rate) and try to find the best setting." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3.9.12 ('base')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" }, "vscode": { "interpreter": { "hash": "3d24234067c217f49dc985cbc60012ce72928059d528f330ba9cb23ce737906d" } } }, "nbformat": 4, "nbformat_minor": 4 }