{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Ranking Ads (multi-armed bandits)\n", "** *\n", "*Note: if you are visualizing this notebook directly from GitHub, some mathematical symbols might display incorrectly or not display at all. This same notebook can be rendered from nbviewer by following [this link.](http://nbviewer.ipython.org/github/david-cortes/datascienceprojects/blob/master/optimization/ranking_ads_multiarmed_bandits.ipynb)*\n", "\n", "This project consists of comparing different policies for how to choose an ad to show amongst a large pool of ads in different situations, using only information about their historical clicks and impressions (that is, no features). The objective is, intuitively, to show the ads that would give the most clicks every time.\n", "\n", "This is known as the multi-armed bandits problem, and the challenge is that, if an algorithm/policy is constantly showing new ads to see how well they do, it would get a low average click rate, whereas if it always shows the same ads, it might miss the fact that there are other, better ads to show (known as the exploration/exploitation dilemma).\n", "\n", "Here I’ll compare different approaches by making simulations under different situations, assuming that each ad has an inherent probability of being clicked when shown (this probability is not known) and changing some of the conditions (e.g. equal vs. different payment per click, few vs. many, permanent vs. expiring, fixed vs. changing probabilities).\n", "** *\n", "## Sections\n", "\n", "[1. Common algorithms](#p1)\n", "* [1.1 Classical setting](#p11)\n", "* [1.2 Different prices and multiple ads shown](#p12)\n", "\n", "[2. Infinite pool of ads (many-armed bandits)](#p2)\n", "\n", "[3. Changing click probabilities (restless bandits)](#p3)\n", "* [3.1 Slowly changing probabilities](#p31)\n", "* [3.2 Not-so-slowly changing probabilities](#p32)\n", "\n", "[4. Expiring ads (mortal multi-armed bandits)](#p4)\n", "** *" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 1 Common algorithms\n", "** *\n", "The most successful and widely studied families of algorithms for this problem have consistently being those that create upper confidence bounds on the click probability of each ad based on observed clicks and impressions, and those that randomly or greedily alternate steps of exploitation (showing what has historically been the best ad) and exploration (showing a different ad).\n", "\n", "In the most general setting of the multi-armed bandits problem, when the rewards of each arm (here: the click probabilities of each ad) can follow arbitrary distributions (e.g. not just click/no-click, but perhaps something more), upper bounds can be calculated using Hoeffdings inequality and decaying the probability of the sum of variables being less than something with time (e.g.$p=t^{-4}$) – giving the following formula at each time step $T$ for each ad $i$ (known as UCB1):\n", "\n", "$$ Upper\\: bound \\:\\hat{P_i} = \\frac{\\sum_t Clicks_{i,t}}{\\sum_t Displays_{i,t}} + \\sqrt{ \\frac{2 \\times ln \\,T}{\\sum_t Displays_{i,t}} }$$\n", "\n", "However, as in this case the clicks are known to follow a Bernoulli distribution (i.e. we know that they can only be click or no-click, each with a certain probability), it's possible to build better upper confidence bounds using classical statistical approximations for the confidence interval of a rate.\n", "\n", "Let $ \\hat{P_i} = \\frac{\\sum_t Clicks_{i,t}}{\\sum_t Displays_{i,t}}$ and $ Z_{inv}(x)$ be the z-score in a normal distribution at which $ P(\\frac{X-\\bar{X}}{\\sigma} \\ge Z) = x $, then:\n", "\n", "$$ Confidence \\: Interval \\: (P_i, \\alpha) = \\hat{P_i} + Z_{inv}(\\alpha) \\times \\sqrt{\\frac{\\hat{P_i}\\times(1-\\hat{P_i})}{\\sum_t Displays_{i,t}}} $$\n", "\n", "