{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# What is the Bandit problem?\n",
"## Introduction\n",
"Ex. slot machine\n",
"\n",
"* You try to earn the return from **K=5** slot machines. \n",
"* The chance is **N=100** times.\n",
"* You don't know the probability function of each return.\n",
"\n",
"From the example, you can find\n",
"* Your goal is to reach the maximum return under **the trade-off between exploration(探索) and exploitation(知識利用)**.\n",
" - Explore an option that looks inferior\n",
" - Exploit the currently best looking option\n",
"* Bandit problem consists of **policys(方策)**, which player can select, and **returns(報酬)** from each policy.\n",
"* we call the example multi-armed bandit problem, especially K-armed bandit problem.\n",
"\n",
"Other examples are below.\n",
"```1.crinical trial, 2.internet advertising, 3.recommender system, 4.game tree search, 5.online routing\n",
"```\n",
"\n",
"## classification\n",
"### problem classification and its level\n",
"- stochastic bandit [Level 1]\n",
"- adversarial bandit\n",
" * oblivious adversary [Level 2]\n",
" * adaptive adversary [Level 3]\n",
"\n",
"### evaluation of player's policy - rewords\n",
"- cumulative reword on finite horizon (reently main)\n",
"- geometric discounted reword on infinite horizon\n",
"- anytime stoppable\n",
"\n",
"### evaluation of player's policy - regret\n",
"- regret\n",
"- expected regret\n",
"- pseudo-regret\n",
"\n",
"To evaluate player's policy, the difference between the ideal target policy and player's policy is calculated, which called **regret**.\n",
"\n",
"$$ \\mbox{Regret}(T) = \\max_{i \\in 1, ..., K} \\sum^T_{t=1} X_i (t) - \\sum^T_{t=1} X_{i(t)}(t) $$\n",
"$$ \\mathbb{E} [ \\mbox{Regret}(T)] = \\mathbb{E} \\left[ \\max_{i \\in 1, ..., K} \\sum^T_{t=1} X_i (t) - \\sum^T_{t=1} X_{i(t)}(t) \\right] $$\n",
"$$ \\overline{\\mbox{Regret}}(T) = \\max_{i \\in 1, ..., K} \\mathbb{E} \\left[ \\sum^T_{t=1} X_i (t) - \\sum^T_{t=1} X_{i(t)}(t) \\right] $$\n",
"\n",
"## History\n",
"Year | Person | Content | study\n",
"--- | --- | --- | ---\n",
"*1933* | William R. Thompson | Multi-Armed Bandits | clinical trial\n",
"** | | |"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"# Basic mathmathecis complehension\n",
"Example of stochastic bandit problem is like below.\n",
"\n",
"「ある公告の現在のクリック率$\\hat{\\mu}$が5%以下のとき、真のクリック率$\\mu$が$10\\%$である可能性は?」\n",
"\n",
"we explain 4 approximations and show the example about Bernoulli probability with a mean $\\mu$.\n",
"\n",
"## 1. Center LImit Theorem: 中心極限定理\n",
"In probability theory, the central limit theorem states that, under certain circumstances, the probability distribution of the scaled mean of a random sample converges to a normal distribution as the sample size increases to infinity.\n",
"\n",
"標準化された標本平均の分布は、標準正規分布に弱収束する。任意の$x \\in \\mathbb{R}$ で次が成り立つ。\n",
"$$ \\lim_{n \\rightarrow \\infty} \\mathbb{P} \\left[ \\frac{\\sqrt{n}(\\hat{\\mu} - \\mu) }{\\sigma} \\leq x \\right] = \\Phi (x) $$\n",
"\n",
" Berry–Esseen theorem [link](https://en.wikipedia.org/wiki/Berry%E2%80%93Esseen_theorem) shows the relative error on the gaussian approximation for CLT
\n",
"Under stronger assumptions, **the Berry–Esseen theorem**, or **Berry–Esseen inequality**, gives a more quantitative result, because it also specifies the rate at which this convergence takes place by **giving a bound on the maximal error of approximation between the normal distribution and the true distribution of the scaled sample mean**. The approximation is measured by **the Kolmogorov–Smirnov distance**. In the case of independent samples, the convergence rate is $n^{-1/2}$, where n is the sample size, and the constant is estimated in terms of the third absolute normalized moments.\n",
"$${\\displaystyle \\left|F_{n}(x)-\\Phi (x)\\right|\\leq {C\\rho \\over \\sigma ^{3}{\\sqrt {n}}}}$$\n",
"Calculated values of the constant C have decreased markedly over the years, from the original value of **7.59** by Esseen (1942)
, to **0.7882** by *van Beek (1972)*, then **0.7655** by *Shiganov (1986)*, then **0.7056** by *Shevtsova (2007)*, then **0.7005** by *Shevtsova (2008)*, then **0.5894** by *Tyurin (2009)*, then **0.5129** by *Korolev & Shevtsova (2009)*, then **0.4785** by *Tyurin (2010)*. The detailed review can be found in the papers *Korolev & Shevtsova (2009)*, 8Korolev & Shevtsova (2010)*. The best estimate as of 2012, **$C < 0.4748$**, follows from the inequality\n",
"$$ {\\displaystyle \\sup _{x\\in \\mathbb {R} }\\left|F_{n}(x)-\\Phi (x)\\right|\\leq {0.33554(\\rho +0.415\\sigma ^{3}) \\over \\sigma ^{3}{\\sqrt {n}}},} $$\n",
"due to Shevtsova (2011), since $σ^3 ≤ ρ$ and $0.33554 · 1.415 < 0.4748$.\n",
"**Sanov's theorem** [link](https://en.wikipedia.org/wiki/Sanov%27s_theorem)
\n",
"In information theory, **Sanov's theorem** gives a bound on the probability of observing an atypical sequence of samples from a given probability distribution. if A is the closure of its interior,\n",
"\n",
"$$ {\\displaystyle \\lim _{n\\to \\infty }{\\frac {1}{n}}\\log q^{n}(x^{n})=-D_{\\mathrm {KL} }(p^{*}||q).} \\lim _{{n\\to \\infty }}{\\frac {1}{n}}\\log q^{n}(x^{n})=-D_{{{\\mathrm {KL}}}}(p^{*}||q). $$\n",
"\n",
"「分布Pからのサンプルn個があたかも分布Qからのものであるように振る舞う」事象の確率は、\n",
"$$ \\mathbb{P} [\\hat{P}_n \\approx Q] \\approx e^{-n D(Q||P)} $$\n",
"\n",
"
\n", " | epsilon | \n", "sim_nums | \n", "times | \n", "chosen_arms | \n", "rewards | \n", "cumulative_rewards | \n", "
---|---|---|---|---|---|---|
0 | \n", "0.1 | \n", "1 | \n", "1 | \n", "0 | \n", "0.0 | \n", "0.0 | \n", "
1 | \n", "0.1 | \n", "1 | \n", "2 | \n", "0 | \n", "0.0 | \n", "0.0 | \n", "
2 | \n", "0.1 | \n", "1 | \n", "3 | \n", "0 | \n", "0.0 | \n", "0.0 | \n", "
3 | \n", "0.1 | \n", "1 | \n", "4 | \n", "0 | \n", "0.0 | \n", "0.0 | \n", "
4 | \n", "0.1 | \n", "1 | \n", "5 | \n", "0 | \n", "0.0 | \n", "0.0 | \n", "