{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Contextual Bandits\n", "\n", "## Overview\n", "This tutorial includes an overview of the contextual bandits approach to reinforcement learning and describes how to approach a contextual bandit problem using Vowpal Wabbit. You will learn how to use Vowpal Wabbit in a contextual bandit setting with the Python tutorial—including when and how to work with different contextual bandits approaches, how to format data, and understand the results. No prior knowledge of contextual bandits or Vowpal Wabbit is required.\n", "\n", "## The contextual bandit problem\n", "In the contextual bandit problem, a learner repeatedly observes a context, chooses an action, and observes a loss/cost/reward for the chosen action only. Contextual bandit algorithms use additional side information (or context) to aid real world decision-making. They work well for choosing actions in dynamic environments where options change rapidly, and the set of available actions is limited.\n", "\n", "## Working with contextual bandits in Vowpal Wabbit\n", "\n", "To introduce a Vowpal Wabbit approach to the contextual bandit problem and explore the capabilities of this approach to reinforcement learning, this guide uses a hypothetical application called **APP**.\n", "\n", "**APP** interacts with the context of a user's behavior (search history, visited pages, or geolocation) in a dynamic environment–such as a news website or a cloud controller. **APP** differs from MAB because we have some information available to the **APP**, which is the context.\n", "\n", "**APP** performs the following functions:\n", "\n", "* Some context **x** arrives and is observed by **APP**.\n", "* **APP** chooses an action **a** from a set of actions **A**, i.e., **a** ∈ **A** (**A** may depend on **x**).\n", "* Some reward **r** for the chosen **a** is observed by **APP**.\n", "\n", "For example:\n", "\n", "**APP** news website:\n", " - **Decision to optimize**: articles to display to user.\n", " - **Context**: user data (browsing history, location, device, time of day)\n", " - **Actions**: available news articles\n", " - **Reward**: user engagement (click or no click)\n", "\n", "**APP** cloud controller:\n", " - **Decision to optimize**: the wait time before reboot of unresponsive machine.\n", " - **Context**: the machine hardware specs (SKU, OS, failure history, location, load).\n", " - **Actions**: time in minutes - {1 ,2 , ...N}\n", " - **Reward**: - the total downtime\n", "\n", "You want **APP** to take actions that provide the highest possible reward. In machine learning parlance, we want a **model** that tells us which action to take.\n", "\n", "### Policy vs. model\n", "\n", "We use the term **policy** many times in this tutorial. In reinforcement learning, the policy is roughly equivalent to **model**. In machine learning, the model means **learned function**. When someone says policy, it is more specific than model because it indicates this is a model that acts in the world.\n", "\n", "Contexts and actions are typically represented as feature vectors in contextual bandit algorithms. For example, **APP** chooses actions by applying a policy **π** that takes a context as input and returns an action. The goal is to find a policy that maximizes the average reward over a sequence of interactions.\n", "\n", "### Specifying the contextual bandit approach\n", "\n", "There are multiple policy evaluation approaches available to optimize a policy. Vowpal Wabbit offers four approaches to specify a contextual bandit approach using `--cb_type`:\n", "\n", "- **Inverse Propensity Score**: `--cb_type ips`\n", "- **Doubly Robust**: `--cb_type dr`\n", "- **Direct Method**: `--cb_type dm`\n", "- **Multi Task Regression/Importance Weighted Regression**: `--cb_type mtr`\n", "\n", ">**Note:** The focal point of contextual bandit learning research is efficient exploration algorithms. For more details, see the [Contextual Bandit bake-off paper](https://arxiv.org/pdf/1802.04064.pdf).\n", "\n", "### Specifying exploration algorithms\n", "\n", "Vowpal Wabbit offers five exploration algorithms:\n", "\n", "- **Explore-First**: `--first`\n", "- **Epsilon-Greedy**: `--epsilon`\n", "- **Bagging Explorer**: `--bag`\n", "- **Online Cover**: `--cover`\n", "- **Softmax Explorer**: `--softmax` (only supported for `--cb_explore_adf`)\n", "\n", ">**Note:** For more details on contextual bandits algorithms and Vowpal Wabbit, please refer to the [Vowpal Wabbit Github Wiki](https://github.com/VowpalWabbit/vowpal_wabbit/wiki/Contextual-Bandit-algorithms).\n", "\n", "## Contextual bandit algorithms and input formats\n", "\n", "There are four main components to a contextual bandit problem:\n", "\n", "- **Context (x)**: the additional information which helps in choosing action.\n", "- **Action (a)**: the action chosen from a set of possible actions **A**.\n", "- **Probability (p)**: the probability of choosing **a** from **A**.\n", "- **Cost/Reward (r)**: the reward received for action **a**.\n", "\n", "Vowpal Wabbit provides three contextual bandits algorithms:\n", "\n", "1. `--cb` \n", " The contextual bandit module which allows you to optimize predictor based on already collected data, or contextual bandits without exploration.\n", "2. `--cb_explore` \n", " The contextual bandit learning algorithm for when the maximum number of actions is known ahead of time and semantics of actions stays the same across examples.\n", "3. `--cb_explore_adf` \n", " The contextual bandit learning algorithm for when the set of actions changes over time or you have rich information for each action.\n", "\n", "### Input format for `--cb`\n", "\n", "```text\n", "--cb \n", "```\n", "\n", "The `--cb 4` command specifies that we want to use the contextual bandit module and our data has a total of four actions.\n", "\n", "Each example is represented as a separate line in your data file and must follow the following format:\n", "\n", "```text\n", "action:cost:probability | features\n", "```\n", "\n", "Sample data file **train.dat** with five examples:\n", "\n", "```text\n", "1:2:0.4 | a c\n", "3:0.5:0.2 | b d\n", "4:1.2:0.5 | a b c\n", "2:1:0.3 | b c\n", "3:1.5:0.7 | a d\n", "```\n", "\n", "Use the command\n", "```text\n", "vw -d train.dat --cb 4\n", "```\n", "\n", ">**Note:** This usage is for the Vowpal Wabbit command line. See below for a Python tutorial.\n", "\n", "### Input format for `--cb_explore`\n", "\n", "```text\n", "--cb_explore \n", "```\n", "\n", "The command `--cb_explore 4` specifies our examples explore a total of four actions.\n", "\n", ">**Note:** This format explores the action space so you must specify which algorithm you want to use for exploration.\n", "\n", "#### Usage\n", "\n", "The following examples use the input format for the `--cb` command example above:\n", "\n", "```text\n", "vw -d train.dat --cb_explore 4 --first 2\n", "```\n", "\n", " In this case, on the first two actions, you take each of the four actions with probability 1/4.\n", "\n", "```text\n", "vw -d train.dat --cb_explore 4 --epsilon 0.2\n", "```\n", "\n", "In this case, the prediction of the current learned policy takes with probability **1 - epsilon** 80% of the time, and with the remaining 20% epsilon probability, an action is chosen uniformly at random.\n", "\n", "```text\n", "vw -d train.dat --cb_explore 4 --bag 5\n", "```\n", "\n", "```text\n", "vw -d train.dat --cb_explore 4 --cover 3\n", "```\n", "\n", "This algorithm is a theoretically optimal exploration algorithm. Similar to the previous bagging **m** example, different policies are trained in this case. Unlike bagging, the training of these policies is explicitly optimized to result in a diverse set of predictions—choosing all the actions which are not already learned to be bad in a given context.\n", "\n", "For more information and research on this theoretically optimal exploration algorithm see this [paper](http://arxiv.org/abs/1402.0555).\n", "\n", "### Input format for `--cb_explore_adf`\n", "\n", "```text\n", "--cb_explore_adf\n", "```\n", "\n", "The command `--cb_explore_adf` is different from the other two example cases because the action set changes over time (or we have rich information for each action):\n", "\n", "- Each example now spans multiple lines, with one line per action\n", "- For each action, we have the label information (action, cost, probability), if known.\n", "- The action field **a** is ignored now since line numbers identify actions and typically set to the 0.\n", "- The semantics of cost and probability are the same as before.\n", "- Each example is also allowed to specify the label information on precisely one action.\n", "- A new line signals end of a multiline example.\n", "\n", "It best to create features for every (context, action) pair rather than features associated only with context and shared across all actions.\n", "\n", ">**Note:** This format explores the action space so you must specify which algorithm you want to use for exploration.\n", "\n", "### Shared contextual features\n", "\n", "You can specify contextual features which share all line actions at the beginning of an example, which always has a `shared` label, as in the second multiline example below.\n", "\n", "Since the shared line is not associated with any action, it should never contain the label information.\n", "\n", "Sample data file **train.dat** with two examples:\n", "\n", "```text\n", "| a:1 b:0.5\n", "0:0.1:0.75 | a:0.5 b:1 c:2\n", "\n", "shared | s_1 s_2\n", "0:1.0:0.5 | a:1 b:1 c:1\n", "| a:0.5 b:2 c:1\n", "```\n", "In the first example, we have two actions, one line for each. The first line represents the first action, and it has two action dependent features **a** and **b**.\n", "\n", "```text\n", "| a:1 b:0.5\n", "```\n", "\n", "The second line represents the second action, and it has three action dependent features **a**, **b**, and **c**.\n", "\n", "```text\n", "0:0.1:0.75 | a:0.5 b:1 c:2\n", "```\n", "\n", "If the second action is the chosen action it follows the following format:\n", "\n", "```text\n", "action:cost:probability | features\n", "0:0.1:0.75 |\n", "```\n", "Action 0 is ignored, has cost 0.1 and a probability of 0.75.\n", "\n", "#### Usage\n", "\n", "In the case of the softmax explorer, which uses the policy not only to predict an action but also predict a score indicating the quality of each action. The probability of action **a** creates distribution proportional to **exp(lambda * score(x, a))**.\n", "\n", "```text\n", "vw -d train_adf.dat --cb_explore_adf\n", "```\n", "\n", "```text\n", "vw -d train.dat --cb_explore_adf --first 2\n", "```\n", "\n", "```text\n", "vw -d train.dat --cb_explore_adf --epsilon 0.1\n", "```\n", "\n", "```text\n", "vw -d train.dat --cb_explore_adf --bag 5\n", "```\n", "\n", "```text\n", "vw -d train.dat --cb_explore_adf --softmax --lambda 10\n", "```\n", "\n", "Here **lambda** is a parameter, which leads to uniform exploration for **lambda = 0**, and stops exploring as **lambda** approaches infinity. In general, this provides an excellent knob for controlled exploration based on the uncertainty in the learned policy.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create contextual bandit data\n", "\n", "Begin by loading the required Python packages:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, generate some sample training data that could originate from previous random trial (for example A/B test) for the contextual bandit to explore:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_data = [\n", " {\n", " \"action\": 1,\n", " \"cost\": 2,\n", " \"probability\": 0.4,\n", " \"feature1\": \"a\",\n", " \"feature2\": \"c\",\n", " \"feature3\": \"\",\n", " },\n", " {\n", " \"action\": 3,\n", " \"cost\": 0,\n", " \"probability\": 0.2,\n", " \"feature1\": \"b\",\n", " \"feature2\": \"d\",\n", " \"feature3\": \"\",\n", " },\n", " {\n", " \"action\": 4,\n", " \"cost\": 1,\n", " \"probability\": 0.5,\n", " \"feature1\": \"a\",\n", " \"feature2\": \"b\",\n", " \"feature3\": \"\",\n", " },\n", " {\n", " \"action\": 2,\n", " \"cost\": 1,\n", " \"probability\": 0.3,\n", " \"feature1\": \"a\",\n", " \"feature2\": \"b\",\n", " \"feature3\": \"c\",\n", " },\n", " {\n", " \"action\": 3,\n", " \"cost\": 1,\n", " \"probability\": 0.7,\n", " \"feature1\": \"a\",\n", " \"feature2\": \"d\",\n", " \"feature3\": \"\",\n", " },\n", "]\n", "\n", "train_df = pd.DataFrame(train_data)\n", "\n", "# Add index to data frame\n", "train_df[\"index\"] = range(1, len(train_df) + 1)\n", "train_df = train_df.set_index(\"index\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ ">**Note:** The data here is equivalent to the [VW wiki example](https://github.com/VowpalWabbit/vowpal_wabbit/wiki/Logged-Contextual-Bandit-Example).\n", "\n", "Next, create data for the contextual bandit to exploit to make decisions (for example features describing new users):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "test_data = [\n", " {\"feature1\": \"b\", \"feature2\": \"c\", \"feature3\": \"\"},\n", " {\"feature1\": \"a\", \"feature2\": \"\", \"feature3\": \"b\"},\n", " {\"feature1\": \"b\", \"feature2\": \"b\", \"feature3\": \"\"},\n", " {\"feature1\": \"a\", \"feature2\": \"\", \"feature3\": \"b\"},\n", "]\n", "\n", "test_df = pd.DataFrame(test_data)\n", "\n", "# Add index to data frame\n", "test_df[\"index\"] = range(1, len(test_df) + 1)\n", "test_df = test_df.set_index(\"index\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Your dataframes are:\n", "train_df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "test_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Contextual bandits Python tutorial\n", "\n", "First, create the Python model store the model parameters in the Python `vw` object.\n", "\n", "Use the following command for a contextual bandit with four possible actions:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import vowpalwabbit\n", "\n", "vw = vowpalwabbit.Workspace(\"--cb 4\", quiet=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", ">**Note:** Use `--quiet` command to turn off diagnostic information in Vowpal Wabbit.\n", "\n", "Now, call learn for each trained example on your Vowpal Wabbit model.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for i in train_df.index:\n", " action = train_df.loc[i, \"action\"]\n", " cost = train_df.loc[i, \"cost\"]\n", " probability = train_df.loc[i, \"probability\"]\n", " feature1 = train_df.loc[i, \"feature1\"]\n", " feature2 = train_df.loc[i, \"feature2\"]\n", " feature3 = train_df.loc[i, \"feature3\"]\n", "\n", " # Construct the example in the required vw format.\n", " learn_example = (\n", " str(action)\n", " + \":\"\n", " + str(cost)\n", " + \":\"\n", " + str(probability)\n", " + \" | \"\n", " + str(feature1)\n", " + \" \"\n", " + str(feature2)\n", " + \" \"\n", " + str(feature3)\n", " )\n", "\n", " # Here we do the actual learning.\n", " vw.learn(learn_example)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use the model that was just trained on the train set to perform predictions on the test set. Construct the example like before but don't include the label and pass it into **predict** instead of **learn**. For example:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for j in test_df.index:\n", " feature1 = test_df.loc[j, \"feature1\"]\n", " feature2 = test_df.loc[j, \"feature2\"]\n", " feature3 = test_df.loc[j, \"feature3\"]\n", "\n", " test_example = \"| \" + str(feature1) + \" \" + str(feature2) + \" \" + str(feature3)\n", "\n", " choice = vw.predict(test_example)\n", " print(j, choice)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ ">**Note:** The contextual bandit assigns every instance to the third action as it should per the cost structure of the train data. You can save and load the model you train from a file.\n", "\n", "Finally, experiment with the cost structure to see that the contextual bandit updates its predictions accordingly." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "vw.save(\"cb.model\")\n", "del vw\n", "\n", "vw = vowpalwabbit.Workspace(\"--cb 4 -i cb.model\", quiet=True)\n", "print(vw.predict(\"| a b\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `-i` argument means input regressor, telling Vowpal Wabbit to load a model from that file instead of starting from scratch.\n", "\n", "## More to explore\n", "\n", "- Review the [example Python notebooks](https://vowpalwabbit.org/docs/vowpal_wabbit/python/latest/examples/).\n", "- Explore the [tutorials section of the GitHub wiki](https://github.com/VowpalWabbit/vowpal_wabbit/wiki/Tutorial#more-tutorials).\n", "- Browse [examples on the GitHub wiki](https://github.com/VowpalWabbit/vowpal_wabbit/wiki/Examples).\n", "- Learn various [VW commands](https://github.com/VowpalWabbit/vowpal_wabbit/wiki/Command-Line-Arguments).\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.4" } }, "nbformat": 4, "nbformat_minor": 4 }