{ "cells": [ { "cell_type": "markdown", "metadata": { "_cell_guid": "403f8973-1fb5-422c-bd82-fb5d7041d8b2", "_uuid": "091313c11c12aaeef8bc25b39a205f79bb2588a6", "collapsed": true }, "source": [ "
\n", "\n", "## Open Machine Learning Course\n", "
Author: [Yury Kashnitsky](https://yorko.github.io/). Translated and edited by [Serge Oreshkov](https://www.linkedin.com/in/sergeoreshkov/), and [Yuanyuan Pao](https://www.linkedin.com/in/yuanyuanpao/).\n", "\n", "This material is subject to the terms and conditions of the license [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Free use is permitted for any non-comercial purpose with an obligatory indication of the names of the authors and of the source.\n", "\n", "#
Topic 8. Vowpal Wabbit: Learning with Gigabytes of Data\n", "This week, we’ll cover two reasons for Vowpal Wabbit’s exceptional training speed, namely, online learning and hashing trick, in both theory and practice. We will try it out with news, movie reviews, and StackOverflow questions." ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "a783ea16-a0be-4a81-9d91-72c9ada412d0", "_uuid": "aba8923437f1bb67ff9615f218dbe3cb19145f4e" }, "source": [ "# Outline\n", "1. Stochastic gradient descent and online learning\n", " - 1.1. SGD\n", " - 1.2. Online approach to learning\n", "2. Categorical data processing: Label Encoding, One-Hot Encoding, Hashing trick\n", " - 2.1. Label Encoding\n", " - 2.2. One-Hot Encoding\n", " - 2.3. Hashing trick\n", "3. Vowpal Wabbit\n", " - 3.1. News. Binary classification\n", " - 3.2. News. Multiclass classification\n", " - 3.3. IMDB reviews\n", " - 3.4. Classifying gigabytes of StackOverflow questions\n", "4. VW and Spooky Author Identification " ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "_cell_guid": "64efb6ef-3591-4036-8461-b7b467f404b7", "_uuid": "bd1f11458028d021571d4a77e3de846bf0dee992" }, "outputs": [], "source": [ "import warnings\n", "warnings.filterwarnings('ignore')\n", "import os\n", "import re\n", "import numpy as np\n", "import pandas as pd\n", "from tqdm import tqdm_notebook\n", "from sklearn.datasets import fetch_20newsgroups, load_files\n", "from sklearn.preprocessing import LabelEncoder, OneHotEncoder\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.metrics import classification_report, accuracy_score, log_loss\n", "from sklearn.metrics import roc_auc_score, roc_curve, confusion_matrix\n", "from scipy.sparse import csr_matrix\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "import seaborn as sns" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "ddb2276a-dd17-4229-8a0b-1860056dc08a", "_uuid": "8c3581ddac23e8621f7688aec41bbce23e91bc6c" }, "source": [ "# 1. Stochastic gradient descent and online learning\n", "## 1.1. Stochastic gradient descent\n", "\n", "Despite the fact that gradient descent is one of the first things learned in machine learning and optimization courses, it is one of its modifications, Stochastic Gradient Descent (SGD), that is hard to top.\n", "\n", "Recall that the idea of gradient descent is to minimize some function by making small steps in the direction of the fastest decrease. This method was named due to the following fact from calculus: vector $\\nabla f = (\\frac{\\partial f}{\\partial x_1}, \\ldots \\frac{\\partial f}{\\partial x_n})^\\text{T}$ of partial derivatives of the function $f(x) = f(x_1, \\ldots x_n)$ points to the direction of the fastest function growth. It means that, by moving in the opposite direction (antigradient), it is possible to decrease the function value with the fastest rate.\n", "\n", "\n", "\n", "Here is a snowboarder (me) in Sheregesh, Russia's most popular winter resort. (I highly recommended it if you like skiing or snowboarding). In addition to advertising the beautiful landscapes, this picture depicts the idea of gradient descent. If you want to ride as fast as possible, you need to choose the path of steepest descent. Calculating antigradients can be seen as evaluating the slope at various spots." ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "eda14ceb-ab19-4c75-a1ad-bbbbf3acb83d", "_uuid": "7c41b31a7549e4dbcd0c31824d425d18ec6b9fbb" }, "source": [ "**Example**\n", "\n", "The paired regression problem can be solved with gradient descent. Let us predict one variable using another: height with weight. Assume that these variables are linearly dependent. We will use the [SOCR](http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data) dataset. " ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "_cell_guid": "84022022-7ce2-4fac-90ae-3cefc5a12961", "_uuid": "0873cf8a28158f2ebdd5b3b42d3026efcd8e6b58" }, "outputs": [], "source": [ "PATH_TO_ALL_DATA = '../../data/'\n", "data_demo = pd.read_csv(os.path.join(PATH_TO_ALL_DATA,\n", " 'weights_heights.csv'))" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "_cell_guid": "094aa0fc-8af4-4a13-88a7-4eedf53fd712", "_uuid": "a60148fd610f423b517a38dad8768f0f6c76d192" }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.scatter(data_demo['Weight'], data_demo['Height']);\n", "plt.xlabel('Weight in lb')\n", "plt.ylabel('Height in inches');" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "1140d0cf-9aef-4d4b-8d74-86795ef71737", "_uuid": "ad436575e565a80eecc4a9fde161065ff1f22f9f" }, "source": [ "Here we have a vector $x$ of dimension $\\ell$ (weight of every person i.e. training sample) and $y$, a vector containing the height of every person in the dataset. \n", "\n", "The task is the following: find weights $w_0$ and $w_1$ such that predicting height as $y_i = w_0 + w_1 x_i$ (where $y_i$ is $i$-th height value, $x_i$ is $i$-th weight value) minimizes the squared error (as well as mean squared error since $\\frac{1}{\\ell}$ doesn't make any difference ):\n", "$$SE(w_0, w_1) = \\frac{1}{2}\\sum_{i=1}^\\ell(y_i - (w_0 + w_1x_{i}))^2 \\rightarrow min_{w_0,w_1}$$\n", "\n", "We will use gradient descent, utilizing the partial derivatives of $SE(w_0, w_1)$ over weights $w_0$ and $w_1$.\n", "An iterative training procedure is then defined by simple update formulas (we change model weights in small steps, proportional to a small constant $\\eta$, towards the antigradient of the function $SE(w_0, w_1)$):\n", "\n", "$$\\begin{array}{rcl} w_0^{(t+1)} = w_0^{(t)} -\\eta \\frac{\\partial SE}{\\partial w_0} |_{t} \\\\ w_1^{(t+1)} = w_1^{(t)} -\\eta \\frac{\\partial SE}{\\partial w_1} |_{t} \\end{array}$$\n", "\n", "Computing the partial derivatives, we get the following: \n", "\n", "$$\\begin{array}{rcl} w_0^{(t+1)} = w_0^{(t)} + \\eta \\sum_{i=1}^{\\ell}(y_i - w_0^{(t)} - w_1^{(t)}x_i) \\\\ w_1^{(t+1)} = w_1^{(t)} + \\eta \\sum_{i=1}^{\\ell}(y_i - w_0^{(t)} - w_1^{(t)}x_i)x_i \\end{array}$$\n", "\n", "This math works quite well as long as the amount of data is not large (we will not discuss issues with local minima, saddle points, choosing the learning rate, moments and other stuff –- these topics are covered very thoroughly in [the Numeric Computation chapter](http://www.deeplearningbook.org/contents/numerical.html) in \"Deep Learning\"). \n", "There is an issue with batch gradient descent -- the gradient evaluation requires the summation of a number of values for every object from the training set. In other words, the algorithm requires a lot of iterations, and every iteration recomputes weights with formula which contains a sum $\\sum_{i=1}^\\ell$ over the whole training set. What happens when we have billions of training samples?\n", "\n", "\n", "\n", "Hence the motivation for stochastic gradient descent! Simply put, we throw away the summation sign and update the weights only over single training samples (or a small number of them). In our case, we have the following:\n", "\n", "$$\\begin{array}{rcl} w_0^{(t+1)} = w_0^{(t)} + \\eta (y_i - w_0^{(t)} - w_1^{(t)}x_i) \\\\ w_1^{(t+1)} = w_1^{(t)} + \\eta (y_i - w_0^{(t)} - w_1^{(t)}x_i)x_i \\end{array}$$\n", "\n", "With this approach, there is no guarantee that we will move in best possible direction at every iteration. Therefore, we may need many more iterations, but we get much faster weight updates." ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "dc9f67f3-4110-40b6-b7bd-a2736233d10f", "_uuid": "7d61f90cd77f5fc64646a22cdcaaa3e9ee012ebc" }, "source": [ "Andrew Ng has a good illustration of this in his [machine learning course](https://www.coursera.org/learn/machine-learning). Let's take a look.\n", "\n", "\n", "\n", "These are the contour plots for some function, and we want to find the global minimum of this function. The red curve shows weight changes (in this picture, $\\theta_0$ and $\\theta_1$ correspond to our $w_0$ and $w_1$). According to the properties of a gradient, the direction of change at every point is orthogonal to contour plots. With stochastic gradient descent, weights are changing in a less predictable manner, and it even may seem that some steps are wrong by leading away from minima; however, both procedures converge to the same solution." ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "3583f288-f80e-4853-82d6-d3bdc5da50f5", "_uuid": "fbf630784df546a50ad03c698537e2b0980057a0" }, "source": [ "## 1.2. Online approach to learning\n", "Stochastic gradient descent gives us practical guidance for training both classifiers and regressors with large amounts of data up to hundreds of GBs (depending on computational resources).\n", "\n", "Considering the case of paired regression, we can store the training data set $(X,y)$ in HDD without loading it into RAM (where it simply won't fit), read objects one by one, and update the weights of our model:\n", "\n", "$$\\begin{array}{rcl} w_0^{(t+1)} = w_0^{(t)} + \\eta (y_i - w_0^{(t)} - w_1^{(t)}x_i) \\\\ w_1^{(t+1)} = w_1^{(t)} + \\eta (y_i - w_0^{(t)} - w_1^{(t)}x_i)x_i \\end{array}$$\n", "\n", "After working through the whole training dataset, our loss function (for example, quadratic squared root error in regression or logistic loss in classification) will decrease, but it usually takes dozens of passes over the training set to make the loss small enough. \n", "\n", "This approach to learning is called **online learning**, and this name emerged even before machine learning MOOC-s turned mainstream.\n", "\n", "We did not discuss many specifics about SGD here. If you want dive into theory, I highly recommend [\"Convex Optimization\" by Stephen Boyd](https://www.amazon.com/Convex-Optimization-Stephen-Boyd/dp/0521833787). Now, we will introduce the Vowpal Wabbit library, which is good for training simple models with huge data sets thanks to stochastic optimization and another trick, feature hashing." ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "abaadfeb-a49c-4fc9-8cc4-c7739e4f9edc", "_uuid": "23782a78c5bac799ed50cc994f3768163fc70a7c" }, "source": [ "In scikit-learn, classifiers and regressors trained with SGD are named `SGDClassifier` and `SGDRegressor` in `sklearn.linear_model`. These are nice implementations of SGD, but we'll focus on VW since it is more performant than sklearn's SGD models in many aspects." ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "9bb0708a-2e81-4040-816b-9dee68595d3f", "_uuid": "56b085b90c338a0e436544c2c5345d7a2716f612" }, "source": [ "# 2. Categorical feature processing: Label Encoding, One-Hot Encoding, and Hashing trick\n", "\n", "## 2.1. Label Encoding\n", "Many classification and regression algorithms operate in Euclidean or metric space, implying that data is represented with vectors of real numbers. However, in real data, we often have categorical features with discrete values such as yes/no or January/February/.../December. We will see how to process this kind of data, particularly with linear models, and how to deal with many categorial features even when they have many unique values." ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "44ebb1ca-8d55-47a1-bfa1-0c1bee8ce99f", "_uuid": "29058442a542088bb018bcf8158fcbbf4e79cef8" }, "source": [ "Let's explore the [UCI bank marketing dataset](https://archive.ics.uci.edu/ml/datasets/bank+marketing) where most of features are categorial." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "_cell_guid": "be6b51fd-ad36-4265-9086-fde116c6d5e4", "_uuid": "c8f015ed0ce62cdfc232529b9db5e17f1ebf55f5" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agejobmaritaleducationdefaulthousingloancontactmonthday_of_weekdurationcampaignpdayspreviouspoutcomeemp.var.ratecons.price.idxcons.conf.idxeuribor3mnr.employed
026studentsinglehigh.schoolnononotelephonejunmon90119990nonexistent1.494.465-41.84.9615228.1
146admin.marrieduniversity.degreenoyesnocellularaugtue20829990nonexistent1.493.444-36.14.9635228.1
249blue-collarmarriedbasic.4yunknownyesyestelephonejuntue13159990nonexistent1.494.465-41.84.8645228.1
331technicianmarrieduniversity.degreenononocellularjultue40419990nonexistent-2.992.469-33.61.0445076.2
442housemaidmarrieduniversity.degreenoyesnotelephonenovmon8519990nonexistent-0.193.200-42.04.1915195.8
\n", "
" ], "text/plain": [ " age job marital education default housing loan \\\n", "0 26 student single high.school no no no \n", "1 46 admin. married university.degree no yes no \n", "2 49 blue-collar married basic.4y unknown yes yes \n", "3 31 technician married university.degree no no no \n", "4 42 housemaid married university.degree no yes no \n", "\n", " contact month day_of_week duration campaign pdays previous \\\n", "0 telephone jun mon 901 1 999 0 \n", "1 cellular aug tue 208 2 999 0 \n", "2 telephone jun tue 131 5 999 0 \n", "3 cellular jul tue 404 1 999 0 \n", "4 telephone nov mon 85 1 999 0 \n", "\n", " poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m \\\n", "0 nonexistent 1.4 94.465 -41.8 4.961 \n", "1 nonexistent 1.4 93.444 -36.1 4.963 \n", "2 nonexistent 1.4 94.465 -41.8 4.864 \n", "3 nonexistent -2.9 92.469 -33.6 1.044 \n", "4 nonexistent -0.1 93.200 -42.0 4.191 \n", "\n", " nr.employed \n", "0 5228.1 \n", "1 5228.1 \n", "2 5228.1 \n", "3 5076.2 \n", "4 5195.8 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv(os.path.join(PATH_TO_ALL_DATA, 'bank_train.csv'))\n", "labels = pd.read_csv(os.path.join(PATH_TO_ALL_DATA,\n", " 'bank_train_target.csv'), \n", " header=None)\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "d81909b1-6b0f-47f5-9d10-f983fe1edcc7", "_uuid": "d900d6aa25e49a150388a9618b8f36bb76ac8667" }, "source": [ "We can see that most of features are not represented by numbers. This poses a problem because we cannot use most machine learning methods (at least those implemented in scikit-learn) out-of-the-box.\n", "\n", "Let's dive into the \"education\" feature." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "_cell_guid": "d5cb0f31-1e52-49a0-9bdb-ae3207541710", "_uuid": "136b05d444ff42d4e8521e51066bf9cd22c023fa" }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df['education'].value_counts().plot.barh();" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "ca94e540-92a0-41bb-a67b-906275c778e9", "_uuid": "284a387b80dba61981db40dabd3fc2f895616011" }, "source": [ "The most straightforward solution is to map each value of this feature into a unique number. For example, we can map `university.degree` to 0, `basic.9y` to 1, and so on. You can use `sklearn.preprocessing.LabelEncoder` to perform this mapping." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "_cell_guid": "fb569cfb-711b-428d-9a29-5393dff4b348", "_uuid": "05c9087139ca3fcdf3b715b9b30f566bdaf0ad1c" }, "outputs": [], "source": [ "label_encoder = LabelEncoder()" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "e75f24ed-b8ed-41dd-96c2-dcd1f42d7014", "_uuid": "276ccbe10a9744f1cca107fb5b7542a314b961fa" }, "source": [ "The `fit` method of this class finds all unique values and builds the actual mapping between categories and numbers, and the `transform` method converts the categories into numbers. After `fit` is executed, `label_encoder` will have the `classes_` attribute with all unique values of the feature. Let us count them to make sure the transformation was correct." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "_cell_guid": "c692cf31-3233-4b92-8bb9-de7d7be15315", "_uuid": "efe26525483e1c3ba137e8f1c55f92c76ab4c799" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{0: 'basic.4y', 1: 'basic.6y', 2: 'basic.9y', 3: 'high.school', 4: 'illiterate', 5: 'professional.course', 6: 'university.degree', 7: 'unknown'}\n" ] }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAWkAAAD3CAYAAADfYKXJAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAADiBJREFUeJzt3X+MZfVZx/H3LLuw0sxaTC9VIgWj5TH2D2iLgQKFDRGh6EKtWk1TLJC2oakRDCldCNTQUBMtq7EWLALLYqVRC6UUImUTBClUioWa0CjPBoqpSTXiCl1+tbjL+Me5A9PZuXMPd8+Z++zm/UpI9t4cznzy3Tmf+93vOfecmbm5OSRJNa2adgBJ0miWtCQVZklLUmGWtCQVZklLUmGru97hzp275p5++oWud9uJgw46ELNNpnI+s03GbJPrI99gMDuz1Pudz6RXr96v6112xmyTq5zPbJMx2+RWMp/LHZJUmCUtSYVZ0pJUmCUtSYVZ0pJUmCUtSYVZ0pJUmCUtSYVZ0pJUWKuvhUfEwcDDwCmZ+Vi/kSRJ88bOpCNiDXAN8GL/cSRJC7VZ7rgS+BzwvZ6zSJIWmVnuGYcRcTbw05l5RUTcC5zXYrnDhyZK0mu35F3wxpX0fTSlOwccBWwDzsjM/1rmB8099dSze5CzP4PBLGabTOV8ZpuM2SbXR75Rtypd9sRhZp44/+cFM+nlClqS1CEvwZOkwlo/mSUz1/eYQ5K0BGfSklSYJS1JhVnSklSYJS1JhVnSklSYJS1JhVnSklSYJS1JhVnSklSYJS1JhVnSklSYJS1JhVnSklSYJS1JhVnSklRY6/tJt7Xhwtu63mXvNm88edoRJGlJzqQlqTBLWpIKs6QlqbCxa9IRcTZw9vDlWuAo4Ccz85n+YkmSoEVJZ+YWYAtARFwFbLagJWlltF7uiIijgbdk5l/2mEeStMBruQTvEuDyvoJM02AwO+0IQJ0co1TOZ7bJmG1yK5WvVUlHxOuByMx7es4zFU899ey0IzAYzJbIMUrlfGabjNkm10e+UaXfdrnjRODuztJIklppW9IBfKfPIJKk3bVa7sjMT/cdRJK0O7/MIkmFWdKSVFjnd8G7fdOZZc/KVj9jLEmLOZOWpMIsaUkqzJKWpMIsaUkqzJKWpMIsaUkqzJKWpMIsaUkqzJKWpMIsaUkqzJKWpMIsaUkqzJKWpMI6vwveh/7+ka53Wdof/uKbpx1B0j7MmbQkFWZJS1JhlrQkFdaqpCPimIi4t+cskqRFxp44jIiLgLOA5/uPI0laqM3VHU8A7wE+33OWvdJgMFtyX32onM9skzHb5FYq39iSzsxbIuLwFciyV+rqwbbVH5JbOZ/ZJmO2yfWRb1Tpe+JQkgqzpCWpMEtakgpr9bXwzPx34Nh+o0iSFnMmLUmFdX6DpWtPf1vZs7LVzxhL0mLOpCWpMEtakgqzpCWpMEtakgqzpCWpMEtakgqzpCWpMEtakgqzpCWpMEtakgqzpCWpMEtakgqzpCWpsM7vgvfw1o91vcvOfHfaAZYxzWxveusnpvjTJS3HmbQkFWZJS1JhlrQkFTZ2TToiVgFXA0cCPwQ+mJmP9x1MktRuJv1uYG1mvgPYCGzqN5IkaV6bqztOAL4KkJkPRsTR/UbSShsMZjvdbhrMNhmzTW6l8rUp6XXA9xe83hURqzNzZ0+ZtMLaPJy38kN8zTYZs02uj3yjSr/NcscOYOH/vcqClqSV0aakHwBOB4iIY4FHe00kSXpFm+WOW4FTIuLrwAxwTr+RJEnzxpZ0Zr4MnLcCWSRJi/hlFkkqrPMbLL39lz9d9qxs5TPGlbNJmh5n0pJUmCUtSYVZ0pJUmCUtSYVZ0pJUmCUtSYVZ0pJUmCUtSYVZ0pJUmCUtSYVZ0pJUmCUtSYVZ0pJUWOd3wfvkhbd3vUvtYz6ycf20I0h7DWfSklSYJS1JhVnSklRYqzXpiHgE2DF8+WRm+jBaSVoBY0s6ItYCM5m5vv84kqSF2sykjwQOjIitw+0vycwH+40lSYJ2Jf0CcCVwHfBm4M6IiMzc2Wsy7bMGg9m9Yp9dMdtkKmeDlcvXpqS3AY9n5hywLSK2Az8F/EevybTP6vqp6JWftG62yVTOBv3kG1X6ba7uOBfYBBARhwDrgP/sLJkkaaQ2M+nrgS0RcT8wB5zrUockrYyxJZ2ZLwHvW4EskqRF/DKLJBXW+Q2WPrFpQ9kF/8onIypng/r5pH2VM2lJKsySlqTCLGlJKsySlqTCLGlJKsySlqTCLGlJKsySlqTCLGlJKsySlqTCLGlJKsySlqTCLGlJKqzzu+A9cOavd73LzmybdoBlVM4GtfPNZzviui3TjCH1wpm0JBVmSUtSYZa0JBW27Jp0RKwBNgOHAwcAV2TmV1YglySJ8TPp9wPbM/OdwGnAZ/uPJEmaN+7qji8CNw//PAPs7DeOJGmhZUs6M58DiIhZmrK+dCVCSZMYDGanHWFJVXOB2fbESuUbe510RBwK3ApcnZlf6D+SNJmKTzOv/JR1s02uj3yjSn/cicM3AluB383MuztNJEkaa9xM+hLgIOCyiLhs+N67MvPFfmNJkmD8mvT5wPkrlEWStIhfZpGkwixpSSqs87vgHX/bLWXPylY+Y1w5G9TOVzmbtKecSUtSYZa0JBVmSUtSYZa0JBVmSUtSYZa0JBVmSUtSYZa0JBVmSUtSYZa0JBVmSUtSYZa0JBXW+Q2W3vu3H+l6l5KKuurkP552hH2eM2lJKsySlqTCLGlJKmzsmnRE7AdcCwQwB5yXmd/uO5gkqd1MegNAZh4PXAp8qtdEkqRXjC3pzPwy8OHhy8OAZ3pNJEl6RatL8DJzZ0TcCPwa8Bv9RpK0txgMZvfKfXdhpfK1vk46Mz8QER8HvhERv5CZz/eYS9JeoK8HAFd/uHAf+UaV/tjljog4KyIuHr58AXh5+J8kqWdtZtJfAm6IiPuANcAFmfliv7EkSdCipIfLGu9dgSySpEX8MoskFWZJS1Jhnd8F7+9+6y/KnpWtfMa4cjaonc9sk6mcTa9yJi1JhVnSklSYJS1JhVnSklSYJS1JhVnSklSYJS1JhVnSklSYJS1JhVnSklSYJS1JhVnSklRY5zdY2nDhbV3vUpLK27zx5F7260xakgqzpCWpMEtakgqzpCWpsFYnDiPiYuAMYH/g6sy8vtdUkiSgxUw6ItYDxwHHAycBh/acSZI01GYmfSrwKHArsA74WK+JJGkvNBjM9rLfNiX9BuAw4FeBnwG+EhE/n5lzvSSSpL3Qnj7Ud1TJtynp7cBjmfkSkBHxA2AA/PceJZIkjdXm6o77gdMiYiYiDgFeR1PckqSejS3pzLwD+BbwEHA78NHM3NV3MElSy0vwMvOivoNIknbnl1kkqbCZubnOL9KY29OznH0ZDGb3+AxsXypng9r5zDYZs02uj3yDwezMUu87k5akwixpSSrMkpakwixpSSrMkpakwixpSSqsj0vwJEkdcSYtSYVZ0pJUmCUtSYVZ0pJUmCUtSYVZ0pJUmCUtSYW1uun/OBGxCrgaOBL4IfDBzHy8i32/hgzHAH+Umesj4ueALcAc8G2ap8m8HBF/APwKsBO4IDMfGrVth7nWAJuBw4EDgCuAf62QLyL2A64FYrj/84AfVMi2IOPBwMPAKcOfXSJbRDwC7Bi+fBK4BvizYYatmXn5qOMiIo5dvG1XuRbkuxg4A9h/mOEfKTB2EXE2cPbw5VrgKGA9BcZueKzeSHOs7gI+RIHfua5m0u8G1mbmO4CNwKaO9ttKRFwEXEfzlw7wJ8ClmflOYAY4MyLeBpwEHAP8NnDVqG07jvd+YPtw/6cBny2UbwNAZh4PXAp8qlC2+YPmGuDFUT9vGtkiYi0wk5nrh/+dA3wOeB9wAnBMRLyV0cfFUtt2JiLWA8cBx9OMzaEUGbvM3DI/bjQfvr9HnbE7HVidmccBn6TI8dBVSZ8AfBUgMx8Eju5ov209Abxnweu308wcAO4Efokm49bMnMvM7wKrI2IwYtsufRG4bPjnGZpP3hL5MvPLwIeHLw8DnqmSbehKmoPye8PXVbIdCRwYEVsj4h8i4kTggMx8IjPngLsWZPuR4yIi1o3YtkunAo8Ct9I8l/QO6owdABFxNPAW4G+oM3bbaMZgFbAO+D8KjFtXJb0O+P6C17siopOllDYy8xaaAZ03M/xLBHgW+HF2zzj//lLbdpntucx8NiJmgZtpZqyV8u2MiBuBPwduqpJt+M/ipzLzrgVvl8gGvEDzAXIqzRLRDcP3FmfY7bgYvrdjiW279AaaidJvDvPdBKwqMnbzLgEuZ/R4TGPsnqNZ6niMZhnwMxT4neuqpHcAswv3m5k7O9r3JBauA83SzBAXZ5x/f6ltOxURhwL3AJ/PzC9Uy5eZHwCOoPnF/LEi2c4FTomIe2nWLf8KOLhItm3AXw9nUttoDtifaJFt1TJ5u7QduCszX8rMpDnPsLAwpvo7FxGvByIz71kmwzTG7vdpxu0Imn8t3Uizpj8uW6/j1lVJP0CznsNwYf/RjvY7qW8N1+UA3gV8jSbjqRGxKiLeRPNB8j8jtu1MRLwR2Ap8PDM3V8oXEWcNTzBBMxN8GfhmhWyZeWJmnjRcu/wX4HeAOytko/kA2QQQEYcABwLPR8TPRsQMzQx7PtuPHBeZuQN4aYltu3Q/cFpEzAzzvQ64u8jYAZwI3A2wzHhMY+ye5tUZ8v8CayhwrHa1JHErzazn6zTrrud0tN9JXQhcGxH7A/8G3JyZuyLia8A/0Xw4fXTUth1nuQQ4CLgsIubXps8HPlMg35eAGyLiPppfyAuGP6PK2C1W5e/1emBLRNxPcyb/XJoPuJuA/WjWK78REf/M0sfFeYu37TAbmXnHcJ38IV4dkyepMXbQXE30nQWvdxuPKY3dnwKbh2OyP82x+02mPG7eqlSSCvPLLJJUmCUtSYVZ0pJUmCUtSYVZ0pJUmCUtSYVZ0pJU2P8DasLQJysna+gAAAAASUVORK5CYII=\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "mapped_education = pd.Series(label_encoder.fit_transform(\n", " df['education']))\n", "mapped_education.value_counts().plot.barh()\n", "print(dict(enumerate(label_encoder.classes_)))" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "_cell_guid": "4658ebf0-6610-4c6f-88d0-3b1027a5c212", "_uuid": "68fba174918a7c498100b8bb2c4f9822a4c9a659" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agejobmaritaleducationdefaulthousingloancontactmonthday_of_weekdurationcampaignpdayspreviouspoutcomeemp.var.ratecons.price.idxcons.conf.idxeuribor3mnr.employed
026studentsingle3nononotelephonejunmon90119990nonexistent1.494.465-41.84.9615228.1
146admin.married6noyesnocellularaugtue20829990nonexistent1.493.444-36.14.9635228.1
249blue-collarmarried0unknownyesyestelephonejuntue13159990nonexistent1.494.465-41.84.8645228.1
331technicianmarried6nononocellularjultue40419990nonexistent-2.992.469-33.61.0445076.2
442housemaidmarried6noyesnotelephonenovmon8519990nonexistent-0.193.200-42.04.1915195.8
\n", "
" ], "text/plain": [ " age job marital education default housing loan contact \\\n", "0 26 student single 3 no no no telephone \n", "1 46 admin. married 6 no yes no cellular \n", "2 49 blue-collar married 0 unknown yes yes telephone \n", "3 31 technician married 6 no no no cellular \n", "4 42 housemaid married 6 no yes no telephone \n", "\n", " month day_of_week duration campaign pdays previous poutcome \\\n", "0 jun mon 901 1 999 0 nonexistent \n", "1 aug tue 208 2 999 0 nonexistent \n", "2 jun tue 131 5 999 0 nonexistent \n", "3 jul tue 404 1 999 0 nonexistent \n", "4 nov mon 85 1 999 0 nonexistent \n", "\n", " emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed \n", "0 1.4 94.465 -41.8 4.961 5228.1 \n", "1 1.4 93.444 -36.1 4.963 5228.1 \n", "2 1.4 94.465 -41.8 4.864 5228.1 \n", "3 -2.9 92.469 -33.6 1.044 5076.2 \n", "4 -0.1 93.200 -42.0 4.191 5195.8 " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['education'] = mapped_education\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "ead4109a-0bc3-476c-be13-59c152e0cb1c", "_uuid": "4c02df6d2afbcd5a47d591004fe9a068e95ecb65" }, "source": [ "Let's apply the transformation to other columns of type `object`." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "_cell_guid": "f6b6056e-3823-4caf-a617-d99ab490512d", "_uuid": "8079ecf58d4fa138d92c21f46b001cb3ed050a0a" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agejobmaritaleducationdefaulthousingloancontactmonthday_of_weekdurationcampaignpdayspreviouspoutcomeemp.var.ratecons.price.idxcons.conf.idxeuribor3mnr.employed
0268230001419011999011.494.465-41.84.9615228.1
1460160200132082999011.493.444-36.14.9635228.1
2491101221431315999011.494.465-41.84.8645228.1
331916000033404199901-2.992.469-33.61.0445076.2
44231602017185199901-0.193.200-42.04.1915195.8
\n", "
" ], "text/plain": [ " age job marital education default housing loan contact month \\\n", "0 26 8 2 3 0 0 0 1 4 \n", "1 46 0 1 6 0 2 0 0 1 \n", "2 49 1 1 0 1 2 2 1 4 \n", "3 31 9 1 6 0 0 0 0 3 \n", "4 42 3 1 6 0 2 0 1 7 \n", "\n", " day_of_week duration campaign pdays previous poutcome emp.var.rate \\\n", "0 1 901 1 999 0 1 1.4 \n", "1 3 208 2 999 0 1 1.4 \n", "2 3 131 5 999 0 1 1.4 \n", "3 3 404 1 999 0 1 -2.9 \n", "4 1 85 1 999 0 1 -0.1 \n", "\n", " cons.price.idx cons.conf.idx euribor3m nr.employed \n", "0 94.465 -41.8 4.961 5228.1 \n", "1 93.444 -36.1 4.963 5228.1 \n", "2 94.465 -41.8 4.864 5228.1 \n", "3 92.469 -33.6 1.044 5076.2 \n", "4 93.200 -42.0 4.191 5195.8 " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "categorical_columns = df.columns[df.dtypes \n", " == 'object'].union(['education'])\n", "for column in categorical_columns:\n", " df[column] = label_encoder.fit_transform(df[column])\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "a79eddaa-26d4-4316-a4d9-ce1a8f024e65", "_uuid": "89931fdf8b27f8e8ba8be53dc33368dc1dc3ad46" }, "source": [ "The main issue with this approach is that we have now introduced some relative ordering where it might not exist. \n", "\n", "For example, we implicitly introduced algebra over the values of the job feature where we can now substract the job of client #2 from the job of client #1 :" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "_cell_guid": "9e85d727-a23c-4fda-859e-9cfa2491b044", "_uuid": "10efbea02831df6e1e8476a8813aedd227f9d4ae" }, "outputs": [ { "data": { "text/plain": [ "-1.0" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[1].job - df.loc[2].job" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "16ea6d8f-8d03-427b-a9f8-f8827cb9b629", "_uuid": "f91a2a86a06fb9b36dd932d4afb08dc7c178983f" }, "source": [ "Does this operation make any sense? Not really. Let's try to train logisitic regression with this feature transformation." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "_cell_guid": "7dede4d6-e787-4232-a1ac-01191a88ec88", "_uuid": "15c8fdb284d4e5166f3f56c4e9d8f1810186ff7c" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " 0 0.88 1.00 0.94 6104\n", " 1 0.50 0.00 0.00 795\n", "\n", "avg / total 0.84 0.88 0.83 6899\n", "\n" ] } ], "source": [ "def logistic_regression_accuracy_on(dataframe, labels):\n", " features = dataframe.as_matrix()\n", " train_features, test_features, train_labels, test_labels = \\\n", " train_test_split(features, labels)\n", "\n", " logit = LogisticRegression()\n", " logit.fit(train_features, train_labels)\n", " return classification_report(test_labels, \n", " logit.predict(test_features))\n", "\n", "print(logistic_regression_accuracy_on(df[categorical_columns], \n", " labels))" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "7ac2b754-b7e9-4bed-b90e-d27db1b72b01", "_uuid": "b2982316ff21f4585188cbb6987df0114cbe4b3f" }, "source": [ "We can see that logistic regression never predicts class 1. In order to use linear models with categorial features, we will use a different approach: One-Hot Encoding.\n", "\n", "## 2.2. One-Hot Encoding\n", "\n", "Suppose that some feature can have one of 10 unique values. One-hot encoding creates 10 new features corresponding to these unique values, all of them *except one* are zeros." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "_cell_guid": "df17447f-6ff8-4342-b755-c856fb1c7bd6", "_uuid": "6ca824f36f66c8e6889688e8f64c2ca6ec7605df" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123456789
00000001000
\n", "
" ], "text/plain": [ " 0 1 2 3 4 5 6 7 8 9\n", "0 0 0 0 0 0 0 1 0 0 0" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "one_hot_example = pd.DataFrame([{i: 0 for i in range(10)}])\n", "one_hot_example.loc[0, 6] = 1\n", "one_hot_example" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "f4c2d551-d511-4b6a-a161-e80b167b6607", "_uuid": "ea9c96c96b528b9fda95fb0ebc53089e05de3a58" }, "source": [ "This idea is implemented in the `OneHotEncoder` class from `sklearn.preprocessing`. By default `OneHotEncoder` transforms data into a sparse matrix to save memory space because most of the values are zeroes and because we do not want to take up more RAM. However, in this particular example, we do not encounter such problems, so we are going to use a \"dense\" matrix representation." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "_cell_guid": "29d1f40e-cfb4-4ee0-b9ee-8c779b7c803b", "_uuid": "fa7ad1f51154182f467a5df5dd7d43ec111132b8" }, "outputs": [], "source": [ "onehot_encoder = OneHotEncoder(sparse=False)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "_cell_guid": "0d70a254-f732-40c9-b738-67ca6b8d2d24", "_uuid": "ecfb22058b460d751779df0f9492928a317914de" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123456789...43444546474849505152
00.01.00.01.00.00.00.01.00.00.0...0.01.00.00.00.00.00.00.01.00.0
11.00.00.00.00.01.00.01.00.00.0...0.00.00.00.00.00.00.00.01.00.0
20.01.00.00.00.01.00.00.01.00.0...0.01.00.00.00.00.00.00.01.00.0
31.00.00.00.00.01.00.01.00.00.0...1.00.00.00.00.00.00.00.01.00.0
40.01.00.01.00.00.00.01.00.00.0...0.00.00.00.01.00.00.00.01.00.0
\n", "

5 rows × 53 columns

\n", "
" ], "text/plain": [ " 0 1 2 3 4 5 6 7 8 9 ... 43 44 45 46 \\\n", "0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 \n", "1 1.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 \n", "2 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 ... 0.0 1.0 0.0 0.0 \n", "3 1.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 \n", "4 0.0 1.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 \n", "\n", " 47 48 49 50 51 52 \n", "0 0.0 0.0 0.0 0.0 1.0 0.0 \n", "1 0.0 0.0 0.0 0.0 1.0 0.0 \n", "2 0.0 0.0 0.0 0.0 1.0 0.0 \n", "3 0.0 0.0 0.0 0.0 1.0 0.0 \n", "4 1.0 0.0 0.0 0.0 1.0 0.0 \n", "\n", "[5 rows x 53 columns]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "encoded_categorical_columns = \\\n", "pd.DataFrame(onehot_encoder.fit_transform(\n", " df[categorical_columns]))\n", "encoded_categorical_columns.head()" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "9db91f18-2a19-4c54-ae15-34cc7df16216", "_uuid": "0c162f2cc7d813cef43971aec666aaccc2aaa7b8" }, "source": [ "We have 53 columns that correspond to the number of unique values of categorical features in our data set. When transformed with One-Hot Encoding, this data can be used with linear models:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "_cell_guid": "84c611df-8c2e-45dc-a108-aab6bf39a530", "_uuid": "cebd716350bc29dabb07f797db16e693fa92a76a" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " 0 0.90 0.99 0.94 6102\n", " 1 0.67 0.18 0.29 797\n", "\n", "avg / total 0.88 0.90 0.87 6899\n", "\n" ] } ], "source": [ "print(logistic_regression_accuracy_on(encoded_categorical_columns, labels))" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "361416b2-4279-434a-98bf-82f79de11f3d", "_uuid": "e7976bb1de18e2f63e78c9e7ec72266ca158b9c1" }, "source": [ "## 2.3. Hashing trick\n", "Real data can be volatile, meaning we cannot guarantee that new values of categorial features will not occur. This issue hampers using a trained model on new data. Besides that, `LabelEncoder` requires preliminary analysis of the whole dataset and storage of constructed mappings in memory, which makes it difficult to work with large datasets.\n", "\n", "There is a simple approach to vectorization of categorial data based on hashing and is known as, not-so-surprisingly, the hashing trick. \n", "\n", "Hash functions can help us find unique codes for different feature values, for example:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "_cell_guid": "d9eb2e54-18e7-4056-8f28-f22a044596d8", "_uuid": "01064911e1e0af0b3829b0e0be2a965aa3e8eb3e" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "university.degree -> -6241459093488141593\n", "high.school -> 7728198035707179500\n", "illiterate -> -7360093633803373451\n" ] } ], "source": [ "for s in ('university.degree', 'high.school', 'illiterate'):\n", " print(s, '->', hash(s))" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "381e8137-cf66-418c-835a-84b5882cfdc2", "_uuid": "b93b5bcf1a9ad4a2dc92704bab63f831ea5e19bf" }, "source": [ "We will not use negative values or values of high magnitude, so we restrict the range of values for the hash function:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "_cell_guid": "943fc564-80f0-4350-980a-cce6c6a2e7cd", "_uuid": "b92f7eabc498b8339abed0f8fd400bba5750209a" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "university.degree -> 7\n", "high.school -> 0\n", "illiterate -> 24\n" ] } ], "source": [ "hash_space = 25\n", "for s in ('university.degree', 'high.school', 'illiterate'):\n", " print(s, '->', hash(s) % hash_space)" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "ae21a334-e53f-4cb1-abfc-fa7ed35ba1ae", "_uuid": "6eda4bc3ccce5ca01700bc95878fc4996b7aabaa" }, "source": [ "Imagine that our data set contains a single (i.e. not married) student, who received a call on Monday. His feature vectors will be created similarly as in the case of One-Hot Encoding but in the space with fixed range for all features:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "_cell_guid": "c94253b2-3981-4bbd-817f-96c3f660d464", "_uuid": "af62a2ed3770a537bc3168d0a06c6d20b737f887" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "job=student -> 20\n", "marital=single -> 23\n", "day_of_week=mon -> 9\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123456789...15161718192021222324
00.00.00.00.00.00.00.00.00.01.0...0.00.00.00.00.01.00.00.01.00.0
\n", "

1 rows × 25 columns

\n", "
" ], "text/plain": [ " 0 1 2 3 4 5 6 7 8 9 ... 15 16 17 18 \\\n", "0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 \n", "\n", " 19 20 21 22 23 24 \n", "0 0.0 1.0 0.0 0.0 1.0 0.0 \n", "\n", "[1 rows x 25 columns]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hashing_example = pd.DataFrame([{i: 0.0 for i in range(hash_space)}])\n", "for s in ('job=student', 'marital=single', 'day_of_week=mon'):\n", " print(s, '->', hash(s) % hash_space)\n", " hashing_example.loc[0, hash(s) % hash_space] = 1\n", "hashing_example" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "4913a8ad-2ecc-4f50-b887-f19b51afdd01", "_uuid": "51d6391d6c180365161865070a0ebf589d2a344e" }, "source": [ "We want to point out that we hash not only feature values but also pairs of **feature name + feature value**. It is important to do this so that we can distinguish the same values of different features." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "_cell_guid": "48540748-405a-413d-ac0c-5aa0422b9087", "_uuid": "e025f8486b3091090e93c27c082145b5f3e11967" }, "outputs": [], "source": [ "assert hash('no') == hash('no')\n", "assert hash('housing=no') != hash('loan=no')" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "bfb6bdc6-9593-4a8f-a961-24221567c70f", "_uuid": "4ec7bcce511995a7d71a6542bbd71c9086f8f9e0" }, "source": [ "Is it possible to have a collision when using hash codes? Sure, it is possible, but it is a rare case with large enough hashing spaces. Even if collision occurs, regression or classification metrics will not suffer much. In this case, hash collisions work as a form of regularization.\n", "\n", "\n", "\n", "You may be saying \"WTF?\"; hashing seems counterintuitive. This is true, but these heuristics sometimes are, in fact, the only plausible approach to work with categorial data. Moreover, this technique has proven to just work. As you work more with data, you may see this for yourself." ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "d2361062-61d9-4dd6-bac2-cbf2bd0ccebb", "_uuid": "1a565b5d7413123af2b5997cd9f74266187effca" }, "source": [ "# 3. Vowpal Wabbit" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "e6a2e0f9-785e-4cf6-9013-b114099daddd", "_uuid": "01e502198f14273eadcab0f10e090ddbaba07b87" }, "source": [ "[Vowpal Wabbit](https://github.com/JohnLangford/vowpal_wabbit) (VW) is one of the most widespread machine learning libraries used in industry. It is prominent for its training speed and support of many training modes, especially for online learning with big and high-dimentional data. This is one of the major merits of the library. Also, with the hashing trick implemented, Vowpal Wabbit is a perfect choice for working with text data.\n", "\n", "Shell is the main interface for VW." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "_cell_guid": "24fd028f-416e-43ff-9530-5c8f47bde176", "_uuid": "4cc17ec6551c7e7dcf224a3b03794720d90cb966" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Num weight bits = 18\r\n", "learning rate = 0.5\r\n", "initial_t = 0\r\n", "power_t = 0.5\r\n", "using no cache\r\n", "Reading datafile = \r\n", "num sources = 1\r\n", "\r\n", "\r\n", "VW options:\r\n", " --random_seed arg seed random number generator\r\n", " --ring_size arg size of example ring\r\n", "\r\n", "Update options:\r\n", " -l [ --learning_rate ] arg Set learning rate\r\n", " --power_t arg t power value\r\n", " --decay_learning_rate arg Set Decay factor for learning_rate \r\n", " between passes\r\n", " --initial_t arg initial t value\r\n", " --feature_mask arg Use existing regressor to determine \r\n", " which parameters may be updated. If no\r\n", " initial_regressor given, also used for \r\n", " initial weights.\r\n", "\r\n", "Weight options:\r\n", " -i [ --initial_regressor ] arg Initial regressor(s)\r\n", " --initial_weight arg Set all weights to an initial value of \r\n", " arg.\r\n", " --random_weights arg make initial weights random\r\n", " --normal_weights arg make initial weights normal\r\n", " --truncated_normal_weights arg make initial weights truncated normal\r\n", " --sparse_weights Use a sparse datastructure for weights\r\n", " --input_feature_regularizer arg Per feature regularization input file\r\n", "\r\n", "Parallelization options:\r\n", " --span_server arg Location of server for setting up \r\n", " spanning tree\r\n", " --threads Enable multi-threading\r\n", " --unique_id arg (=0) unique id used for cluster parallel \r\n", " jobs\r\n", " --total arg (=1) total number of nodes used in cluster \r\n", " parallel job\r\n", " --node arg (=0) node number in cluster parallel job\r\n", "\r\n", "Diagnostic options:\r\n", " --version Version information\r\n", " -a [ --audit ] print weights of features\r\n", " -P [ --progress ] arg Progress update frequency. int: \r\n", " additive, float: multiplicative\r\n", " --quiet Don't output disgnostics and progress \r\n", " updates\r\n", " -h [ --help ] Look here: http://hunch.net/~vw/ and \r\n", " click on Tutorial.\r\n", "\r\n", "Feature options:\r\n", " --hash arg how to hash the features. Available \r\n", " options: strings, all\r\n", " --ignore arg ignore namespaces beginning with \r\n", " character \r\n", " --ignore_linear arg ignore namespaces beginning with \r\n", " character for linear terms only\r\n", " --keep arg keep namespaces beginning with \r\n", " character \r\n", " --redefine arg redefine namespaces beginning with \r\n", " characters of string S as namespace N. \r\n", " shall be in form 'N:=S' where := \r\n", " is operator. Empty N or S are treated \r\n", " as default namespace. Use ':' as a \r\n", " wildcard in S.\r\n", " -b [ --bit_precision ] arg number of bits in the feature table\r\n", " --noconstant Don't add a constant feature\r\n", " -C [ --constant ] arg Set initial value of constant\r\n", " --ngram arg Generate N grams. To generate N grams \r\n", " for a single namespace 'foo', arg \r\n", " should be fN.\r\n", " --skips arg Generate skips in N grams. This in \r\n", " conjunction with the ngram tag can be \r\n", " used to generate generalized \r\n", " n-skip-k-gram. To generate n-skips for \r\n", " a single namespace 'foo', arg should be\r\n", " fN.\r\n", " --feature_limit arg limit to N features. To apply to a \r\n", " single namespace 'foo', arg should be \r\n", " fN\r\n", " --affix arg generate prefixes/suffixes of features;\r\n", " argument '+2a,-3b,+1' means generate \r\n", " 2-char prefixes for namespace a, 3-char\r\n", " suffixes for b and 1 char prefixes for \r\n", " default namespace\r\n", " --spelling arg compute spelling features for a give \r\n", " namespace (use '_' for default \r\n", " namespace)\r\n", " --dictionary arg read a dictionary for additional \r\n", " features (arg either 'x:file' or just \r\n", " 'file')\r\n", " --dictionary_path arg look in this directory for \r\n", " dictionaries; defaults to current \r\n", " directory or env{PATH}\r\n", " --interactions arg Create feature interactions of any \r\n", " level between namespaces.\r\n", " --permutations Use permutations instead of \r\n", " combinations for feature interactions \r\n", " of same namespace.\r\n", " --leave_duplicate_interactions Don't remove interactions with \r\n", " duplicate combinations of namespaces. \r\n", " For ex. this is a duplicate: '-q ab -q \r\n", " ba' and a lot more in '-q ::'.\r\n", " -q [ --quadratic ] arg Create and use quadratic features\r\n", " --q: arg : corresponds to a wildcard for all \r\n", " printable characters\r\n", " --cubic arg Create and use cubic features\r\n", "\r\n", "Example options:\r\n", " -t [ --testonly ] Ignore label information and just test\r\n", " --holdout_off no holdout data in multiple passes\r\n", " --holdout_period arg holdout period for test only, default \r\n", " 10\r\n", " --holdout_after arg holdout after n training examples, \r\n", " default off (disables holdout_period)\r\n", " --early_terminate arg Specify the number of passes tolerated \r\n", " when holdout loss doesn't decrease \r\n", " before early termination, default is 3\r\n", " --passes arg Number of Training Passes\r\n", " --initial_pass_length arg initial number of examples per pass\r\n", " --examples arg number of examples to parse\r\n", " --min_prediction arg Smallest prediction to output\r\n", " --max_prediction arg Largest prediction to output\r\n", " --sort_features turn this on to disregard order in \r", "\r\n", " which features have been defined. This \r\n", " will lead to smaller cache sizes\r\n", " --loss_function arg (=squared) Specify the loss function to be used, \r\n", " uses squared by default. Currently \r\n", " available ones are squared, classic, \r\n", " hinge, logistic, quantile and poisson.\r\n", " --quantile_tau arg (=0.5) Parameter \\tau associated with Quantile\r\n", " loss. Defaults to 0.5\r\n", " --l1 arg l_1 lambda\r\n", " --l2 arg l_2 lambda\r\n", " --no_bias_regularization arg no bias in regularization\r\n", " --named_labels arg use names for labels (multiclass, etc.)\r\n", " rather than integers, argument \r\n", " specified all possible labels, \r\n", " comma-sep, eg \"--named_labels \r\n", " Noun,Verb,Adj,Punc\"\r\n", "\r\n", "Output model:\r\n", " -f [ --final_regressor ] arg Final regressor\r\n", " --readable_model arg Output human-readable final regressor \r\n", " with numeric features\r\n", " --invert_hash arg Output human-readable final regressor \r\n", " with feature names. Computationally \r\n", " expensive.\r\n", " --save_resume save extra state so learning can be \r\n", " resumed later with new data\r\n", " --preserve_performance_counters reset performance counters when \r\n", " warmstarting\r\n", " --save_per_pass Save the model after every pass over \r\n", " data\r\n", " --output_feature_regularizer_binary arg\r\n", " Per feature regularization output file\r\n", " --output_feature_regularizer_text arg Per feature regularization output file,\r\n", " in text\r\n", " --id arg User supplied ID embedded into the \r\n", " final regressor\r\n", "\r\n", "Output options:\r\n", " -p [ --predictions ] arg File to output predictions to\r\n", " -r [ --raw_predictions ] arg File to output unnormalized predictions\r\n", " to\r\n", "\r\n", "Reduction options, use [option] --help for more info:\r\n", "\r\n", " --audit_regressor arg stores feature names and their \r\n", " regressor values. Same dataset must be \r\n", " used for both regressor training and \r\n", " this mode.\r\n", "\r\n", " --search arg Use learning to search, \r\n", " argument=maximum action id or 0 for LDF\r\n", "\r", "\r\n", " --replay_c arg use experience replay at a specified \r\n", " level [b=classification/regression, \r\n", " m=multiclass, c=cost sensitive] with \r\n", " specified buffer size\r\n", "\r\n", " --explore_eval Evaluate explore_eval adf policies\r\n", "\r\n", " --cbify arg Convert multiclass on classes into \r\n", " a contextual bandit problem\r\n", "\r\n", " --cb_explore_adf Online explore-exploit for a contextual\r\n", " bandit problem with multiline action \r\n", " dependent features\r\n", "\r\n", " --cb_explore arg Online explore-exploit for a action\r\n", " contextual bandit problem\r\n", "\r\n", " --multiworld_test arg Evaluate features as a policies\r\n", "\r\n", " --cb_adf Do Contextual Bandit learning with \r\n", " multiline action dependent features.\r\n", "\r\n", " --cb arg Use contextual bandit learning with \r\n", " costs\r\n", "\r\n", " --csoaa_ldf arg Use one-against-all multiclass learning\r\n", " with label dependent features. Specify\r\n", " singleline or multiline.\r\n", "\r\n", " --wap_ldf arg Use weighted all-pairs multiclass \r\n", " learning with label dependent features.\r\n", " Specify singleline or multiline.\r\n", "\r\n", " --interact arg Put weights on feature products from \r\n", " namespaces and \r\n", "\r\n", " --csoaa arg One-against-all multiclass with \r\n", " costs\r\n", "\r\n", " --cs_active arg Cost-sensitive active learning with \r\n", " costs\r\n", "\r\n", " --multilabel_oaa arg One-against-all multilabel with \r\n", " labels\r\n", "\r\n", " --classweight arg importance weight multiplier for class\r\n", "\r\n", " --recall_tree arg Use online tree for multiclass\r\n", "\r\n", " --log_multi arg Use online tree for multiclass\r\n", "\r\n", " --ect arg Error correcting tournament with \r\n", " labels\r\n", "\r\n", " --boosting arg Online boosting with weak learners\r\n", "\r\n", " --oaa arg One-against-all multiclass with \r\n", " labels\r\n", "\r\n", " --top arg top k recommendation\r\n", "\r\n", " --replay_m arg use experience replay at a specified \r\n", " level [b=classification/regression, \r\n", " m=multiclass, c=cost sensitive] with \r\n", " specified buffer size\r\n", "\r\n", " --binary report loss as binary classification on\r\n", " -1,1\r\n", "\r\n", " --bootstrap arg k-way bootstrap by online importance \r\n", " resampling\r\n", "\r\n", " --link arg (=identity) Specify the link function: identity, \r\n", " logistic, glf1 or poisson\r\n", "\r\n", " --stage_poly use stagewise polynomial feature \r\n", " learning\r\n", "\r\n", " --lrqfa arg use low rank quadratic features with \r\n", " field aware weights\r\n", "\r\n", " --lrq arg use low rank quadratic features\r\n", "\r\n", " --autolink arg create link function with polynomial d\r\n", "\r\n", " --marginal arg substitute marginal label estimates for\r\n", " ids\r\n", "\r\n", " --new_mf arg rank for reduction-based matrix \r\n", " factorization\r\n", "\r\n", " --nn arg Sigmoidal feedforward network with \r\n", " hidden units\r\n", "\r\n", "confidence options:\r\n", " --confidence_after_training Confidence after training\r\n", "\r\n", " --confidence Get confidence for binary predictions\r\n", "\r\n", " --active_cover enable active learning with cover\r\n", "\r\n", " --active enable active learning\r\n", "\r\n", " --replay_b arg use experience replay at a specified \r\n", " level [b=classification/regression, \r\n", " m=multiclass, c=cost sensitive] with \r\n", " specified buffer size\r\n", "\r\n", " --baseline Learn an additive baseline (from \r\n", " constant features) and a residual \r\n", " separately in regression.\r\n", "\r\n", " --OjaNewton Online Newton with Oja's Sketch\r\n", "\r\n", " --bfgs use bfgs optimization\r\n", "\r\n", " --conjugate_gradient use conjugate gradient based \r\n", " optimization\r\n", "\r\n", " --lda arg Run lda with topics\r\n", "\r\n", " --noop do no learning\r\n", "\r\n", " --print print examples\r\n", "\r\n", " --rank arg rank for matrix factorization.\r\n", "\r\n", " --sendto arg send examples to \r\n", "\r\n", " --svrg Streaming Stochastic Variance Reduced \r\n", " Gradient\r\n", "\r\n", " --ftrl FTRL: Follow the Proximal Regularized \r\n", " Leader\r\n", "\r\n", " --pistol FTRL: Parameter-free Stochastic \r\n", " Learning\r\n", "\r\n", " --ksvm kernel svm\r\n", "\r\n", "Gradient Descent options:\r\n", " --sgd use regular stochastic gradient descent\r\n", " update.\r\n", " --adaptive use adaptive, individual learning \r\n", " rates.\r\n", " --adax use adaptive learning rates with x^2 \r\n", " instead of g^2x^2\r\n", " --invariant use safe/importance aware updates.\r\n", " --normalized use per feature normalized updates\r\n", " --sparse_l2 arg (=0) use per feature normalized updates\r\n", "\r\n", "Input options:\r\n", " -d [ --data ] arg Example Set\r\n", " --daemon persistent daemon mode on port 26542\r\n", " --foreground in persistent daemon mode, do not run \r\n", " in the background\r\n", " --port arg port to listen on; use 0 to pick unused\r\n", " port\r\n", " --num_children arg number of children for persistent \r\n", " daemon mode\r\n", " --pid_file arg Write pid file in persistent daemon \r\n", " mode\r\n", " --port_file arg Write port used in persistent daemon \r\n", " mode\r\n", " -c [ --cache ] Use a cache. The default is \r\n", " .cache\r\n", " --cache_file arg The location(s) of cache_file.\r\n", " --json Enable JSON parsing.\r\n", " --dsjson Enable Decision Service JSON parsing.\r\n", " -k [ --kill_cache ] do not reuse existing cache: create a \r\n", " new one always\r\n", " --compressed use gzip format whenever possible. If a\r\n", " cache file is being created, this \r\n", " option creates a compressed cache file.\r\n", " A mixture of raw-text & compressed \r\n", " inputs are supported with \r\n", " autodetection.\r\n", " --no_stdin do not default to reading from stdin\r\n", "\r\n" ] } ], "source": [ "!vw --help" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "62f9eb36-c321-43bb-823b-16e0dfa807fd", "_uuid": "e1af0fd07d8be18a75d911a2ca939b8af9839cf2" }, "source": [ "Vowpal Wabbit reads data from files or from standard input stream (stdin) with the following format:\n", "\n", "`[Label] [Importance] [Tag]|Namespace Features |Namespace Features ... |Namespace Features`\n", "\n", "`Namespace=String[:Value]`\n", "\n", "`Features=(String[:Value] )*`\n", "\n", "here [] denotes non-mandatory elements, and (...)\\* means multiple inputs allowed. \n", "\n", "- **Label** is a number. In the case of classification, it is usually 1 and -1; for regression, it is a real float value\n", "- **Importance** is a number. It denotes the sample weight during training. Setting this helps when working with imbalanced data.\n", "- **Tag** is a string without spaces. It is the \"name\" of the sample that VW saves upon prediction. In order to separate Tag from Importance, it is better to start Tag with the ' character.\n", "- **Namespace** is for creating different feature spaces. \n", "- **Features** are object features inside a given **Namespace**. Features have weight 1.0 by default, but it can be changed, for example feature:0.1. \n", "\n", "\n", "The following string matches the VW format:\n", "\n", "```\n", "1 1.0 |Subject WHAT car is this |Organization University of Maryland:0.5 College Park\n", "```\n", "\n", "\n", "Let's check the format by running VW with this training sample:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "_cell_guid": "5cbedd43-8d8e-4577-9a67-4f2879f0666c", "_uuid": "250fa571012a53fbe15f9b16bd88244dae4f805a" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Num weight bits = 18\r\n", "learning rate = 0.5\r\n", "initial_t = 0\r\n", "power_t = 0.5\r\n", "using no cache\r\n", "Reading datafile = \r\n", "num sources = 1\r\n", "average since example example current current current\r\n", "loss last counter weight label predict features\r\n", "1.000000 1.000000 1 1.0 1.0000 0.0000 10\r\n", "\r\n", "finished run\r\n", "number of examples per pass = 1\r\n", "passes used = 1\r\n", "weighted example sum = 1.000000\r\n", "weighted label sum = 1.000000\r\n", "average loss = 1.000000\r\n", "best constant = 1.000000\r\n", "best constant's loss = 0.000000\r\n", "total feature number = 10\r\n" ] } ], "source": [ "! echo '1 1.0 |Subject WHAT car is this |Organization University of Maryland:0.5 College Park' | vw" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "250ad844-c8bc-4cc6-a1dd-cf4770d7c840", "_uuid": "1d754e1b0d24913b137f42b84ccdf9d46ce53f05" }, "source": [ "VW is a wonderful tool for working with text data. We'll illustrate it with the [20newsgroups dataset](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html), which contains letters from 20 different newsletters.\n", "\n", "\n", "## 3.1. News. Binary classification." ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "_cell_guid": "a37611a4-5bdb-40bb-9b89-bfddd8726ba9", "_uuid": "0f527d54e1a759be5fb53665ae4b768116094b27" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Downloading 20news dataset. This may take a few minutes.\n", "Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)\n" ] } ], "source": [ "# load data with sklearn's function \n", "newsgroups = fetch_20newsgroups(PATH_TO_ALL_DATA)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "_cell_guid": "7fad47bd-c5fc-4b56-b024-0e866137d07b", "_uuid": "a582c83f4d792d71d12e5063d7539c040b017a67" }, "outputs": [ { "data": { "text/plain": [ "['alt.atheism',\n", " 'comp.graphics',\n", " 'comp.os.ms-windows.misc',\n", " 'comp.sys.ibm.pc.hardware',\n", " 'comp.sys.mac.hardware',\n", " 'comp.windows.x',\n", " 'misc.forsale',\n", " 'rec.autos',\n", " 'rec.motorcycles',\n", " 'rec.sport.baseball',\n", " 'rec.sport.hockey',\n", " 'sci.crypt',\n", " 'sci.electronics',\n", " 'sci.med',\n", " 'sci.space',\n", " 'soc.religion.christian',\n", " 'talk.politics.guns',\n", " 'talk.politics.mideast',\n", " 'talk.politics.misc',\n", " 'talk.religion.misc']" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "newsgroups['target_names']" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "9eab187f-162a-4a1c-bcaa-6e96bbaeb807", "_uuid": "307c3d13c7525e41cba581816a9ef75660d54fb2" }, "source": [ "Lets look at the first document in this collection:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "_cell_guid": "d571d3d0-fba9-4d2f-a4a9-3a36e80fe456", "_uuid": "039c32b72a84277884c788de8fe2a5e692a4dfb5" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-----\n", "rec.autos\n", "-----\n", "From: lerxst@wam.umd.edu (where's my thing)\n", "Subject: WHAT car is this!?\n", "Nntp-Posting-Host: rac3.wam.umd.edu\n", "Organization: University of Maryland, College Park\n", "Lines: 15\n", "\n", " I was wondering if anyone out there could enlighten me on this car I saw\n", "the other day. It was a 2-door sports car, looked to be from the late 60s/\n", "early 70s. It was called a Bricklin. The doors were really small. In addition,\n", "the front bumper was separate from the rest of the body. This is \n", "all I know. If anyone can tellme a model name, engine specs, years\n", "of production, where this car is made, history, or whatever info you\n", "have on this funky looking car, please e-mail.\n", "\n", "Thanks,\n", "- IL\n", " ---- brought to you by your neighborhood Lerxst ----\n", "----\n" ] } ], "source": [ "text = newsgroups['data'][0]\n", "target = newsgroups['target_names'][newsgroups['target'][0]]\n", "\n", "print('-----')\n", "print(target)\n", "print('-----')\n", "print(text.strip())\n", "print('----')" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "bb940e9e-e2bf-4f39-b8a2-467100548f40", "_uuid": "a464de9c01ff461f95ef1cba04acbe0211723c58" }, "source": [ "Now we convert the data into something Vowpal Wabbit can understand. We will throw away words shorter than 3 symbols. Here, we will skip some important NLP stages such as stemming and lemmatization; however, we will later see that VW solves the problem even without these steps." ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "_cell_guid": "61c25a92-1b0a-45cf-904f-6c35de89575f", "_uuid": "9b480cff8801ad51f371b05f6f63225c758883ca" }, "outputs": [ { "data": { "text/plain": [ "'1 |text from lerxst wam umd edu where thing subject what car this nntp posting host rac3 wam umd edu organization university maryland college park lines was wondering anyone out there could enlighten this car saw the other day was door sports car looked from the late 60s early 70s was called bricklin the doors were really small addition the front bumper was separate from the rest the body this all know anyone can tellme model name engine specs years production where this car made history whatever info you have this funky looking car please mail thanks brought you your neighborhood lerxst\\n'" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def to_vw_format(document, label=None):\n", " return str(label or '') + ' |text ' + ' '.join(re.findall('\\w{3,}', \n", " document.lower())) + '\\n'\n", "\n", "to_vw_format(text, 1 if target == 'rec.autos' else -1)" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "a9ff9575-96b8-421c-9ebb-fe5de97b4d79", "_uuid": "0fada3ac4275dc108ea54a0be7efa8ab176ca06f" }, "source": [ "We split the dataset into train and test and write these into separate files. We will consider a document as positive if it corresponds to **rec.autos**. Thus, we are constructing a model which distinguishes articles about cars from other topics: " ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "_cell_guid": "297c4ed7-3157-412e-b3af-1bc59084defd", "_uuid": "f5754e324f2165a47ab603e2e5fafa5b681ea1e9" }, "outputs": [], "source": [ "all_documents = newsgroups['data']\n", "all_targets = [1 if newsgroups['target_names'][target] == 'rec.autos' \n", " else -1 for target in newsgroups['target']]" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "_cell_guid": "d0cd644c-fdd2-4b46-9ff5-377ef12a3d6e", "_uuid": "623b5b239dda745dc0e79a7b225825eed8f98d09" }, "outputs": [], "source": [ "train_documents, test_documents, train_labels, test_labels = \\\n", " train_test_split(all_documents, all_targets, random_state=7)\n", " \n", "with open(os.path.join(PATH_TO_ALL_DATA, '20news_train.vw'), 'w') as vw_train_data:\n", " for text, target in zip(train_documents, train_labels):\n", " vw_train_data.write(to_vw_format(text, target))\n", "with open(os.path.join(PATH_TO_ALL_DATA, '20news_test.vw'), 'w') as vw_test_data:\n", " for text in test_documents:\n", " vw_test_data.write(to_vw_format(text))" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "601ec250-c847-4c9e-8d47-ca00b8b659af", "_uuid": "6330fc49440e8801cbfee9aa075df523e845ae23" }, "source": [ "Now, we pass the created training file to Vowpal Wabbit. We solve the classification problem with a hinge loss function (linear SVM). The trained model will be saved in the `20news_model.vw` file:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "_cell_guid": "80f2960a-2389-4810-9b04-1928b613b4b2", "_uuid": "7a5a15cc11e801a98b493bab026136c9eeb33816" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "final_regressor = ../../data//20news_model.vw\n", "Num weight bits = 18\n", "learning rate = 0.5\n", "initial_t = 0\n", "power_t = 0.5\n", "using no cache\n", "Reading datafile = ../../data//20news_train.vw\n", "num sources = 1\n", "average since example example current current current\n", "loss last counter weight label predict features\n", "1.000000 1.000000 1 1.0 -1.0000 0.0000 157\n", "0.911276 0.822551 2 2.0 -1.0000 -0.1774 159\n", "0.605793 0.300311 4 4.0 -1.0000 -0.3994 92\n", "0.419594 0.233394 8 8.0 -1.0000 -0.8167 129\n", "0.313998 0.208402 16 16.0 -1.0000 -0.6509 108\n", "0.196014 0.078029 32 32.0 -1.0000 -1.0000 115\n", "0.183158 0.170302 64 64.0 -1.0000 -0.7072 114\n", "0.261046 0.338935 128 128.0 1.0000 -0.7900 110\n", "0.262910 0.264774 256 256.0 -1.0000 -0.6425 44\n", "0.216663 0.170415 512 512.0 -1.0000 -1.0000 160\n", "0.176710 0.136757 1024 1024.0 -1.0000 -1.0000 194\n", "0.134541 0.092371 2048 2048.0 -1.0000 -1.0000 438\n", "0.104403 0.074266 4096 4096.0 -1.0000 -1.0000 644\n", "0.081329 0.058255 8192 8192.0 -1.0000 -1.0000 174\n", "\n", "finished run\n", "number of examples per pass = 8485\n", "passes used = 1\n", "weighted example sum = 8485.000000\n", "weighted label sum = -7555.000000\n", "average loss = 0.079837\n", "best constant = -1.000000\n", "best constant's loss = 0.109605\n", "total feature number = 2048932\n" ] } ], "source": [ "!vw -d $PATH_TO_ALL_DATA/20news_train.vw \\\n", " --loss_function hinge -f $PATH_TO_ALL_DATA/20news_model.vw" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "8d4e9c7a-c55b-4e18-a26a-d9b63db8ad47", "_uuid": "efc30d9b5fa31f0e39eb486e85f6775ac52cb229" }, "source": [ "VW prints a lot of interesting info while training (one can suppress it with the `--quiet` parameter). You can see documentation of the diagnostic output on [GitHub](https://github.com/JohnLangford/vowpal_wabbit/wiki/Tutorial#vws-diagnostic-information). Note how average loss drops while training. For loss computation, VW uses samples it has never seen before, so this measure is usually accurate. Now, we apply our trained model to the test set, saving predictions into a file with the `-p` flag: " ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "_cell_guid": "17c99751-4e30-4251-b18f-fe6ed8a34e58", "_uuid": "276f1431b121543e67f668bb5e9d601dd882e98e" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "only testing\r\n", "predictions = ../../data//20news_test_predictions.txt\r\n", "Num weight bits = 18\r\n", "learning rate = 0.5\r\n", "initial_t = 0\r\n", "power_t = 0.5\r\n", "using no cache\r\n", "Reading datafile = ../../data//20news_test.vw\r\n", "num sources = 1\r\n", "average since example example current current current\r\n", "loss last counter weight label predict features\r\n", " n.a. n.a. 1 1.0 unknown 1.0000 349\r\n", " n.a. n.a. 2 2.0 unknown -1.0000 50\r\n", " n.a. n.a. 4 4.0 unknown -1.0000 251\r\n", " n.a. n.a. 8 8.0 unknown -1.0000 237\r\n", " n.a. n.a. 16 16.0 unknown -0.8978 106\r\n", " n.a. n.a. 32 32.0 unknown -1.0000 964\r\n", " n.a. n.a. 64 64.0 unknown -1.0000 261\r\n", " n.a. n.a. 128 128.0 unknown 0.4621 82\r\n", " n.a. n.a. 256 256.0 unknown -1.0000 186\r\n", " n.a. n.a. 512 512.0 unknown -1.0000 162\r\n", " n.a. n.a. 1024 1024.0 unknown -1.0000 283\r\n", " n.a. n.a. 2048 2048.0 unknown -1.0000 104\r\n", "\r\n", "finished run\r\n", "number of examples per pass = 2829\r\n", "passes used = 1\r\n", "weighted example sum = 2829.000000\r\n", "weighted label sum = 0.000000\r\n", "average loss = n.a.\r\n", "total feature number = 642215\r\n" ] } ], "source": [ "!vw -i $PATH_TO_ALL_DATA/20news_model.vw -t -d $PATH_TO_ALL_DATA/20news_test.vw \\\n", "-p $PATH_TO_ALL_DATA/20news_test_predictions.txt" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "b3c2b322-058b-4051-a29e-961e9691947f", "_uuid": "e478d52c8fd3c52b8a46d6e1dd5ebccfd691748e" }, "source": [ "Now we load our predictions, compute AUC, and plot the ROC curve:" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "_cell_guid": "93d9f296-f06b-4d18-94a5-e18ccf88b094", "_uuid": "40b4112f1afa764db10b58b0ab05adfed48ab388" }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "with open(os.path.join(PATH_TO_ALL_DATA, \n", " '20news_test_predictions.txt')) as pred_file:\n", " test_prediction = [float(label) \n", " for label in pred_file.readlines()]\n", "\n", "auc = roc_auc_score(test_labels, test_prediction)\n", "roc_curve = roc_curve(test_labels, test_prediction)\n", "\n", "with plt.xkcd():\n", " plt.plot(roc_curve[0], roc_curve[1]);\n", " plt.plot([0,1], [0,1])\n", " plt.xlabel('FPR'); plt.ylabel('TPR'); \n", " plt.title('test AUC = %f' % (auc)); \n", " plt.axis([-0.05,1.05,-0.05,1.05]);" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "a302f698-1f6f-4653-a1a1-ef467919ca1e", "_uuid": "5aa3b9722f255cb9db25c8fe5991c1e2c7a17f71" }, "source": [ "The AUC value we get shows that we have achieved high classification quality." ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "6cc41756-d83c-48a5-93f6-56c238487b51", "_uuid": "47ea9632a74d49a003c1d876661af0113efd6bd4" }, "source": [ "# 3.2. News. Multiclass classification" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "ad900d77-3eb6-413b-ba95-7d33d036e655", "_uuid": "2fd54a3102db4cf3d3665877acffb3cead417a2f" }, "source": [ "We will use the same news dataset, but, this time, we will solve a multiclass classification problem. `Vowpal Wabbit` is a little picky – it wants labels starting from 1 till K, where K – is the number of classes in the classification task (20 in our case). So we will use LabelEncoder and add 1 afterwards (recall that `LabelEncoder` maps labels into range from 0 to K-1)." ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "_cell_guid": "e94dfe85-61d6-42d2-b19e-89923b511535", "_uuid": "30fca7585b9d841b5925ec9b3db9b12ff1bb5142" }, "outputs": [], "source": [ "all_documents = newsgroups['data']\n", "topic_encoder = LabelEncoder()\n", "all_targets_mult = topic_encoder.fit_transform(newsgroups['target']) + 1" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "0d4b5845-3bc5-4c2d-ba00-5977b5f108cc", "_uuid": "93289fedab1875334df6632138311299dfe7dc4d" }, "source": [ "**The data is the same, but we have changed the labels, train_labels_mult and test_labels_mult, into label vectors from 1 to 20.**" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "_cell_guid": "90e22502-efe9-4a2c-9868-fa6a18ed91b2", "_uuid": "f90ea2c247690e680d26f6d30992daf45ffd0315" }, "outputs": [], "source": [ "train_documents, test_documents, train_labels_mult, test_labels_mult = \\\n", " train_test_split(all_documents, all_targets_mult, random_state=7)\n", " \n", "with open(os.path.join(PATH_TO_ALL_DATA, \n", " '20news_train_mult.vw'), 'w') as vw_train_data:\n", " for text, target in zip(train_documents, train_labels_mult):\n", " vw_train_data.write(to_vw_format(text, target))\n", "with open(os.path.join(PATH_TO_ALL_DATA, \n", " '20news_test_mult.vw'), 'w') as vw_test_data:\n", " for text in test_documents:\n", " vw_test_data.write(to_vw_format(text))" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "478ead84-c063-4383-baac-7deaa627d83f", "_uuid": "c0f488bb8182487c1393e5b9f0fdbc301de58a02" }, "source": [ "We train Vowpal Wabbit in multiclass classification mode, passing the `oaa` parameter(\"one against all\") with the number of classes. Also, let's see what parameters our model quality is dependent on (more info can be found in the [official Vowpal Wabbit tutorial](https://github.com/JohnLangford/vowpal_wabbit/wiki/Tutorial)):\n", " - learning rate (-l, 0.5 default) – rate of weight change on every step\n", " - learning rate decay (--power_t, 0.5 default) – it is proven in practice, that, if the learning rate drops with the number of steps in stochastic gradient descent, we approach the minimum loss better\n", " - loss function (--loss_function) – the entire training algorithm depends on it. See [docs](https://github.com/JohnLangford/vowpal_wabbit/wiki/Loss-functions) for loss functions\n", " - Regularization (-l1) – note that VW calculates regularization for every object. That is why we usually set regularization values to about $10^{-20}.$\n", " \n", "Additionally, we can try automatic Vowpal Wabbit parameter tuning with [Hyperopt](https://github.com/hyperopt/hyperopt)." ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "_cell_guid": "80ae364e-8332-41db-bdc3-7f642bb24d7a", "_uuid": "97ea9ebcb6ed46b5512f5a9364680e501c2925cf" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "final_regressor = ../../data//20news_model_mult.vw\n", "Num weight bits = 18\n", "learning rate = 0.5\n", "initial_t = 0\n", "power_t = 0.5\n", "using no cache\n", "Reading datafile = ../../data//20news_train_mult.vw\n", "num sources = 1\n", "average since example example current current current\n", "loss last counter weight label predict features\n", "1.000000 1.000000 1 1.0 15 1 157\n", "1.000000 1.000000 2 2.0 2 15 159\n", "1.000000 1.000000 4 4.0 15 10 92\n", "1.000000 1.000000 8 8.0 16 15 129\n", "1.000000 1.000000 16 16.0 13 12 108\n", "0.937500 0.875000 32 32.0 2 9 115\n", "0.906250 0.875000 64 64.0 16 16 114\n", "0.867188 0.828125 128 128.0 8 4 110\n", "0.816406 0.765625 256 256.0 7 15 44\n", "0.646484 0.476562 512 512.0 13 9 160\n", "0.502930 0.359375 1024 1024.0 3 4 194\n", "0.388672 0.274414 2048 2048.0 1 1 438\n", "0.300293 0.211914 4096 4096.0 11 11 644\n", "0.225098 0.149902 8192 8192.0 5 5 174\n", "\n", "finished run\n", "number of examples per pass = 8485\n", "passes used = 1\n", "weighted example sum = 8485.000000\n", "weighted label sum = 0.000000\n", "average loss = 0.222392\n", "total feature number = 2048932\n", "CPU times: user 10.2 ms, sys: 11.7 ms, total: 21.9 ms\n", "Wall time: 396 ms\n" ] } ], "source": [ "%%time\n", "!vw --oaa 20 $PATH_TO_ALL_DATA/20news_train_mult.vw -f $PATH_TO_ALL_DATA/20news_model_mult.vw \\\n", "--loss_function=hinge" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "_cell_guid": "4d78df8b-f6d2-477a-b5be-a5f7364c0e56", "_uuid": "d30d194b2be9521d609d703385f528c24a557aa0" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "only testing\n", "predictions = ../../data//20news_test_predictions_mult.txt\n", "Num weight bits = 18\n", "learning rate = 0.5\n", "initial_t = 0\n", "power_t = 0.5\n", "using no cache\n", "Reading datafile = ../../data//20news_test_mult.vw\n", "num sources = 1\n", "average since example example current current current\n", "loss last counter weight label predict features\n", " n.a. n.a. 1 1.0 unknown 8 349\n", " n.a. n.a. 2 2.0 unknown 6 50\n", " n.a. n.a. 4 4.0 unknown 18 251\n", " n.a. n.a. 8 8.0 unknown 18 237\n", " n.a. n.a. 16 16.0 unknown 4 106\n", " n.a. n.a. 32 32.0 unknown 15 964\n", " n.a. n.a. 64 64.0 unknown 4 261\n", " n.a. n.a. 128 128.0 unknown 8 82\n", " n.a. n.a. 256 256.0 unknown 10 186\n", " n.a. n.a. 512 512.0 unknown 1 162\n", " n.a. n.a. 1024 1024.0 unknown 11 283\n", " n.a. n.a. 2048 2048.0 unknown 14 104\n", "\n", "finished run\n", "number of examples per pass = 2829\n", "passes used = 1\n", "weighted example sum = 2829.000000\n", "weighted label sum = 0.000000\n", "average loss = n.a.\n", "total feature number = 642215\n", "CPU times: user 5.29 ms, sys: 9.01 ms, total: 14.3 ms\n", "Wall time: 182 ms\n" ] } ], "source": [ "%%time\n", "!vw -i $PATH_TO_ALL_DATA/20news_model_mult.vw -t -d $PATH_TO_ALL_DATA/20news_test_mult.vw \\\n", "-p $PATH_TO_ALL_DATA/20news_test_predictions_mult.txt" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "_cell_guid": "bf4c66cf-3039-4ac4-a129-e52652c16d8d", "_uuid": "f6911717662ade8c466cc2f3681f356619d8676c" }, "outputs": [], "source": [ "with open(os.path.join(PATH_TO_ALL_DATA, \n", " '20news_test_predictions_mult.txt')) as pred_file:\n", " test_prediction_mult = [float(label) \n", " for label in pred_file.readlines()]" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "_cell_guid": "c8cd9521-6c23-412b-959d-19e9f95bd5ee", "_uuid": "c110cfb692d0c280009ca815efa504f5889e28a6" }, "outputs": [ { "data": { "text/plain": [ "0.8734535171438671" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "accuracy_score(test_labels_mult, test_prediction_mult)" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "365da3c4-6bcb-40fa-8786-d57415595491", "_uuid": "ab8ac32cc2357207807e8c262177e0347e9f6d71" }, "source": [ "Here is how often the model misclassifies atheism with other topics:" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "_cell_guid": "5b40bd39-4895-4713-be3a-2fea6fa37062", "_uuid": "fefe7d01f98e8cd3af7ce5a3b7d0f859755e5047" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "rec.autos 1\n", "rec.sport.baseball 1\n", "sci.med 1\n", "soc.religion.christian 3\n", "talk.religion.misc 5\n" ] } ], "source": [ "M = confusion_matrix(test_labels_mult, test_prediction_mult)\n", "for i in np.where(M[0,:] > 0)[0][1:]:\n", " print(newsgroups['target_names'][i], M[0,i])" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "cab1c6f7-515d-4bf3-9054-5bc25d81be8e", "_uuid": "dc891b0f4124cdc76a2f9db9340d49055a91e1f8" }, "source": [ "# 3.3. IMDB movie reviews\n", "In this part we will do binary classification of [IMDB](http://www.imdb.com) (International Movie DataBase) movie reviews. We will see how fast Vowpal Wabbit performs.\n", "\n", "Using the `load_files` function from `sklearn.datasets`, we load the movie reviews [here](https://drive.google.com/file/d/1xq4l5c0JrcxJdyBwJWvy0u9Ad_pvkJ1l/view?usp=sharing). If you want to reproduce the results, please download the archive, unzip it, and set the path to `imdb_reviews` (it already contains *train* and *test* subdirectories). Unpacking can take several minutes as there are 100k files. Both train and test sets hold 12.5k good and bad movie reviews. First, we split the texts and labels." ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "_cell_guid": "aa499b93-4331-4dce-8b9d-ab3c77afec52", "_uuid": "4c219655c397260e2264555513dd27fb91515454" }, "outputs": [], "source": [ "import pickle" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "_cell_guid": "a1827aa1-5bc7-4838-bb01-68792de451b6", "_uuid": "e719b5b6c976cf9daf10a9f3b7c35b4cde79b057" }, "outputs": [], "source": [ "# change this for your path to imdb_reviews\n", "path_to_movies = os.path.expanduser('/Users/y.kashnitsky/Documents/Machine_learning/datasets/imdb_reviews')\n", "reviews_train = load_files(os.path.join(path_to_movies, 'train'))\n", "text_train, y_train = reviews_train.data, reviews_train.target" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "_cell_guid": "db550541-630a-47a5-8b98-246d94a3fb33", "_uuid": "593032d9be6e7aff583e0c7ba40e3e801b136edf" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of documents in training data: 25000\n", "[12500 12500]\n" ] } ], "source": [ "print(\"Number of documents in training data: %d\" % len(text_train))\n", "print(np.bincount(y_train))" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "50ea99f8-6680-4a77-b3fb-a9fb18fd11b3", "_uuid": "6cd9d12d23bec99ee3df859934378ed1a911be22" }, "source": [ "Do the same for the test set." ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "_cell_guid": "ce60fd02-e5d4-40aa-a5db-33556ef7063f", "_uuid": "8fea469dd838a26671f234d54cf82d6f57188e73" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of documents in test data: 25000\n", "[12500 12500]\n" ] } ], "source": [ "reviews_test = load_files(os.path.join(path_to_movies, 'test'))\n", "text_test, y_test = reviews_test.data, reviews_train.target\n", "print(\"Number of documents in test data: %d\" % len(text_test))\n", "print(np.bincount(y_test))" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "a7f87c80-8d74-47f1-9d47-cd31542da37d", "_uuid": "d37600cecaa2fd716479ed286ca9985380722c40" }, "source": [ "Take a look at examples of reviews and their corresponding labels." ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "_cell_guid": "213017e1-25f0-4d08-8d9c-5341b449b8f6", "_uuid": "aea5acfcef80c4006dbcab233d237d2cd091ab48" }, "outputs": [ { "data": { "text/plain": [ "b\"Zero Day leads you to think, even re-think why two boys/young men would do what they did - commit mutual suicide via slaughtering their classmates. It captures what must be beyond a bizarre mode of being for two humans who have decided to withdraw from common civility in order to define their own/mutual world via coupled destruction.

It is not a perfect movie but given what money/time the filmmaker and actors had - it is a remarkable product. In terms of explaining the motives and actions of the two young suicide/murderers it is better than 'Elephant' - in terms of being a film that gets under our 'rationalistic' skin it is a far, far better film than almost anything you are likely to see.

Flawed but honest with a terrible honesty.\"" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text_train[0]" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "_cell_guid": "23c8ca4d-7287-406a-8381-f9c9c095f1b7", "_uuid": "976c29d60e5fc1da27e36b5c34dd1fa4185cd083" }, "outputs": [ { "data": { "text/plain": [ "1" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_train[0] # good review" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "_cell_guid": "b4cd5590-5952-4388-9960-37702e78be4f", "_uuid": "2fe24a84eb18599fb67664a63e3b2178ee29ee26" }, "outputs": [ { "data": { "text/plain": [ "b'Words can\\'t describe how bad this movie is. I can\\'t explain it by writing only. You have too see it for yourself to get at grip of how horrible a movie really can be. Not that I recommend you to do that. There are so many clich\\xc3\\xa9s, mistakes (and all other negative things you can imagine) here that will just make you cry. To start with the technical first, there are a LOT of mistakes regarding the airplane. I won\\'t list them here, but just mention the coloring of the plane. They didn\\'t even manage to show an airliner in the colors of a fictional airline, but instead used a 747 painted in the original Boeing livery. Very bad. The plot is stupid and has been done many times before, only much, much better. There are so many ridiculous moments here that i lost count of it really early. Also, I was on the bad guys\\' side all the time in the movie, because the good guys were so stupid. \"Executive Decision\" should without a doubt be you\\'re choice over this one, even the \"Turbulence\"-movies are better. In fact, every other movie in the world is better than this one.'" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text_train[1]" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "_cell_guid": "0faa88f7-c96d-4073-9b16-b6d68d31d354", "_uuid": "f64122c2e93648b75494df26b1de4bf01358df35" }, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_train[1] # bad review" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "_cell_guid": "23032bfe-25d1-4f79-b24e-a4cb69f44127", "_uuid": "0ce9e75db41712f6f9161f9d6cd8000381a5b3ba" }, "outputs": [ { "data": { "text/plain": [ "'1 |text words can describe how bad this movie can explain writing only you have too see for yourself get grip how horrible movie really can not that recommend you that there are many clich xc3 xa9s mistakes and all other negative things you can imagine here that will just make you cry start with the technical first there are lot mistakes regarding the airplane won list them here but just mention the coloring the plane they didn even manage show airliner the colors fictional airline but instead used 747 painted the original boeing livery very bad the plot stupid and has been done many times before only much much better there are many ridiculous moments here that lost count really early also was the bad guys side all the time the movie because the good guys were stupid executive decision should without doubt you choice over this one even the turbulence movies are better fact every other movie the world better than this one\\n'" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "to_vw_format(str(text_train[1]), 1 if y_train[0] == 1 else -1)" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "886409fe-bdbf-4e09-9091-1912143715b9", "_uuid": "844598ae3aa0064ecfb4cb778df0c555a6b91403" }, "source": [ "Now, we prepare training (`movie_reviews_train.vw`), validation (`movie_reviews_valid.vw`), and test (`movie_reviews_test.vw`) sets for Vowpal Wabbit. We will use 70% for training, 30% for the hold-out set." ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "_cell_guid": "c08edd2b-7290-4e70-9072-895e33923d44", "_uuid": "7309769413ef059831390aea012334a5d7482817" }, "outputs": [], "source": [ "train_share = int(0.7 * len(text_train))\n", "train, valid = text_train[:train_share], text_train[train_share:]\n", "train_labels, valid_labels = y_train[:train_share], y_train[train_share:]" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "_cell_guid": "a30cddd4-fceb-46ca-87a8-1b7b7d3b36d1", "_uuid": "9154377a268f463e98c17432a84a77d03e68052e" }, "outputs": [ { "data": { "text/plain": [ "(17500, 7500)" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(train_labels), len(valid_labels)" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "_cell_guid": "6d25382f-b135-47f8-a82c-2568ad02f9ff", "_uuid": "cce438177377598771a890dc94651de5a6a8495e" }, "outputs": [], "source": [ "with open(os.path.join(PATH_TO_ALL_DATA, 'movie_reviews_train.vw'), 'w') as vw_train_data:\n", " for text, target in zip(train, train_labels):\n", " vw_train_data.write(to_vw_format(str(text), 1 if target == 1 else -1))\n", "with open(os.path.join(PATH_TO_ALL_DATA, 'movie_reviews_valid.vw'), 'w') as vw_train_data:\n", " for text, target in zip(valid, valid_labels):\n", " vw_train_data.write(to_vw_format(str(text), 1 if target == 1 else -1))\n", "with open(os.path.join(PATH_TO_ALL_DATA, 'movie_reviews_test.vw'), 'w') as vw_test_data:\n", " for text in text_test:\n", " vw_test_data.write(to_vw_format(str(text)))" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "_cell_guid": "cb999742-8ad7-428d-8f66-f015d38de4f8", "_uuid": "8d87247a0d044e81b545e37ffaa59a96712a252f" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 |text zero day leads you think even think why two boys young men would what they did commit mutual suicide via slaughtering their classmates captures what must beyond bizarre mode being for two humans who have decided withdraw from common civility order define their own mutual world via coupled destruction not perfect movie but given what money time the filmmaker and actors had remarkable product terms explaining the motives and actions the two young suicide murderers better than elephant terms being film that gets under our rationalistic skin far far better film than almost anything you are likely see flawed but honest with terrible honesty\r\n", "-1 |text words can describe how bad this movie can explain writing only you have too see for yourself get grip how horrible movie really can not that recommend you that there are many clich xc3 xa9s mistakes and all other negative things you can imagine here that will just make you cry start with the technical first there are lot mistakes regarding the airplane won list them here but just mention the coloring the plane they didn even manage show airliner the colors fictional airline but instead used 747 painted the original boeing livery very bad the plot stupid and has been done many times before only much much better there are many ridiculous moments here that lost count really early also was the bad guys side all the time the movie because the good guys were stupid executive decision should without doubt you choice over this one even the turbulence movies are better fact every other movie the world better than this one\r\n" ] } ], "source": [ "!head -2 $PATH_TO_ALL_DATA/movie_reviews_train.vw" ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "_cell_guid": "f25cd0d3-1767-41f1-b7bf-096224619bc5", "_uuid": "b63440f7ffebc472ecbe54d8e3aef8c3a3b75fa1" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 |text matter life and death what can you really say that would properly justice the genius and beauty this film powell and pressburger visual imagination knows bounds every frame filled with fantastically bold compositions the switches between the bold colours the real world the stark black and white heaven ingenious showing visually just how much more vibrant life the final court scene also fantastic the judge and jury descend the stairway heaven hold court over peter david niven operation all the performances are spot roger livesey being standout and the romantic energy the film beautiful never has there been more romantic film than this there has haven seen matter life and death all about the power love and just how important life and jack cardiff cinematography reason enough watch the film alone the way lights kim hunter face makes her all the more beautiful what genius can make simple things such game table tennis look exciting and the sound design also impeccable the way the sound mutes vital points was decision way ahead its time this true classic that can restore anyone faith cinema under appreciated its initial release and today audiences but one all time favourites which why give this film word beautiful\r\n", "1 |text while this was better movie than 101 dalmations live action not animated version think still fell little short what disney could was well filmed the music was more suited the action and the effects were better done compared 101 the acting was perhaps better but then the human characters were given far more appropriate roles this sequel and glenn close really not missed the first movie she makes shine her poor lackey and the overzealous furrier sidekicks are wonderful characters play off and they add the spectacle disney has given this great family film with little objectionable material and yet remains fun and interesting for adults and children alike bound classic many disney films are here hoping the third will even better still because you know they probably want make one\r\n" ] } ], "source": [ "!head -2 $PATH_TO_ALL_DATA/movie_reviews_valid.vw" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "_cell_guid": "eaa1a658-e6e0-42f1-9c30-14be762fc7d9", "_uuid": "fd4e364bad19a64110f7ec839a56a2776222306c" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " |text don hate heather graham because she beautiful hate her because she fun watch this movie like the hip clothing and funky surroundings the actors this flick work well together casey affleck hysterical and heather graham literally lights the screen the minor characters goran visnjic sigh and patricia velazquez are talented they are gorgeous congratulations miramax director lisa krueger\r\n", " |text don know how this movie has received many positive comments one can call artistic and beautifully filmed but those things don make for the empty plot that was filled with sexual innuendos wish had not wasted time watch this movie rather than being biographical was poor excuse for promoting strange and lewd behavior was just another hollywood attempt convince that that kind life normal and from the very beginning asked self what was the point this movie and continued watching hoping that would change and was quite disappointed that continued the same vein glad did not spend the money see this theater\r\n" ] } ], "source": [ "!head -2 $PATH_TO_ALL_DATA/movie_reviews_test.vw" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "0cedab5e-5896-4439-beeb-96b268382587", "_uuid": "6403759f3d06322ee2b8e4017f181bc9681661ea" }, "source": [ "**Now, launch Vowpal Wabbit with the following arguments:**\n", "\n", " - -d, path to training set (corresponding .vw file)\n", " - --loss_function – hinge (feel free to experiment here)\n", " - -f – path to the output file (which can also be in the .vw format)" ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "_cell_guid": "89094f9e-7453-4137-87b0-21df78627ae6", "_uuid": "4e27dea292bfde2224e802d60c00aad0fbac3274" }, "outputs": [], "source": [ "!vw -d $PATH_TO_ALL_DATA/movie_reviews_train.vw --loss_function hinge \\\n", "-f $PATH_TO_ALL_DATA/movie_reviews_model.vw --quiet" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "7653ab5e-de84-47c0-9981-80fbd6a04783", "_uuid": "66ccbe8a95fea95703c28f21af5f6b9a3dcd87f6" }, "source": [ "Next, make the hold-out prediction with the following VW arguments:\n", " - -i –path to the trained model (.vw file)\n", " - -t -d – path to hold-out set (.vw file) \n", " - -p – path to a txt-file where the predictions will be stored" ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "_cell_guid": "7fbac967-4a81-4eff-9d31-721a197ed568", "_uuid": "c8ae7e58aeacc3e88910b62aff684ab12fb970e1" }, "outputs": [], "source": [ "!vw -i $PATH_TO_ALL_DATA/movie_reviews_model.vw -t \\\n", "-d $PATH_TO_ALL_DATA/movie_reviews_valid.vw -p $PATH_TO_ALL_DATA/movie_valid_pred.txt --quiet" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "5fa99f83-de1f-42d7-97ba-52da84c086f8", "_uuid": "9664438878f6d1c633bc67ee13266b929bab7dd0" }, "source": [ "Read the predictions from the text file and estimate the accuracy and ROC AUC. Note that VW prints probability estimates of the +1 class. These estimates are distributed from -1 to 1, so we can convert these into binary answers, assuming that positive values belong to class 1." ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "_cell_guid": "609497c1-de20-436d-9910-018316f36e8e", "_uuid": "8cce125db3a615ff84b0892819f1b5f7c680a3e2" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy: 0.885\n", "AUC: 0.942\n" ] } ], "source": [ "with open(os.path.join(PATH_TO_ALL_DATA, 'movie_valid_pred.txt')) as pred_file:\n", " valid_prediction = [float(label) \n", " for label in pred_file.readlines()]\n", "print(\"Accuracy: {}\".format(round(accuracy_score(valid_labels, \n", " [int(pred_prob > 0) for pred_prob in valid_prediction]), 3)))\n", "print(\"AUC: {}\".format(round(roc_auc_score(valid_labels, valid_prediction), 3)))" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "699aac8c-8b17-4df1-9258-45f997ae0f79", "_uuid": "d19e8357504987994c53d41ea373891aab41afa6" }, "source": [ "Again, do the same for the test set." ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "_cell_guid": "aaa8c5c7-aa05-4cdd-a06e-bf7325972691", "_uuid": "36f5d2d1336d537c95947e884207ecffe18e2d91" }, "outputs": [], "source": [ "!vw -i $PATH_TO_ALL_DATA/movie_reviews_model.vw -t \\\n", "-d $PATH_TO_ALL_DATA/movie_reviews_test.vw \\\n", "-p $PATH_TO_ALL_DATA/movie_test_pred.txt --quiet" ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "_cell_guid": "7a61e739-89f4-44d1-a290-9ad2be4bdffd", "_uuid": "b3cf5f3efa4dbaf05baff4f4900c61665f2d9fee" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy: 0.88\n", "AUC: 0.94\n" ] } ], "source": [ "with open(os.path.join(PATH_TO_ALL_DATA, 'movie_test_pred.txt')) as pred_file:\n", " test_prediction = [float(label) \n", " for label in pred_file.readlines()]\n", "print(\"Accuracy: {}\".format(round(accuracy_score(y_test, \n", " [int(pred_prob > 0) for pred_prob in test_prediction]), 3)))\n", "print(\"AUC: {}\".format(round(roc_auc_score(y_test, test_prediction), 3)))" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "b5ecbd7f-1bf9-496a-a35f-f82a1ea2db10", "_uuid": "45c4a3badafdf39fc53a6bbc32d2c43b15b88c98" }, "source": [ "Let's try to achieve a higher accuracy by incorporating bigrams." ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "_cell_guid": "c5b25dce-c730-4031-8938-1b9fa64575b6", "_uuid": "98fbb3a6f858b362aef8f797469075eec9072aaf" }, "outputs": [], "source": [ "!vw -d $PATH_TO_ALL_DATA/movie_reviews_train.vw \\\n", "--loss_function hinge --ngram 2 -f $PATH_TO_ALL_DATA/movie_reviews_model2.vw --quiet" ] }, { "cell_type": "code", "execution_count": 66, "metadata": { "_cell_guid": "17ceb519-ab1f-4184-b9c2-9eb925b32ea8", "_uuid": "48aa4c0a6c58be2781ed607a2393de37083345ae" }, "outputs": [], "source": [ "!vw -i$PATH_TO_ALL_DATA/movie_reviews_model2.vw -t -d $PATH_TO_ALL_DATA/movie_reviews_valid.vw \\\n", "-p $PATH_TO_ALL_DATA/movie_valid_pred2.txt --quiet" ] }, { "cell_type": "code", "execution_count": 67, "metadata": { "_cell_guid": "d1c01554-e2b6-4a55-97b9-0d17c8b26686", "_uuid": "be78a33c70c03e1bc9a48a25e804248930fba2c2" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy: 0.894\n", "AUC: 0.954\n" ] } ], "source": [ "with open(os.path.join(PATH_TO_ALL_DATA, 'movie_valid_pred2.txt')) as pred_file:\n", " valid_prediction = [float(label) \n", " for label in pred_file.readlines()]\n", "print(\"Accuracy: {}\".format(round(accuracy_score(valid_labels, \n", " [int(pred_prob > 0) for pred_prob in valid_prediction]), 3)))\n", "print(\"AUC: {}\".format(round(roc_auc_score(valid_labels, valid_prediction), 3)))" ] }, { "cell_type": "code", "execution_count": 68, "metadata": { "_cell_guid": "3015a9e6-f814-4beb-9d37-f8c6a2e2c546", "_uuid": "f1aa2121a7ee92577ad4058b0688f8290d24e1d5" }, "outputs": [], "source": [ "!vw -i $PATH_TO_ALL_DATA/movie_reviews_model2.vw -t -d $PATH_TO_ALL_DATA/movie_reviews_test.vw \\\n", "-p $PATH_TO_ALL_DATA/movie_test_pred2.txt --quiet" ] }, { "cell_type": "code", "execution_count": 69, "metadata": { "_cell_guid": "527e0327-b208-47a0-9853-2e1bd06beb38", "_uuid": "22b7b6feb376fcbd44faf942914afaade9368b30" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy: 0.888\n", "AUC: 0.952\n" ] } ], "source": [ "with open(os.path.join(PATH_TO_ALL_DATA, 'movie_test_pred2.txt')) as pred_file:\n", " test_prediction2 = [float(label) \n", " for label in pred_file.readlines()]\n", "print(\"Accuracy: {}\".format(round(accuracy_score(y_test, \n", " [int(pred_prob > 0) for pred_prob in test_prediction2]), 3)))\n", "print(\"AUC: {}\".format(round(roc_auc_score(y_test, test_prediction2), 3)))" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "a8c85011-e7d2-42c5-8294-34177d6de387", "_uuid": "63990db5f343a0e9f44c0069091a66d9f35b954b" }, "source": [ "Adding bigrams really helped to improve our model!" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "22c8f4bc-07db-44ab-aed5-cab6db73689d", "_uuid": "b6c7d300b2f4c9d50358345c5145780f201745ea" }, "source": [ "# 3.4. Classifying gigabytes of StackOverflow questions" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "f3e9b72a-cc49-49c4-b663-b7df4632cf69", "_uuid": "a84523499443d63d08abe83c7a6ebe5bd7c63bf2" }, "source": [ "Now, let's see Vowpal Wabbit work on large datasets. There is a 10GB dataset of StackOverflow questions [here](https://drive.google.com/file/d/1ZU4J3KhJDrHVMj48fROFcTsTZKorPGlG/view?usp=sharing). The original dataset is comprised of 10 million questions; each question can have several tags. The data is quite clean, so don't call it \"Big Data\", even in a pub. :)\n", "\n", "\n", "\n", "We chose only 10 tags: 'javascript', 'java', 'python', 'ruby', 'php', 'c++', 'c#', 'go', 'scala' and 'swift'. Let's solve the 10-class classification problem: we want to predict a tag corresponding to one of these 10 popular programming languages given only the text of the question." ] }, { "cell_type": "code", "execution_count": 70, "metadata": { "_cell_guid": "ab910dd7-922e-4d85-825e-68fee0005a31", "_uuid": "c01e42904278e7f21df1109e8afed96515538e36" }, "outputs": [], "source": [ "# change the path to data\n", "PATH_TO_STACKOVERFLOW_DATA = '/Users/y.kashnitsky/Documents/stackoverflow_10mln/'" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "299a2038-eda6-4cac-96ff-638b39993c56", "_uuid": "2cce7434f48c59e765ca04980921d26bd51087ff" }, "source": [ "Print the first 3 lines from a sample of the dataset." ] }, { "cell_type": "code", "execution_count": 73, "metadata": { "_cell_guid": "6477773a-6537-46b4-a97e-b65228ae7028", "_uuid": "88d65c52c98bf7134a2a4a86e31a5e5c76b093a3" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 | i ve got some code in window scroll that checks if an element is visible then triggers another function however only the first section of code is firing both bits of code work in and of themselves if i swap their order whichever is on top fires correctly my code is as follows fn isonscreen function use strict var win window viewport top win scrolltop left win scrollleft bounds this offset viewport right viewport left + win width viewport bottom viewport top + win height bounds right bounds left + this outerwidth bounds bottom bounds top + this outerheight return viewport right lt bounds left viewport left gt bounds right viewport bottom lt bounds top viewport top gt bounds bottom window scroll function use strict var load_more_results ajax load_more_results isonscreen if load_more_results true loadmoreresults var load_more_staff ajax load_more_staff isonscreen if load_more_staff true loadmorestaff what am i doing wrong can you only fire one event from window scroll i assume not\r\n", "4 | redefining some constant in ruby ex foo bar generates the warning already initialized constant i m trying to write a sort of reallyconstants module where this code should have this behaviour reallyconstants define_constant foo bar gt sets the constant reallyconstants foo to bar reallyconstants foo gt bar reallyconstants foo foobar gt this should raise an exception that is constant redefinition should generate an exception is that possible\r\n", "1 | in my form panel i added a checkbox setting stateful true stateid loginpanelremeberme then before sending form i save state calling this savestate on the panel all other componenets save their state and whe i reload the page they recall the previous state but checkbox alway start in unchecked state is there any way to force save value\r\n" ] } ], "source": [ "!head -3 $PATH_TO_STACKOVERFLOW_DATA/stackoverflow_train.vw" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "c6f3878f-14dc-4db7-8a7e-342bd50c2064", "_uuid": "ede9d5f834bcafe602ede57580d43cc9c66bf367" }, "source": [ "After selecting our 10 tags, we have a 4.7G dataset that we divide into train and test." ] }, { "cell_type": "code", "execution_count": 74, "metadata": { "_cell_guid": "0af42cdc-e0a2-4f14-a01b-c06814555a31", "_uuid": "f129a969cb8f794c741ed2ae83189155389d976e" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "4,7G\t/Users/y.kashnitsky/Documents/stackoverflow_10mln//stackoverflow_10mln.vw\r\n", "1,6G\t/Users/y.kashnitsky/Documents/stackoverflow_10mln//stackoverflow_test.vw\r\n", "3,1G\t/Users/y.kashnitsky/Documents/stackoverflow_10mln//stackoverflow_train.vw\r\n" ] } ], "source": [ "!du -hs $PATH_TO_STACKOVERFLOW_DATA/stackoverflow_*.vw" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "d4d62394-a317-49b2-bcf8-143ad1088803", "_uuid": "b13bba2f5490e6032a262f22812f8ebc71e20df7" }, "source": [ "We will process the training set (3.1 GiB) with Vowpal Wabbit and the following arguments: \n", "- -oaa 10 – for multiclass classification with 10 classes\n", "- -d – path to data\n", "- -f – path to output file of the trained model\n", "- -b 28 – we will use 28 bits for hashing, resulting in a $2^{28}$-sized feature space\n", "- fix random seed for reproducibility" ] }, { "cell_type": "code", "execution_count": 75, "metadata": { "_cell_guid": "b442e9c2-f882-4a50-8d86-5f20af65680f", "_uuid": "e958864c6d5f006d01b846a1af40f28943303369" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 559 ms, sys: 171 ms, total: 730 ms\n", "Wall time: 38.5 s\n" ] } ], "source": [ "%%time\n", "!vw --oaa 10 -d $PATH_TO_STACKOVERFLOW_DATA/stackoverflow_train.vw \\\n", "-f vw_model1_10mln.vw -b 28 --random_seed 17 --quiet" ] }, { "cell_type": "code", "execution_count": 76, "metadata": { "_cell_guid": "084d7a47-841b-48ea-8a99-2a73f6fc623d", "_uuid": "966f6361ac0b979a32f769f10b2bfa885b04f497" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 322 ms, sys: 97.4 ms, total: 420 ms\n", "Wall time: 22.8 s\n" ] } ], "source": [ "%%time\n", "!vw -t -i vw_model1_10mln.vw -d $PATH_TO_STACKOVERFLOW_DATA/stackoverflow_test.vw \\\n", "-p vw_test_pred.csv --random_seed 17 --quiet" ] }, { "cell_type": "code", "execution_count": 77, "metadata": { "_cell_guid": "5b687d6d-2184-447b-83d3-d9d821afc7ed", "_uuid": "23ba328602f55cb4e93103c46f2f4866645dbd94" }, "outputs": [ { "data": { "text/plain": [ "0.91728604842865913" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vw_pred = np.loadtxt(os.path.join(PATH_TO_STACKOVERFLOW_DATA, \n", " 'vw_test_pred.csv'))\n", "test_labels = np.loadtxt(os.path.join(PATH_TO_STACKOVERFLOW_DATA, \n", " 'stackoverflow_test_labels.txt'))\n", "accuracy_score(test_labels, vw_pred)" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "832a61d7-5d5e-411a-b6e3-e0817b2db0b1", "_uuid": "623eef2b8d8b5afe558fd92279ef6e847071d247" }, "source": [ "The model has trained and made predictions in less than a minute (check it, these results are reported for a MacBook Pro, mid 2015, 2.2 GHz Intel Core i7, 16GB RAM). Its accuracy is almost 92%. All of this without a Hadoop cluster! :) Impressive, isn't it?" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "91cbf31e-43a1-482d-af7a-02de3826b00f", "_uuid": "bf8903c882ff92061cb8ad971f097734254ed866" }, "source": [ "\n", "## 4. Useful links\n", "- Official VW [documentation](https://github.com/JohnLangford/vowpal_wabbit/wiki) on Github\n", "- [\"Numeric Computation\" Chapter](http://www.deeplearningbook.org/contents/numerical.html) of the [Deep Learning book](http://www.deeplearningbook.org/)\n", "- [\"Convex Optimization\" by Stephen Boyd](https://www.amazon.com/Convex-Optimization-Stephen-Boyd/dp/0521833787)\n", "- \"Command-line Tools can be 235x Faster than your Hadoop Cluster\" [post](https://aadrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html)\n", "- Benchmarking various ML algorithms on Criteo 1TB dataset on [GitHub](https://github.com/rambler-digital-solutions/criteo-1tb-benchmark)\n", "- [VW on FastML.com](http://fastml.com/blog/categories/vw/)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 1 }