{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# GraphLab Create v the Criteo Dataset\n", "## or, How to Classify 1TB of Clicks using Gradient Boosted Decision Trees" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In June 2014, Criteo Labs have released a [1TB dataset of feature values and click feedback for millions of display\n", "ads](http://labs.criteo.com/downloads/download-terabyte-click-logs/). The data consists of 24 days of click feedback data. It was soon followed by a [Kaggle competition](https://www.kaggle.com/c/criteo-display-ad-challenge/data).\n", "\n", "The Kaggle competition was dedicated to predicting whether a user will click on an ad displayed to her. The data was 7 days of labelled data: 39 features per ad, and the label 1 if the ad was clicked, 0 otherwise. The full criteo dataset contains the data of 24 days, and weighs 1TB in this raw form. This full version is the dataset we are going to tackle today." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The Data\n", "\n", "Below are the first 5 rows of the data. As you can see, it has gone through complete anonymization. The first, leftmost column - which we'll denote here as **X1** - is the target column. '1' means a displayed ad was clicked - '0' means it wasn't. Nice-and-binary.\n", "\n", "The next 13 columns (**X2-X14**) have numeric, integer values. These are straightforward to use as well. The last 26 columns (**X15-X40**) are categorical." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
X1X2X3X4X5X6X7X8X9X10X11X12X13X14X15X16
15110None16None101471None306None62770d79e21f5d58
03235None1006150131575e5f3fd8da0aaffa6
0None2331146100997013101162770d79ad984203
0None24None1124None0563None220456None710103fd
06022361550018021582602e197c5c2ced437
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
X17X18X19X20X21X22X23X24X25X26X27
afea442f945c7fcf38b027486fcd6dcb3580aa212880890346dedfa62e027dc10c7c423195981d1f00c5ffb7
6faa15d5da8a34213cd69f236fcd6dcbab16ed8143426c291df5e1547de9c0a96652dc6499eb4e2700c5ffb7
62bec60d386c49eee755064d6fcd6dcbb5f5eb62d1f2cc8b2e4e821f2e027dc10c7c42311271618400c5ffb7
c73d2eb50c758dfbf1738f486fcd6dcbe824fc1109f8a09de25a4c1112716184d49eb1df
a24276193f85ecaeb8c51ab76fcd6dcb26d0f5bb337bf7a5e25a4c116da2367ebf624fa3ec982ce0a77a4a56
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
X28X29X30X31X32X33X34X35X36X37X38
be4ee5378a0b74cc4cdc3efad20856aab8170bba9512c20bc38e2f2814f65a5d25b1b089d7c1fc0b7caf609c
be4ee537f3bbfe994cdc3efad20856aaa1eb15119512c20bfebfd863a3323ca1c8e1ee561752e9e875350c8a
be4ee537f70f0d0b4cdc3efad20856aa628f1b8d9512c20bc38e2f2814f65a5d25b1b089d7c1fc0b34a9b905
b96f9e1a2b083b9610dd37441f7fc70ba1eb15119512c20bdc209cd3b8a81fb0
be4ee537eb24f5854cdc3efad20856aad9f758ff9512c20bc709ec072b07677ea89a92a5aa137169e619743b
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
X39X40
30436bfced10571d
991321eab757e957
ff654802ed10571d
30436bfcb757e957
cdc3217eed10571d
\n", "[5 rows x 40 columns]
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## CountFeaturizer - An Efficient Expansion of Categorical Features\n", "\n", "A funny thing about categorical features is that many algorithms have to expand them.\n", "\n", "By expansion we mean the following: suppose that this is my dataset:\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
categorical_feature_1categorical_feature_2label
ac0
ad0
aa1
bc1
bc0
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first categorical feature can take one of the values **{a, b}**.
\n", "The second categorical feature can take one of the values **{a, c, d}**.
\n", "\n", "Expansion of features means that instead of dealing with 2 features *per-se*, the algorithm will consider each categorical value to be a feature column of it's own, with the value of 1 if the data point (each row in our example) has it, 0 otherwise. You can read more about categorical feature expansion in [Srikrishna's notebook about feature engineering](https://dato.com/learn/gallery/notebooks/feature-engineering.html#Adding-Categorical-Features).\n", "\n", "For the purpose of expansion, the same categorical values in different feature columns (such as with value `a`) doesn't mean the algorithm will expand this categorical value into a single column. On the contrary - the expanded dataset will look like this:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
categorical_feature_1.acategorical_feature_1.bcategorical_feature_2.acategorical_feature_2.ccategorical_feature_2.dlabel
100100
100010
101001
010101
010100
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With only 2 categorical features, one with 2 categories and the other with 3 categories, we suddenly have 5 numerical features. From 2 features to 5 in the most silly example we could produce! What would happen on the full dataset? This is not funny anymore...!\n", "\n", "To figure this out, we loaded 1000 rows into an SFrame named `sf1k` and checked the number of category-values using the following code:\n", "\n", "```python\n", "from collections import Counter\n", "c = Counter({\n", " colname: len(sf1k[colname].unique())\n", " for colname in [\"X%d\" % (i) for i in range(15, 41)]\n", " }\n", ")\n", "c.most_common()\n", "```\n", "\n", "We discovered that on the first 1000 rows, 26 categorical features will be expanded into 5537 features!\n", "\n", "```python\n", ">>> sum(c.itervalues()) # sum of number of unique values per categorical column\n", "5537\n", "\n", ">>> c.most_common() # distribution of values amongst the columns\n", "[('X17', 607),\n", " ('X26', 584),\n", " ('X38', 530),\n", " ('X21', 512),\n", " ('X16', 388),\n", " ('X29', 308),\n", " ('X34', 307),\n", " ('X15', 307),\n", " ('X36', 300),\n", " ('X24', 295),\n", " ('X35', 267),\n", " ('X19', 254),\n", " ('X25', 218),\n", " ('X18', 182),\n", " ('X22', 181),\n", " ('X37', 121),\n", " ('X32', 47),\n", " ('X28', 46),\n", " ('X40', 17),\n", " ('X23', 14),\n", " ('X39', 13),\n", " ('X33', 12),\n", " ('X30', 12),\n", " ('X27', 9),\n", " ('X31', 4),\n", " ('X20', 2)]\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the context of Gradient Boosted Trees, each categorical value may have its own node in the tree.\n", "\n", "While GraphLab can handle this situation, this is a very heavy-duty task, when there's another way to represent these categorical values. In the benchmark notebook, we will build a model using only those categorical feature columns with as little categorical values as possible." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The other way is to use our [Count Featurizer API](https://dato.com/products/create/docs/generated/graphlab.toolkits.feature_engineering.CountFeaturizer.html#graphlab.toolkits.feature_engineering.CountFeaturizer),\n", "which implements the [method described here](https://blogs.technet.microsoft.com/machinelearning/2015/02/17/big-learning-made-easy-with-counts/).\n", "\n", "You will see this in practice in the benchmark notebook, and you're invited to [read more about in in our user guide](https://dato.com/learn/userguide/feature-engineering/count_featurizer.html)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tweaking Runtime Configuration\n", "\n", "Inspecting the code, you will see that before all the data loading, models training and fine tuning, there are two changes to the runtime config:\n", "```python\n", "gl.set_runtime_config('GRAPHLAB_FILEIO_MAXIMUM_CACHE_CAPACITY', 20 * 1024 * 1024 * 1024)\n", "gl.set_runtime_config('GRAPHLAB_FILEIO_MAXIMUM_CACHE_CAPACITY_PER_FILE', 20 * 1024 * 1024 * 1024)\n", "```\n", "\n", "When dealing with extremely large datasets, on an amazingly powerful machine, such tweaks can save a lot of time.\n", "\n", "**MAXIMUM_CACHE_CAPACITY** is the number of bytes to hold in memory before flushing to disk. SFrame/SArray are all \"external memory\" datastructures, but we don't want to flush any SFrame/SArray objects to disk. So we have a cache IO layer acting as a disk buffer, and the constant is the capacity of the buffer. Once the size of new intermediate SFrame/SArray exceeds the buffer, they are flushed to disk.\n", "\n", "**MAXIMUM_CACHE_CAPACITY_PER_FILE** is similar to **MAXIMUM_CACHE_CAPACITY** but limit to a single file.\n", "\n", "The goal of these changes is to improve the disk IO efficiency because the boosted trees model is running in external memory mode. Before the first iteration, we preprocess and sort the columns and save on disk. Each iteration at every depth, these \"columns\" are loaded into memory to search the best split." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A note about using a Validation Set in GraphLab Create\n", "\n", "You might notice that during model creation (AKA training), the `create()` method is supplied with the test split as its `validation set`:\n", "\n", "```python\n", "model = gl.boosted_trees_classifier.create(full_train,\n", " target=target_feature,\n", " validation_set=test,\n", " features=(num_features + cat_features),\n", " max_iterations=5,\n", " random_seed=0)\n", "```\n", "\n", "Are we using test data during training? Are we breaking one of the ground rules of machine learning?\n", "\n", "Rest assured - we aren't doing that. Supplying a validation set is equivalent to running `model.evaludate(validation_set)` after each iteration of the algorithm. It does not affect the training process, the direction of the gradients, etc. For our purposes, this is only done to track the progress of the algorithm and see how each iteration affects the final accuracy (as the validation accuracy is really the test accuracy).\n", "\n", "With some algorithms, a validation set can be useful though, even to the point that GraphLab would take out 5% of the supplied train data to be used as the validation set. If the algorithm implements an early-stopping option, it can stop training if there is no significant improvement in the validation metrics after some iterations. This option is available for boosted trees but is turned off by default.\n", "\n", "For more information, read the documentation for the **early_stopping_rounds** parameter in the [boosted trees API documentation](https://dato.com/products/create/docs/generated/graphlab.boosted_trees_classifier.create.html#graphlab.boosted_trees_classifier.create)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.10" } }, "nbformat": 4, "nbformat_minor": 0 }