{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# GraphLab Create v the Criteo Dataset\n",
"## or, How to Classify 1TB of Clicks using Gradient Boosted Decision Trees"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In June 2014, Criteo Labs have released a [1TB dataset of feature values and click feedback for millions of display\n",
"ads](http://labs.criteo.com/downloads/download-terabyte-click-logs/). The data consists of 24 days of click feedback data. It was soon followed by a [Kaggle competition](https://www.kaggle.com/c/criteo-display-ad-challenge/data).\n",
"\n",
"The Kaggle competition was dedicated to predicting whether a user will click on an ad displayed to her. The data was 7 days of labelled data: 39 features per ad, and the label 1 if the ad was clicked, 0 otherwise. The full criteo dataset contains the data of 24 days, and weighs 1TB in this raw form. This full version is the dataset we are going to tackle today."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The Data\n",
"\n",
"Below are the first 5 rows of the data. As you can see, it has gone through complete anonymization. The first, leftmost column - which we'll denote here as **X1** - is the target column. '1' means a displayed ad was clicked - '0' means it wasn't. Nice-and-binary.\n",
"\n",
"The next 13 columns (**X2-X14**) have numeric, integer values. These are straightforward to use as well. The last 26 columns (**X15-X40**) are categorical."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
" \n",
" X1 | \n",
" X2 | \n",
" X3 | \n",
" X4 | \n",
" X5 | \n",
" X6 | \n",
" X7 | \n",
" X8 | \n",
" X9 | \n",
" X10 | \n",
" X11 | \n",
" X12 | \n",
" X13 | \n",
" X14 | \n",
" X15 | \n",
" X16 | \n",
"
\n",
" \n",
" 1 | \n",
" 5 | \n",
" 110 | \n",
" None | \n",
" 16 | \n",
" None | \n",
" 1 | \n",
" 0 | \n",
" 14 | \n",
" 7 | \n",
" 1 | \n",
" None | \n",
" 306 | \n",
" None | \n",
" 62770d79 | \n",
" e21f5d58 | \n",
"
\n",
" \n",
" 0 | \n",
" 32 | \n",
" 3 | \n",
" 5 | \n",
" None | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 61 | \n",
" 5 | \n",
" 0 | \n",
" 1 | \n",
" 3157 | \n",
" 5 | \n",
" e5f3fd8d | \n",
" a0aaffa6 | \n",
"
\n",
" \n",
" 0 | \n",
" None | \n",
" 233 | \n",
" 1 | \n",
" 146 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 99 | \n",
" 7 | \n",
" 0 | \n",
" 1 | \n",
" 3101 | \n",
" 1 | \n",
" 62770d79 | \n",
" ad984203 | \n",
"
\n",
" \n",
" 0 | \n",
" None | \n",
" 24 | \n",
" None | \n",
" 11 | \n",
" 24 | \n",
" None | \n",
" 0 | \n",
" 56 | \n",
" 3 | \n",
" None | \n",
" 2 | \n",
" 20456 | \n",
" None | \n",
" | \n",
" 710103fd | \n",
"
\n",
" \n",
" 0 | \n",
" 60 | \n",
" 223 | \n",
" 6 | \n",
" 15 | \n",
" 5 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 8 | \n",
" 0 | \n",
" 2 | \n",
" 1582 | \n",
" 6 | \n",
" 02e197c5 | \n",
" c2ced437 | \n",
"
\n",
"
\n",
"
\n",
" \n",
" X17 | \n",
" X18 | \n",
" X19 | \n",
" X20 | \n",
" X21 | \n",
" X22 | \n",
" X23 | \n",
" X24 | \n",
" X25 | \n",
" X26 | \n",
" X27 | \n",
"
\n",
" \n",
" afea442f | \n",
" 945c7fcf | \n",
" 38b02748 | \n",
" 6fcd6dcb | \n",
" 3580aa21 | \n",
" 28808903 | \n",
" 46dedfa6 | \n",
" 2e027dc1 | \n",
" 0c7c4231 | \n",
" 95981d1f | \n",
" 00c5ffb7 | \n",
"
\n",
" \n",
" 6faa15d5 | \n",
" da8a3421 | \n",
" 3cd69f23 | \n",
" 6fcd6dcb | \n",
" ab16ed81 | \n",
" 43426c29 | \n",
" 1df5e154 | \n",
" 7de9c0a9 | \n",
" 6652dc64 | \n",
" 99eb4e27 | \n",
" 00c5ffb7 | \n",
"
\n",
" \n",
" 62bec60d | \n",
" 386c49ee | \n",
" e755064d | \n",
" 6fcd6dcb | \n",
" b5f5eb62 | \n",
" d1f2cc8b | \n",
" 2e4e821f | \n",
" 2e027dc1 | \n",
" 0c7c4231 | \n",
" 12716184 | \n",
" 00c5ffb7 | \n",
"
\n",
" \n",
" c73d2eb5 | \n",
" 0c758dfb | \n",
" f1738f48 | \n",
" 6fcd6dcb | \n",
" e824fc11 | \n",
" 09f8a09d | \n",
" e25a4c11 | \n",
" | \n",
" | \n",
" 12716184 | \n",
" d49eb1df | \n",
"
\n",
" \n",
" a2427619 | \n",
" 3f85ecae | \n",
" b8c51ab7 | \n",
" 6fcd6dcb | \n",
" 26d0f5bb | \n",
" 337bf7a5 | \n",
" e25a4c11 | \n",
" 6da2367e | \n",
" bf624fa3 | \n",
" ec982ce0 | \n",
" a77a4a56 | \n",
"
\n",
"
\n",
"
\n",
" \n",
" X28 | \n",
" X29 | \n",
" X30 | \n",
" X31 | \n",
" X32 | \n",
" X33 | \n",
" X34 | \n",
" X35 | \n",
" X36 | \n",
" X37 | \n",
" X38 | \n",
"
\n",
" \n",
" be4ee537 | \n",
" 8a0b74cc | \n",
" 4cdc3efa | \n",
" d20856aa | \n",
" b8170bba | \n",
" 9512c20b | \n",
" c38e2f28 | \n",
" 14f65a5d | \n",
" 25b1b089 | \n",
" d7c1fc0b | \n",
" 7caf609c | \n",
"
\n",
" \n",
" be4ee537 | \n",
" f3bbfe99 | \n",
" 4cdc3efa | \n",
" d20856aa | \n",
" a1eb1511 | \n",
" 9512c20b | \n",
" febfd863 | \n",
" a3323ca1 | \n",
" c8e1ee56 | \n",
" 1752e9e8 | \n",
" 75350c8a | \n",
"
\n",
" \n",
" be4ee537 | \n",
" f70f0d0b | \n",
" 4cdc3efa | \n",
" d20856aa | \n",
" 628f1b8d | \n",
" 9512c20b | \n",
" c38e2f28 | \n",
" 14f65a5d | \n",
" 25b1b089 | \n",
" d7c1fc0b | \n",
" 34a9b905 | \n",
"
\n",
" \n",
" b96f9e1a | \n",
" 2b083b96 | \n",
" 10dd3744 | \n",
" 1f7fc70b | \n",
" a1eb1511 | \n",
" 9512c20b | \n",
" | \n",
" | \n",
" | \n",
" dc209cd3 | \n",
" b8a81fb0 | \n",
"
\n",
" \n",
" be4ee537 | \n",
" eb24f585 | \n",
" 4cdc3efa | \n",
" d20856aa | \n",
" d9f758ff | \n",
" 9512c20b | \n",
" c709ec07 | \n",
" 2b07677e | \n",
" a89a92a5 | \n",
" aa137169 | \n",
" e619743b | \n",
"
\n",
"
\n",
"
\n",
" \n",
" X39 | \n",
" X40 | \n",
"
\n",
" \n",
" 30436bfc | \n",
" ed10571d | \n",
"
\n",
" \n",
" 991321ea | \n",
" b757e957 | \n",
"
\n",
" \n",
" ff654802 | \n",
" ed10571d | \n",
"
\n",
" \n",
" 30436bfc | \n",
" b757e957 | \n",
"
\n",
" \n",
" cdc3217e | \n",
" ed10571d | \n",
"
\n",
"
\n",
"[5 rows x 40 columns]
\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## CountFeaturizer - An Efficient Expansion of Categorical Features\n",
"\n",
"A funny thing about categorical features is that many algorithms have to expand them.\n",
"\n",
"By expansion we mean the following: suppose that this is my dataset:\n",
"\n",
"\n",
" \n",
" categorical_feature_1 | \n",
" categorical_feature_2 | \n",
" label | \n",
"
\n",
"\n",
" \n",
" a | \n",
" c | \n",
" 0 | \n",
"
\n",
" \n",
" \n",
" a | \n",
" d | \n",
" 0 | \n",
"
\n",
" \n",
" \n",
" a | \n",
" a | \n",
" 1 | \n",
"
\n",
" \n",
" \n",
" b | \n",
" c | \n",
" 1 | \n",
"
\n",
" \n",
" \n",
" b | \n",
" c | \n",
" 0 | \n",
"
\n",
" \n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The first categorical feature can take one of the values **{a, b}**.
\n",
"The second categorical feature can take one of the values **{a, c, d}**.
\n",
"\n",
"Expansion of features means that instead of dealing with 2 features *per-se*, the algorithm will consider each categorical value to be a feature column of it's own, with the value of 1 if the data point (each row in our example) has it, 0 otherwise. You can read more about categorical feature expansion in [Srikrishna's notebook about feature engineering](https://dato.com/learn/gallery/notebooks/feature-engineering.html#Adding-Categorical-Features).\n",
"\n",
"For the purpose of expansion, the same categorical values in different feature columns (such as with value `a`) doesn't mean the algorithm will expand this categorical value into a single column. On the contrary - the expanded dataset will look like this:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
" \n",
" categorical_feature_1.a | \n",
" categorical_feature_1.b | \n",
" categorical_feature_2.a | \n",
" categorical_feature_2.c | \n",
" categorical_feature_2.d | \n",
" label | \n",
"
\n",
" \n",
" \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With only 2 categorical features, one with 2 categories and the other with 3 categories, we suddenly have 5 numerical features. From 2 features to 5 in the most silly example we could produce! What would happen on the full dataset? This is not funny anymore...!\n",
"\n",
"To figure this out, we loaded 1000 rows into an SFrame named `sf1k` and checked the number of category-values using the following code:\n",
"\n",
"```python\n",
"from collections import Counter\n",
"c = Counter({\n",
" colname: len(sf1k[colname].unique())\n",
" for colname in [\"X%d\" % (i) for i in range(15, 41)]\n",
" }\n",
")\n",
"c.most_common()\n",
"```\n",
"\n",
"We discovered that on the first 1000 rows, 26 categorical features will be expanded into 5537 features!\n",
"\n",
"```python\n",
">>> sum(c.itervalues()) # sum of number of unique values per categorical column\n",
"5537\n",
"\n",
">>> c.most_common() # distribution of values amongst the columns\n",
"[('X17', 607),\n",
" ('X26', 584),\n",
" ('X38', 530),\n",
" ('X21', 512),\n",
" ('X16', 388),\n",
" ('X29', 308),\n",
" ('X34', 307),\n",
" ('X15', 307),\n",
" ('X36', 300),\n",
" ('X24', 295),\n",
" ('X35', 267),\n",
" ('X19', 254),\n",
" ('X25', 218),\n",
" ('X18', 182),\n",
" ('X22', 181),\n",
" ('X37', 121),\n",
" ('X32', 47),\n",
" ('X28', 46),\n",
" ('X40', 17),\n",
" ('X23', 14),\n",
" ('X39', 13),\n",
" ('X33', 12),\n",
" ('X30', 12),\n",
" ('X27', 9),\n",
" ('X31', 4),\n",
" ('X20', 2)]\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the context of Gradient Boosted Trees, each categorical value may have its own node in the tree.\n",
"\n",
"While GraphLab can handle this situation, this is a very heavy-duty task, when there's another way to represent these categorical values. In the benchmark notebook, we will build a model using only those categorical feature columns with as little categorical values as possible."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The other way is to use our [Count Featurizer API](https://dato.com/products/create/docs/generated/graphlab.toolkits.feature_engineering.CountFeaturizer.html#graphlab.toolkits.feature_engineering.CountFeaturizer),\n",
"which implements the [method described here](https://blogs.technet.microsoft.com/machinelearning/2015/02/17/big-learning-made-easy-with-counts/).\n",
"\n",
"You will see this in practice in the benchmark notebook, and you're invited to [read more about in in our user guide](https://dato.com/learn/userguide/feature-engineering/count_featurizer.html)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Tweaking Runtime Configuration\n",
"\n",
"Inspecting the code, you will see that before all the data loading, models training and fine tuning, there are two changes to the runtime config:\n",
"```python\n",
"gl.set_runtime_config('GRAPHLAB_FILEIO_MAXIMUM_CACHE_CAPACITY', 20 * 1024 * 1024 * 1024)\n",
"gl.set_runtime_config('GRAPHLAB_FILEIO_MAXIMUM_CACHE_CAPACITY_PER_FILE', 20 * 1024 * 1024 * 1024)\n",
"```\n",
"\n",
"When dealing with extremely large datasets, on an amazingly powerful machine, such tweaks can save a lot of time.\n",
"\n",
"**MAXIMUM_CACHE_CAPACITY** is the number of bytes to hold in memory before flushing to disk. SFrame/SArray are all \"external memory\" datastructures, but we don't want to flush any SFrame/SArray objects to disk. So we have a cache IO layer acting as a disk buffer, and the constant is the capacity of the buffer. Once the size of new intermediate SFrame/SArray exceeds the buffer, they are flushed to disk.\n",
"\n",
"**MAXIMUM_CACHE_CAPACITY_PER_FILE** is similar to **MAXIMUM_CACHE_CAPACITY** but limit to a single file.\n",
"\n",
"The goal of these changes is to improve the disk IO efficiency because the boosted trees model is running in external memory mode. Before the first iteration, we preprocess and sort the columns and save on disk. Each iteration at every depth, these \"columns\" are loaded into memory to search the best split."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## A note about using a Validation Set in GraphLab Create\n",
"\n",
"You might notice that during model creation (AKA training), the `create()` method is supplied with the test split as its `validation set`:\n",
"\n",
"```python\n",
"model = gl.boosted_trees_classifier.create(full_train,\n",
" target=target_feature,\n",
" validation_set=test,\n",
" features=(num_features + cat_features),\n",
" max_iterations=5,\n",
" random_seed=0)\n",
"```\n",
"\n",
"Are we using test data during training? Are we breaking one of the ground rules of machine learning?\n",
"\n",
"Rest assured - we aren't doing that. Supplying a validation set is equivalent to running `model.evaludate(validation_set)` after each iteration of the algorithm. It does not affect the training process, the direction of the gradients, etc. For our purposes, this is only done to track the progress of the algorithm and see how each iteration affects the final accuracy (as the validation accuracy is really the test accuracy).\n",
"\n",
"With some algorithms, a validation set can be useful though, even to the point that GraphLab would take out 5% of the supplied train data to be used as the validation set. If the algorithm implements an early-stopping option, it can stop training if there is no significant improvement in the validation metrics after some iterations. This option is available for boosted trees but is turned off by default.\n",
"\n",
"For more information, read the documentation for the **early_stopping_rounds** parameter in the [boosted trees API documentation](https://dato.com/products/create/docs/generated/graphlab.boosted_trees_classifier.create.html#graphlab.boosted_trees_classifier.create)."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.10"
}
},
"nbformat": 4,
"nbformat_minor": 0
}