# GraphLab Create Classification Benchmark - Criteo Terabyte Dataset
## AWS EC2 Benchmark Notebook

You should set the following key for the code to run.

The GraphLab Product Key should have been e-mailed to you after you [registered on the Dato website](https://dato.com/download/). If you register yet, do it now.

The dataset in its original format can be found [here](http://labs.criteo.com/downloads/download-terabyte-click-logs/).

In [None]:
graphlab_create_product_key = 'YOUR_PRODUCT_KEY'
train_over_subset = True
train_over_transformed = True

This notebook should be used when running the GraphLab Create PageRank Benchmark [over an EC2 instance as described here](https://github.com/guy4261/glc_pagerank_benchmark/blob/master/commoncrawl_benchmark_ec2_instructions/guide.pdf). If you are running this on your own machine, change the following flag from `True` to `False`.

In [None]:
running_on_ec2 = True
# running_on_ec2 = False

## Initialize and mount SSDs that will be used as cache locations

**If you are not running on EC2, skip this stage to the [Initialize GraphLab Create](#init_glc) step.**

The following cell will initialize and mount the ephemeral SSD drives that are available on your instance.

In [None]:
%%bash
# initialize filesystem on SSD drives
sudo mkfs -t ext4 /dev/xvdb
sudo mkfs -t ext4 /dev/xvdc

# create mount points for SSD drives
sudo mkdir -p /mnt/tmp1
sudo mkdir -p /mnt/tmp2

# mount SSD drives on created points and temporary file locations
sudo mount /dev/xvdb /mnt/tmp1
sudo mount /dev/xvdc /mnt/tmp2
sudo mount /dev/xvdb /tmp
sudo mount /dev/xvdc /var/tmp

# set permissions for mounted locations
sudo chown ubuntu:ubuntu /mnt/tmp1
sudo chown ubuntu:ubuntu /mnt/tmp2

In [None]:
%%bash
# Mount EBS data volumn
# You should attach an EBS volume with at least 500G of space 
# Assuming the disk is mounted at /dev/xvdd

sudo mkdir -p /mnt/data

if grep -qs '/mnt/data' /proc/mounts; then
    echo "EBS volume seems to be already mounted."
else
    sudo mount /dev/xvdd /mnt/data
    if [ $? -ne 0 ]; then
        sudo mkfs -t ext4 /dev/xvdd
        sudo mount /dev/xvdd /mnt/data
    fi
fi

sudo chown -R ubuntu:ubuntu /mnt/data

## Download the Dataset

In [None]:
%%bash
cd /mnt/data
for i in {0..23}; do
    wget --continue --timestamping http://azuremlsampleexperiments.blob.core.windows.net/criteo/day_${i}.gz
done

<span id="init-glc"></span>
## Initialize GraphLab Create and Set Runtime Configurations

In [None]:
import graphlab as gl

if gl.product_key.get_product_key() is None:
    gl.product_key.set_product_key(graphlab_create_product_key)

In [None]:
# Set the cache locations to the SSDs.
if running_on_ec2:
    gl.set_runtime_config("GRAPHLAB_CACHE_FILE_LOCATIONS", "/mnt/tmp1:/mnt/tmp2")

from multiprocessing import cpu_count
gl.set_runtime_config('GRAPHLAB_DEFAULT_NUM_PYLAMBDA_WORKERS', cpu_count())
gl.set_runtime_config('GRAPHLAB_FILEIO_MAXIMUM_CACHE_CAPACITY', 20 * 1024 * 1024 * 1024)
gl.set_runtime_config('GRAPHLAB_FILEIO_MAXIMUM_CACHE_CAPACITY_PER_FILE', 20 * 1024 * 1024 * 1024)

## Load the data

There are 24 days worth of data, we will use last 2 days as a testing set.<br>
For the first part, we will use the first 22 days as a training set. For the second part, we will use the first 20 days for fitting feature engineering transformations, and 2 days for a training set.

In [None]:
data_dir = '/mnt/data'

def load_days(start, end):
    data = gl.SFrame()
    for i in range(start, end + 1):
        data = data.append(gl.SFrame.read_csv('%s/day_%d.gz' % (data_dir, i),
                          delimiter='\t', header=False, verbose=False))
    return data

# Load the fit set
fit_set = load_days(0, 19)

# Load the training set
train = load_days(20, 21)

# Create the full training set
full_train = fit_set.append(train)

# Load the testing set
test = load_days(22, 23)

In [None]:
train.print_rows(3)

In [None]:
import time
from datetime import timedelta

## Training a GBT model on the full training set using a subset of the features

In [None]:
if train_over_subset:
    target_feature = 'X1'
    num_features = ['X%d' % (i) for i in xrange(2, 15)] # X2..X14
    cat_features = ['X20', 'X27', 'X31', 'X39']

    start = time.time()

    model = gl.boosted_trees_classifier.create(full_train,
                                               target=target_feature,
                                               validation_set=test,
                                               features=(num_features + cat_features),
                                               max_iterations=5,
                                               random_seed=0)

    print 'End-to-end training time:', timedelta(seconds=(time.time() - start))

## Improve the model performance by using the Count Featurizer on all categorical columns

Using the Count Featurizer, the model will perform slightly better, and will finish training much faster using less data.

In [None]:
# Transform only categorical features
categorical_features = ['X' + str(i) for i in range(15, 41)]

In [None]:
if train_over_transformed:
    start = time.time()

    # Fit the count featurizer on the fit set (first 20 days)
    featurizer = gl.feature_engineering.CountFeaturizer(features=categorical_features, target='X1')
    featurizer.fit(fit_set)

    # Transform the training set (days 21, 22) using the featurizer
    transformed_train = featurizer.transform(train)

    # Transform the testing set (days 23, 24) using the featurizer
    transformed_test = featurizer.transform(test)

    fit_transform_time = time.time() - start
    print 'Fitting the count featurizer and transforming the data time:', timedelta(seconds=fit_transform_time)
    
    # See the transformed data
    transformed_train.print_rows(3)

In [None]:
if train_over_transformed:
    start = time.time()

    model = gl.boosted_trees_classifier.create(transformed_train,
                                               target='X1', 
                                               validation_set=transformed_test,
                                               max_iterations=5,
                                               random_seed=0)

    training_time = time.time() - start
    print 'Training time:', timedelta(seconds=training_time)
    print 'End-to-end fitting and training time', timedelta(seconds=(fit_transform_time + training_time))