{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# GraphLab Create Classification Benchmark - Criteo Terabyte Dataset\n",
    "## AWS EC2 Benchmark Notebook"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You should set the following key for the code to run.\n",
    "\n",
    "The GraphLab Product Key should have been e-mailed to you after you [registered on the Dato website](https://dato.com/download/). If you register yet, do it now."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The dataset in its original format can be found [here](http://labs.criteo.com/downloads/download-terabyte-click-logs/)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "graphlab_create_product_key = 'YOUR_PRODUCT_KEY'\n",
    "train_over_subset = True\n",
    "train_over_transformed = True"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook should be used when running the GraphLab Create PageRank Benchmark [over an EC2 instance as described here](https://github.com/guy4261/glc_pagerank_benchmark/blob/master/commoncrawl_benchmark_ec2_instructions/guide.pdf). If you are running this on your own machine, change the following flag from `True` to `False`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "running_on_ec2 = True\n",
    "# running_on_ec2 = False"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Initialize and mount SSDs that will be used as cache locations\n",
    "\n",
    "**If you are not running on EC2, skip this stage to the [Initialize GraphLab Create](#init_glc) step.**\n",
    "\n",
    "The following cell will initialize and mount the ephemeral SSD drives that are available on your instance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "# initialize filesystem on SSD drives\n",
    "sudo mkfs -t ext4 /dev/xvdb\n",
    "sudo mkfs -t ext4 /dev/xvdc\n",
    "\n",
    "# create mount points for SSD drives\n",
    "sudo mkdir -p /mnt/tmp1\n",
    "sudo mkdir -p /mnt/tmp2\n",
    "\n",
    "# mount SSD drives on created points and temporary file locations\n",
    "sudo mount /dev/xvdb /mnt/tmp1\n",
    "sudo mount /dev/xvdc /mnt/tmp2\n",
    "sudo mount /dev/xvdb /tmp\n",
    "sudo mount /dev/xvdc /var/tmp\n",
    "\n",
    "# set permissions for mounted locations\n",
    "sudo chown ubuntu:ubuntu /mnt/tmp1\n",
    "sudo chown ubuntu:ubuntu /mnt/tmp2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "# Mount EBS data volumn\n",
    "# You should attach an EBS volume with at least 500G of space \n",
    "# Assuming the disk is mounted at /dev/xvdd\n",
    "\n",
    "sudo mkdir -p /mnt/data\n",
    "\n",
    "if grep -qs '/mnt/data' /proc/mounts; then\n",
    "    echo \"EBS volume seems to be already mounted.\"\n",
    "else\n",
    "    sudo mount /dev/xvdd /mnt/data\n",
    "    if [ $? -ne 0 ]; then\n",
    "        sudo mkfs -t ext4 /dev/xvdd\n",
    "        sudo mount /dev/xvdd /mnt/data\n",
    "    fi\n",
    "fi\n",
    "\n",
    "sudo chown -R ubuntu:ubuntu /mnt/data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Download the Dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "cd /mnt/data\n",
    "for i in {0..23}; do\n",
    "    wget --continue --timestamping http://azuremlsampleexperiments.blob.core.windows.net/criteo/day_${i}.gz\n",
    "done"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<span id=\"init-glc\"></span>\n",
    "## Initialize GraphLab Create and Set Runtime Configurations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import graphlab as gl\n",
    "\n",
    "if gl.product_key.get_product_key() is None:\n",
    "    gl.product_key.set_product_key(graphlab_create_product_key)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Set the cache locations to the SSDs.\n",
    "if running_on_ec2:\n",
    "    gl.set_runtime_config(\"GRAPHLAB_CACHE_FILE_LOCATIONS\", \"/mnt/tmp1:/mnt/tmp2\")\n",
    "\n",
    "from multiprocessing import cpu_count\n",
    "gl.set_runtime_config('GRAPHLAB_DEFAULT_NUM_PYLAMBDA_WORKERS', cpu_count())\n",
    "gl.set_runtime_config('GRAPHLAB_FILEIO_MAXIMUM_CACHE_CAPACITY', 20 * 1024 * 1024 * 1024)\n",
    "gl.set_runtime_config('GRAPHLAB_FILEIO_MAXIMUM_CACHE_CAPACITY_PER_FILE', 20 * 1024 * 1024 * 1024)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load the data\n",
    "\n",
    "There are 24 days worth of data, we will use last 2 days as a testing set.<br>\n",
    "For the first part, we will use the first 22 days as a training set. For the second part, we will use the first 20 days for fitting feature engineering transformations, and 2 days for a training set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "data_dir = '/mnt/data'\n",
    "\n",
    "def load_days(start, end):\n",
    "    data = gl.SFrame()\n",
    "    for i in range(start, end + 1):\n",
    "        data = data.append(gl.SFrame.read_csv('%s/day_%d.gz' % (data_dir, i),\n",
    "                          delimiter='\\t', header=False, verbose=False))\n",
    "    return data\n",
    "\n",
    "# Load the fit set\n",
    "fit_set = load_days(0, 19)\n",
    "\n",
    "# Load the training set\n",
    "train = load_days(20, 21)\n",
    "\n",
    "# Create the full training set\n",
    "full_train = fit_set.append(train)\n",
    "\n",
    "# Load the testing set\n",
    "test = load_days(22, 23)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "train.print_rows(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import time\n",
    "from datetime import timedelta"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training a GBT model on the full training set using a subset of the features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "if train_over_subset:\n",
    "    target_feature = 'X1'\n",
    "    num_features = ['X%d' % (i) for i in xrange(2, 15)] # X2..X14\n",
    "    cat_features = ['X20', 'X27', 'X31', 'X39']\n",
    "\n",
    "    start = time.time()\n",
    "\n",
    "    model = gl.boosted_trees_classifier.create(full_train,\n",
    "                                               target=target_feature,\n",
    "                                               validation_set=test,\n",
    "                                               features=(num_features + cat_features),\n",
    "                                               max_iterations=5,\n",
    "                                               random_seed=0)\n",
    "\n",
    "    print 'End-to-end training time:', timedelta(seconds=(time.time() - start))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Improve the model performance by using the Count Featurizer on all categorical columns\n",
    "\n",
    "Using the Count Featurizer, the model will perform slightly better, and will finish training much faster using less data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Transform only categorical features\n",
    "categorical_features = ['X' + str(i) for i in range(15, 41)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "if train_over_transformed:\n",
    "    start = time.time()\n",
    "\n",
    "    # Fit the count featurizer on the fit set (first 20 days)\n",
    "    featurizer = gl.feature_engineering.CountFeaturizer(features=categorical_features, target='X1')\n",
    "    featurizer.fit(fit_set)\n",
    "\n",
    "    # Transform the training set (days 21, 22) using the featurizer\n",
    "    transformed_train = featurizer.transform(train)\n",
    "\n",
    "    # Transform the testing set (days 23, 24) using the featurizer\n",
    "    transformed_test = featurizer.transform(test)\n",
    "\n",
    "    fit_transform_time = time.time() - start\n",
    "    print 'Fitting the count featurizer and transforming the data time:', timedelta(seconds=fit_transform_time)\n",
    "    \n",
    "    # See the transformed data\n",
    "    transformed_train.print_rows(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "if train_over_transformed:\n",
    "    start = time.time()\n",
    "\n",
    "    model = gl.boosted_trees_classifier.create(transformed_train,\n",
    "                                               target='X1', \n",
    "                                               validation_set=transformed_test,\n",
    "                                               max_iterations=5,\n",
    "                                               random_seed=0)\n",
    "\n",
    "    training_time = time.time() - start\n",
    "    print 'Training time:', timedelta(seconds=training_time)\n",
    "    print 'End-to-end fitting and training time', timedelta(seconds=(fit_transform_time + training_time))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}