{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# GraphLab Create PageRank Benchmark - CommonCrawl 2012 Dataset\n",
    "## AWS EC2 Benchmark Notebook"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You should set the 3 following keys for the code to run (even if you are not running this benchmark on EC2).\n",
    "\n",
    "The GraphLab Product Key should have been e-mailed to you after you [registered on the Dato website](https://dato.com/download/). If you register yet, do it now.\n",
    "\n",
    "The AWS keys should be available to you via the AWS website. [Follow their instructions](http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSGettingStartedGuide/AWSCredentials.html) to get these keys. You will need these keys to access the S3 bucket where the CommonCrawl SGraph is stored. Any pair of credentials will do."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "graphlab_create_product_key = 'YOUR_PRODUCT_KEY'\n",
    "aws_access_key_id='YOUR_ACCESS_KEY'\n",
    "aws_secret_access_key='YOUR_SECRET_KEY'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook should be used when running the GraphLab Create PageRank Benchmark [over an EC2 instance as described here](https://github.com/guy4261/glc_pagerank_benchmark/blob/master/commoncrawl_benchmark_ec2_instructions/guide.pdf). If you are running this on your own machine, change the following flag from `True` to `False`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "running_on_ec2 = True\n",
    "# running_on_ec2 = False"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Skip this stage if you are not running on EC2 and proceed to the **Initialize GraphLab Create** step.\n",
    "\n",
    "### Initialize and mount SSDs that will be used as cache locations\n",
    "\n",
    "The following cell will initialize and mount the ephemeral SSD drives that are available on your instance.\n",
    "\n",
    "**If you are not running this benchmark from an S3 instance, skip this step.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "# initialize filesystem on SSD drives\n",
    "sudo mkfs -t ext4 /dev/xvdb\n",
    "sudo mkfs -t ext4 /dev/xvdc\n",
    "\n",
    "# create mount points for SSD drives\n",
    "sudo mkdir -p /mnt/tmp1\n",
    "sudo mkdir -p /mnt/tmp2\n",
    "\n",
    "# mount SSD drives on created points and temporary file locations\n",
    "sudo mount /dev/xvdb /mnt/tmp1\n",
    "sudo mount /dev/xvdc /mnt/tmp2\n",
    "sudo mount /dev/xvdb /tmp\n",
    "sudo mount /dev/xvdc /var/tmp\n",
    "\n",
    "# set permissions for mounted locations\n",
    "sudo chown ubuntu:ubuntu /mnt/tmp1\n",
    "sudo chown ubuntu:ubuntu /mnt/tmp2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Initialize GraphLab Create"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Fill in YOUR_PRODUCT_KEY which you got from Dato; and from your AWS credentials, YOUR_ACCESS_KEY and YOUR_SECRET_KEY \n",
    "import graphlab as gl\n",
    "\n",
    "if gl.product_key.get_product_key() is None:\n",
    "    gl.product_key.set_product_key(graphlab_create_product_key)\n",
    "\n",
    "try:\n",
    "    gl.aws.get_credentials()\n",
    "except KeyError:\n",
    "    gl.aws.set_credentials(access_key_id=aws_access_key_id, \n",
    "                       secret_access_key=aws_secret_access_key)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Set the cache locations to the SSDs.\n",
    "if running_on_ec2:\n",
    "    gl.set_runtime_config(\"GRAPHLAB_CACHE_FILE_LOCATIONS\", \"/mnt/tmp1:/mnt/tmp2\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Run the Benchmark"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Load the CommonCrawl 2012 SGraph\n",
    "s3_sgraph_path = \"s3://dato-datasets-oregon/webgraphs/sgraph/common_crawl_2012_sgraph\"\n",
    "g = gl.load_sgraph(s3_sgraph_path)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Run PageRank over the SGraph\n",
    "pr = gl.pagerank.create(g)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Review the Results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Print results\n",
    "print \"Done! Resulting PageRank model:\"\n",
    "print\n",
    "print pr"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Print timings\n",
    "from datetime import timedelta\n",
    "training_time_secs = pr['training_time']\n",
    "print \"Total training time:\", timedelta(seconds=training_time_secs)\n",
    "print \"Avg. time per iteration:\", timedelta(seconds=(training_time_secs / float(pr['num_iterations'])))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}