# GraphLab Create PageRank Benchmark - CommonCrawl 2012 Dataset
## AWS EC2 Benchmark Notebook

You should set the 3 following keys for the code to run (even if you are not running this benchmark on EC2).

The GraphLab Product Key should have been e-mailed to you after you [registered on the Dato website](https://dato.com/download/). If you register yet, do it now.

The AWS keys should be available to you via the AWS website. [Follow their instructions](http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSGettingStartedGuide/AWSCredentials.html) to get these keys. You will need these keys to access the S3 bucket where the CommonCrawl SGraph is stored. Any pair of credentials will do.

In [None]:
graphlab_create_product_key = 'YOUR_PRODUCT_KEY'
aws_access_key_id='YOUR_ACCESS_KEY'
aws_secret_access_key='YOUR_SECRET_KEY'

This notebook should be used when running the GraphLab Create PageRank Benchmark [over an EC2 instance as described here](https://github.com/guy4261/glc_pagerank_benchmark/blob/master/commoncrawl_benchmark_ec2_instructions/guide.pdf). If you are running this on your own machine, change the following flag from `True` to `False`.

In [None]:
running_on_ec2 = True
# running_on_ec2 = False

Skip this stage if you are not running on EC2 and proceed to the **Initialize GraphLab Create** step.

### Initialize and mount SSDs that will be used as cache locations

The following cell will initialize and mount the ephemeral SSD drives that are available on your instance.

**If you are not running this benchmark from an S3 instance, skip this step.**

In [None]:
%%bash
# initialize filesystem on SSD drives
sudo mkfs -t ext4 /dev/xvdb
sudo mkfs -t ext4 /dev/xvdc

# create mount points for SSD drives
sudo mkdir -p /mnt/tmp1
sudo mkdir -p /mnt/tmp2

# mount SSD drives on created points and temporary file locations
sudo mount /dev/xvdb /mnt/tmp1
sudo mount /dev/xvdc /mnt/tmp2
sudo mount /dev/xvdb /tmp
sudo mount /dev/xvdc /var/tmp

# set permissions for mounted locations
sudo chown ubuntu:ubuntu /mnt/tmp1
sudo chown ubuntu:ubuntu /mnt/tmp2

### Initialize GraphLab Create

In [None]:
# Fill in YOUR_PRODUCT_KEY which you got from Dato; and from your AWS credentials, YOUR_ACCESS_KEY and YOUR_SECRET_KEY 
import graphlab as gl

if gl.product_key.get_product_key() is None:
 gl.product_key.set_product_key(graphlab_create_product_key)

try:
 gl.aws.get_credentials()
except KeyError:
 gl.aws.set_credentials(access_key_id=aws_access_key_id, 
 secret_access_key=aws_secret_access_key)

In [None]:
# Set the cache locations to the SSDs.
if running_on_ec2:
 gl.set_runtime_config("GRAPHLAB_CACHE_FILE_LOCATIONS", "/mnt/tmp1:/mnt/tmp2")

### Run the Benchmark

In [None]:
# Load the CommonCrawl 2012 SGraph
s3_sgraph_path = "s3://dato-datasets-oregon/webgraphs/sgraph/common_crawl_2012_sgraph"
g = gl.load_sgraph(s3_sgraph_path)

In [None]:
# Run PageRank over the SGraph
pr = gl.pagerank.create(g)

### Review the Results

In [None]:
# Print results
print "Done! Resulting PageRank model:"
print
print pr

In [None]:
# Print timings
from datetime import timedelta
training_time_secs = pr['training_time']
print "Total training time:", timedelta(seconds=training_time_secs)
print "Avg. time per iteration:", timedelta(seconds=(training_time_secs / float(pr['num_iterations'])))