{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# GraphLab Create PageRank Benchmark - CommonCrawl 2012 Dataset\n", "## AWS EC2 Benchmark Notebook" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You should set the 3 following keys for the code to run (even if you are not running this benchmark on EC2).\n", "\n", "The GraphLab Product Key should have been e-mailed to you after you [registered on the Dato website](https://dato.com/download/). If you register yet, do it now.\n", "\n", "The AWS keys should be available to you via the AWS website. [Follow their instructions](http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSGettingStartedGuide/AWSCredentials.html) to get these keys. You will need these keys to access the S3 bucket where the CommonCrawl SGraph is stored. Any pair of credentials will do." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "graphlab_create_product_key = 'YOUR_PRODUCT_KEY'\n", "aws_access_key_id='YOUR_ACCESS_KEY'\n", "aws_secret_access_key='YOUR_SECRET_KEY'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook should be used when running the GraphLab Create PageRank Benchmark [over an EC2 instance as described here](https://github.com/guy4261/glc_pagerank_benchmark/blob/master/commoncrawl_benchmark_ec2_instructions/guide.pdf). If you are running this on your own machine, change the following flag from `True` to `False`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "running_on_ec2 = True\n", "# running_on_ec2 = False" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Skip this stage if you are not running on EC2 and proceed to the **Initialize GraphLab Create** step.\n", "\n", "### Initialize and mount SSDs that will be used as cache locations\n", "\n", "The following cell will initialize and mount the ephemeral SSD drives that are available on your instance.\n", "\n", "**If you are not running this benchmark from an S3 instance, skip this step.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%%bash\n", "# initialize filesystem on SSD drives\n", "sudo mkfs -t ext4 /dev/xvdb\n", "sudo mkfs -t ext4 /dev/xvdc\n", "\n", "# create mount points for SSD drives\n", "sudo mkdir -p /mnt/tmp1\n", "sudo mkdir -p /mnt/tmp2\n", "\n", "# mount SSD drives on created points and temporary file locations\n", "sudo mount /dev/xvdb /mnt/tmp1\n", "sudo mount /dev/xvdc /mnt/tmp2\n", "sudo mount /dev/xvdb /tmp\n", "sudo mount /dev/xvdc /var/tmp\n", "\n", "# set permissions for mounted locations\n", "sudo chown ubuntu:ubuntu /mnt/tmp1\n", "sudo chown ubuntu:ubuntu /mnt/tmp2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Initialize GraphLab Create" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Fill in YOUR_PRODUCT_KEY which you got from Dato; and from your AWS credentials, YOUR_ACCESS_KEY and YOUR_SECRET_KEY \n", "import graphlab as gl\n", "\n", "if gl.product_key.get_product_key() is None:\n", " gl.product_key.set_product_key(graphlab_create_product_key)\n", "\n", "try:\n", " gl.aws.get_credentials()\n", "except KeyError:\n", " gl.aws.set_credentials(access_key_id=aws_access_key_id, \n", " secret_access_key=aws_secret_access_key)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Set the cache locations to the SSDs.\n", "if running_on_ec2:\n", " gl.set_runtime_config(\"GRAPHLAB_CACHE_FILE_LOCATIONS\", \"/mnt/tmp1:/mnt/tmp2\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Run the Benchmark" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Load the CommonCrawl 2012 SGraph\n", "s3_sgraph_path = \"s3://dato-datasets-oregon/webgraphs/sgraph/common_crawl_2012_sgraph\"\n", "g = gl.load_sgraph(s3_sgraph_path)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Run PageRank over the SGraph\n", "pr = gl.pagerank.create(g)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Review the Results" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Print results\n", "print \"Done! Resulting PageRank model:\"\n", "print\n", "print pr" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Print timings\n", "from datetime import timedelta\n", "training_time_secs = pr['training_time']\n", "print \"Total training time:\", timedelta(seconds=training_time_secs)\n", "print \"Avg. time per iteration:\", timedelta(seconds=(training_time_secs / float(pr['num_iterations'])))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.10" } }, "nbformat": 4, "nbformat_minor": 0 }