# GraphLab Create v Common Crawl 2012 WebGraph
## or, How I Learned to Stop Worrying about 128B edges and Love the PageRank

Here at Dato we highly evaluate openness and transparency. Look at the [Dato Gallery](https://dato.com/learn/gallery/ "Dato Gallery"), filled with notebooks you can download and execute on your own machine, to see that that GraphLab Create can really do what we say it does. So when it came to benchmarks, we said, why not upload a notebook about it?

This notebook will describe our **PageRank benchmark**. The dataset for this benchmark is the web itself - compiled by good people from [commoncrawl.org](http://commoncrawl.org/) in 2012. You can [download the dataset from here](http://webdatacommons.org/hyperlinkgraph/2012-08/download.html#toc0). We will run the **PageRank** algorithm over a network of **3.5 billion nodes** and **128 billion links**. Each node represents a web page, and each link - a hyperlink between two pages. GraphLab should do it in **about 5 hours**.

Running this benchmark will prove you how powerful and robust GraphLab is - we are not aware of any general-purpose graph analytics system that can cope with this task, either on a single machine or distributed. However, unlike other notebooks in the gallery, we **don't recommend running this notebook on your laptop!**. Instead, this notebook will describe how to run this benchmark on an EC2 instance in the Amazon Web Services (AWS) cloud.

We'll be using an **r3.8xlarge** EC2 instance. That's a strong machine,
with 32 cores, 244 Gigabytes of RAM, and 2 SSD drives, each sized 320 GBs.
If you can access a physical machine of this calibre, expect similar results.

Here are the steps for running this benchmark (over EC2, of course):

1. Launch an EC2 instance,
2. Install GraphLab Create and Jupyter (formerly *IPython Notebook*) on it,
3. Connect to your Jupyter instance running on the EC2 machine and run the benchmark.

## Step 1: Launch an EC2 instance

We created a detailed guide for launching an EC2 instance, [which is available here](https://github.com/guy4261/glc_pagerank_benchmark/blob/master/commoncrawl_benchmark_ec2_instructions/guide.pdf).

The specifications for this instance are as following:
* The region should be US West (Oregon).
* The AMI should be **Ubuntu Server 14.04 LTS (HVM), SSD Volume Type - ami-9abea4fb**.
* The instance type should be **r3.8xlarge** (32 cores, 244 GBs of RAM, two 320 GB SSDs)
* The storage should include a **Root volume of 16 GBs**.
* The Security Group should allow everyone to access **ports 22 (SSH) and 8888 (Jupyter)**.

Again - if you are not sure how to launch an AWS instance, don't worry! [Use our guide](https://github.com/guy4261/glc_pagerank_benchmark/blob/master/commoncrawl_benchmark_ec2_instructions/guide.pdf), which includes the screenshots of all the stages you will go through to launch your instance.

Once you've set up your machine, connect to it via ssh (OS X, Linux) or a client such as PuTTY (Windows), and proceed to the next step.

## Step 2: Install GraphLab Create and Jupyter

When connected to the your machine, download [the installation script from here](https://raw.githubusercontent.com/guy4261/glc_pagerank_benchmark/master/install.sh) and run it.

```bash
$ wget https://raw.githubusercontent.com/guy4261/glc_pagerank_benchmark/master/install.sh
$ chmod u+x install.sh
$ ./install.sh
```

When the script will finish running, you will be able to access Jupyter via your browser, and run code that will execute on your EC2 instance.

The script will install Python, GraphLab Create and Jupyter, and will start Jupyter on port 8888. You can see the entire script here:

https://raw.githubusercontent.com/guy4261/glc_pagerank_benchmark/master/install.sh

### Tip: Password protecting your Jupyter server
Since we assume you are only creating this instance for the purpose of running this benchmark, we allow the running Jupyter server to be open to the outside world. However, if you want to password-protect your instance, run the following lines in your shell (after connecting via SSH to your instance).
```bash
# Kill the previous jupyter instance
$ kill -9 `cat pid`

# Generate config files for jupyter
$ jupyter notebook --generate-config

# Set a password for the Jupyter instane
$ python -c "from notebook.auth import passwd; password = passwd(); open('/home/ubuntu/.jupyter/jupyter_notebook_config.py', 'a').write('c.NotebookApp.password = u\'%s\'' % (password))"

# Restart Jupyter
$ nohup jupyter notebook --no-browser --ip="*" & > pid
```

You will be prompted for a password the next time you browse to the Jupyter address on your instance.

## Step 3: Connect to Jupyter and run the benchmark

If you used the install script, [the benchmark notebook](https://github.com/guy4261/glc_pagerank_benchmark/blob/master/commoncrawl_benchmark.ipynb) has been downloaded to your EC2 instance and should be visible when you browse into Jupyter.

You can also download the notebook from this address: https://raw.githubusercontent.com/guy4261/glc_pagerank_benchmark/master/commoncrawl_benchmark.ipynb

The notebook does two main things: first, it prepares and mounts the two SSD volumes available on your instance. Then, it runs the benchmark. The benchmark essentially consists of the following four lines:

```python
import graphlab as gl
s3_sgraph_path = "s3://dato-datasets-oregon/webgraphs/sgraph/common_crawl_2012_sgraph"
g = gl.load_sgraph(s3_sgraph_path)
pr = gl.pagerank.create(g)
```

But since this is supposedly the first time you run GraphLab Create, and the first time you pull anything off AWS, you will need to enter 3 keys - your GraphLab Create Product Key, and your AWS Access Key ID and Secret Access Key. You can see [the rendered notebook on github](https://github.com/guy4261/glc_pagerank_benchmark/blob/master/commoncrawl_benchmark.ipynb).

When the benchmarking is running, you can close your browser window. If you do that, then when you come back, create a new cell and run some Python code to see if the benchmark is still running. If you'll get a (\*) mark next to your cell - that's your signal that there's some ongoing calculation taking place - the benchmark. Otherwise, if you immediately see output then the benchmark is done. You can examine the resulting benchmark object (`pr`).

## Data Overview

The [dataset's webpage says](http://webdatacommons.org/hyperlinkgraph/2012-08/download.html#toc0):
 
 Downloading the page graph: The page graph (arc and indes files) are, due to their size split into in small files of around 500 MB. These files can be downloaded using ```wget -i http://webdatacommons.org/hyperlinkgraph/2012-08/data/index.list.txt``` for the index files and respectively ```wget -i http://webdatacommons.org/hyperlinkgraph/2012-08/data/arc.list.txt``` for the arc files.

Examining http://webdatacommons.org/hyperlinkgraph/2012-08/data/arc.list.txt , we will find a list of URLs pointing at .gz files.

```
http://data.dws.informatik.uni-mannheim.de/hyperlinkgraph/2012-08/network/part-r-00000.gz
http://data.dws.informatik.uni-mannheim.de/hyperlinkgraph/2012-08/network/part-r-00001.gz
...
http://data.dws.informatik.uni-mannheim.de/hyperlinkgraph/2012-08/network/part-r-00696.gz
```

As CommonCrawl's own documentation says, each of these gzip files weights ~500 MBs. The entire edges dataset weight around **330 GBs**. Here is the `head` of the first file (`part-r-00000.gz`):

```
0	739935047
1	741742773
2	741745070
```

This is a very common format for storing graph edges - id1-TAB-id2.
While GraphLab could handle loading such a graph, we saved you the the trouble of downloading the files to your EC2 instance and uploaded it to an open S3 bucket. To access it you only need AWS credentials.

The SGraph created by loading the data is stored in binary form in this bucket; that way, it weights around **218 GBs** - which is less than the **330 GBs of gzipped edges' files**.
Also, since this data is stored on Amazon's S3, accessing it from your EC2 instance should be faster than downloading the raw data from CommonCrawl's servers.

The bucket containing the SGraph is located at [s3://dato-datasets-oregon/webgraphs/sgraph/common_crawl_2012_sgraph/](s3://dato-datasets-oregon/webgraphs/sgraph/common_crawl_2012_sgraph/) .

For those who would like to download it via the AWS command-line tool, use the following command: 

`aws s3 cp s3://dato-datasets/webgraphs/sgraph/common_crawl_2012_sgraph ccg2012 --recursive`

This will create a new folder `ccg2012` in the directory where the command is executed. You can, of course, choose a different path.

## Summary

This notebook covers it all: where are the fully-detailed instructions for launching an EC2 instance, how to get the script that installs everything from Python to Jupyter on your instance, and how to access your instance to run the benchmark. Our goal was to help anyone willing to run one of the most heavy-duty benchmarks of graph algorithms, even if you don't have the equipment to support it (or previous knowledege of EC2). If you still come across any trouble, drop a line to [Guy Rapaport <guy@dato.com>](mailto:guy@dato.com) and ask for help.

Good luck and Happy Benchmarking!