{
 "metadata": {
  "name": "",
  "signature": "sha256:89b31567699d26877d1a7406cc718f5609a31c4d05e95c8a8ec474b0f62daa56"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "RDD creation"
     ]
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "[Introduction to Spark with Python, by Jose A. Dianes](https://github.com/jadianes/spark-py-notebooks)"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In this notebook we will introduce two different ways of getting data into the basic Spark data structure, the **Resilient Distributed Dataset** or **RDD**. An RDD is a distributed collection of elements. All work in Spark is expressed as either creating new RDDs, transforming existing RDDs, or calling actions on RDDs to compute a result. Spark automatically distributes the data contained in RDDs across your cluster and parallelizes the operations you perform on them."
     ]
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "References"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The reference book for these and other Spark related topics is *Learning Spark* by Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia.  "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The KDD Cup 1999 competition dataset is described in detail [here](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99)."
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Getting the data files  "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In this notebook we will use the reduced dataset (10 percent) provided for the KDD Cup 1999, containing nearly half million network interactions. The file is provided as a *Gzip* file that we will download locally.  "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import urllib\n",
      "f = urllib.urlretrieve (\"http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data_10_percent.gz\", \"kddcup.data_10_percent.gz\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 31
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Creating a RDD from a file  "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The most common way of creating an RDD is to load it from a file. Notice that Spark's `textFile` can handle compressed files directly.    "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "data_file = \"./kddcup.data_10_percent.gz\"\n",
      "raw_data = sc.textFile(data_file)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 32
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now we have our data file loaded into the `raw_data` RDD."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Without getting into Spark *transformations* and *actions*, the most basic thing we can do to check that we got our RDD contents right is to `count()` the number of lines loaded from the file into the RDD.  "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "raw_data.count()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 33,
       "text": [
        "494021"
       ]
      }
     ],
     "prompt_number": 33
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can also check the first few entries in our data.  "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "raw_data.take(5)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 34,
       "text": [
        "[u'0,tcp,http,SF,181,5450,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,9,9,1.00,0.00,0.11,0.00,0.00,0.00,0.00,0.00,normal.',\n",
        " u'0,tcp,http,SF,239,486,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,19,19,1.00,0.00,0.05,0.00,0.00,0.00,0.00,0.00,normal.',\n",
        " u'0,tcp,http,SF,235,1337,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,29,29,1.00,0.00,0.03,0.00,0.00,0.00,0.00,0.00,normal.',\n",
        " u'0,tcp,http,SF,219,1337,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,6,6,0.00,0.00,0.00,0.00,1.00,0.00,0.00,39,39,1.00,0.00,0.03,0.00,0.00,0.00,0.00,0.00,normal.',\n",
        " u'0,tcp,http,SF,217,2032,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,6,6,0.00,0.00,0.00,0.00,1.00,0.00,0.00,49,49,1.00,0.00,0.02,0.00,0.00,0.00,0.00,0.00,normal.']"
       ]
      }
     ],
     "prompt_number": 34
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In the following notebooks, we will use this raw data to learn about the different Spark transformations and actions.  "
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Creating and RDD using `parallelize`"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Another way of creating an RDD is to parallelize an already existing list.  "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "a = range(100)\n",
      "\n",
      "data = sc.parallelize(a)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 35
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "As we did before, we can `count()` the number of elements in the RDD."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "data.count()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 36,
       "text": [
        "100"
       ]
      }
     ],
     "prompt_number": 36
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "As before, we can access the first few elements on our RDD.  "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "data.take(5)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 37,
       "text": [
        "[0, 1, 2, 3, 4]"
       ]
      }
     ],
     "prompt_number": 37
    }
   ],
   "metadata": {}
  }
 ]
}