{ "metadata": { "name": "", "signature": "sha256:89b31567699d26877d1a7406cc718f5609a31c4d05e95c8a8ec474b0f62daa56" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "RDD creation" ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "[Introduction to Spark with Python, by Jose A. Dianes](https://github.com/jadianes/spark-py-notebooks)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook we will introduce two different ways of getting data into the basic Spark data structure, the **Resilient Distributed Dataset** or **RDD**. An RDD is a distributed collection of elements. All work in Spark is expressed as either creating new RDDs, transforming existing RDDs, or calling actions on RDDs to compute a result. Spark automatically distributes the data contained in RDDs across your cluster and parallelizes the operations you perform on them." ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "References" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The reference book for these and other Spark related topics is *Learning Spark* by Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The KDD Cup 1999 competition dataset is described in detail [here](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99)." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Getting the data files " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook we will use the reduced dataset (10 percent) provided for the KDD Cup 1999, containing nearly half million network interactions. The file is provided as a *Gzip* file that we will download locally. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "import urllib\n", "f = urllib.urlretrieve (\"http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data_10_percent.gz\", \"kddcup.data_10_percent.gz\")" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 31 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Creating a RDD from a file " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The most common way of creating an RDD is to load it from a file. Notice that Spark's `textFile` can handle compressed files directly. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "data_file = \"./kddcup.data_10_percent.gz\"\n", "raw_data = sc.textFile(data_file)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 32 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we have our data file loaded into the `raw_data` RDD." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Without getting into Spark *transformations* and *actions*, the most basic thing we can do to check that we got our RDD contents right is to `count()` the number of lines loaded from the file into the RDD. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "raw_data.count()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 33, "text": [ "494021" ] } ], "prompt_number": 33 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also check the first few entries in our data. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "raw_data.take(5)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 34, "text": [ "[u'0,tcp,http,SF,181,5450,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,9,9,1.00,0.00,0.11,0.00,0.00,0.00,0.00,0.00,normal.',\n", " u'0,tcp,http,SF,239,486,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,19,19,1.00,0.00,0.05,0.00,0.00,0.00,0.00,0.00,normal.',\n", " u'0,tcp,http,SF,235,1337,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,29,29,1.00,0.00,0.03,0.00,0.00,0.00,0.00,0.00,normal.',\n", " u'0,tcp,http,SF,219,1337,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,6,6,0.00,0.00,0.00,0.00,1.00,0.00,0.00,39,39,1.00,0.00,0.03,0.00,0.00,0.00,0.00,0.00,normal.',\n", " u'0,tcp,http,SF,217,2032,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,6,6,0.00,0.00,0.00,0.00,1.00,0.00,0.00,49,49,1.00,0.00,0.02,0.00,0.00,0.00,0.00,0.00,normal.']" ] } ], "prompt_number": 34 }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the following notebooks, we will use this raw data to learn about the different Spark transformations and actions. " ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Creating and RDD using `parallelize`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another way of creating an RDD is to parallelize an already existing list. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "a = range(100)\n", "\n", "data = sc.parallelize(a)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 35 }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we did before, we can `count()` the number of elements in the RDD." ] }, { "cell_type": "code", "collapsed": false, "input": [ "data.count()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 36, "text": [ "100" ] } ], "prompt_number": 36 }, { "cell_type": "markdown", "metadata": {}, "source": [ "As before, we can access the first few elements on our RDD. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "data.take(5)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 37, "text": [ "[0, 1, 2, 3, 4]" ] } ], "prompt_number": 37 } ], "metadata": {} } ] }