{ "metadata": { "celltoolbar": "Slideshow", "name": "", "signature": "sha256:613d7c0f204aea3f3eab4c0c2ec4158f0faa2cd88f47d82030477d35a94bdf9b" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "![Imgur](http://i.imgur.com/AkRFP3U.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Introduction to (Py)Spark\n", "\n", "* Tim Hopper \u2013\u00a0[@tdhopper](http://www.twitter.com/tdhopper) \u2013\u00a0Raleigh, NC\n", "* Developer and Anecdote Scientist\n", "* IPython Notebook available at http://tinyurl.com/PySpark\n", "\n", "
textFile
, which would return one record per line in each file\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`SparkContext.newAPIHadoopRDD`\n",
"\n",
"> PySpark can also read any Hadoop InputFormat or write any Hadoop OutputFormat, for both \u2018new\u2019 and \u2018old\u2019 Hadoop MapReduce APIs. \n",
"\n",
"For example, [Cassandra](https://github.com/apache/spark/blob/master/examples/src/main/python/cassandra_inputformat.py)."
]
},
{
"cell_type": "heading",
"level": 1,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Saving Data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`rdd.collect()` converts a RDD object into a Python list on the host machine.\n",
"\n",
"`rdd.saveAsTextFile()` saves an RDD as a string. See also `rdd.saveAsPickleFile()`.\n",
"\n",
"`rdd.saveAsNewAPIHadoopDataset()` saves an RDD object to a Hadoop data source (e.g. HDFS, [Cassandra](https://github.com/Parsely/pyspark-cassandra))."
]
},
{
"cell_type": "heading",
"level": 1,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Manipulating Data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sort last three presidents by last name"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"rdd = sc.parallelize([\"Barack Hussein Obama\", \"George Walker Bush\", \"William Jefferson Clinton\"])\n",
"\n",
"rdd.sortBy(keyfunc=lambda k: k.split(\" \")[-1]).collect()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 70,
"text": [
"['George Walker Bush', 'William Jefferson Clinton', 'Barack Hussein Obama']"
]
}
],
"prompt_number": 70
},
{
"cell_type": "heading",
"level": 1,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Manipulating Data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Join Datasets"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"rdd1 = sc.parallelize([(\"a\", 1), (\"b\", 2), (\"c\", 3)])\n",
"rdd2 = sc.parallelize([(\"a\", 6), (\"b\", 7), (\"b\", 8), (\"d\", 9)])"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 73
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"rdd1.join(rdd2).collect()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 74,
"text": [
"[('a', (1, 6)), ('b', (2, 7)), ('b', (2, 8))]"
]
}
],
"prompt_number": 74
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"rdd1.fullOuterJoin(rdd2).collect()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 75,
"text": [
"[('a', (1, 6)),\n",
" ('c', (3, None)),\n",
" ('b', (2, 7)),\n",
" ('b', (2, 8)),\n",
" ('d', (None, 9))]"
]
}
],
"prompt_number": 75
},
{
"cell_type": "heading",
"level": 1,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Manipulating Data"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from pyspark.sql import SQLContext, Row, StructField, StructType, FloatType, Row\n",
"import pandas as pd"
],
"language": "python",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"prompt_number": 13
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's load in the [Wake County Real Estate Data](http://www.wakegov.com/tax/realestate/redatafile/pages/default.aspx)."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"raw_real_estate = sc.textFile(\"all.txt\")\n",
"print raw_real_estate.take(1)[0][:250]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"FISHER, GEORGE NORMAN 2720 BEDFORD AVE RALEIGH NC 27607-7114 000000101 01 001506 WAKE FOREST RD RA 01 0\n"
]
}
],
"prompt_number": 80
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
""
]
},
{
"cell_type": "heading",
"level": 1,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Manipulating Data"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"wake_county_real_estate = raw_real_estate.map(lambda row: \n",
" dict(\n",
" owner = row[0:35].strip().title(),\n",
" last_name = row[0:35].strip().title().split(\",\")[0],\n",
" address = row[70:105].strip().title(), \n",
" sale_price = int(row[273:(284)].strip() or -1),\n",
" value = int(row[305:(316)].strip() or -1),\n",
" use = int(row[653:656].strip() or -1),\n",
" heated_area = int(row[471:482].strip() or -1),\n",
" year_built = int(row[455:459].strip() or -1),\n",
" height = row[509:510].strip(),\n",
" ))\n",
"\n",
"sqlContext = SQLContext(sc)\n",
"schemaWake = sqlContext.inferSchema(wake_county_real_estate.map(lambda d: Row(**d))) \\\n",
" .registerTempTable(\"wake\")"
],
"language": "python",
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [],
"prompt_number": 29
},
{
"cell_type": "heading",
"level": 1,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Manipulating Data"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"source": [
"Who owns the most expensive church buildings in Raleigh?"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"pd.DataFrame.from_records(\n",
" sqlContext.sql(\"\"\"SELECT DISTINCT owner, address, \n",
" year_built, value\n",
" FROM wake \n",
" WHERE value > 4000000 AND \n",
" use = 66 AND \n",
" owner LIKE '%Church%'\n",
" \"\"\").collect(),\n",
"columns=[\"Name\",\"Street\",\"Year Built\",\"Value\"])"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"\n", " | Name | \n", "Street | \n", "Year Built | \n", "Value | \n", "
---|---|---|---|---|
0 | \n", "Crosspointe Church At Cary | \n", "6911 Carpenter Fire Station Rd | \n", "2003 | \n", "6853333 | \n", "
1 | \n", "Edenton St Methodist Church | \n", "228 W Edenton St | \n", "2002 | \n", "6207300 | \n", "
2 | \n", "Edenton St Methodist Church | \n", "228 W Edenton St | \n", "1959 | \n", "6207300 | \n", "
3 | \n", "First Presbyterian Church | \n", "111 W Morgan St | \n", "2013 | \n", "4617350 | \n", "
4 | \n", "Providence Baptist Church | \n", "6339 Glenwood Ave Ste 451 | \n", "1972 | \n", "5540832 | \n", "
5 | \n", "First Presbyterian Church | \n", "111 W Morgan St | \n", "1987 | \n", "4617350 | \n", "
6 | \n", "Tabernacle Baptist Church Of Ral | \n", "8304 Leesville Rd | \n", "2001 | \n", "4600500 | \n", "