{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Lab 2 - Spark SQL\n",
    "This lab will show you how to work with SparkSQL.  It's meant to be self-guided, but don't hesitate to ask your presentor for help.  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1 - Getting started: Create a [SQL Context](https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html#pyspark.sql.SQLContext)\n",
    "<br>\n",
    " <div class=\"panel-group\" id=\"accordion-1\">\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-1\" href=\"#collapse1-1\">\n",
    "        Hint 1</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse1-1\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\"><i>SQLContext</i> is not included by default.   You need to import it from <i>pyspark.sql</i></div>\n",
    "    </div>\n",
    "  </div>\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-1\" href=\"#collapse2-1\">\n",
    "        Hint 2</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse2-1\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\"><i>SQLContext()</i> takes a single parameter which is the current Spark context.</div>\n",
    "    </div>\n",
    "  </div>\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-1\" href=\"#collapse3-1\">\n",
    "        Hint 3</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse3-1\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">Type:<br>\n",
    "\n",
    "from pyspark.sql import SQLContext<br>\n",
    "sqlContext = SQLContext(sc)<br></div>\n",
    "    </div>\n",
    "  </div>\n",
    "</div> \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#Import the SparkSQL library and connect to the current Spark context\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2 - Download a JSON Recordset to work with\n",
    "Let's download the data, we can run commands on the console of the server (or docker image) that the notebook environment is using. To do so we simply put a \"!\" in front of the command that we want to run. For example:\n",
    "\n",
    "!pwd\n",
    "\n",
    "To get the data we will download a file to the environment. Simple run these two commands, the first just ensures that the file is removed if it exists:\n",
    "\n",
    "!rm world_bank.json.gz -f <br>\n",
    "!wget https://raw.githubusercontent.com/bradenrc/sparksql_pot/master/world_bank.json.gz<br><br>\n",
    " <div class=\"panel-group\" id=\"accordion-2\">\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-1\" href=\"#collapse1-2\">\n",
    "        Advanced Optional 1</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse1-2\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">Comment out the rm statement i.e. #!rm and re-run this section.   What is the name of the downloaded file?</div>\n",
    "    </div>\n",
    "  </div>\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-2\" href=\"#collapse2-2\">\n",
    "        Advanced Optional 2</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse2-2\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">Add !ls to see all the files currently in storage.   Try running !mkdir testdir</div>\n",
    "    </div>\n",
    "  </div>\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-2\" href=\"#collapse3-2\">\n",
    "        Advanced Optional 3</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse3-2\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">Clean up all added files/directories.   Use !rmdir to remove a directory.</div>\n",
    "    </div>\n",
    "  </div>\n",
    "</div> \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#Download file here\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3 - Create a Dataframe \n",
    "<br>\n",
    "Use the <a href=\"https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html#pyspark.sql.SQLContext\">SQLContext</a> you created earlier to read the World Bank json data - <i>world_bank.json.gz</i> and return it as a <a href=\"https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html#pyspark.sql.DataFrame\">Dataframe</a><br><br>\n",
    " <div class=\"panel-group\" id=\"accordion-3\">\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-3\" href=\"#collapse1-3\">\n",
    "        Hint 1</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse1-3\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">Use the <i>read</i> variable in <i>SQLContext</i> to return a Dataframe reader</div>\n",
    "    </div>\n",
    "  </div>\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-3\" href=\"#collapse2-3\">\n",
    "        Hint 2</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse2-3\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">Use the <i>json()</i> method in Dataframe to read the file.   Note that the method handles a gzipped file format.</div>\n",
    "    </div>\n",
    "  </div>\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-3\" href=\"#collapse3-3\">\n",
    "        Hint 3</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse3-3\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">To create the Dataframe type:<br>\n",
    "\n",
    "example1_df = sqlContext.read.json(\"world_bank.json.gz\")<br></div>\n",
    "    </div>\n",
    "  </div>\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-4\" href=\"#collapse4-3\">\n",
    "        Advanced Optional</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse4-3\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">Obtain the same result by using <i>textFile()</i> to read the file as RDD and then convert to a Dataframe</div>\n",
    "    </div>\n",
    "  </div>\n",
    "</div> \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#Create the Dataframe here:\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " ## Step 3.1 - Show the Dataframe schema\n",
    " <br>\n",
    " <div class=\"panel-group\" id=\"accordion-31\">\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-1\" href=\"#collapse1-31\">\n",
    "        Hint 1</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse1-31\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\"><h3>We can look at the schema with this command:</h3>\n",
    "\n",
    "Type: <br>\n",
    "example1_df.printSchema()</div>\n",
    "    </div>\n",
    "  </div>\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-31\" href=\"#collapse2-31\">\n",
    "        Advanced Optional 1</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse2-31\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">Get the dataframe columns.   Try using command-completion (use TAB after the .) to obtain the list of possible methods/values</div>\n",
    "    </div>\n",
    "  </div>\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-31\" href=\"#collapse3-31\">\n",
    "        Advanced Optional 2</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse3-31\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">Convert the dataframe back to JSON and print the first value</div>\n",
    "    </div>\n",
    "  </div>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#Print out the schema\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3.2 - Using the Dataframe\n",
    "<br>\n",
    "Dataframes are a subset of RDDs and can be similarly transformed.  You can map and filter them.\n",
    "<br>Take a look at the first two rows of data using the [take()](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=take#pyspark.sql.DataFrame.take) function.<br>\n",
    "<div class=\"panel-group\" id=\"accordion-1\">\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-32\" href=\"#collapse1-32\">\n",
    "        Hint 1</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse1-32\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">example1_df.take(2)</div>\n",
    "    </div>\n",
    "  </div>\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-32\" href=\"#collapse2-32\">\n",
    "        Advanced Optional 1</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse2-32\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\"><i>take()</i> returns data as an RDD list of Row objects.   <i>show()</i> prints the objects to the console.   What is the default number of rows displayed?</div>\n",
    "    </div>\n",
    "  </div>\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-32\" href=\"#collapse3-32\">\n",
    "        Advanced Optional 2</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse3-32\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">Save the table as a parquet table.   Use !ls to confirm it was saved.  Use a <i><a href=\"https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter\">DataFrameWriter</a></i>.  What did you see when you ran the ls command?</div>\n",
    "    </div>\n",
    "  </div>\n",
    "</div> \n",
    " \n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#Use take on the DataFrame to pull out 2 rows\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4 - Register a Temp Table\n",
    "<br>\n",
    "SQL works on tables.   Currently we have data in a dataframe, but we have no table identifier for it.   Thus, we want to create a temporary table reference that refers to this dataframe.\n",
    "<br>\n",
    "<div class=\"panel-group\" id=\"accordion-4\">\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-4\" href=\"#collapse1-4\">\n",
    "        Hint 1</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse1-4\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">The function is: DataframeObject.registerTempTable(\"name_of_table\")<br>\n",
    "Create a table named \"world_bank\"<br></div>\n",
    "    </div>\n",
    "  </div>\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-4\" href=\"#collapse2-4\">\n",
    "        Hint 2</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse2-4\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">example1_df.registerTempTable(\"world_bank\")</div>\n",
    "    </div>\n",
    "  </div>\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-4\" href=\"#collapse3-4\">\n",
    "        Advanced Optional 1</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse3-4\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">Use the <i>tables()</i> method in <i>SQLContext</i> to list all tables and their state.  Extra Hint: show()</div>\n",
    "    </div>\n",
    "  </div>\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-4\" href=\"#collapse4-4\">\n",
    "        Advanced Optional 2</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse4-4\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">Try creating a second temporary table on the same dataframe.   What does <i>tables()</i> return?</div>\n",
    "    </div>\n",
    "  </div>\n",
    "    <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-4\" href=\"#collapse5-4\">\n",
    "        Advanced Optional 3</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse5-4\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">Drop the additional temp table.   What does <i>tables()</i> return?</div>\n",
    "    </div>\n",
    "  </div>\n",
    "</div>\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#Create the table to be referenced via SparkSQL\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 5 - Writing SQL Statements\n",
    "<br>\n",
    "Write SQL statements to return two rows from the world_bank table.\n",
    "<br>\n",
    " <div class=\"panel-group\" id=\"accordion-5\">\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-5\" href=\"#collapse1-5\">\n",
    "        Hint 1</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse1-5\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">Use the <i>sql()</i> method on your SQLContext</div>\n",
    "    </div>\n",
    "  </div>\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-5\" href=\"#collapse2-5\">\n",
    "        Hint 2</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse2-5\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">Use <i>limit</i> (i.e. <i>limit 2</i>) within your SQL statement to limit the number of rows returned.   Use <i>show()</i> to display the values.</div>\n",
    "    </div>\n",
    "  </div>\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-5\" href=\"#collapse3-5\">\n",
    "        Hint 3</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse3-5\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">Type:<br>\n",
    "      sqlContext.sql(\"select * from world_bank limit 2\").show()<br></div>\n",
    "    </div>\n",
    "  </div>\n",
    "</div> \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#Use SQL to query the table and print the output\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 5.1 - Writing SQL Statements\n",
    "<br>\n",
    "Try writing the next three sections yourself first.   Each hint contains the solution for that section.   We provide this here because this is more SQL than Spark and not everyone is familar with SQL.  Nor is this an SQL class!\n",
    "<br><br>\n",
    " <div class=\"panel-group\" id=\"accordion-51\">\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-51\" href=\"#collapse1-51\">\n",
    "        Hint 1</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse1-51\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">sqlContext.sql(\"select * from world_bank limit 2\").toPandas()</div>\n",
    "    </div>\n",
    "  </div>\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-51\" href=\"#collapse2-51\">\n",
    "        Hint 2</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse2-51\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">sqlContext.sql(\"select regionname, count(*) as regioncount from world_bank group by regionname order by regioncount desc\").toPandas()</div>\n",
    "    </div>\n",
    "  </div>\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-51\" href=\"#collapse3-51\">\n",
    "        Hint 3</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse3-51\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">sqlContext.sql(\"select sector.Name from world_bank limit 5\").toPandas()</div>\n",
    "    </div>\n",
    "  </div>\n",
    "</div> "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#Extra credit, take the DataFrame you created with the two records and convert it into a Pandas DataFrame\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Now calculate a simple count based on a group, for example \"regionname\".   Return the regionname and a count of the values for that regionname. \n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# With JSON data you can reference the nested data.  \n",
    "# If you look at the Schema above you can see that sector.Name is a nested column.\n",
    "# Select that column and limit to a reasonable output (say five rows)\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 6 - Creating Simple Graphs\n",
    "<br>\n",
    "Create some simple graphs using the [matplotlib](http://matplotlib.org/1.5.3/index.html) and [numpy](http://www.numpy.org/) libraries\n",
    "<br>\n",
    "The \"%matplotlib inline\" statement is used to ensure that graphs are drawn within the notebook instead of popping up as separate windows.\n",
    "\n",
    "Make SURE you actually run this cell!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Load the libraries\n",
    "%matplotlib inline \n",
    "import matplotlib.pyplot as plt, numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "### Step 6.1 - Create the SQL data\n",
    "Write the sql statement and look at the data, remember to add <i>.toPandas()</i> for a formatted display. An easier option is to create a variable and set it to the SQL statement.\n",
    "#### First create a SQL statement that is a reasonable number of items\n",
    "For example, you can count the number of projects (rows) by countryname\n",
    "<br>or in other words: \n",
    "<br>count(*), countryname from table group by countryname<br><br>\n",
    "\n",
    " <div class=\"panel-group\" id=\"accordion-61\">\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-61\" href=\"#collapse1-61\">\n",
    "        Hint 1</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse1-61\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">Type:<br>\n",
    "query = \"select count(*) as Count, countryname from world_bank group by countryname\"<br></div>\n",
    "    </div>\n",
    "  </div>\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-61\" href=\"#collapse2-61\">\n",
    "        Hint 2</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse2-61\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">Type:<br>\n",
    "      chart1_df = sqlContext.sql(query).toPandas()<br>\n",
    "print chart1_df<br></div>\n",
    "    </div>\n",
    "  </div>\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-61\" href=\"#collapse3-61\">\n",
    "        Advanced Optional 1</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse3-61\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">Printing the result isn't as nicely formatted.   What command gives you a nicely formatted output?   Use that instead.</div>\n",
    "    </div>\n",
    "  </div>\n",
    "</div> \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# create the query to obtain the number of projects by countryname, save to a variable and print that variable\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 6.2 - Create charts based on the SQL data\n",
    "<br>\n",
    "Here we wish to create a chart based on the SQL data we just obtained.   Python is an excellent choice when you need to create charts because of the variety and power of the charting libraries available.   The one we are using here is for Pandas.   Specifically the plot() method.   Documentation can be found <a href=\"http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html\">here</a>.\n",
    "\n",
    " <div class=\"panel-group\" id=\"accordion-62\">\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-1\" href=\"#collapse1-62\">\n",
    "        Hint 1</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse1-62\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">Type:<br>\n",
    "      chart1_df.plot(kind='bar', x='countryname', y='Count', figsize=(12, 5))</div>\n",
    "    </div>\n",
    "  </div>\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-62\" href=\"#collapse2-62\">\n",
    "       Advanced Optional 1</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse2-62\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">The table contains too much data.   Change the SQL statement to return a smaller group of values like 30 or 40<br></div>\n",
    "    </div>\n",
    "  </div>\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-62\" href=\"#collapse3-62\">\n",
    "        Advanced Optional 2</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse3-62\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">Looking at the Pandas <i>plot()</i> documentation try other styles of plotting.   Look <a href=\"http://pandas.pydata.org/pandas-docs/version/0.18.1/visualization.html\">here</a> for ideas.<br></div>\n",
    "    </div>\n",
    "  </div>\n",
    "</div> \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Now take the variable (or same sql statement) and use the method:\n",
    "# .plot(kind='bar', x='countryname', y='Count', figsize=(12, 5)) to plot a graph\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 7 - Creating a DataFrame\n",
    "<br>\n",
    "Not all data comes with a defined (or derivable) schema like JSON.   Sometimes we have the data first and <b>then</b> need to create a schema for it.<br>\n",
    "Try adding a schema to an RDD to create a DataFrame.<br>\n",
    "First, you need to create an RDD. This can be done with a loop or as\n",
    "seen in the instructor's example, or more simply by assigning values to an array.\n",
    "<br><br>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[[1, 1, 1], [2, 2, 2], [3, 3, 3], ['4a', '4a', '4a'], [5, 5, 5]]"
      ]
     },
     "execution_count": 74,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Default array defined below. Feel free to change as desired.\n",
    "array=[[1,1,1],[2,2,2],[3,3,3],[\"4a\",\"4a\",\"4a\"],[5,5,5]]\n",
    "my_rdd = sc.parallelize(array)\n",
    "my_rdd.collect()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 7.1 - Second, you need to add a schema to the RDD you created in the previous code block.\n",
    "Use first the StructField method, following these steps:<br>\n",
    "1- Define your schema columns as a string<br>\n",
    "2- Build the schema object using StructField<br>\n",
    "3- Apply the schema object to the RDD<br>\n",
    "\n",
    "Note: The cell below is missing some code and will not run properly until you add in some missing parts.\n",
    "<br><br>\n",
    "<div class=\"panel-group\" id=\"accordion-71\">\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-71\" href=\"#collapse1-71\">\n",
    "        Hint 1</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse1-71\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">The schema string is simply the space-separated list of column names (i.e. \"var1 var2 var3\")<br></div>\n",
    "    </div>\n",
    "  </div>\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-71\" href=\"#collapse2-71\">\n",
    "        Hint 2</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse2-71\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">The type should be IntegerType or StringType.   Note that because we are applying this is a loop *everything* will be an Integer or String<br></div>\n",
    "    </div>\n",
    "  </div>\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-71\" href=\"#collapse3-71\">\n",
    "        Hint 3</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse3-71\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">Use the RDD you created above and apply the schema to it i.e.<br>\n",
    "      schemaExample = sqlContext.createDataFrame(my_rdd, schema)\n",
    "      </div>\n",
    "    </div>\n",
    "  </div>\n",
    "    <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-71\" href=\"#collapse4-71\">\n",
    "        Hint 4</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse4-71\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">We really don't need to tell you a name to use for your temp table do we?</div>\n",
    "    </div>\n",
    "  </div>\n",
    "</div> \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from pyspark.sql.types import *\n",
    "\n",
    "# The schema is encoded in a string. Complete the string below\n",
    "schemaString = \"var1 var2 var3\"\n",
    "\n",
    "# MissingType() should be either StringType() or IntegerType(). Please replace as required.\n",
    "fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]\n",
    "schema = StructType(fields)\n",
    "\n",
    "# Apply the schema to the RDD.\n",
    "schemaExample = sqlContext.createDataFrame(my_rdd, schema)\n",
    "\n",
    "# Register the DataFrame as a table. Add table name below as parameter to registerTempTable.\n",
    "schemaExample.registerTempTable(\"myRDDTempTable\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 7.2 - Thirdly, write some SQL statements to verify that you successfully added a schema to your RDD\n",
    "<br>\n",
    " <div class=\"panel-group\" id=\"accordion-72\">\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-72\" href=\"#collapse1-72\">\n",
    "        Hint 1</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse1-72\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">sqlContext.sql(\"select * from myRDDTempTable\").toPandas()</div>\n",
    "    </div>\n",
    "  </div>\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-72\" href=\"#collapse2-72\">\n",
    "        Advanced Optional 1</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse2-72\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">What is the type is changed to IntegerType (or StringType).   Any change in the results?</div>\n",
    "    </div>\n",
    "  </div>\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-72\" href=\"#collapse3-72\">\n",
    "        Advanced Optional 2</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse3-72\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">Try to do some specific queries on the data (i.e.)<br>\n",
    "      sqlContext.sql(\"select * from myRDDTempTable where var3 > 2\").toPandas()<br>\n",
    "      Does this work regardless of the data type (i.e. IntegerType or StringType)?<br>\n",
    "      What if you change some of the input values (i.e. change all 4s to \"4a\")<br></div>\n",
    "    </div>\n",
    "  </div>\n",
    "</div> "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#Run some SQL statements on your newly created DataFrame and display the output\n",
    "#sqlContext.sql(\"select * from myRDDTempTable\").toPandas()\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 8\n",
    "### Reading from an external data source\n",
    "If you have time, this is a good example to show you how to read from other datasources.  <br><br>\n",
    "In a different browser tab, create a dashDB service, add credentials and come back to this notebook. <br>If you are using Data Science Experience, you need to log into Bluemix and create a dashDB instance.   The login and password should be the same as for DSE.<br>\n",
    "Each dashDB instance in Bluemix is created with a \"GOSALES\" set of tables which we can reuse for the purpose of this example. (You can create your own tables if you wish...)<br><br>Replace the Xs in the cell below with proper credentials and verify access to dashDB tables.<br><br>\n",
    "You can read from any database that you can connect to through jdbc.  Here is the [documentation](http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases)\n",
    "<br><br>\n",
    " <div class=\"panel-group\" id=\"accordion-8\">\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-8\" href=\"#collapse1-8\">\n",
    "        Hint 1</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse1-8\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">To connect to a general dashDB instance:<br>\n",
    "      url=\"\"<br>\n",
    "user=\"\"<br>\n",
    "password=\"\"<br>\n",
    "connection=\"jdbc:db2://\" + url + \":50000/BLUDB:user=\" + user + \";password=\" + password + \";\"<br></div>\n",
    "    </div>\n",
    "  </div>\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-8\" href=\"#collapse2-8\">\n",
    "        Advanced Optional</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse2-8\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">Create your own dashDB instance in Bluemix and connect to it</div>\n",
    "    </div>\n",
    "  </div>\n",
    "</div> \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "url=\"\"\n",
    "user=\"\"\n",
    "password=\"\"\n",
    "connection=\"jdbc:db2://\" + url + \":50000/BLUDB:user=\" + user + \";password=\" + password + \";\"\n",
    "\n",
    "salesDF = sqlContext.read.format('jdbc').\\\n",
    "          options(url=connection,\\\n",
    "                  dbtable='GOSALES.BRANCH').load()\n",
    "salesDF.show()\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2 with Spark 1.6",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}