{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src='https://raw.githubusercontent.com/dxkikuchi/SparkSnips/master/dsxbanner.jpg' width='75%'></img>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "IBM Data Science Experience is an interactive, collaborative, cloud-based environment where data scientists can use multiple tools to activate their insights.  Data scientists can use the best of open source, tap into IBM's unique features, grow their capabilities, and share their successess.  In addition to all the features in the current preview, many new capabilities are being added including the ability to ingest Object Storage data with a single click, an enhanced user interface for version control, a facility to comment or chat about a notebook with others, and many more!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## New York State Restaurant Inspections Notebook\n",
    "This notebook will provide insights from official restaurant inspections records for most of the state of New York and provide visualizations of that data.  The location of this data is defined below.  For additional details, please see <a href=\"https://health.data.ny.gov/Health/Food-Service-Establishment-Last-Inspection/cnih-y5dw\" target=\"_blank\">New York State Restaurant Ratings</a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "nyr = 'https://health.data.ny.gov/api/views/cnih-y5dw/rows.csv?accessType=DOWNLOAD'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The csv (comma separated values) data will be read into a Pandas dataframe (nyr) and the first 5 records are displayed using the 'head()' method.<br>\n",
    "Please attempt to write the code to read the csv data in the following cell without looking at the solution provided 2 cells from here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "# Please add your code here\n",
    "nyr.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<b>Solution</b>: Please copy and paste the following code into the previous cell where specified.<br>\n",
    "nyr = pd.read_csv(nyr)<br>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ingesting data can be as simple as using one line of code.  Similarly, data can be ingested from Cloudant, DashDB, Object Storage, relational databases, and many others."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The data will be transformed into a Spark dataframe 'nyrDF' and a table will be registered.  Spark dataframes are conceptually equivalent to a table in a relational database or a dataframe in R/Python, but with richer optimizations under the hood.  A table that is registered can be used in subsequent SQL statements.<br>\n",
    "Please attempt to write the code in the following cell to register the table 'nvrDF' without looking at the solution provided 2 cells from here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "nyrDF = sqlContext.createDataFrame(nyr)\n",
    "# Please add your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<b>Solution</b>: Please copy and paste the following code into the previous cell where specified.<br>\n",
    "nyrDF.registerTempTable(\"nyrDF\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, a Spark dataframe 'nyvDF' will be created using SQL that will contain the restaurant name (FACILITY), latitude, longitude and violations.  Note that the latitude and longitude are combined in the final column (Location1) of the retrieved data.  They will be extracted separately using regular expressions in the SQL.  The results are ordered by number of violations in descending order and the top 10 are displayed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "query = \"\"\"\n",
    "select \n",
    "    FACILITY, \n",
    "    trim(regexp_extract(location1, '(\\\\\\()(.*),(.*)(\\\\\\))',2)) as lat, \n",
    "    trim(regexp_extract(location1, '(\\\\\\()(.*),(.*)(\\\\\\))',3)) as lon,\n",
    "    cast(`TOTAL # CRITICAL VIOLATIONS` as int) as Violations\n",
    "from nyrDF \n",
    "order by Violations desc\n",
    "limit 1000\n",
    "\"\"\"\n",
    "nyvDF = sqlContext.sql(query)\n",
    "nyvDF.show(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Brunel visualization will be used to map the latitude and longitude to a New York state map.  Colors represent the number of violations as described in the key."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import brunel\n",
    "nyvPan = nyvDF.toPandas()\n",
    "%brunel map ('NY') + data('nyvPan') x(lon) y(lat) color(Violations) tooltip(FACILITY)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One of the many key strengths of Data Science Experience is the ability to easily search and quickly learn about various topics.  For example, to find articles, tutorials or notebooks on Brunel, click on the search button on the top right hand corner of this web page.  A side palette will appear where you can enter 'Brunel' or other topics of interest."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Pixiedust provides charting and visualization.  It is an open source Python library that works as an add-on to Jupyter notebooks to improve the user experience of working with data.  Please execute the next cell for a tabular view of the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from pixiedust.display import *\n",
    "display(nyvDF)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you hover over the lonely lighter colored dot in the middle of the New York State map, you can see that it is for 'CAMP KINGSLEY - CC'.  By starting to type the value 'camp' in the 'Search table' text field above, the record will be displayed.  Numerous visualization are available with support for maps in the future.  Please take a look at the histogram of this data for another insight.  In addition, the data can be downloaded as a file, or stashed to Cloudant or Object Storage."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In just a few notebook cells, data was ingested, manipulated, visualized and yielded insights.  Much more capability, including machine learning, could be leveraged with IBM Data Science Experience.  This is just the tip of the iceberg!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}