{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# PySpark Demo Notebook 4\n", "\n", "## Contents\n", "\n", "1. [Read CSV-Format File](#Read-CSV-Format-File)\n", "2. [Run PostgreSQL Script](#Run-PostgreSQL-Script)\n", "3. [Load PostgreSQL Data](#Run-PostgreSQL-Script)\n", "4. [Create a New Record](#Create-a-New-Record)\n", "5. [Append Record to Database Table](#Append-Record-to-Database-Table)\n", "6. [Overwrite Data to Database Table](#Overwrite-Data-to-Database-Table)\n", "7. [Analyze and Graph Data with BokehJS](#Analyze-and-Graph-Data-with-BokehJS)\n", "9. [Read and Write Data to Parquet](#Read-and-Write-Data-to-Parquet)\n", "\n", "## Background\n", "\n", "_Prepared by: [Gary A. Stafford](https://twitter.com/GaryStafford) \n", "Associated article: [Getting Started with Data Analytics using Jupyter Notebooks, PySpark, and Docker](https://wp.me/p1RD28-6Fj)_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Read CSV-Format File\n", "Read CSV-format data file into a Spark DataFrame." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2019-12-06T03:45:29.626016Z", "start_time": "2019-12-06T03:45:29.416677Z" }, "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "from pyspark.sql import SparkSession\n", "from pyspark.sql.types import StructType, StructField, StringType, IntegerType" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2019-12-06T03:45:34.343019Z", "start_time": "2019-12-06T03:45:29.628615Z" }, "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "spark = SparkSession \\\n", " .builder \\\n", " .appName('04_notebook') \\\n", " .config('spark.driver.extraClassPath', 'postgresql-42.2.10.jar') \\\n", " .getOrCreate()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2019-12-06T03:45:34.354352Z", "start_time": "2019-12-06T03:45:34.346118Z" }, "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "bakery_schema = StructType([\n", " StructField('date', StringType(), True),\n", " StructField('time', StringType(), True),\n", " StructField('transaction', IntegerType(), True),\n", " StructField('item', StringType(), True)\n", "])" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2019-12-06T03:45:36.751894Z", "start_time": "2019-12-06T03:45:34.357930Z" }, "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "df1 = spark.read \\\n", " .format('csv') \\\n", " .option('header', 'true') \\\n", " .load('BreadBasket_DMS.csv', schema=bakery_schema)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2019-12-06T03:45:39.950006Z", "start_time": "2019-12-06T03:45:36.753623Z" }, "pycharm": { "is_executing": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "DataFrame rows: 21293\n", "DataFrame schema: DataFrame[date: string, time: string, transaction: int, item: string]\n", "+----------+--------+-----------+-------------+\n", "|date |time |transaction|item |\n", "+----------+--------+-----------+-------------+\n", "|2016-10-30|09:58:11|1 |Bread |\n", "|2016-10-30|10:05:34|2 |Scandinavian |\n", "|2016-10-30|10:05:34|2 |Scandinavian |\n", "|2016-10-30|10:07:57|3 |Hot chocolate|\n", "|2016-10-30|10:07:57|3 |Jam |\n", "|2016-10-30|10:07:57|3 |Cookies |\n", "|2016-10-30|10:08:41|4 |Muffin |\n", "|2016-10-30|10:13:03|5 |Coffee |\n", "|2016-10-30|10:13:03|5 |Pastry |\n", "|2016-10-30|10:13:03|5 |Bread |\n", "+----------+--------+-----------+-------------+\n", "only showing top 10 rows\n", "\n" ] } ], "source": [ "print('DataFrame rows: %d' % df1.count())\n", "print('DataFrame schema: %s' % df1)\n", "df1.show(10, False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Run PostgreSQL Script\n", "Run the sql script to create the database schema and import data from CSV file." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2019-12-06T03:45:40.268164Z", "start_time": "2019-12-06T03:45:39.954014Z" }, "pycharm": { "is_executing": false }, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "DROP TABLE IF EXISTS \"transactions\"\n", "\n", "DROP SEQUENCE IF EXISTS transactions_id_seq\n", "\n", "CREATE SEQUENCE transactions_id_seq INCREMENT 1 MINVALUE 1 MAXVALUE 2147483647 START 1 CACHE 1\n", "\n", "\n", "CREATE TABLE \"public\".\"transactions\"\n", "(\n", " \"id\" integer DEFAULT nextval('transactions_id_seq') NOT NULL,\n", " \"date\" character varying(10) NOT NULL,\n", " \"time\" character varying(8) NOT NULL,\n", " \"transaction\" integer NOT NULL,\n", " \"item\" character varying(50) NOT NULL\n", ") WITH (oids = false)\n", "\n", "Row count: 21293\n" ] } ], "source": [ "%run -i '03_load_sql.py'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load PostgreSQL Data\n", "Load the PostgreSQL 'transactions' table's contents into a Spark DataFrame." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2019-12-06T03:45:40.276617Z", "start_time": "2019-12-06T03:45:40.270872Z" }, "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "properties = {\n", " 'driver': 'org.postgresql.Driver',\n", " 'url': 'jdbc:postgresql://postgres:5432/bakery',\n", " 'user': 'postgres',\n", " 'password': 'postgres1234',\n", " 'dbtable': 'transactions',\n", "}" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2019-12-06T03:45:40.600010Z", "start_time": "2019-12-06T03:45:40.278813Z" }, "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "df2 = spark.read \\\n", " .format('jdbc') \\\n", " .option('driver', properties['driver']) \\\n", " .option('url', properties['url']) \\\n", " .option('user', properties['user']) \\\n", " .option('password', properties['password']) \\\n", " .option('dbtable', properties['dbtable']) \\\n", " .load()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "ExecuteTime": { "end_time": "2019-12-06T03:45:40.979256Z", "start_time": "2019-12-06T03:45:40.602206Z" }, "pycharm": { "is_executing": false }, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "DataFrame rows: 21293\n", "DataFrame schema: DataFrame[date: string, time: string, transaction: int, item: string]\n", "+---+----------+--------+-----------+-------------+\n", "|id |date |time |transaction|item |\n", "+---+----------+--------+-----------+-------------+\n", "|1 |2016-10-30|09:58:11|1 |Bread |\n", "|2 |2016-10-30|10:05:34|2 |Scandinavian |\n", "|3 |2016-10-30|10:05:34|2 |Scandinavian |\n", "|4 |2016-10-30|10:07:57|3 |Hot chocolate|\n", "|5 |2016-10-30|10:07:57|3 |Jam |\n", "|6 |2016-10-30|10:07:57|3 |Cookies |\n", "|7 |2016-10-30|10:08:41|4 |Muffin |\n", "|8 |2016-10-30|10:13:03|5 |Coffee |\n", "|9 |2016-10-30|10:13:03|5 |Pastry |\n", "|10 |2016-10-30|10:13:03|5 |Bread |\n", "+---+----------+--------+-----------+-------------+\n", "only showing top 10 rows\n", "\n" ] } ], "source": [ "print('DataFrame rows: %d' % df1.count())\n", "print('DataFrame schema: %s' % df1)\n", "df2.show(10, False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create a New Record\n", "Create a new bakery record and load into a Spark DataFrame." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "ExecuteTime": { "end_time": "2019-12-06T03:45:41.359691Z", "start_time": "2019-12-06T03:45:40.980775Z" }, "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "data = [('2016-10-30', '10:13:27', 2, 'Pastry')]\n", "df3 = spark.createDataFrame(data, bakery_schema)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "ExecuteTime": { "end_time": "2019-12-06T03:45:42.269428Z", "start_time": "2019-12-06T03:45:41.363867Z" }, "pycharm": { "is_executing": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "DataFrame rows: 1\n", "DataFrame schema: DataFrame[date: string, time: string, transaction: int, item: string]\n", "+----------+--------+-----------+------+\n", "|date |time |transaction|item |\n", "+----------+--------+-----------+------+\n", "|2016-10-30|10:13:27|2 |Pastry|\n", "+----------+--------+-----------+------+\n", "\n" ] } ], "source": [ "print('DataFrame rows: %d' % df3.count())\n", "print('DataFrame schema: %s' % df3)\n", "df3.show(10, False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Append Record to Database Table\n", "Append the contents of the DataFrame to the bakery PostgreSQL database's 'transactions' table." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "ExecuteTime": { "end_time": "2019-12-06T03:45:42.600551Z", "start_time": "2019-12-06T03:45:42.270651Z" }, "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "df3.write \\\n", " .format('jdbc') \\\n", " .option('driver', properties['driver']) \\\n", " .option('url', properties['url']) \\\n", " .option('user', properties['user']) \\\n", " .option('password', properties['password']) \\\n", " .option('dbtable', properties['dbtable']) \\\n", " .mode('append') \\\n", " .save()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "ExecuteTime": { "end_time": "2019-12-06T03:45:42.748246Z", "start_time": "2019-12-06T03:45:42.602245Z" }, "pycharm": { "is_executing": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "DataFrame rows: 21294\n" ] } ], "source": [ "# should now contain one additional row of data\n", "print('DataFrame rows: %d' % df2.count())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overwrite Data to Database Table\n", "Overwrite the contents of the CSV file-based DataFrame to the 'transactions' table." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "ExecuteTime": { "end_time": "2019-12-06T03:45:44.441462Z", "start_time": "2019-12-06T03:45:42.750237Z" }, "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "df1.write \\\n", " .format('jdbc') \\\n", " .option('driver', properties['driver']) \\\n", " .option('url', properties['url']) \\\n", " .option('user', properties['user']) \\\n", " .option('password', properties['password']) \\\n", " .option('dbtable', properties['dbtable']) \\\n", " .option('truncate', 'true') \\\n", " .mode('overwrite') \\\n", " .save()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Analyze and Graph Data with BokehJS\n", "Perform some simple analysis of the bakery data and plot the results with [BokehJS](https://docs.bokeh.org/en/latest/index.html).\n", "### Business Questions\n", "1. What are the busiest days of the week?\n", "2. What are the busiest times of the day?\n", "3. What are the top selling bakery items?\n", "4. How many items do customers usually buy?" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "ExecuteTime": { "end_time": "2019-12-06T03:45:44.682664Z", "start_time": "2019-12-06T03:45:44.445670Z" }, "pycharm": { "is_executing": false } }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", " \n", " Loading BokehJS ...\n", "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/javascript": [ "\n", "(function(root) {\n", " function now() {\n", " return new Date();\n", " }\n", "\n", " var force = true;\n", "\n", " if (typeof root._bokeh_onload_callbacks === \"undefined\" || force === true) {\n", " root._bokeh_onload_callbacks = [];\n", " root._bokeh_is_loading = undefined;\n", " }\n", "\n", " var JS_MIME_TYPE = 'application/javascript';\n", " var HTML_MIME_TYPE = 'text/html';\n", " var EXEC_MIME_TYPE = 'application/vnd.bokehjs_exec.v0+json';\n", " var CLASS_NAME = 'output_bokeh rendered_html';\n", "\n", " /**\n", " * Render data to the DOM node\n", " */\n", " function render(props, node) {\n", " var script = document.createElement(\"script\");\n", " node.appendChild(script);\n", " }\n", "\n", " /**\n", " * Handle when an output is cleared or removed\n", " */\n", " function handleClearOutput(event, handle) {\n", " var cell = handle.cell;\n", "\n", " var id = cell.output_area._bokeh_element_id;\n", " var server_id = cell.output_area._bokeh_server_id;\n", " // Clean up Bokeh references\n", " if (id != null && id in Bokeh.index) {\n", " Bokeh.index[id].model.document.clear();\n", " delete Bokeh.index[id];\n", " }\n", "\n", " if (server_id !== undefined) {\n", " // Clean up Bokeh references\n", " var cmd = \"from bokeh.io.state import curstate; print(curstate().uuid_to_server['\" + server_id + \"'].get_sessions()[0].document.roots[0]._id)\";\n", " cell.notebook.kernel.execute(cmd, {\n", " iopub: {\n", " output: function(msg) {\n", " var id = msg.content.text.trim();\n", " if (id in Bokeh.index) {\n", " Bokeh.index[id].model.document.clear();\n", " delete Bokeh.index[id];\n", " }\n", " }\n", " }\n", " });\n", " // Destroy server and session\n", " var cmd = \"import bokeh.io.notebook as ion; ion.destroy_server('\" + server_id + \"')\";\n", " cell.notebook.kernel.execute(cmd);\n", " }\n", " }\n", "\n", " /**\n", " * Handle when a new output is added\n", " */\n", " function handleAddOutput(event, handle) {\n", " var output_area = handle.output_area;\n", " var output = handle.output;\n", "\n", " // limit handleAddOutput to display_data with EXEC_MIME_TYPE content only\n", " if ((output.output_type != \"display_data\") || (!output.data.hasOwnProperty(EXEC_MIME_TYPE))) {\n", " return\n", " }\n", "\n", " var toinsert = output_area.element.find(\".\" + CLASS_NAME.split(' ')[0]);\n", "\n", " if (output.metadata[EXEC_MIME_TYPE][\"id\"] !== undefined) {\n", " toinsert[toinsert.length - 1].firstChild.textContent = output.data[JS_MIME_TYPE];\n", " // store reference to embed id on output_area\n", " output_area._bokeh_element_id = output.metadata[EXEC_MIME_TYPE][\"id\"];\n", " }\n", " if (output.metadata[EXEC_MIME_TYPE][\"server_id\"] !== undefined) {\n", " var bk_div = document.createElement(\"div\");\n", " bk_div.innerHTML = output.data[HTML_MIME_TYPE];\n", " var script_attrs = bk_div.children[0].attributes;\n", " for (var i = 0; i < script_attrs.length; i++) {\n", " toinsert[toinsert.length - 1].firstChild.setAttribute(script_attrs[i].name, script_attrs[i].value);\n", " }\n", " // store reference to server id on output_area\n", " output_area._bokeh_server_id = output.metadata[EXEC_MIME_TYPE][\"server_id\"];\n", " }\n", " }\n", "\n", " function register_renderer(events, OutputArea) {\n", "\n", " function append_mime(data, metadata, element) {\n", " // create a DOM node to render to\n", " var toinsert = this.create_output_subarea(\n", " metadata,\n", " CLASS_NAME,\n", " EXEC_MIME_TYPE\n", " );\n", " this.keyboard_manager.register_events(toinsert);\n", " // Render to node\n", " var props = {data: data, metadata: metadata[EXEC_MIME_TYPE]};\n", " render(props, toinsert[toinsert.length - 1]);\n", " element.append(toinsert);\n", " return toinsert\n", " }\n", "\n", " /* Handle when an output is cleared or removed */\n", " events.on('clear_output.CodeCell', handleClearOutput);\n", " events.on('delete.Cell', handleClearOutput);\n", "\n", " /* Handle when a new output is added */\n", " events.on('output_added.OutputArea', handleAddOutput);\n", "\n", " /**\n", " * Register the mime type and append_mime function with output_area\n", " */\n", " OutputArea.prototype.register_mime_type(EXEC_MIME_TYPE, append_mime, {\n", " /* Is output safe? */\n", " safe: true,\n", " /* Index of renderer in `output_area.display_order` */\n", " index: 0\n", " });\n", " }\n", "\n", " // register the mime type if in Jupyter Notebook environment and previously unregistered\n", " if (root.Jupyter !== undefined) {\n", " var events = require('base/js/events');\n", " var OutputArea = require('notebook/js/outputarea').OutputArea;\n", "\n", " if (OutputArea.prototype.mime_types().indexOf(EXEC_MIME_TYPE) == -1) {\n", " register_renderer(events, OutputArea);\n", " }\n", " }\n", "\n", " \n", " if (typeof (root._bokeh_timeout) === \"undefined\" || force === true) {\n", " root._bokeh_timeout = Date.now() + 5000;\n", " root._bokeh_failed_load = false;\n", " }\n", "\n", " var NB_LOAD_WARNING = {'data': {'text/html':\n", " \"
\\n\"+\n", " \"

\\n\"+\n", " \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n", " \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n", " \"

\\n\"+\n", " \"\\n\"+\n", " \"\\n\"+\n", " \"from bokeh.resources import INLINE\\n\"+\n", " \"output_notebook(resources=INLINE)\\n\"+\n", " \"\\n\"+\n", " \"
\"}};\n", "\n", " function display_loaded() {\n", " var el = document.getElementById(\"1001\");\n", " if (el != null) {\n", " el.textContent = \"BokehJS is loading...\";\n", " }\n", " if (root.Bokeh !== undefined) {\n", " if (el != null) {\n", " el.textContent = \"BokehJS \" + root.Bokeh.version + \" successfully loaded.\";\n", " }\n", " } else if (Date.now() < root._bokeh_timeout) {\n", " setTimeout(display_loaded, 100)\n", " }\n", " }\n", "\n", "\n", " function run_callbacks() {\n", " try {\n", " root._bokeh_onload_callbacks.forEach(function(callback) {\n", " if (callback != null)\n", " callback();\n", " });\n", " } finally {\n", " delete root._bokeh_onload_callbacks\n", " }\n", " console.debug(\"Bokeh: all callbacks have finished\");\n", " }\n", "\n", " function load_libs(css_urls, js_urls, callback) {\n", " if (css_urls == null) css_urls = [];\n", " if (js_urls == null) js_urls = [];\n", "\n", " root._bokeh_onload_callbacks.push(callback);\n", " if (root._bokeh_is_loading > 0) {\n", " console.debug(\"Bokeh: BokehJS is being loaded, scheduling callback at\", now());\n", " return null;\n", " }\n", " if (js_urls == null || js_urls.length === 0) {\n", " run_callbacks();\n", " return null;\n", " }\n", " console.debug(\"Bokeh: BokehJS not loaded, scheduling load and callback at\", now());\n", " root._bokeh_is_loading = css_urls.length + js_urls.length;\n", "\n", " function on_load() {\n", " root._bokeh_is_loading--;\n", " if (root._bokeh_is_loading === 0) {\n", " console.debug(\"Bokeh: all BokehJS libraries/stylesheets loaded\");\n", " run_callbacks()\n", " }\n", " }\n", "\n", " function on_error() {\n", " console.error(\"failed to load \" + url);\n", " }\n", "\n", " for (var i = 0; i < css_urls.length; i++) {\n", " var url = css_urls[i];\n", " const element = document.createElement(\"link\");\n", " element.onload = on_load;\n", " element.onerror = on_error;\n", " element.rel = \"stylesheet\";\n", " element.type = \"text/css\";\n", " element.href = url;\n", " console.debug(\"Bokeh: injecting link tag for BokehJS stylesheet: \", url);\n", " document.body.appendChild(element);\n", " }\n", "\n", " for (var i = 0; i < js_urls.length; i++) {\n", " var url = js_urls[i];\n", " var element = document.createElement('script');\n", " element.onload = on_load;\n", " element.onerror = on_error;\n", " element.async = false;\n", " element.src = url;\n", " console.debug(\"Bokeh: injecting script tag for BokehJS library: \", url);\n", " document.head.appendChild(element);\n", " }\n", " };var element = document.getElementById(\"1001\");\n", " if (element == null) {\n", " console.error(\"Bokeh: ERROR: autoload.js configured with elementid '1001' but no matching script tag was found. \")\n", " return false;\n", " }\n", "\n", " function inject_raw_css(css) {\n", " const element = document.createElement(\"style\");\n", " element.appendChild(document.createTextNode(css));\n", " document.body.appendChild(element);\n", " }\n", "\n", " \n", " var js_urls = [\"https://cdn.pydata.org/bokeh/release/bokeh-1.4.0.min.js\", \"https://cdn.pydata.org/bokeh/release/bokeh-widgets-1.4.0.min.js\", \"https://cdn.pydata.org/bokeh/release/bokeh-tables-1.4.0.min.js\", \"https://cdn.pydata.org/bokeh/release/bokeh-gl-1.4.0.min.js\"];\n", " var css_urls = [];\n", " \n", "\n", " var inline_js = [\n", " function(Bokeh) {\n", " Bokeh.set_log_level(\"info\");\n", " },\n", " function(Bokeh) {\n", " \n", " \n", " }\n", " ];\n", "\n", " function run_inline_js() {\n", " \n", " if (root.Bokeh !== undefined || force === true) {\n", " \n", " for (var i = 0; i < inline_js.length; i++) {\n", " inline_js[i].call(root, root.Bokeh);\n", " }\n", " if (force === true) {\n", " display_loaded();\n", " }} else if (Date.now() < root._bokeh_timeout) {\n", " setTimeout(run_inline_js, 100);\n", " } else if (!root._bokeh_failed_load) {\n", " console.log(\"Bokeh: BokehJS failed to load within specified timeout.\");\n", " root._bokeh_failed_load = true;\n", " } else if (force !== true) {\n", " var cell = $(document.getElementById(\"1001\")).parents('.cell').data().cell;\n", " cell.output_area.append_execute_result(NB_LOAD_WARNING)\n", " }\n", "\n", " }\n", "\n", " if (root._bokeh_is_loading === 0) {\n", " console.debug(\"Bokeh: BokehJS loaded, going straight to plotting\");\n", " run_inline_js();\n", " } else {\n", " load_libs(css_urls, js_urls, function() {\n", " console.debug(\"Bokeh: BokehJS plotting callback run at\", now());\n", " run_inline_js();\n", " });\n", " }\n", "}(window));" ], "application/vnd.bokehjs_load.v0+json": "\n(function(root) {\n function now() {\n return new Date();\n }\n\n var force = true;\n\n if (typeof root._bokeh_onload_callbacks === \"undefined\" || force === true) {\n root._bokeh_onload_callbacks = [];\n root._bokeh_is_loading = undefined;\n }\n\n \n\n \n if (typeof (root._bokeh_timeout) === \"undefined\" || force === true) {\n root._bokeh_timeout = Date.now() + 5000;\n root._bokeh_failed_load = false;\n }\n\n var NB_LOAD_WARNING = {'data': {'text/html':\n \"
\\n\"+\n \"

\\n\"+\n \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n \"

\\n\"+\n \"\\n\"+\n \"\\n\"+\n \"from bokeh.resources import INLINE\\n\"+\n \"output_notebook(resources=INLINE)\\n\"+\n \"\\n\"+\n \"
\"}};\n\n function display_loaded() {\n var el = document.getElementById(\"1001\");\n if (el != null) {\n el.textContent = \"BokehJS is loading...\";\n }\n if (root.Bokeh !== undefined) {\n if (el != null) {\n el.textContent = \"BokehJS \" + root.Bokeh.version + \" successfully loaded.\";\n }\n } else if (Date.now() < root._bokeh_timeout) {\n setTimeout(display_loaded, 100)\n }\n }\n\n\n function run_callbacks() {\n try {\n root._bokeh_onload_callbacks.forEach(function(callback) {\n if (callback != null)\n callback();\n });\n } finally {\n delete root._bokeh_onload_callbacks\n }\n console.debug(\"Bokeh: all callbacks have finished\");\n }\n\n function load_libs(css_urls, js_urls, callback) {\n if (css_urls == null) css_urls = [];\n if (js_urls == null) js_urls = [];\n\n root._bokeh_onload_callbacks.push(callback);\n if (root._bokeh_is_loading > 0) {\n console.debug(\"Bokeh: BokehJS is being loaded, scheduling callback at\", now());\n return null;\n }\n if (js_urls == null || js_urls.length === 0) {\n run_callbacks();\n return null;\n }\n console.debug(\"Bokeh: BokehJS not loaded, scheduling load and callback at\", now());\n root._bokeh_is_loading = css_urls.length + js_urls.length;\n\n function on_load() {\n root._bokeh_is_loading--;\n if (root._bokeh_is_loading === 0) {\n console.debug(\"Bokeh: all BokehJS libraries/stylesheets loaded\");\n run_callbacks()\n }\n }\n\n function on_error() {\n console.error(\"failed to load \" + url);\n }\n\n for (var i = 0; i < css_urls.length; i++) {\n var url = css_urls[i];\n const element = document.createElement(\"link\");\n element.onload = on_load;\n element.onerror = on_error;\n element.rel = \"stylesheet\";\n element.type = \"text/css\";\n element.href = url;\n console.debug(\"Bokeh: injecting link tag for BokehJS stylesheet: \", url);\n document.body.appendChild(element);\n }\n\n for (var i = 0; i < js_urls.length; i++) {\n var url = js_urls[i];\n var element = document.createElement('script');\n element.onload = on_load;\n element.onerror = on_error;\n element.async = false;\n element.src = url;\n console.debug(\"Bokeh: injecting script tag for BokehJS library: \", url);\n document.head.appendChild(element);\n }\n };var element = document.getElementById(\"1001\");\n if (element == null) {\n console.error(\"Bokeh: ERROR: autoload.js configured with elementid '1001' but no matching script tag was found. \")\n return false;\n }\n\n function inject_raw_css(css) {\n const element = document.createElement(\"style\");\n element.appendChild(document.createTextNode(css));\n document.body.appendChild(element);\n }\n\n \n var js_urls = [\"https://cdn.pydata.org/bokeh/release/bokeh-1.4.0.min.js\", \"https://cdn.pydata.org/bokeh/release/bokeh-widgets-1.4.0.min.js\", \"https://cdn.pydata.org/bokeh/release/bokeh-tables-1.4.0.min.js\", \"https://cdn.pydata.org/bokeh/release/bokeh-gl-1.4.0.min.js\"];\n var css_urls = [];\n \n\n var inline_js = [\n function(Bokeh) {\n Bokeh.set_log_level(\"info\");\n },\n function(Bokeh) {\n \n \n }\n ];\n\n function run_inline_js() {\n \n if (root.Bokeh !== undefined || force === true) {\n \n for (var i = 0; i < inline_js.length; i++) {\n inline_js[i].call(root, root.Bokeh);\n }\n if (force === true) {\n display_loaded();\n }} else if (Date.now() < root._bokeh_timeout) {\n setTimeout(run_inline_js, 100);\n } else if (!root._bokeh_failed_load) {\n console.log(\"Bokeh: BokehJS failed to load within specified timeout.\");\n root._bokeh_failed_load = true;\n } else if (force !== true) {\n var cell = $(document.getElementById(\"1001\")).parents('.cell').data().cell;\n cell.output_area.append_execute_result(NB_LOAD_WARNING)\n }\n\n }\n\n if (root._bokeh_is_loading === 0) {\n console.debug(\"Bokeh: BokehJS loaded, going straight to plotting\");\n run_inline_js();\n } else {\n load_libs(css_urls, js_urls, function() {\n console.debug(\"Bokeh: BokehJS plotting callback run at\", now());\n run_inline_js();\n });\n }\n}(window));" }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from math import pi\n", "from bokeh.io import output_notebook, show\n", "from bokeh.plotting import figure\n", "from bokeh.models import ColumnDataSource\n", "from bokeh.transform import factor_cmap, cumsum\n", "from bokeh.palettes import Paired12\n", "\n", "output_notebook()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Pie Chart\n", "What are the busiest days of the week?" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "ExecuteTime": { "end_time": "2019-12-06T03:45:45.072814Z", "start_time": "2019-12-06T03:45:44.684274Z" } }, "outputs": [], "source": [ "df1.createOrReplaceTempView('tmp_bakery')" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "ExecuteTime": { "end_time": "2019-12-06T03:45:48.126136Z", "start_time": "2019-12-06T03:45:45.073935Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+---------+-----+\n", "|day |count|\n", "+---------+-----+\n", "|Wednesday|2320 |\n", "|Monday |2324 |\n", "|Tuesday |2392 |\n", "|Thursday |2646 |\n", "|Sunday |3095 |\n", "|Friday |3124 |\n", "|Saturday |4605 |\n", "+---------+-----+\n", "\n" ] } ], "source": [ "sql_query = \"SELECT date_format(date, 'EEEE') as day, count(*) as count \" \\\n", " \"FROM tmp_bakery \" \\\n", " \"WHERE item NOT LIKE 'NONE' AND item NOT LIKE 'Adjustment' \" \\\n", " \"GROUP BY day \" \\\n", " \"ORDER BY count ASC \" \\\n", " \"LIMIT 10\"\n", "df4 = spark.sql(sql_query)\n", "df4.show(10, False)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "ExecuteTime": { "end_time": "2019-12-06T03:45:49.952748Z", "start_time": "2019-12-06T03:45:48.133574Z" }, "pycharm": { "is_executing": false } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "
\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/javascript": [ "(function(root) {\n", " function embed_document(root) {\n", " \n", " var docs_json = {\"6f764be0-f527-4346-b57f-fb478ac469ea\":{\"roots\":{\"references\":[{\"attributes\":{\"below\":[{\"id\":\"1014\",\"type\":\"LinearAxis\"}],\"center\":[{\"id\":\"1018\",\"type\":\"Grid\"},{\"id\":\"1023\",\"type\":\"Grid\"},{\"id\":\"1053\",\"type\":\"Legend\"}],\"left\":[{\"id\":\"1019\",\"type\":\"LinearAxis\"}],\"plot_height\":450,\"plot_width\":700,\"renderers\":[{\"id\":\"1045\",\"type\":\"GlyphRenderer\"}],\"title\":{\"id\":\"1004\",\"type\":\"Title\"},\"toolbar\":{\"id\":\"1031\",\"type\":\"Toolbar\"},\"x_range\":{\"id\":\"1006\",\"type\":\"Range1d\"},\"x_scale\":{\"id\":\"1010\",\"type\":\"LinearScale\"},\"y_range\":{\"id\":\"1008\",\"type\":\"DataRange1d\"},\"y_scale\":{\"id\":\"1012\",\"type\":\"LinearScale\"}},\"id\":\"1003\",\"subtype\":\"Figure\",\"type\":\"Plot\"},{\"attributes\":{\"callback\":null},\"id\":\"1008\",\"type\":\"DataRange1d\"},{\"attributes\":{},\"id\":\"1050\",\"type\":\"BasicTickFormatter\"},{\"attributes\":{\"dimension\":1,\"grid_line_color\":null,\"ticker\":{\"id\":\"1020\",\"type\":\"BasicTicker\"}},\"id\":\"1023\",\"type\":\"Grid\"},{\"attributes\":{\"items\":[{\"id\":\"1054\",\"type\":\"LegendItem\"}]},\"id\":\"1053\",\"type\":\"Legend\"},{\"attributes\":{\"active_drag\":\"auto\",\"active_inspect\":\"auto\",\"active_multi\":null,\"active_scroll\":\"auto\",\"active_tap\":\"auto\",\"tools\":[{\"id\":\"1024\",\"type\":\"PanTool\"},{\"id\":\"1025\",\"type\":\"WheelZoomTool\"},{\"id\":\"1026\",\"type\":\"BoxZoomTool\"},{\"id\":\"1027\",\"type\":\"SaveTool\"},{\"id\":\"1028\",\"type\":\"ResetTool\"},{\"id\":\"1029\",\"type\":\"HelpTool\"},{\"id\":\"1030\",\"type\":\"HoverTool\"}]},\"id\":\"1031\",\"type\":\"Toolbar\"},{\"attributes\":{\"field\":\"angle\",\"include_zero\":true},\"id\":\"1039\",\"type\":\"CumSum\"},{\"attributes\":{\"label\":{\"field\":\"day\"},\"renderers\":[{\"id\":\"1045\",\"type\":\"GlyphRenderer\"}]},\"id\":\"1054\",\"type\":\"LegendItem\"},{\"attributes\":{\"field\":\"angle\"},\"id\":\"1040\",\"type\":\"CumSum\"},{\"attributes\":{\"callback\":null,\"data\":{\"angle\":{\"__ndarray__\":\"lS49KWe/5j8XEZ19ccnmP7kb+xchdOc/+srDBbHx6T993M6EuFjuP6sGhmiDoe4/nK8Z+HmT9j8=\",\"dtype\":\"float64\",\"shape\":[7]},\"count\":[2320,2324,2392,2646,3095,3124,4605],\"day\":[\"Wednesday\",\"Monday\",\"Tuesday\",\"Thursday\",\"Sunday\",\"Friday\",\"Saturday\"],\"index\":[0,1,2,3,4,5,6]},\"selected\":{\"id\":\"1061\",\"type\":\"Selection\"},\"selection_policy\":{\"id\":\"1062\",\"type\":\"UnionRenderers\"}},\"id\":\"1041\",\"type\":\"ColumnDataSource\"},{\"attributes\":{\"callback\":null,\"start\":-0.5},\"id\":\"1006\",\"type\":\"Range1d\"},{\"attributes\":{},\"id\":\"1061\",\"type\":\"Selection\"},{\"attributes\":{},\"id\":\"1062\",\"type\":\"UnionRenderers\"},{\"attributes\":{},\"id\":\"1024\",\"type\":\"PanTool\"},{\"attributes\":{},\"id\":\"1025\",\"type\":\"WheelZoomTool\"},{\"attributes\":{\"text\":\"Items Sold/Day\"},\"id\":\"1004\",\"type\":\"Title\"},{\"attributes\":{\"overlay\":{\"id\":\"1052\",\"type\":\"BoxAnnotation\"}},\"id\":\"1026\",\"type\":\"BoxZoomTool\"},{\"attributes\":{},\"id\":\"1020\",\"type\":\"BasicTicker\"},{\"attributes\":{\"bottom_units\":\"screen\",\"fill_alpha\":{\"value\":0.5},\"fill_color\":{\"value\":\"lightgrey\"},\"left_units\":\"screen\",\"level\":\"overlay\",\"line_alpha\":{\"value\":1.0},\"line_color\":{\"value\":\"black\"},\"line_dash\":[4,4],\"line_width\":{\"value\":2},\"render_mode\":\"css\",\"right_units\":\"screen\",\"top_units\":\"screen\"},\"id\":\"1052\",\"type\":\"BoxAnnotation\"},{\"attributes\":{},\"id\":\"1027\",\"type\":\"SaveTool\"},{\"attributes\":{},\"id\":\"1028\",\"type\":\"ResetTool\"},{\"attributes\":{\"axis_label\":null,\"formatter\":{\"id\":\"1048\",\"type\":\"BasicTickFormatter\"},\"ticker\":{\"id\":\"1020\",\"type\":\"BasicTicker\"},\"visible\":false},\"id\":\"1019\",\"type\":\"LinearAxis\"},{\"attributes\":{},\"id\":\"1010\",\"type\":\"LinearScale\"},{\"attributes\":{},\"id\":\"1029\",\"type\":\"HelpTool\"},{\"attributes\":{\"callback\":null,\"tooltips\":[[\"day\",\"@day\"],[\"count\",\"@{count}{,}\"]]},\"id\":\"1030\",\"type\":\"HoverTool\"},{\"attributes\":{\"end_angle\":{\"expr\":{\"id\":\"1040\",\"type\":\"CumSum\"},\"units\":\"rad\"},\"fill_alpha\":{\"value\":0.1},\"fill_color\":{\"value\":\"#1f77b4\"},\"line_alpha\":{\"value\":0.1},\"line_color\":{\"value\":\"#1f77b4\"},\"radius\":{\"units\":\"data\",\"value\":0.4},\"start_angle\":{\"expr\":{\"id\":\"1039\",\"type\":\"CumSum\"},\"units\":\"rad\"},\"x\":{\"value\":0},\"y\":{\"value\":1}},\"id\":\"1044\",\"type\":\"Wedge\"},{\"attributes\":{},\"id\":\"1012\",\"type\":\"LinearScale\"},{\"attributes\":{\"end_angle\":{\"expr\":{\"id\":\"1040\",\"type\":\"CumSum\"},\"units\":\"rad\"},\"fill_color\":{\"field\":\"day\",\"transform\":{\"id\":\"1002\",\"type\":\"CategoricalColorMapper\"}},\"line_color\":{\"value\":\"white\"},\"radius\":{\"units\":\"data\",\"value\":0.4},\"start_angle\":{\"expr\":{\"id\":\"1039\",\"type\":\"CumSum\"},\"units\":\"rad\"},\"x\":{\"value\":0},\"y\":{\"value\":1}},\"id\":\"1043\",\"type\":\"Wedge\"},{\"attributes\":{},\"id\":\"1015\",\"type\":\"BasicTicker\"},{\"attributes\":{\"factors\":[\"Wednesday\",\"Monday\",\"Tuesday\",\"Thursday\",\"Sunday\",\"Friday\",\"Saturday\"],\"palette\":[\"#a6cee3\",\"#1f78b4\",\"#b2df8a\",\"#33a02c\",\"#fb9a99\",\"#e31a1c\",\"#fdbf6f\",\"#ff7f00\",\"#cab2d6\",\"#6a3d9a\",\"#ffff99\",\"#b15928\"]},\"id\":\"1002\",\"type\":\"CategoricalColorMapper\"},{\"attributes\":{\"data_source\":{\"id\":\"1041\",\"type\":\"ColumnDataSource\"},\"glyph\":{\"id\":\"1043\",\"type\":\"Wedge\"},\"hover_glyph\":null,\"muted_glyph\":null,\"nonselection_glyph\":{\"id\":\"1044\",\"type\":\"Wedge\"},\"selection_glyph\":null,\"view\":{\"id\":\"1046\",\"type\":\"CDSView\"}},\"id\":\"1045\",\"type\":\"GlyphRenderer\"},{\"attributes\":{\"grid_line_color\":null,\"ticker\":{\"id\":\"1015\",\"type\":\"BasicTicker\"}},\"id\":\"1018\",\"type\":\"Grid\"},{\"attributes\":{\"source\":{\"id\":\"1041\",\"type\":\"ColumnDataSource\"}},\"id\":\"1046\",\"type\":\"CDSView\"},{\"attributes\":{},\"id\":\"1048\",\"type\":\"BasicTickFormatter\"},{\"attributes\":{\"axis_label\":null,\"formatter\":{\"id\":\"1050\",\"type\":\"BasicTickFormatter\"},\"ticker\":{\"id\":\"1015\",\"type\":\"BasicTicker\"},\"visible\":false},\"id\":\"1014\",\"type\":\"LinearAxis\"}],\"root_ids\":[\"1003\"]},\"title\":\"Bokeh Application\",\"version\":\"1.4.0\"}};\n", " var render_items = [{\"docid\":\"6f764be0-f527-4346-b57f-fb478ac469ea\",\"roots\":{\"1003\":\"ef7bb50b-4ecb-4a86-951a-cb2c294ac99b\"}}];\n", " root.Bokeh.embed.embed_items_notebook(docs_json, render_items);\n", "\n", " }\n", " if (root.Bokeh !== undefined) {\n", " embed_document(root);\n", " } else {\n", " var attempts = 0;\n", " var timer = setInterval(function(root) {\n", " if (root.Bokeh !== undefined) {\n", " clearInterval(timer);\n", " embed_document(root);\n", " } else {\n", " attempts++;\n", " if (attempts > 100) {\n", " clearInterval(timer);\n", " console.log(\"Bokeh: ERROR: Unable to run BokehJS code because BokehJS library is missing\");\n", " }\n", " }\n", " }, 10, root)\n", " }\n", "})(window);" ], "application/vnd.bokehjs_exec.v0+json": "" }, "metadata": { "application/vnd.bokehjs_exec.v0+json": { "id": "1003" } }, "output_type": "display_data" } ], "source": [ "data = df4.toPandas()\n", "tooltips = [('day', '@day'), ('count', '@{count}{,}')]\n", "days = data['day'].tolist()\n", "color_map = factor_cmap(field_name='day', palette=Paired12, factors=days)\n", "\n", "data['angle'] = data['count'] / data['count'].sum() * 2 * pi\n", "plot = figure(plot_height=450,\n", " plot_width=700,\n", " title='Items Sold/Day',\n", " tooltips=tooltips,\n", " x_range=(-0.5, 1.0))\n", "plot.wedge(x=0,\n", " y=1,\n", " radius=0.4,\n", " start_angle=cumsum('angle', include_zero=True),\n", " end_angle=cumsum('angle'),\n", " line_color='white',\n", " fill_color=color_map,\n", " legend_field='day',\n", " source=data)\n", "plot.axis.axis_label = None\n", "plot.axis.visible = False\n", "plot.grid.grid_line_color = None\n", "\n", "show(plot)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Vertical Bar Chart\n", "What are the busiest times of the day?" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "ExecuteTime": { "end_time": "2019-12-06T03:45:53.354117Z", "start_time": "2019-12-06T03:45:49.956857Z" }, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+------+-----+\n", "|period|count|\n", "+------+-----+\n", "|07:00 |1 |\n", "|07:30 |23 |\n", "|08:00 |209 |\n", "|08:30 |436 |\n", "|09:00 |960 |\n", "|09:30 |1006 |\n", "|10:00 |1238 |\n", "|10:30 |1428 |\n", "|11:00 |1628 |\n", "|11:30 |1474 |\n", "+------+-----+\n", "only showing top 10 rows\n", "\n" ] } ], "source": [ "def time_increment(h, m):\n", " \"\"\"Calculates a 30-minute time increment\n", "\n", " Parameters:\n", " h (str): hours, '0' or '00' to '23'\n", " m (str): minutes, '0' or '00' to '59'\n", "\n", " Returns:\n", " str: 30-minute time increment, i.e. '07:30', '23:00', or '12:00'\n", "\n", " \"\"\"\n", "\n", " increment = (int(m) * (100 / 60)) / 100 # 0.0000 - 0.9833\n", " increment = round(increment, 0) # 0.0 or 1.0\n", " increment = int(increment) * 30 # 0 or 30\n", " increment = str(h).rjust(2, '0') + ':' + str(increment).rjust(2, '0')\n", "\n", " return increment # i.e. '07:30' or '23:00'\n", "\n", "\n", "spark.udf.register(\"udfTimeIncrement\", time_increment, StringType())\n", "\n", "\n", "sql_query = \"WITH tmp_table AS (\" \\\n", " \" SELECT udfTimeIncrement(date_format(time, 'HH'), date_format(time, 'mm')) as period, count(*) as count \" \\\n", " \" FROM tmp_bakery \" \\\n", " \" WHERE item NOT LIKE 'NONE' AND item NOT LIKE 'Adjustment' \" \\\n", " \" GROUP BY period \" \\\n", " \" ORDER BY period ASC\" \\\n", " \") \" \\\n", " \"SELECT period, count \" \\\n", " \"FROM tmp_table \" \\\n", " \"WHERE period BETWEEN '07:00' AND '19:00'\"\n", "\n", "df5 = spark.sql(sql_query)\n", "df5.show(10, False)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "ExecuteTime": { "end_time": "2019-12-06T03:45:57.039172Z", "start_time": "2019-12-06T03:45:53.355547Z" }, "pycharm": { "is_executing": false } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "
\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/javascript": [ "(function(root) {\n", " function embed_document(root) {\n", " \n", " var docs_json = {\"8383c3d3-ce27-4680-adaa-5c98f981259b\":{\"roots\":{\"references\":[{\"attributes\":{\"below\":[{\"id\":\"1121\",\"type\":\"CategoricalAxis\"}],\"center\":[{\"id\":\"1124\",\"type\":\"Grid\"},{\"id\":\"1129\",\"type\":\"Grid\"}],\"left\":[{\"id\":\"1125\",\"type\":\"LinearAxis\"}],\"min_border\":0,\"plot_height\":450,\"plot_width\":900,\"renderers\":[{\"id\":\"1148\",\"type\":\"GlyphRenderer\"}],\"title\":{\"id\":\"1150\",\"type\":\"Title\"},\"toolbar\":{\"id\":\"1137\",\"type\":\"Toolbar\"},\"x_range\":{\"id\":\"1113\",\"type\":\"FactorRange\"},\"x_scale\":{\"id\":\"1117\",\"type\":\"CategoricalScale\"},\"y_range\":{\"id\":\"1115\",\"type\":\"DataRange1d\"},\"y_scale\":{\"id\":\"1119\",\"type\":\"LinearScale\"}},\"id\":\"1112\",\"subtype\":\"Figure\",\"type\":\"Plot\"},{\"attributes\":{\"axis_label\":\"Hour of the Day\",\"formatter\":{\"id\":\"1163\",\"type\":\"CategoricalTickFormatter\"},\"ticker\":{\"id\":\"1122\",\"type\":\"CategoricalTicker\"}},\"id\":\"1121\",\"type\":\"CategoricalAxis\"},{\"attributes\":{\"data_source\":{\"id\":\"1111\",\"type\":\"ColumnDataSource\"},\"glyph\":{\"id\":\"1146\",\"type\":\"VBar\"},\"hover_glyph\":null,\"muted_glyph\":null,\"nonselection_glyph\":{\"id\":\"1147\",\"type\":\"VBar\"},\"selection_glyph\":null,\"view\":{\"id\":\"1149\",\"type\":\"CDSView\"}},\"id\":\"1148\",\"type\":\"GlyphRenderer\"},{\"attributes\":{\"callback\":null,\"data\":{\"count\":[1,23,209,436,960,1006,1238,1428,1628,1474,1474,1380,1341,1276,1472,1168,1062,1053,824,519,300,68,42,40,32],\"index\":[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24],\"period\":[\"07:00\",\"07:30\",\"08:00\",\"08:30\",\"09:00\",\"09:30\",\"10:00\",\"10:30\",\"11:00\",\"11:30\",\"12:00\",\"12:30\",\"13:00\",\"13:30\",\"14:00\",\"14:30\",\"15:00\",\"15:30\",\"16:00\",\"16:30\",\"17:00\",\"17:30\",\"18:00\",\"18:30\",\"19:00\"]},\"selected\":{\"id\":\"1166\",\"type\":\"Selection\"},\"selection_policy\":{\"id\":\"1167\",\"type\":\"UnionRenderers\"}},\"id\":\"1111\",\"type\":\"ColumnDataSource\"},{\"attributes\":{\"callback\":null,\"tooltips\":[[\"period\",\"@period\"],[\"count\",\"@{count}{,}\"]]},\"id\":\"1136\",\"type\":\"HoverTool\"},{\"attributes\":{},\"id\":\"1119\",\"type\":\"LinearScale\"},{\"attributes\":{\"active_drag\":\"auto\",\"active_inspect\":\"auto\",\"active_multi\":null,\"active_scroll\":\"auto\",\"active_tap\":\"auto\",\"tools\":[{\"id\":\"1130\",\"type\":\"PanTool\"},{\"id\":\"1131\",\"type\":\"WheelZoomTool\"},{\"id\":\"1132\",\"type\":\"BoxZoomTool\"},{\"id\":\"1133\",\"type\":\"SaveTool\"},{\"id\":\"1134\",\"type\":\"ResetTool\"},{\"id\":\"1135\",\"type\":\"HelpTool\"},{\"id\":\"1136\",\"type\":\"HoverTool\"}]},\"id\":\"1137\",\"type\":\"Toolbar\"},{\"attributes\":{},\"id\":\"1166\",\"type\":\"Selection\"},{\"attributes\":{\"callback\":null},\"id\":\"1115\",\"type\":\"DataRange1d\"},{\"attributes\":{\"overlay\":{\"id\":\"1165\",\"type\":\"BoxAnnotation\"}},\"id\":\"1132\",\"type\":\"BoxZoomTool\"},{\"attributes\":{\"source\":{\"id\":\"1111\",\"type\":\"ColumnDataSource\"}},\"id\":\"1149\",\"type\":\"CDSView\"},{\"attributes\":{\"dimension\":1,\"ticker\":{\"id\":\"1126\",\"type\":\"BasicTicker\"}},\"id\":\"1129\",\"type\":\"Grid\"},{\"attributes\":{},\"id\":\"1163\",\"type\":\"CategoricalTickFormatter\"},{\"attributes\":{},\"id\":\"1161\",\"type\":\"BasicTickFormatter\"},{\"attributes\":{\"ticker\":{\"id\":\"1122\",\"type\":\"CategoricalTicker\"}},\"id\":\"1124\",\"type\":\"Grid\"},{\"attributes\":{},\"id\":\"1133\",\"type\":\"SaveTool\"},{\"attributes\":{},\"id\":\"1117\",\"type\":\"CategoricalScale\"},{\"attributes\":{},\"id\":\"1134\",\"type\":\"ResetTool\"},{\"attributes\":{},\"id\":\"1122\",\"type\":\"CategoricalTicker\"},{\"attributes\":{\"bottom_units\":\"screen\",\"fill_alpha\":{\"value\":0.5},\"fill_color\":{\"value\":\"lightgrey\"},\"left_units\":\"screen\",\"level\":\"overlay\",\"line_alpha\":{\"value\":1.0},\"line_color\":{\"value\":\"black\"},\"line_dash\":[4,4],\"line_width\":{\"value\":2},\"render_mode\":\"css\",\"right_units\":\"screen\",\"top_units\":\"screen\"},\"id\":\"1165\",\"type\":\"BoxAnnotation\"},{\"attributes\":{\"text\":\"Items Sold/Hour\"},\"id\":\"1150\",\"type\":\"Title\"},{\"attributes\":{},\"id\":\"1167\",\"type\":\"UnionRenderers\"},{\"attributes\":{\"axis_label\":\"Total Items Sold\",\"formatter\":{\"id\":\"1161\",\"type\":\"BasicTickFormatter\"},\"ticker\":{\"id\":\"1126\",\"type\":\"BasicTicker\"}},\"id\":\"1125\",\"type\":\"LinearAxis\"},{\"attributes\":{},\"id\":\"1135\",\"type\":\"HelpTool\"},{\"attributes\":{\"callback\":null,\"factors\":[\"07:00\",\"07:30\",\"08:00\",\"08:30\",\"09:00\",\"09:30\",\"10:00\",\"10:30\",\"11:00\",\"11:30\",\"12:00\",\"12:30\",\"13:00\",\"13:30\",\"14:00\",\"14:30\",\"15:00\",\"15:30\",\"16:00\",\"16:30\",\"17:00\",\"17:30\",\"18:00\",\"18:30\",\"19:00\"]},\"id\":\"1113\",\"type\":\"FactorRange\"},{\"attributes\":{\"fill_alpha\":{\"value\":0.1},\"fill_color\":{\"value\":\"#1f77b4\"},\"line_alpha\":{\"value\":0.1},\"line_color\":{\"value\":\"#1f77b4\"},\"top\":{\"field\":\"count\"},\"width\":{\"value\":0.9},\"x\":{\"field\":\"period\"}},\"id\":\"1147\",\"type\":\"VBar\"},{\"attributes\":{},\"id\":\"1126\",\"type\":\"BasicTicker\"},{\"attributes\":{\"fill_color\":{\"value\":\"#1f77b4\"},\"line_color\":{\"value\":\"#1f77b4\"},\"top\":{\"field\":\"count\"},\"width\":{\"value\":0.9},\"x\":{\"field\":\"period\"}},\"id\":\"1146\",\"type\":\"VBar\"},{\"attributes\":{},\"id\":\"1131\",\"type\":\"WheelZoomTool\"},{\"attributes\":{},\"id\":\"1130\",\"type\":\"PanTool\"}],\"root_ids\":[\"1112\"]},\"title\":\"Bokeh Application\",\"version\":\"1.4.0\"}};\n", " var render_items = [{\"docid\":\"8383c3d3-ce27-4680-adaa-5c98f981259b\",\"roots\":{\"1112\":\"1bc6866d-29a0-4b79-87d3-d01eee51b3b7\"}}];\n", " root.Bokeh.embed.embed_items_notebook(docs_json, render_items);\n", "\n", " }\n", " if (root.Bokeh !== undefined) {\n", " embed_document(root);\n", " } else {\n", " var attempts = 0;\n", " var timer = setInterval(function(root) {\n", " if (root.Bokeh !== undefined) {\n", " clearInterval(timer);\n", " embed_document(root);\n", " } else {\n", " attempts++;\n", " if (attempts > 100) {\n", " clearInterval(timer);\n", " console.log(\"Bokeh: ERROR: Unable to run BokehJS code because BokehJS library is missing\");\n", " }\n", " }\n", " }, 10, root)\n", " }\n", "})(window);" ], "application/vnd.bokehjs_exec.v0+json": "" }, "metadata": { "application/vnd.bokehjs_exec.v0+json": { "id": "1112" } }, "output_type": "display_data" } ], "source": [ "source = ColumnDataSource(data=df5.toPandas())\n", "tooltips = [('period', '@period'), ('count', '@{count}{,}')]\n", "periods = source.data['period'].tolist()\n", "plot = figure(x_range=periods,\n", " plot_width=900,\n", " plot_height=450,\n", " min_border=0,\n", " tooltips=tooltips)\n", "plot.vbar(x='period', bottom=0, top='count', source=source, width=0.9)\n", "plot.title.text = 'Items Sold/Hour'\n", "plot.xaxis.axis_label = 'Hour of the Day'\n", "plot.yaxis.axis_label = 'Total Items Sold'\n", "\n", "show(plot)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Horizontal Bar Chart\n", "What are the top selling bakery items?" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "ExecuteTime": { "end_time": "2019-12-06T03:45:58.169964Z", "start_time": "2019-12-06T03:45:57.042381Z" }, "pycharm": { "is_executing": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-------------+-----+\n", "|item |count|\n", "+-------------+-----+\n", "|Coffee |5471 |\n", "|Bread |3325 |\n", "|Tea |1435 |\n", "|Cake |1025 |\n", "|Pastry |856 |\n", "|Sandwich |771 |\n", "|Medialuna |616 |\n", "|Hot chocolate|590 |\n", "|Cookies |540 |\n", "|Brownie |379 |\n", "+-------------+-----+\n", "\n" ] } ], "source": [ "sql_query = \"SELECT item, count(*) as count \" \\\n", " \"FROM tmp_bakery \" \\\n", " \"WHERE item NOT LIKE 'NONE' AND item NOT LIKE 'Adjustment' \" \\\n", " \"GROUP BY item \" \\\n", " \"ORDER BY count DESC \" \\\n", " \"LIMIT 10\"\n", "\n", "df6 = spark.sql(sql_query)\n", "df6.show(10, False)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "ExecuteTime": { "end_time": "2019-12-06T03:45:59.120361Z", "start_time": "2019-12-06T03:45:58.171342Z" }, "pycharm": { "is_executing": false } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "
\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/javascript": [ "(function(root) {\n", " function embed_document(root) {\n", " \n", " var docs_json = {\"d3112e97-deb3-4dea-90a8-8ab0648c6a45\":{\"roots\":{\"references\":[{\"attributes\":{\"below\":[{\"id\":\"1232\",\"type\":\"LinearAxis\"}],\"center\":[{\"id\":\"1236\",\"type\":\"Grid\"},{\"id\":\"1240\",\"type\":\"Grid\"}],\"left\":[{\"id\":\"1237\",\"type\":\"CategoricalAxis\"}],\"min_border\":0,\"plot_height\":375,\"plot_width\":750,\"renderers\":[{\"id\":\"1259\",\"type\":\"GlyphRenderer\"}],\"title\":{\"id\":\"1261\",\"type\":\"Title\"},\"toolbar\":{\"id\":\"1248\",\"type\":\"Toolbar\"},\"x_range\":{\"id\":\"1224\",\"type\":\"DataRange1d\"},\"x_scale\":{\"id\":\"1228\",\"type\":\"LinearScale\"},\"y_range\":{\"id\":\"1226\",\"type\":\"FactorRange\"},\"y_scale\":{\"id\":\"1230\",\"type\":\"CategoricalScale\"}},\"id\":\"1223\",\"subtype\":\"Figure\",\"type\":\"Plot\"},{\"attributes\":{\"text\":\"Top 10 Bakery Items\"},\"id\":\"1261\",\"type\":\"Title\"},{\"attributes\":{},\"id\":\"1228\",\"type\":\"LinearScale\"},{\"attributes\":{\"ticker\":{\"id\":\"1233\",\"type\":\"BasicTicker\"}},\"id\":\"1236\",\"type\":\"Grid\"},{\"attributes\":{\"fill_alpha\":{\"value\":0.1},\"fill_color\":{\"value\":\"#1f77b4\"},\"height\":{\"value\":0.9},\"line_alpha\":{\"value\":0.1},\"line_color\":{\"value\":\"#1f77b4\"},\"right\":{\"field\":\"count\"},\"y\":{\"field\":\"item\"}},\"id\":\"1258\",\"type\":\"HBar\"},{\"attributes\":{\"dimension\":1,\"ticker\":{\"id\":\"1238\",\"type\":\"CategoricalTicker\"}},\"id\":\"1240\",\"type\":\"Grid\"},{\"attributes\":{\"source\":{\"id\":\"1222\",\"type\":\"ColumnDataSource\"}},\"id\":\"1260\",\"type\":\"CDSView\"},{\"attributes\":{\"active_drag\":\"auto\",\"active_inspect\":\"auto\",\"active_multi\":null,\"active_scroll\":\"auto\",\"active_tap\":\"auto\",\"tools\":[{\"id\":\"1241\",\"type\":\"PanTool\"},{\"id\":\"1242\",\"type\":\"WheelZoomTool\"},{\"id\":\"1243\",\"type\":\"BoxZoomTool\"},{\"id\":\"1244\",\"type\":\"SaveTool\"},{\"id\":\"1245\",\"type\":\"ResetTool\"},{\"id\":\"1246\",\"type\":\"HelpTool\"},{\"id\":\"1247\",\"type\":\"HoverTool\"}]},\"id\":\"1248\",\"type\":\"Toolbar\"},{\"attributes\":{\"data_source\":{\"id\":\"1222\",\"type\":\"ColumnDataSource\"},\"glyph\":{\"id\":\"1257\",\"type\":\"HBar\"},\"hover_glyph\":null,\"muted_glyph\":null,\"nonselection_glyph\":{\"id\":\"1258\",\"type\":\"HBar\"},\"selection_glyph\":null,\"view\":{\"id\":\"1260\",\"type\":\"CDSView\"}},\"id\":\"1259\",\"type\":\"GlyphRenderer\"},{\"attributes\":{\"axis_label\":\"Items\",\"formatter\":{\"id\":\"1281\",\"type\":\"CategoricalTickFormatter\"},\"ticker\":{\"id\":\"1238\",\"type\":\"CategoricalTicker\"}},\"id\":\"1237\",\"type\":\"CategoricalAxis\"},{\"attributes\":{\"callback\":null,\"tooltips\":[[\"item\",\"@item\"],[\"count\",\"@{count}{,}\"]]},\"id\":\"1247\",\"type\":\"HoverTool\"},{\"attributes\":{},\"id\":\"1238\",\"type\":\"CategoricalTicker\"},{\"attributes\":{},\"id\":\"1246\",\"type\":\"HelpTool\"},{\"attributes\":{\"callback\":null,\"data\":{\"count\":[5471,3325,1435,1025,856,771,616,590,540,379],\"index\":[0,1,2,3,4,5,6,7,8,9],\"item\":[\"Coffee\",\"Bread\",\"Tea\",\"Cake\",\"Pastry\",\"Sandwich\",\"Medialuna\",\"Hot chocolate\",\"Cookies\",\"Brownie\"]},\"selected\":{\"id\":\"1286\",\"type\":\"Selection\"},\"selection_policy\":{\"id\":\"1287\",\"type\":\"UnionRenderers\"}},\"id\":\"1222\",\"type\":\"ColumnDataSource\"},{\"attributes\":{},\"id\":\"1245\",\"type\":\"ResetTool\"},{\"attributes\":{\"callback\":null},\"id\":\"1224\",\"type\":\"DataRange1d\"},{\"attributes\":{},\"id\":\"1244\",\"type\":\"SaveTool\"},{\"attributes\":{},\"id\":\"1283\",\"type\":\"BasicTickFormatter\"},{\"attributes\":{\"overlay\":{\"id\":\"1285\",\"type\":\"BoxAnnotation\"}},\"id\":\"1243\",\"type\":\"BoxZoomTool\"},{\"attributes\":{\"bottom_units\":\"screen\",\"fill_alpha\":{\"value\":0.5},\"fill_color\":{\"value\":\"lightgrey\"},\"left_units\":\"screen\",\"level\":\"overlay\",\"line_alpha\":{\"value\":1.0},\"line_color\":{\"value\":\"black\"},\"line_dash\":[4,4],\"line_width\":{\"value\":2},\"render_mode\":\"css\",\"right_units\":\"screen\",\"top_units\":\"screen\"},\"id\":\"1285\",\"type\":\"BoxAnnotation\"},{\"attributes\":{},\"id\":\"1281\",\"type\":\"CategoricalTickFormatter\"},{\"attributes\":{},\"id\":\"1242\",\"type\":\"WheelZoomTool\"},{\"attributes\":{},\"id\":\"1286\",\"type\":\"Selection\"},{\"attributes\":{},\"id\":\"1230\",\"type\":\"CategoricalScale\"},{\"attributes\":{},\"id\":\"1287\",\"type\":\"UnionRenderers\"},{\"attributes\":{},\"id\":\"1241\",\"type\":\"PanTool\"},{\"attributes\":{\"callback\":null,\"factors\":[\"Brownie\",\"Cookies\",\"Hot chocolate\",\"Medialuna\",\"Sandwich\",\"Pastry\",\"Cake\",\"Tea\",\"Bread\",\"Coffee\"]},\"id\":\"1226\",\"type\":\"FactorRange\"},{\"attributes\":{},\"id\":\"1233\",\"type\":\"BasicTicker\"},{\"attributes\":{\"axis_label\":\"Total Items Sold\",\"formatter\":{\"id\":\"1283\",\"type\":\"BasicTickFormatter\"},\"ticker\":{\"id\":\"1233\",\"type\":\"BasicTicker\"}},\"id\":\"1232\",\"type\":\"LinearAxis\"},{\"attributes\":{\"fill_color\":{\"value\":\"#1f77b4\"},\"height\":{\"value\":0.9},\"line_color\":{\"value\":\"#1f77b4\"},\"right\":{\"field\":\"count\"},\"y\":{\"field\":\"item\"}},\"id\":\"1257\",\"type\":\"HBar\"}],\"root_ids\":[\"1223\"]},\"title\":\"Bokeh Application\",\"version\":\"1.4.0\"}};\n", " var render_items = [{\"docid\":\"d3112e97-deb3-4dea-90a8-8ab0648c6a45\",\"roots\":{\"1223\":\"53185ccd-0d4a-4acb-9a6b-81521f9da94d\"}}];\n", " root.Bokeh.embed.embed_items_notebook(docs_json, render_items);\n", "\n", " }\n", " if (root.Bokeh !== undefined) {\n", " embed_document(root);\n", " } else {\n", " var attempts = 0;\n", " var timer = setInterval(function(root) {\n", " if (root.Bokeh !== undefined) {\n", " clearInterval(timer);\n", " embed_document(root);\n", " } else {\n", " attempts++;\n", " if (attempts > 100) {\n", " clearInterval(timer);\n", " console.log(\"Bokeh: ERROR: Unable to run BokehJS code because BokehJS library is missing\");\n", " }\n", " }\n", " }, 10, root)\n", " }\n", "})(window);" ], "application/vnd.bokehjs_exec.v0+json": "" }, "metadata": { "application/vnd.bokehjs_exec.v0+json": { "id": "1223" } }, "output_type": "display_data" } ], "source": [ "source = ColumnDataSource(data=df6.toPandas())\n", "tooltips = [('item', '@item'), ('count', '@{count}{,}')]\n", "items = source.data['item'].tolist()\n", "items.reverse()\n", "plot = figure(y_range=items,\n", " plot_width=750,\n", " plot_height=375,\n", " min_border=0,\n", " tooltips=tooltips)\n", "plot.hbar(y='item', right='count', height=.9, source=source)\n", "plot.title.text = 'Top 10 Bakery Items'\n", "plot.yaxis.axis_label = 'Items'\n", "plot.xaxis.axis_label = 'Total Items Sold'\n", "\n", "show(plot)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Vertical Bar Chart\n", "How many items do customers usually buy?" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "ExecuteTime": { "end_time": "2019-12-06T03:46:01.838780Z", "start_time": "2019-12-06T03:45:59.122373Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+----------+-----+\n", "|order_size|count|\n", "+----------+-----+\n", "|1 |3630 |\n", "|2 |2908 |\n", "|3 |1528 |\n", "|4 |850 |\n", "|5 |341 |\n", "|6 |135 |\n", "|7 |38 |\n", "|8 |21 |\n", "|9 |7 |\n", "|10 |2 |\n", "|11 |4 |\n", "+----------+-----+\n", "\n" ] } ], "source": [ "sql_query = \"WITH tmp_table AS (\" \\\n", " \" SELECT transaction, count(*) as order_size \" \\\n", " \" FROM tmp_bakery \" \\\n", " \" WHERE item NOT LIKE 'NONE' AND item NOT LIKE 'Adjustment' \" \\\n", " \" GROUP BY transaction \" \\\n", " \" ORDER BY order_size DESC\" \\\n", " \") \" \\\n", " \"SELECT order_size, count(*) as count \" \\\n", " \"FROM tmp_table \" \\\n", " \"GROUP BY order_size \" \\\n", " \"ORDER BY order_size ASC\" \\\n", "\n", "df7 = spark.sql(sql_query)\n", "df7.show(24, False)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "ExecuteTime": { "end_time": "2019-12-06T03:46:04.794879Z", "start_time": "2019-12-06T03:46:01.842464Z" }, "pycharm": { "is_executing": false } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "
\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/javascript": [ "(function(root) {\n", " function embed_document(root) {\n", " \n", " var docs_json = {\"e530c1b2-2392-434a-886f-b501dd29c1bd\":{\"roots\":{\"references\":[{\"attributes\":{\"below\":[{\"id\":\"1352\",\"type\":\"CategoricalAxis\"}],\"center\":[{\"id\":\"1355\",\"type\":\"Grid\"},{\"id\":\"1360\",\"type\":\"Grid\"}],\"left\":[{\"id\":\"1356\",\"type\":\"LinearAxis\"}],\"min_border\":0,\"plot_height\":375,\"plot_width\":750,\"renderers\":[{\"id\":\"1379\",\"type\":\"GlyphRenderer\"},{\"id\":\"1384\",\"type\":\"GlyphRenderer\"}],\"title\":{\"id\":\"1386\",\"type\":\"Title\"},\"toolbar\":{\"id\":\"1368\",\"type\":\"Toolbar\"},\"x_range\":{\"id\":\"1344\",\"type\":\"FactorRange\"},\"x_scale\":{\"id\":\"1348\",\"type\":\"CategoricalScale\"},\"y_range\":{\"id\":\"1346\",\"type\":\"DataRange1d\"},\"y_scale\":{\"id\":\"1350\",\"type\":\"LinearScale\"}},\"id\":\"1343\",\"subtype\":\"Figure\",\"type\":\"Plot\"},{\"attributes\":{},\"id\":\"1417\",\"type\":\"CategoricalTickFormatter\"},{\"attributes\":{\"fill_alpha\":{\"value\":0.1},\"fill_color\":{\"value\":\"#1f77b4\"},\"line_alpha\":{\"value\":0.1},\"line_color\":{\"value\":\"#1f77b4\"},\"top\":{\"field\":\"count\"},\"width\":{\"value\":0.9},\"x\":{\"field\":\"order_size\"}},\"id\":\"1378\",\"type\":\"VBar\"},{\"attributes\":{\"active_drag\":\"auto\",\"active_inspect\":\"auto\",\"active_multi\":null,\"active_scroll\":\"auto\",\"active_tap\":\"auto\",\"tools\":[{\"id\":\"1361\",\"type\":\"PanTool\"},{\"id\":\"1362\",\"type\":\"WheelZoomTool\"},{\"id\":\"1363\",\"type\":\"BoxZoomTool\"},{\"id\":\"1364\",\"type\":\"SaveTool\"},{\"id\":\"1365\",\"type\":\"ResetTool\"},{\"id\":\"1366\",\"type\":\"HelpTool\"},{\"id\":\"1367\",\"type\":\"HoverTool\"}]},\"id\":\"1368\",\"type\":\"Toolbar\"},{\"attributes\":{\"callback\":null},\"id\":\"1346\",\"type\":\"DataRange1d\"},{\"attributes\":{},\"id\":\"1361\",\"type\":\"PanTool\"},{\"attributes\":{},\"id\":\"1350\",\"type\":\"LinearScale\"},{\"attributes\":{\"bottom_units\":\"screen\",\"fill_alpha\":{\"value\":0.5},\"fill_color\":{\"value\":\"lightgrey\"},\"left_units\":\"screen\",\"level\":\"overlay\",\"line_alpha\":{\"value\":1.0},\"line_color\":{\"value\":\"black\"},\"line_dash\":[4,4],\"line_width\":{\"value\":2},\"render_mode\":\"css\",\"right_units\":\"screen\",\"top_units\":\"screen\"},\"id\":\"1419\",\"type\":\"BoxAnnotation\"},{\"attributes\":{},\"id\":\"1362\",\"type\":\"WheelZoomTool\"},{\"attributes\":{},\"id\":\"1420\",\"type\":\"Selection\"},{\"attributes\":{\"overlay\":{\"id\":\"1419\",\"type\":\"BoxAnnotation\"}},\"id\":\"1363\",\"type\":\"BoxZoomTool\"},{\"attributes\":{\"callback\":null,\"factors\":[\"1\",\"2\",\"3\",\"4\",\"5\",\"6\",\"7\",\"8\",\"9\",\"10\",\"11\"]},\"id\":\"1344\",\"type\":\"FactorRange\"},{\"attributes\":{},\"id\":\"1421\",\"type\":\"UnionRenderers\"},{\"attributes\":{\"fill_color\":{\"value\":\"#1f77b4\"},\"line_color\":{\"value\":\"#1f77b4\"},\"top\":{\"field\":\"count\"},\"width\":{\"value\":0.9},\"x\":{\"field\":\"order_size\"}},\"id\":\"1377\",\"type\":\"VBar\"},{\"attributes\":{},\"id\":\"1364\",\"type\":\"SaveTool\"},{\"attributes\":{},\"id\":\"1353\",\"type\":\"CategoricalTicker\"},{\"attributes\":{},\"id\":\"1365\",\"type\":\"ResetTool\"},{\"attributes\":{},\"id\":\"1366\",\"type\":\"HelpTool\"},{\"attributes\":{\"callback\":null,\"tooltips\":[[\"order_size\",\"@order_size\"],[\"count\",\"@count\"]]},\"id\":\"1367\",\"type\":\"HoverTool\"},{\"attributes\":{\"callback\":null,\"data\":{\"count\":[3630,2908,1528,850,341,135,38,21,7,2,4],\"index\":[0,1,2,3,4,5,6,7,8,9,10],\"order_size\":[1,2,3,4,5,6,7,8,9,10,11]},\"selected\":{\"id\":\"1420\",\"type\":\"Selection\"},\"selection_policy\":{\"id\":\"1421\",\"type\":\"UnionRenderers\"}},\"id\":\"1342\",\"type\":\"ColumnDataSource\"},{\"attributes\":{\"data_source\":{\"id\":\"1342\",\"type\":\"ColumnDataSource\"},\"glyph\":{\"id\":\"1377\",\"type\":\"VBar\"},\"hover_glyph\":null,\"muted_glyph\":null,\"nonselection_glyph\":{\"id\":\"1378\",\"type\":\"VBar\"},\"selection_glyph\":null,\"view\":{\"id\":\"1380\",\"type\":\"CDSView\"}},\"id\":\"1379\",\"type\":\"GlyphRenderer\"},{\"attributes\":{\"source\":{\"id\":\"1342\",\"type\":\"ColumnDataSource\"}},\"id\":\"1380\",\"type\":\"CDSView\"},{\"attributes\":{\"text\":\"Transaction Size\"},\"id\":\"1386\",\"type\":\"Title\"},{\"attributes\":{\"dimension\":1,\"ticker\":{\"id\":\"1357\",\"type\":\"BasicTicker\"}},\"id\":\"1360\",\"type\":\"Grid\"},{\"attributes\":{\"line_color\":\"red\",\"line_width\":2,\"x\":{\"field\":\"order_size\"},\"y\":{\"field\":\"count\"}},\"id\":\"1382\",\"type\":\"Line\"},{\"attributes\":{\"line_alpha\":0.1,\"line_color\":\"#1f77b4\",\"line_width\":2,\"x\":{\"field\":\"order_size\"},\"y\":{\"field\":\"count\"}},\"id\":\"1383\",\"type\":\"Line\"},{\"attributes\":{},\"id\":\"1357\",\"type\":\"BasicTicker\"},{\"attributes\":{},\"id\":\"1348\",\"type\":\"CategoricalScale\"},{\"attributes\":{\"data_source\":{\"id\":\"1342\",\"type\":\"ColumnDataSource\"},\"glyph\":{\"id\":\"1382\",\"type\":\"Line\"},\"hover_glyph\":null,\"muted_glyph\":null,\"nonselection_glyph\":{\"id\":\"1383\",\"type\":\"Line\"},\"selection_glyph\":null,\"view\":{\"id\":\"1385\",\"type\":\"CDSView\"}},\"id\":\"1384\",\"type\":\"GlyphRenderer\"},{\"attributes\":{\"axis_label\":\"Total Transactions\",\"formatter\":{\"id\":\"1415\",\"type\":\"BasicTickFormatter\"},\"ticker\":{\"id\":\"1357\",\"type\":\"BasicTicker\"}},\"id\":\"1356\",\"type\":\"LinearAxis\"},{\"attributes\":{\"ticker\":{\"id\":\"1353\",\"type\":\"CategoricalTicker\"}},\"id\":\"1355\",\"type\":\"Grid\"},{\"attributes\":{\"source\":{\"id\":\"1342\",\"type\":\"ColumnDataSource\"}},\"id\":\"1385\",\"type\":\"CDSView\"},{\"attributes\":{\"axis_label\":\"Items/Transaction\",\"formatter\":{\"id\":\"1417\",\"type\":\"CategoricalTickFormatter\"},\"ticker\":{\"id\":\"1353\",\"type\":\"CategoricalTicker\"}},\"id\":\"1352\",\"type\":\"CategoricalAxis\"},{\"attributes\":{},\"id\":\"1415\",\"type\":\"BasicTickFormatter\"}],\"root_ids\":[\"1343\"]},\"title\":\"Bokeh Application\",\"version\":\"1.4.0\"}};\n", " var render_items = [{\"docid\":\"e530c1b2-2392-434a-886f-b501dd29c1bd\",\"roots\":{\"1343\":\"2e6f7fd3-d357-4ab4-9c76-c15c1c682ccb\"}}];\n", " root.Bokeh.embed.embed_items_notebook(docs_json, render_items);\n", "\n", " }\n", " if (root.Bokeh !== undefined) {\n", " embed_document(root);\n", " } else {\n", " var attempts = 0;\n", " var timer = setInterval(function(root) {\n", " if (root.Bokeh !== undefined) {\n", " clearInterval(timer);\n", " embed_document(root);\n", " } else {\n", " attempts++;\n", " if (attempts > 100) {\n", " clearInterval(timer);\n", " console.log(\"Bokeh: ERROR: Unable to run BokehJS code because BokehJS library is missing\");\n", " }\n", " }\n", " }, 10, root)\n", " }\n", "})(window);" ], "application/vnd.bokehjs_exec.v0+json": "" }, "metadata": { "application/vnd.bokehjs_exec.v0+json": { "id": "1343" } }, "output_type": "display_data" } ], "source": [ "source = ColumnDataSource(data=df7.toPandas())\n", "tooltips = [('order_size', '@order_size'), ('count', '@count')]\n", "items = source.data['order_size'].tolist()\n", "items = list(map(str, items))\n", "plot = figure(x_range=items,\n", " plot_width=750,\n", " plot_height=375,\n", " min_border=0,\n", " tooltips=tooltips)\n", "plot.vbar(x='order_size', bottom=0, top='count', source=source, width=0.9)\n", "plot.line(x='order_size',\n", " y='count',\n", " source=source,\n", " line_color='red',\n", " line_width=2)\n", "plot.title.text = 'Transaction Size'\n", "plot.xaxis.axis_label = 'Items/Transaction'\n", "plot.yaxis.axis_label = 'Total Transactions'\n", "\n", "show(plot)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Read and Write Data to Parquet\n", "Perform basic analysis of the bakery data using Spark SQL. Read and write resulting DataFrame contents to [Apache Parquet](https://parquet.apache.org/) format." ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "ExecuteTime": { "end_time": "2019-12-06T03:46:07.263463Z", "start_time": "2019-12-06T03:46:04.800474Z" }, "pycharm": { "is_executing": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "DataFrame rows: 20506\n", "DataFrame schema: DataFrame[transaction: int, timestamp: timestamp, item: string]\n", "+-----------+-------------------+-------------+\n", "|transaction|timestamp |item |\n", "+-----------+-------------------+-------------+\n", "|1 |2016-10-30 09:58:11|Bread |\n", "|2 |2016-10-30 10:05:34|Scandinavian |\n", "|2 |2016-10-30 10:05:34|Scandinavian |\n", "|3 |2016-10-30 10:07:57|Cookies |\n", "|3 |2016-10-30 10:07:57|Hot chocolate|\n", "|3 |2016-10-30 10:07:57|Jam |\n", "|4 |2016-10-30 10:08:41|Muffin |\n", "|5 |2016-10-30 10:13:03|Bread |\n", "|5 |2016-10-30 10:13:03|Coffee |\n", "|5 |2016-10-30 10:13:03|Pastry |\n", "+-----------+-------------------+-------------+\n", "only showing top 10 rows\n", "\n" ] } ], "source": [ "sql_query = \"SELECT transaction, CAST(CONCAT(date,' ',time) as timestamp) as timestamp, item \" \\\n", " \"FROM tmp_bakery \" \\\n", " \"WHERE item NOT LIKE 'NONE' AND item NOT LIKE 'Adjustment' \" \\\n", " \"ORDER BY transaction ASC, item ASC\"\n", "\n", "df8 = spark.sql(sql_query)\n", "print('DataFrame rows: %d' % df8.count())\n", "print('DataFrame schema: %s' % df8)\n", "df8.show(10, False)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "ExecuteTime": { "end_time": "2019-12-06T03:46:20.183778Z", "start_time": "2019-12-06T03:46:07.264751Z" }, "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "df8.write.parquet('output/bakery_parquet', mode='overwrite')" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "ExecuteTime": { "end_time": "2019-12-06T03:46:21.330630Z", "start_time": "2019-12-06T03:46:20.184990Z" }, "pycharm": { "is_executing": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "total 800K\n", "-rw-r--r-- 1 garystaf users 1.9K Dec 6 03:46 part-00000-50c8ea60-bdf4-4213-a6cd-78c9df626246-c000.snappy.parquet\n", "-rw-r--r-- 1 garystaf users 2.0K Dec 6 03:46 part-00001-50c8ea60-bdf4-4213-a6cd-78c9df626246-c000.snappy.parquet\n", "-rw-r--r-- 1 garystaf users 1.8K Dec 6 03:46 part-00002-50c8ea60-bdf4-4213-a6cd-78c9df626246-c000.snappy.parquet\n", "-rw-r--r-- 1 garystaf users 2.0K Dec 6 03:46 part-00003-50c8ea60-bdf4-4213-a6cd-78c9df626246-c000.snappy.parquet\n", "-rw-r--r-- 1 garystaf users 1.9K Dec 6 03:46 part-00004-50c8ea60-bdf4-4213-a6cd-78c9df626246-c000.snappy.parquet\n", "-rw-r--r-- 1 garystaf users 1.9K Dec 6 03:46 part-00005-50c8ea60-bdf4-4213-a6cd-78c9df626246-c000.snappy.parquet\n", "-rw-r--r-- 1 garystaf users 2.0K Dec 6 03:46 part-00006-50c8ea60-bdf4-4213-a6cd-78c9df626246-c000.snappy.parquet\n", "-rw-r--r-- 1 garystaf users 1.9K Dec 6 03:46 part-00007-50c8ea60-bdf4-4213-a6cd-78c9df626246-c000.snappy.parquet\n", "-rw-r--r-- 1 garystaf users 2.1K Dec 6 03:46 part-00008-50c8ea60-bdf4-4213-a6cd-78c9df626246-c000.snappy.parquet\n", "Parquet Files: 13\n" ] } ], "source": [ "! ls 2>&1 -lh output/bakery_parquet | head -10\n", "! echo 'Parquet Files:' $(ls | wc -l)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "ExecuteTime": { "end_time": "2019-12-06T03:46:30.259680Z", "start_time": "2019-12-06T03:46:21.339399Z" }, "pycharm": { "is_executing": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "DataFrame rows: 20506\n", "DataFrame schema: DataFrame[transaction: int, timestamp: timestamp, item: string]\n", "+-----------+-------------------+-------------+\n", "|transaction|timestamp |item |\n", "+-----------+-------------------+-------------+\n", "|1 |2016-10-30 09:58:11|Bread |\n", "|2 |2016-10-30 10:05:34|Scandinavian |\n", "|2 |2016-10-30 10:05:34|Scandinavian |\n", "|3 |2016-10-30 10:07:57|Cookies |\n", "|3 |2016-10-30 10:07:57|Hot chocolate|\n", "|3 |2016-10-30 10:07:57|Jam |\n", "|4 |2016-10-30 10:08:41|Muffin |\n", "|5 |2016-10-30 10:13:03|Bread |\n", "|5 |2016-10-30 10:13:03|Coffee |\n", "|5 |2016-10-30 10:13:03|Pastry |\n", "+-----------+-------------------+-------------+\n", "only showing top 10 rows\n", "\n" ] } ], "source": [ "df9 = spark.read.parquet('output/bakery_parquet')\n", "print('DataFrame rows: %d' % df9.count())\n", "print('DataFrame schema: %s' % df9)\n", "df9.select('transaction', 'timestamp', 'item') \\\n", " .sort('transaction', 'item') \\\n", " .show(10, False)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" }, "pycharm": { "stem_cell": { "cell_type": "raw", "source": [], "metadata": { "collapsed": false } } }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 2 }