{"paragraphs":[{"text":"%md\n\n![sv-image](https://raw.githubusercontent.com/roberthryniewicz/images/master/silicon_valley_corporation.jpg)\n\n## Apache Spark in 5 Minutes \n#### Exploring Silicon Valley Show Episodes Dataset\n\n**Level**: Beginner\n**Language**: Scala\n**Requirements**: \n- [HDP 2.6](http://hortonworks.com/products/sandbox/) (or later) or [HDCloud](https://hortonworks.github.io/hdp-aws/)\n- Spark 2.x\n\n**Author**: Robert Hryniewicz\n**Follow** [@RobH8z](https://twitter.com/RobertH8z)","user":"admin","dateUpdated":"2017-06-13T18:56:56+0000","config":{"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true,"editorSetting":{"editOnDblClick":true,"language":"markdown"},"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n

\"sv-image\"

\n

Apache Spark in 5 Minutes

\n

Exploring Silicon Valley Show Episodes Dataset

\n

Level: Beginner
Language: Scala
Requirements:
- HDP 2.6 (or later) or HDCloud
- Spark 2.x

\n

Author: Robert Hryniewicz
Follow @RobH8z

\n
"}]},"apps":[],"jobName":"paragraph_1487287821074_-1765493669","id":"20161013-011142_1891215806","dateCreated":"2017-02-17T05:00:21+0000","dateStarted":"2017-06-13T18:56:56+0000","dateFinished":"2017-06-13T18:56:56+0000","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:10940"},{"title":"Short Intro","text":"%md\n\nWelcome to a quick overview of Apache Spark with Sillicon Valley Episodes dataset. If you've never watched the Silicon Valley show you can learn more about it [here](https://en.wikipedia.org/wiki/Silicon_Valley_(TV_series)). \n\nIn this notebook we will download the dataset (in JSON format) from an external github repository, ingest it into a Spark Dataset and perform basic analysis, filtering, and word count.","user":"admin","dateUpdated":"2017-02-22T15:25:20+0000","config":{"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true,"editorSetting":{"editOnDblClick":true,"language":"markdown"},"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n

Welcome to a quick overview of Apache Spark with Sillicon Valley Episodes dataset. If you’ve never watched the Silicon Valley show you can learn more about it [here](https://en.wikipedia.org/wiki/Silicon_Valley_(TV_series)).

\n

In this notebook we will download the dataset (in JSON format) from an external github repository, ingest it into a Spark Dataset and perform basic analysis, filtering, and word count.

\n
"}]},"apps":[],"jobName":"paragraph_1487287821076_-1767802163","id":"20161013-011155_1645524279","dateCreated":"2017-02-17T05:00:21+0000","dateStarted":"2017-02-22T15:25:20+0000","dateFinished":"2017-02-22T15:25:20+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:10941"},{"title":"New to Scala?","text":"%md\n\nThroughout this lab we will use basic Scala syntax. If you would like to learn more about Scala, here's an excellent introductory [Tutorial](http://www.dhgarrette.com/nlpclass/scala/basics.html).","user":"admin","dateUpdated":"2017-02-22T15:25:22+0000","config":{"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true,"editorSetting":{}},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n

Throughout this lab we will use basic Scala syntax. If you would like to learn more about Scala, here’s an excellent introductory Tutorial.

\n
"}]},"apps":[],"jobName":"paragraph_1487287821077_-1768186912","id":"20161013-173447_845128564","dateCreated":"2017-02-17T05:00:21+0000","dateStarted":"2017-02-22T15:25:22+0000","dateFinished":"2017-02-22T15:25:22+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:10942"},{"title":"New to Zeppelin?","text":"%md\n\nIf you haven't already, checkout the [Hortonworks Apache Zeppelin](https://hortonworks.com/apache/zeppelin/) page as well as the [Getting Started with Apache Zeppelin](http://hortonworks.com/hadoop-tutorial/getting-started-apache-zeppelin/) tutorial.\n\nYou will find the official Apache Zeppelin page [here](https://zeppelin.apache.org/).","user":"admin","dateUpdated":"2017-02-22T15:25:25+0000","config":{"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true,"editorSetting":{}},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n

If you haven’t already, checkout the Hortonworks Apache Zeppelin page as well as the Getting Started with Apache Zeppelin tutorial.

\n

You will find the official Apache Zeppelin page here.

\n
"}]},"apps":[],"jobName":"paragraph_1487287821077_-1768186912","id":"20161014-155201_679736099","dateCreated":"2017-02-17T05:00:21+0000","dateStarted":"2017-02-22T15:25:25+0000","dateFinished":"2017-02-22T15:25:25+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:10943"},{"title":"New to Spark?","text":"%md\n\nApache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets.\n\nIf you would like to learn more about Apache Spark visit:\n- [Official Apache Spark Page](http://spark.apache.org/)\n- [Hortonworks Apache Spark Page](http://hortonworks.com/apache/spark/)\n- [Hortonworks Apache Spark Docs](http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.0/bk_spark-component-guide/content/ch_developing-spark-apps.html)","user":"admin","dateUpdated":"2017-02-22T15:25:27+0000","config":{"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true,"editorSetting":{}},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n

Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets.

\n

If you would like to learn more about Apache Spark visit:
- Official Apache Spark Page
- Hortonworks Apache Spark Page
- Hortonworks Apache Spark Docs

\n
"}]},"apps":[],"jobName":"paragraph_1487287821078_-1767032665","id":"20161014-121442_628671851","dateCreated":"2017-02-17T05:00:21+0000","dateStarted":"2017-02-22T15:25:27+0000","dateFinished":"2017-02-22T15:25:27+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:10944"},{"title":"How to run a paragraph?","text":"%md\nTo run a paragraph in a Zeppelin notebook you can either click the `play` button (blue triangle) on the right-hand side or simply press `Shift + Enter`.","user":"admin","dateUpdated":"2017-02-22T15:25:29+0000","config":{"colWidth":12,"editorMode":"ace/mode/scala","editorHide":true,"title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true,"editorSetting":{}},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n

To run a paragraph in a Zeppelin notebook you can either click the play button (blue triangle) on the right-hand side or simply press Shift + Enter.

\n
"}]},"apps":[],"jobName":"paragraph_1487287821078_-1767032665","id":"20161014-144044_1782842084","dateCreated":"2017-02-17T05:00:21+0000","dateStarted":"2017-02-22T15:25:29+0000","dateFinished":"2017-02-22T15:25:29+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:10945"},{"title":"What are Interpreters?","text":"%md\n\nIn the following paragraphs we are going to execute Spark code, run shell commands to download and move files, run sql queries etc. Each paragraph will start with `%` followed by an interpreter name, e.g. `%spark2` for a Spark 2.x interpreter. Different interpreter names indicate what will be executed: code, markdown, html etc.This allows you to perform data ingestion, munging, wrangling, visualization, analysis, processing and more, all in one place!\n\nThroughtout this notebook we will use the following interpreters:\n\n- `%spark2` - Spark interpreter to run Spark 2.x code written in Scala\n- `%spark2.sql` - Spark SQL interprter (to execute SQL queries against temporary tables in Spark)\n- `%sh` - Shell interpreter to run shell commands\n- `%angular` - Angular interpreter to run Angular and HTML code\n- `%md` - Markdown for displaying formatted text, links, and images\n\nTo learn more about Zeppelin interpreters check out this [link](https://zeppelin.apache.org/docs/0.5.6-incubating/manual/interpreters.html).","user":"admin","dateUpdated":"2017-02-22T15:25:31+0000","config":{"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true,"editorSetting":{"editOnDblClick":true,"language":"markdown"},"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n

In the following paragraphs we are going to execute Spark code, run shell commands to download and move files, run sql queries etc. Each paragraph will start with % followed by an interpreter name, e.g. %spark2 for a Spark 2.x interpreter. Different interpreter names indicate what will be executed: code, markdown, html etc.This allows you to perform data ingestion, munging, wrangling, visualization, analysis, processing and more, all in one place!

\n

Throughtout this notebook we will use the following interpreters:

\n\n

To learn more about Zeppelin interpreters check out this link.

\n
"}]},"apps":[],"jobName":"paragraph_1487287821078_-1767032665","id":"20161014-145714_450762590","dateCreated":"2017-02-17T05:00:21+0000","dateStarted":"2017-02-22T15:25:31+0000","dateFinished":"2017-02-22T15:25:31+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:10946"},{"title":"Some initial delay to be expected...","text":"%md\n**Note**: The first time you run `spark.version` in the paragraph below, several services will initialize in the background. \nThis may take **1~2 min** so please **be patient**. Afterwards, each paragraph should run much more quickly since all the services will already be running.","user":"admin","dateUpdated":"2017-02-22T15:25:34+0000","config":{"colWidth":12,"editorMode":"ace/mode/scala","editorHide":true,"title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true,"editorSetting":{}},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n

Note: The first time you run spark.version in the paragraph below, several services will initialize in the background.
This may take 1~2 min so please be patient. Afterwards, each paragraph should run much more quickly since all the services will already be running.

\n
"}]},"apps":[],"jobName":"paragraph_1487287821079_-1767417414","id":"20161014-144409_1067974024","dateCreated":"2017-02-17T05:00:21+0000","dateStarted":"2017-02-22T15:25:34+0000","dateFinished":"2017-02-22T15:25:34+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:10947"},{"title":"Verify Spark Version (should be 2.x)","text":"%spark2.spark\n\nspark.version","user":"admin","dateUpdated":"2017-03-09T12:32:33+0000","config":{"colWidth":12,"editorMode":"ace/mode/text","editorHide":false,"title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true,"editorSetting":{"editOnDblClick":false,"language":"text"}},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487287821079_-1767417414","id":"20161012-235330_1461856587","dateCreated":"2017-02-17T05:00:21+0000","dateStarted":"2017-02-17T05:02:44+0000","dateFinished":"2017-02-17T05:03:05+0000","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:10948"},{"title":"Download JSON data file","text":"%sh \n\n# Remove old json file if already exists in local /tmp directory\nif [ -e /tmp/svepisodes.json ]\nthen\n rm -f /tmp/svepisodes.json\nfi\n\nwget https://raw.githubusercontent.com/roberthryniewicz/datasets/master/svepisodes.json -O /tmp/svepisodes.json","user":"admin","dateUpdated":"2017-02-18T09:41:46+0000","config":{"colWidth":12,"editorMode":"ace/mode/text","editorHide":false,"title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true,"editorSetting":{"editOnDblClick":false,"language":"text"}},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487287821080_-1769341158","id":"20161012-193914_1818868460","dateCreated":"2017-02-17T05:00:21+0000","dateStarted":"2017-02-18T09:41:46+0000","dateFinished":"2017-02-18T09:41:47+0000","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:10949"},{"title":"Move file from Local Storage to HDFS (if available/supported)","text":"%sh\n\n# Remove existing (if any) copy of data from HDFS\nhdfs dfs -rm -f /tmp/svepisodes.json\n\n# Move downloaded JSON file from local storage to HDFS\nhdfs dfs -put /tmp/svepisodes.json /tmp\n","user":"admin","dateUpdated":"2017-02-22T15:26:30+0000","config":{"colWidth":12,"editorMode":"ace/mode/text","editorHide":false,"title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true,"editorSetting":{"editOnDblClick":false,"language":"text"}},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487287821080_-1769341158","id":"20161012-200245_1679004004","dateCreated":"2017-02-17T05:00:21+0000","dateStarted":"2017-02-22T14:07:14+0000","dateFinished":"2017-02-22T14:07:18+0000","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:10950"},{"title":"Load data into a Spark DataFrame","text":"%spark2.spark\n\nval path = \"/tmp/svepisodes.json\"\nval svEpisodes = spark.read.json(path) // Create a DataFrame from JSON data (automatically infer schema and data types)","user":"admin","dateUpdated":"2017-03-09T12:32:47+0000","config":{"colWidth":12,"editorMode":"ace/mode/text","editorHide":false,"title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true,"editorSetting":{"editOnDblClick":false,"language":"text"}},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487287821080_-1769341158","id":"20161012-200853_1560821654","dateCreated":"2017-02-17T05:00:21+0000","dateStarted":"2017-02-22T14:07:37+0000","dateFinished":"2017-02-22T14:07:38+0000","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:10951"},{"title":"What are Datasets and DataFrames?","text":"%md\n\n**Datasets** and **DataFrames** are distributed collections of data created from a variety of sources: JSON and XML files, tables in Hive, external databases and more. Conceptually, they are equivalent to a table in a relational database or a DataFrame in R or Python. Key difference between the Dataset and the DataFrame is that Datasets are strongly typed.\n\nThere are complex manipulations possible on Datasets and DataFrames, however they are beyond this quick guide.\n\nTo learn more about Datasets and DataFrames checkout this [link](http://spark.apache.org/docs/2.0.0/sql-programming-guide.html#datasets-and-dataframes).","user":"admin","dateUpdated":"2017-02-22T15:26:53+0000","config":{"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true,"editorSetting":{"editOnDblClick":true,"language":"markdown"},"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n

Datasets and DataFrames are distributed collections of data created from a variety of sources: JSON and XML files, tables in Hive, external databases and more. Conceptually, they are equivalent to a table in a relational database or a DataFrame in R or Python. Key difference between the Dataset and the DataFrame is that Datasets are strongly typed.

\n

There are complex manipulations possible on Datasets and DataFrames, however they are beyond this quick guide.

\n

To learn more about Datasets and DataFrames checkout this link.

\n
"}]},"apps":[],"jobName":"paragraph_1487287821081_-1769725907","id":"20161014-131031_180366265","dateCreated":"2017-02-17T05:00:21+0000","dateStarted":"2017-02-22T15:26:53+0000","dateFinished":"2017-02-22T15:26:53+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:10952"},{"title":"Print DataFrame Schema","text":"%spark2.spark\n\nsvEpisodes.printSchema()","user":"admin","dateUpdated":"2017-03-09T12:33:04+0000","config":{"colWidth":12,"editorMode":"ace/mode/text","editorHide":false,"title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true,"editorSetting":{"editOnDblClick":false,"language":"text"}},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487287821081_-1769725907","id":"20161012-202011_596248668","dateCreated":"2017-02-17T05:00:21+0000","dateStarted":"2017-02-18T10:03:24+0000","dateFinished":"2017-02-18T10:03:24+0000","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:10953"},{"title":"Data Description","text":"%angular\n\n\n\n\n\n\n\n\n\n\n \n \n \n\n \n \n \n\n \n \n \n\n \n \n \n\n \n \n \n\n \n \n \n\n \n \n \n\n \n \n \n\n \n \n \n\n \n \n \n\n \n \n \n\n
Column NameDescription
1AirdateDate when an episode was aired
2AirstampTimestamp when an episode was aired
3AirtimeLength of an actual episode airtime (no commercials)
4IdUnique show id
5NameName of an episode
6NumberEpisode number
7RuntimeTotal length of an episode (including commercials)
8SeasonShow season
9SummaryBrief summary of an episode
10UrlUrl where more information is available online about an episode
\n\n\n\n","user":"admin","dateUpdated":"2017-02-22T15:26:59+0000","config":{"colWidth":12,"editorMode":"ace/mode/scala","editorHide":true,"title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true,"editorSetting":{}},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"ANGULAR","data":"\n\n\n\n\n\n\n\n\n \n \n \n\n \n \n \n\n \n \n \n\n \n \n \n\n \n \n \n\n \n \n \n\n \n \n \n\n \n \n \n\n \n \n \n\n \n \n \n\n \n \n \n\n
Column NameDescription
1AirdateDate when an episode was aired
2AirstampTimestamp when an episode was aired
3AirtimeLength of an actual episode airtime (no commercials)
4IdUnique show id
5NameName of an episode
6NumberEpisode number
7RuntimeTotal length of an episode (including commercials)
8SeasonShow season
9SummaryBrief summary of an episode
10UrlUrl where more information is available online about an episode
\n\n\n"}]},"apps":[],"jobName":"paragraph_1487287821081_-1769725907","id":"20161014-140056_345247395","dateCreated":"2017-02-17T05:00:21+0000","dateStarted":"2017-02-22T15:26:59+0000","dateFinished":"2017-02-22T15:26:59+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:10954"},{"title":"Show DataFrame Contents","text":"%spark2.spark\n\nsvEpisodes.show()","user":"admin","dateUpdated":"2017-03-09T12:33:41+0000","config":{"colWidth":12,"editorMode":"ace/mode/text","editorHide":false,"title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true,"editorSetting":{"editOnDblClick":false,"language":"text"}},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487287821082_-1768571661","id":"20161012-234401_1548074862","dateCreated":"2017-02-17T05:00:21+0000","dateStarted":"2017-02-18T10:18:42+0000","dateFinished":"2017-02-18T10:18:44+0000","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:10955"},{"title":"Is there a more interactive way to display query results?","text":"%md\n\nShort answer, yes! The data displayed in the paragraph above isn't too interactive. To have a more dynamic experience, let's create a temporary (in-memory) view that we can query against and interact with the resulting data in a table or graph format. The temporary view will allow us to run SQL queries to get back results.\n\nNote that the temporary view will reside in memory as long as the Spark session is alive.","user":"admin","dateUpdated":"2017-02-22T15:27:03+0000","config":{"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true,"editorSetting":{"editOnDblClick":true,"language":"markdown"},"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n

Short answer, yes! The data displayed in the paragraph above isn’t too interactive. To have a more dynamic experience, let’s create a temporary (in-memory) view that we can query against and interact with the resulting data in a table or graph format. The temporary view will allow us to run SQL queries to get back results.

\n

Note that the temporary view will reside in memory as long as the Spark session is alive.

\n
"}]},"apps":[],"jobName":"paragraph_1487287821082_-1768571661","id":"20161013-005846_439497469","dateCreated":"2017-02-17T05:00:21+0000","dateStarted":"2017-02-22T15:27:03+0000","dateFinished":"2017-02-22T15:27:03+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:10956"},{"title":"Create a Temporary View","text":"%spark2.spark\n\n// Creates a temporary view\nsvEpisodes.createOrReplaceTempView(\"svepisodes\")","user":"admin","dateUpdated":"2017-03-09T12:33:41+0000","config":{"colWidth":12,"editorMode":"ace/mode/text","editorHide":false,"title":true,"results":[],"enabled":true,"editorSetting":{"editOnDblClick":false,"language":"text"}},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487287821083_-1768956409","id":"20161012-202125_3295223","dateCreated":"2017-02-17T05:00:21+0000","dateStarted":"2017-02-18T10:22:30+0000","dateFinished":"2017-02-18T10:22:31+0000","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:10957"},{"title":"So now what?","text":"%md\n\nAt this point we can run queries using a familiar SQL syntax against our newly registered `svepisodes` table. \n\nNote that although we are using a SQL syntax in the following paragraph it is translated and executed using the Spark engine with all the expected optimizations.","user":"admin","dateUpdated":"2017-02-22T15:27:08+0000","config":{"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true,"editorSetting":{}},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n

At this point we can run queries using a familiar SQL syntax against our newly registered svepisodes table.

\n

Note that although we are using a SQL syntax in the following paragraph it is translated and executed using the Spark engine with all the expected optimizations.

\n
"}]},"apps":[],"jobName":"paragraph_1487287821083_-1768956409","id":"20161013-182547_1601163342","dateCreated":"2017-02-17T05:00:21+0000","dateStarted":"2017-02-22T15:27:08+0000","dateFinished":"2017-02-22T15:27:08+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:10958"},{"title":"View Data in an Interactive Table Format","text":"%spark2.sql\n\nSELECT * FROM svepisodes ORDER BY season, number","user":"admin","dateUpdated":"2017-02-18T10:22:33+0000","config":{"colWidth":12,"editorMode":"ace/mode/text","editorHide":false,"title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[{"name":"airdate","index":0,"aggr":"sum"}],"values":[{"name":"airstamp","index":1,"aggr":"sum"}],"groups":[],"scatter":{"xAxis":{"name":"airdate","index":0,"aggr":"sum"},"yAxis":{"name":"airstamp","index":1,"aggr":"sum"}}}}],"enabled":true,"editorSetting":{"editOnDblClick":false,"language":"text"},"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487287821083_-1768956409","id":"20161013-005646_1818766386","dateCreated":"2017-02-17T05:00:21+0000","dateStarted":"2017-02-18T10:22:33+0000","dateFinished":"2017-02-18T10:22:33+0000","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:10959"},{"title":"Can we do something useful?","text":"%md\n\nOK, so now let's run a slightly more complex SQL query on the underlying table data.","user":"admin","dateUpdated":"2017-02-22T15:27:13+0000","config":{"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true,"editorSetting":{}},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n

OK, so now let’s run a slightly more complex SQL query on the underlying table data.

\n
"}]},"apps":[],"jobName":"paragraph_1487287821084_-1770880154","id":"20161013-182951_885833546","dateCreated":"2017-02-17T05:00:21+0000","dateStarted":"2017-02-22T15:27:13+0000","dateFinished":"2017-02-22T15:27:13+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:10960"},{"title":"Total Number of Episodes","text":"%spark2.sql\n\nSELECT count(1) AS TotalNumEpisodes FROM svepisodes","user":"admin","dateUpdated":"2017-02-18T10:22:46+0000","config":{"colWidth":4,"editorMode":"ace/mode/text","title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[{"name":"TotalNumEpisodes","index":0,"aggr":"sum"}],"values":[],"groups":[],"scatter":{"xAxis":{"name":"TotalNumEpisodes","index":0,"aggr":"sum"}}}}],"enabled":true,"editorSetting":{"editOnDblClick":false,"language":"text"}},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487287821084_-1770880154","id":"20161017-235756_1441150850","dateCreated":"2017-02-17T05:00:21+0000","dateStarted":"2017-02-18T10:22:46+0000","dateFinished":"2017-02-18T10:22:46+0000","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:10961"},{"title":"Number of Episodes per Season","text":"%spark2.sql \n\nSELECT season, count(number) as episodes FROM svepisodes GROUP BY season","user":"admin","dateUpdated":"2017-02-18T10:22:50+0000","config":{"colWidth":8,"editorMode":"ace/mode/text","editorHide":false,"title":true,"results":[{"graph":{"mode":"multiBarChart","height":300,"optionOpen":false,"keys":[{"name":"season","index":0,"aggr":"sum"}],"values":[{"name":"episodes","index":1,"aggr":"sum"}],"groups":[],"scatter":{"xAxis":{"name":"season","index":0,"aggr":"sum"}}}}],"enabled":true,"editorSetting":{"editOnDblClick":false,"language":"text"}},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487287821085_-1771264903","id":"20161012-202204_1707933023","dateCreated":"2017-02-17T05:00:21+0000","dateStarted":"2017-02-18T10:22:50+0000","dateFinished":"2017-02-18T10:22:54+0000","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:10962"},{"title":"Word Count on Episode Summaries","text":"%md\n\nNow let's perform a basic word-count on the summary column and find out which words occur most frequently. This should give us some indication on the popularity of certain characters and other relevant keywords in the context of the Sillicon Valley show.","user":"admin","dateUpdated":"2017-02-22T15:27:18+0000","config":{"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true,"editorSetting":{}},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n

Now let’s perform a basic word-count on the summary column and find out which words occur most frequently. This should give us some indication on the popularity of certain characters and other relevant keywords in the context of the Sillicon Valley show.

\n
"}]},"apps":[],"jobName":"paragraph_1487287821085_-1771264903","id":"20161013-010351_1570854534","dateCreated":"2017-02-17T05:00:21+0000","dateStarted":"2017-02-22T15:27:19+0000","dateFinished":"2017-02-22T15:27:19+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:10963"},{"title":"Raw Word Count","text":"%spark2.spark\n\nimport org.apache.spark.sql.functions._ // Import additional helper functions\n\nval svSummaries = svEpisodes.select(\"summary\").as[String] // Convert to String type (becomes a Dataset)\n\n// Extract individual words\nval words = svSummaries\n .flatMap(_.split(\"\\\\s+\")) // Split on whitespace\n .filter(_ != \"\") // Remove empty words\n .map(_.toLowerCase()) // Lowercase\n\n// Word count\nwords.groupByKey(value => value) // Group by word\n .count() // Count\n .orderBy($\"count(1)\" desc) // Order by most frequent\n .show() // Display results","user":"admin","dateUpdated":"2017-03-09T12:34:16+0000","config":{"colWidth":12,"editorMode":"ace/mode/text","editorHide":false,"title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true,"editorSetting":{"editOnDblClick":false,"language":"text"}},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487287821085_-1771264903","id":"20161013-000142_472015281","dateCreated":"2017-02-17T05:00:21+0000","dateStarted":"2017-02-18T10:10:44+0000","dateFinished":"2017-02-18T10:10:50+0000","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:10964"},{"title":"Can we improve this?","text":"%md\n\nAs you can see there are plenty of stop words and punctuation marks that surface to the top. Let's clean this up a bit by creating a basic stop word list and a punctuation mark list that we'll use as basic filters before we aggregate and order the words again.","user":"admin","dateUpdated":"2017-02-22T15:27:23+0000","config":{"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true,"editorSetting":{}},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n

As you can see there are plenty of stop words and punctuation marks that surface to the top. Let’s clean this up a bit by creating a basic stop word list and a punctuation mark list that we’ll use as basic filters before we aggregate and order the words again.

\n
"}]},"apps":[],"jobName":"paragraph_1487287821086_-1770110656","id":"20161013-010505_1972414834","dateCreated":"2017-02-17T05:00:21+0000","dateStarted":"2017-02-22T15:27:23+0000","dateFinished":"2017-02-22T15:27:23+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:10965"},{"title":"More Sophisticated Filtering","text":"%spark2.spark\n\nval stopWords = List(\"a\", \"an\", \"to\", \"and\", \"the\", \"of\", \"in\", \"for\", \"by\", \"at\") // Basic set of stop words\nval punctuationMarks = List(\"-\", \",\", \";\", \":\", \".\", \"?\", \"!\") // Basic set of punctuation marks\n\n// Filter out stop words and punctuation marks\nval wordsFiltered = words // Create a new Dataset\n .filter(!stopWords.contains(_)) // Remove stop words\n .filter(!punctuationMarks.contains(_)) // Remove punctuation marks","user":"admin","dateUpdated":"2017-03-09T12:34:16+0000","config":{"colWidth":12,"editorMode":"ace/mode/text","editorHide":false,"title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true,"editorSetting":{"editOnDblClick":false,"language":"text"}},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487287821086_-1770110656","id":"20161013-003539_1918843179","dateCreated":"2017-02-17T05:00:21+0000","dateStarted":"2017-02-18T10:11:16+0000","dateFinished":"2017-02-18T10:11:17+0000","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:10966"},{"title":"Improved Word Count","text":"%spark2.spark\n\n// Word count\nwordsFiltered\n .groupBy($\"value\" as \"word\") // Group on values (default) column name\n .agg(count(\"*\") as \"occurences\") // Aggregate\n .orderBy($\"occurences\" desc) // Display most common words first\n .show() // Display results","user":"admin","dateUpdated":"2017-03-09T12:34:17+0000","config":{"colWidth":12,"editorMode":"ace/mode/text","editorHide":false,"title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true,"editorSetting":{"editOnDblClick":false,"language":"text"}},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487287821086_-1770110656","id":"20161013-004841_1248757887","dateCreated":"2017-02-17T05:00:21+0000","dateStarted":"2017-02-18T10:12:16+0000","dateFinished":"2017-02-18T10:12:18+0000","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:10967"},{"title":"Note on the Results","text":"%md\n\nLooks like Richard, Pied Piper, Dinesh, Erlich, Gavin and Jared are the key words in the Sillicon Valley show. Looks like a lot revolves around Richard!","user":"admin","dateUpdated":"2017-02-22T15:27:31+0000","config":{"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true,"editorSetting":{}},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n

Looks like Richard, Pied Piper, Dinesh, Erlich, Gavin and Jared are the key words in the Sillicon Valley show. Looks like a lot revolves around Richard!

\n
"}]},"apps":[],"jobName":"paragraph_1487287821087_-1770495405","id":"20161014-142139_512800114","dateCreated":"2017-02-17T05:00:21+0000","dateStarted":"2017-02-22T15:27:31+0000","dateFinished":"2017-02-22T15:27:31+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:10968"},{"title":"Final Comments on Word Count","text":"%md\n\nAs you can see, there's more to do with our word list, e.g. Piper and Piper's should be counted as the same word. There's more, of course, however this is beyond the scope of this quick intro to Apache Spark.","user":"admin","dateUpdated":"2017-02-22T15:27:34+0000","config":{"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true,"editorSetting":{}},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n

As you can see, there’s more to do with our word list, e.g. Piper and Piper’s should be counted as the same word. There’s more, of course, however this is beyond the scope of this quick intro to Apache Spark.

\n
"}]},"apps":[],"jobName":"paragraph_1487287821087_-1770495405","id":"20161013-010754_1315051750","dateCreated":"2017-02-17T05:00:21+0000","dateStarted":"2017-02-22T15:27:34+0000","dateFinished":"2017-02-22T15:27:34+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:10969"},{"title":"Additional Resources","text":"%md\n\nWe hope you've enjoyed this brief intro to Apache Spark. Below are additional resources that you should find useful:\n\n1. [Hortonworks Apache Spark Tutorials](http://hortonworks.com/tutorials/#tuts-developers) are your natural next step where you can explore Spark in more depth.\n2. [Hortonworks Community Connection (HCC)](https://community.hortonworks.com/spaces/85/data-science.html?type=question) is a great resource for questions and answers on Spark, Data Analytics/Science, and many more Big Data topics.\n3. [Hortonworks Apache Spark Docs](http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.0/bk_spark-component-guide/content/ch_developing-spark-apps.html) - official Spark documentation.\n4. [Hortonworks Apache Zeppelin Docs](http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.0/bk_zeppelin-component-guide/content/ch_using_zeppelin.html) - official Zeppelin documentation.\n","user":"admin","dateUpdated":"2017-02-22T15:27:37+0000","config":{"colWidth":10,"editorMode":"ace/mode/markdown","editorHide":true,"title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true,"editorSetting":{"editOnDblClick":true,"language":"markdown"},"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n

We hope you’ve enjoyed this brief intro to Apache Spark. Below are additional resources that you should find useful:

\n
    \n
  1. Hortonworks Apache Spark Tutorials are your natural next step where you can explore Spark in more depth.
  2. \n
  3. Hortonworks Community Connection (HCC) is a great resource for questions and answers on Spark, Data Analytics/Science, and many more Big Data topics.
  4. \n
  5. Hortonworks Apache Spark Docs - official Spark documentation.
  6. \n
  7. Hortonworks Apache Zeppelin Docs - official Zeppelin documentation.
  8. \n
\n
"}]},"apps":[],"jobName":"paragraph_1487287821087_-1770495405","id":"20160226-200649_425588199","dateCreated":"2017-02-17T05:00:21+0000","dateStarted":"2017-02-22T15:27:37+0000","dateFinished":"2017-02-22T15:27:37+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:10970"},{"text":"%angular\n
\n
\n\n \"HCC\"\n\n
","user":"admin","dateUpdated":"2017-02-22T15:27:39+0000","config":{"colWidth":2,"editorMode":"ace/mode/scala","editorHide":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true,"editorSetting":{},"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"ANGULAR","data":"
\n
\n\n \"HCC\"\n\n
"}]},"apps":[],"jobName":"paragraph_1487287821088_-1784731114","id":"20161013-185141_1487979052","dateCreated":"2017-02-17T05:00:21+0000","dateStarted":"2017-02-22T15:27:39+0000","dateFinished":"2017-02-22T15:27:39+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:10971"},{"text":"","dateUpdated":"2017-02-17T05:00:21+0000","config":{"colWidth":12,"editorMode":"ace/mode/text","graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"enabled":true,"results":{},"editorSetting":{"editOnDblClick":false,"language":"text"}},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487287821088_-1784731114","id":"20161018-143930_1545375880","dateCreated":"2017-02-17T05:00:21+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:10972"}],"name":"Getting Started / Apache Spark in 5 Minutes","id":"2CBTZPY14","angularObjects":{"2C9J4X9BB:shared_process":[],"2C97XTJFE:shared_process":[],"2C9BD8WCX:shared_process":[],"2CBT85YD7:shared_process":[],"2C8RGTKC3:shared_process":[],"2CBQNWPMD:shared_process":[],"2C8JDGPHH:shared_process":[],"2C9CSKWHY:shared_process":[],"2CBN9WPNN:shared_process":[],"2CB11VTD7:shared_process":[],"2C9Z4TVBW:shared_process":[],"2CB3RUCX8:shared_process":[],"2C9PSG7XP:shared_process":[],"2C8PPBWFC:shared_process":[],"2C95B7UJY:shared_process":[],"2CB91QEZG:shared_process":[],"2CAPDMDA1:shared_process":[],"2CACTG458:shared_process":[],"2CAD4U2BW:shared_process":[],"2CBTJTHZE:shared_process":[],"2C9VPGHR9:shared_process":[]},"config":{"looknfeel":"default","personalizedMode":"false"},"info":{}}