{"paragraphs":[{"text":"%md\n\n![sv-image](https://raw.githubusercontent.com/roberthryniewicz/images/master/silicon_valley_corporation.jpg)\n\n## Apache Spark in 5 Minutes \n#### Exploring Silicon Valley Show Episodes Dataset\n\n**Level**: Beginner\n**Language**: Scala\n**Requirements**: \n- [HDP 2.6](http://hortonworks.com/products/sandbox/) (or later) or [HDCloud](https://hortonworks.github.io/hdp-aws/)\n- Spark 2.x\n\n**Author**: Robert Hryniewicz\n**Follow** [@RobH8z](https://twitter.com/RobertH8z)","user":"admin","dateUpdated":"2017-06-13T18:56:56+0000","config":{"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true,"editorSetting":{"editOnDblClick":true,"language":"markdown"},"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
Level: Beginner
Language: Scala
Requirements:
- HDP 2.6 (or later) or HDCloud
- Spark 2.x
Author: Robert Hryniewicz
Follow @RobH8z
Welcome to a quick overview of Apache Spark with Sillicon Valley Episodes dataset. If you’ve never watched the Silicon Valley show you can learn more about it [here](https://en.wikipedia.org/wiki/Silicon_Valley_(TV_series)).
\nIn this notebook we will download the dataset (in JSON format) from an external github repository, ingest it into a Spark Dataset and perform basic analysis, filtering, and word count.
\nThroughout this lab we will use basic Scala syntax. If you would like to learn more about Scala, here’s an excellent introductory Tutorial.
\nIf you haven’t already, checkout the Hortonworks Apache Zeppelin page as well as the Getting Started with Apache Zeppelin tutorial.
\nYou will find the official Apache Zeppelin page here.
\nApache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets.
\nIf you would like to learn more about Apache Spark visit:
- Official Apache Spark Page
- Hortonworks Apache Spark Page
- Hortonworks Apache Spark Docs
To run a paragraph in a Zeppelin notebook you can either click the play
button (blue triangle) on the right-hand side or simply press Shift + Enter
.
In the following paragraphs we are going to execute Spark code, run shell commands to download and move files, run sql queries etc. Each paragraph will start with %
followed by an interpreter name, e.g. %spark2
for a Spark 2.x interpreter. Different interpreter names indicate what will be executed: code, markdown, html etc.This allows you to perform data ingestion, munging, wrangling, visualization, analysis, processing and more, all in one place!
Throughtout this notebook we will use the following interpreters:
\n%spark2
- Spark interpreter to run Spark 2.x code written in Scala%spark2.sql
- Spark SQL interprter (to execute SQL queries against temporary tables in Spark)%sh
- Shell interpreter to run shell commands%angular
- Angular interpreter to run Angular and HTML code%md
- Markdown for displaying formatted text, links, and imagesTo learn more about Zeppelin interpreters check out this link.
\nNote: The first time you run spark.version
in the paragraph below, several services will initialize in the background.
This may take 1~2 min so please be patient. Afterwards, each paragraph should run much more quickly since all the services will already be running.
Datasets and DataFrames are distributed collections of data created from a variety of sources: JSON and XML files, tables in Hive, external databases and more. Conceptually, they are equivalent to a table in a relational database or a DataFrame in R or Python. Key difference between the Dataset and the DataFrame is that Datasets are strongly typed.
\nThere are complex manipulations possible on Datasets and DataFrames, however they are beyond this quick guide.
\nTo learn more about Datasets and DataFrames checkout this link.
\n\n | Column Name | \nDescription | \n
---|---|---|
1 | \nAirdate | \nDate when an episode was aired | \n
2 | \nAirstamp | \nTimestamp when an episode was aired | \n
3 | \nAirtime | \nLength of an actual episode airtime (no commercials) | \n
4 | \nId | \nUnique show id | \n
5 | \nName | \nName of an episode | \n
6 | \nNumber | \nEpisode number | \n
7 | \nRuntime | \nTotal length of an episode (including commercials) | \n
8 | \nSeason | \nShow season | \n
9 | \nSummary | \nBrief summary of an episode | \n
10 | \nUrl | \nUrl where more information is available online about an episode | \n