{"paragraphs":[{"text":"%md\n\n## Exploring Spark SQL Module\n#### with an Airline Dataset\n\n**Level**: Beginner\n**Language**: Scala\n**Requirements**: \n- [HDP 2.6](http://hortonworks.com/products/sandbox/) (or later) or [HDCloud](https://hortonworks.github.io/hdp-aws/)\n- Spark 2.x\n\n**Author**: Robert Hryniewicz\n**Follow** [@RobH8z](https://twitter.com/RobertH8z)","dateUpdated":"2017-06-13T19:04:13+0000","config":{"tableHide":false,"editorSetting":{"editOnDblClick":true,"language":"markdown"},"editorMode":"ace/mode/markdown","colWidth":12,"editorHide":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"

Exploring Spark SQL Module

with an Airline Dataset

Level: Beginner
Language: Scala
Requirements:
- HDP 2.6 (or later) or HDCloud
- Spark 2.x

Author: Robert Hryniewicz
Follow @RobH8z

"}]},"apps":[],"jobName":"paragraph_1497380635124_-2130072315","id":"20160410-003138_1880368561","dateCreated":"2017-06-13T19:03:55+0000","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:21057","user":"admin","dateFinished":"2017-06-13T19:04:13+0000","dateStarted":"2017-06-13T19:04:13+0000"},{"title":"Introduction","text":"%md\n\nIn this lab you will use Spark SQL via DataFrames API in Part 1 of the lab and SQL API in Part 2 of the lab to explore an Airline Dataset. This is a very interesting dataset that is further explored in other demo notebooks.","dateUpdated":"2017-06-13T19:03:55+0000","config":{"editorSetting":{},"editorMode":"ace/mode/markdown","colWidth":12,"editorHide":true,"title":true,"results":[{"graph":{"mode":"table","height":217,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"

In this lab you will use Spark SQL via DataFrames API in Part 1 of the lab and SQL API in Part 2 of the lab to explore an Airline Dataset. This is a very interesting dataset that is further explored in other demo notebooks.

"}]},"apps":[],"jobName":"paragraph_1497380635125_-2130457064","id":"20160410-003138_985055475","dateCreated":"2017-06-13T19:03:55+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:21058"},{"title":"Datasets? DataFrames?","text":"%md\n\nA **Dataset** is a distributed collection of data. Dataset provides the benefits of strong typing, ability to use powerful lambda functions with the benefits of (Spark SQL’s) optimized execution engine. A Dataset can be constructed from JVM objects and then manipulated using functional transformations (map, flatMap, filter, etc.). The Dataset API is available in Scala and Java.\n\nA **DataFrame** is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. The DataFrame API is available in Scala, Java, Python, and R. In Scala and Java, a DataFrame is represented by a Dataset of Rows. In the Scala API, DataFrame is simply a type alias of Dataset[Row]. (Note that in Scala type parameters (generics) are enclosed in square brackets.)\n\nThroughout this document, we will often refer to Scala/Java Datasets of Rows as DataFrames. [[source](http://spark.apache.org/docs/2.0.0/sql-programming-guide.html#datasets-and-dataframes)]","dateUpdated":"2017-06-13T19:03:55+0000","config":{"editorSetting":{},"editorMode":"ace/mode/markdown","colWidth":12,"editorHide":true,"title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"

A Dataset is a distributed collection of data. Dataset provides the benefits of strong typing, ability to use powerful lambda functions with the benefits of (Spark SQL’s) optimized execution engine. A Dataset can be constructed from JVM objects and then manipulated using functional transformations (map, flatMap, filter, etc.). The Dataset API is available in Scala and Java.

A DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. The DataFrame API is available in Scala, Java, Python, and R. In Scala and Java, a DataFrame is represented by a Dataset of Rows. In the Scala API, DataFrame is simply a type alias of Dataset[Row]. (Note that in Scala type parameters (generics) are enclosed in square brackets.)

Throughout this document, we will often refer to Scala/Java Datasets of Rows as DataFrames. [source]

"}]},"apps":[],"jobName":"paragraph_1497380635125_-2130457064","id":"20160410-003138_875933602","dateCreated":"2017-06-13T19:03:55+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:21059"},{"title":"New to Scala?","text":"%md\n\nThroughout this lab we will use basic Scala syntax. If you would like to learn more about Scala, here's an excellent introductory [Tutorial](http://www.dhgarrette.com/nlpclass/scala/basics.html).","dateUpdated":"2017-06-13T19:03:55+0000","config":{"editorSetting":{},"editorMode":"ace/mode/markdown","colWidth":12,"editorHide":true,"title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"

Throughout this lab we will use basic Scala syntax. If you would like to learn more about Scala, here’s an excellent introductory Tutorial.

"}]},"apps":[],"jobName":"paragraph_1497380635125_-2130457064","id":"20160410-140356_736870357","dateCreated":"2017-06-13T19:03:55+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:21060"},{"title":"How to run a paragraph","text":"%md\nTo run a paragraph in a Zeppelin notebook you can either click the `play` button (blue triangle) on the right-hand side or simply press `Shift + Enter`.","dateUpdated":"2017-06-13T19:03:55+0000","config":{"editorSetting":{},"editorMode":"ace/mode/markdown","colWidth":12,"editorHide":true,"title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"

To run a paragraph in a Zeppelin notebook you can either click the play button (blue triangle) on the right-hand side or simply press Shift + Enter.

"}]},"apps":[],"jobName":"paragraph_1497380635126_-2129302817","id":"20160410-003138_1218388802","dateCreated":"2017-06-13T19:03:55+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:21061"},{"title":"What are Interpreters?","text":"%md\n\nIn the following paragraphs we are going to execute Spark code, run shell commands to download and move files, run sql queries etc. Each paragraph will start with `%` followed by an interpreter name, e.g. `%spark2` for a Spark 2.x interpreter. Different interpreter names indicate what will be executed: code, markdown, html etc. This allows you to perform data ingestion, munging, wrangling, visualization, analysis, processing and more, all in one place!\n\nThroughtout this notebook we will use the following interpreters:\n\n- `%spark2` - Spark interpreter to run Spark code written in Scala\n- `%spark2.sql` - Spark SQL interprter (to execute SQL queries against temporary tables in Spark)\n- `%sh` - Shell interpreter to run shell commands\n- `%angular` - Angular interpreter to run Angular and HTML code\n- `%md` - Markdown for displaying formatted text, links, and images\n\nTo learn more about Zeppelin interpreters check out this [link](https://zeppelin.apache.org/docs/0.5.6-incubating/manual/interpreters.html).","dateUpdated":"2017-06-13T19:03:55+0000","config":{"tableHide":false,"editorSetting":{"editOnDblClick":true,"language":"markdown"},"editorMode":"ace/mode/markdown","colWidth":12,"editorHide":true,"title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"

In the following paragraphs we are going to execute Spark code, run shell commands to download and move files, run sql queries etc. Each paragraph will start with % followed by an interpreter name, e.g. %spark2 for a Spark 2.x interpreter. Different interpreter names indicate what will be executed: code, markdown, html etc. This allows you to perform data ingestion, munging, wrangling, visualization, analysis, processing and more, all in one place!

Throughtout this notebook we will use the following interpreters:

%spark2 - Spark interpreter to run Spark code written in Scala
%spark2.sql - Spark SQL interprter (to execute SQL queries against temporary tables in Spark)
%sh - Shell interpreter to run shell commands
%angular - Angular interpreter to run Angular and HTML code
%md - Markdown for displaying formatted text, links, and images

To learn more about Zeppelin interpreters check out this link.

"}]},"apps":[],"jobName":"paragraph_1497380635126_-2129302817","id":"20160410-003138_290903368","dateCreated":"2017-06-13T19:03:55+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:21062"},{"title":"Verify Spark Version (should be 2.x)","text":"%spark2\n\nspark.version","dateUpdated":"2017-06-13T19:03:55+0000","config":{"editorSetting":{"editOnDblClick":false,"language":"scala"},"editorMode":"ace/mode/scala","colWidth":12,"editorHide":false,"title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1497380635126_-2129302817","id":"20160410-003138_631425785","dateCreated":"2017-06-13T19:03:55+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:21063"},{"title":"Download CSV flight data file ","text":"%sh\n\n# You will now download a subset of 2008 flights (only 100k lines)\n# The full dataset may be found here: http://stat-computing.org/dataexpo/2009/the-data.html\n\nwget https://raw.githubusercontent.com/roberthryniewicz/datasets/master/airline-dataset/flights/flights.csv -O /tmp/flights.csv\necho \"Downloaded!\"","dateUpdated":"2017-06-13T19:03:55+0000","config":{"tableHide":false,"editorSetting":{"editOnDblClick":false,"language":"sh"},"editorMode":"ace/mode/sh","colWidth":12,"editorHide":false,"title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1497380635126_-2129302817","id":"20160410-003138_1540125404","dateCreated":"2017-06-13T19:03:55+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:21064"},{"title":"Preview Downloaded File","text":"%sh\n\ncat /tmp/flights.csv | head","dateUpdated":"2017-06-13T19:03:55+0000","config":{"tableHide":false,"editorSetting":{"editOnDblClick":false,"language":"sh"},"editorMode":"ace/mode/sh","colWidth":12,"editorHide":false,"title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1497380635127_-2129687566","id":"20160410-003138_226044813","dateCreated":"2017-06-13T19:03:55+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:21065"},{"title":"Move dataset to HDFS (if supported/available)","text":"%sh\n\n# remove existing copies of dataset from HDFS\nhdfs dfs -rm -r -f /tmp/flights.csv\n\n# put data into HDFS\nhdfs dfs -put /tmp/flights.csv /tmp/","dateUpdated":"2017-06-13T19:03:55+0000","config":{"editorSetting":{"editOnDblClick":false,"language":"sh"},"editorMode":"ace/mode/sh","colWidth":12,"editorHide":false,"title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1497380635127_-2129687566","id":"20160410-003138_1267267737","dateCreated":"2017-06-13T19:03:55+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:21066"},{"title":"Create a DataFrame from CSV file","text":"%spark2\n\n// Create a flights DataFrame from CSV file\nval flights = spark.read\n .option(\"header\", \"true\") // Use first line as header\n .option(\"inferSchema\", \"true\") // Infer schema\n .csv(\"/tmp/flights.csv\") // Read data","dateUpdated":"2017-06-13T19:03:55+0000","config":{"editorSetting":{"editOnDblClick":false,"language":"scala"},"editorMode":"ace/mode/scala","colWidth":12,"editorHide":false,"title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1497380635127_-2129687566","id":"20160410-003138_236600548","dateCreated":"2017-06-13T19:03:55+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:21067"},{"title":"Print Schema","text":"%spark2\n\n// Print the schema in a tree format\nflights.printSchema()","dateUpdated":"2017-06-13T19:03:55+0000","config":{"editorSetting":{"editOnDblClick":false,"language":"scala"},"editorMode":"ace/mode/scala","colWidth":12,"editorHide":false,"title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1497380635127_-2129687566","id":"20160410-003138_1553179639","dateCreated":"2017-06-13T19:03:55+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:21068"},{"title":"Dataset Description","text":"%angular\n\n\n\n\n\n\n\n\n\n \n \n \n\n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n

	Name	Description
1	Year	1987-2008
2	Month	1-12
3	DayofMonth	1-31
4	DayOfWeek	1 (Monday) - 7 (Sunday)
5	DepTime	actual departure time (local, hhmm)
6	CRSDepTime	scheduled departure time (local, hhmm)
7	ArrTime	actual arrival time (local, hhmm)
8	CRSArrTime	scheduled arrival time (local, hhmm)
9	UniqueCarrier	unique carrier code
10	FlightNum	flight number
11	TailNum	plane tail number
12	ActualElapsedTime	in minutes
13	CRSElapsedTime	in minutes
14	AirTime	in minutes
15	ArrDelay	arrival delay, in minutes
16	DepDelay	departure delay, in minutes
17	Origin	origin IATA airport code
18	Dest	destination IATA airport code
19	Distance	in miles
20	TaxiIn	taxi in time, in minutes
21	TaxiOut	taxi out time in minutes
22	Cancelled	was the flight cancelled?
23	CancellationCode	reason for cancellation (A = carrier, B = weather, C = NAS, D = security)
24	Diverted	1 = yes, 0 = no
25	CarrierDelay	in minutes
26	WeatherDelay	in minutes
27	NASDelay	in minutes
28	SecurityDelay	in minutes
29	LateAircraftDelay	in minutes

\n\n\n","dateUpdated":"2017-06-13T19:03:55+0000","config":{"editorSetting":{},"editorMode":"ace/mode/scala","colWidth":12,"editorHide":true,"title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"ANGULAR","data":"\n\n\n\n\n\n\n\n \n \n \n\n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n

	Name	Description
1	Year	1987-2008
2	Month	1-12
3	DayofMonth	1-31
4	DayOfWeek	1 (Monday) - 7 (Sunday)
5	DepTime	actual departure time (local, hhmm)
6	CRSDepTime	scheduled departure time (local, hhmm)
7	ArrTime	actual arrival time (local, hhmm)
8	CRSArrTime	scheduled arrival time (local, hhmm)
9	UniqueCarrier	unique carrier code
10	FlightNum	flight number
11	TailNum	plane tail number
12	ActualElapsedTime	in minutes
13	CRSElapsedTime	in minutes
14	AirTime	in minutes
15	ArrDelay	arrival delay, in minutes
16	DepDelay	departure delay, in minutes
17	Origin	origin IATA airport code
18	Dest	destination IATA airport code
19	Distance	in miles
20	TaxiIn	taxi in time, in minutes
21	TaxiOut	taxi out time in minutes
22	Cancelled	was the flight cancelled?
23	CancellationCode	reason for cancellation (A = carrier, B = weather, C = NAS, D = security)
24	Diverted	1 = yes, 0 = no
25	CarrierDelay	in minutes
26	WeatherDelay	in minutes
27	NASDelay	in minutes
28	SecurityDelay	in minutes
29	LateAircraftDelay	in minutes

\n\n\n"}]},"apps":[],"jobName":"paragraph_1497380635128_-2131611310","id":"20160410-003138_1626463388","dateCreated":"2017-06-13T19:03:55+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:21069"},{"text":"%md\n### Part 1: Using DataFrame/Dataset API to Analyze the Airline Data\n\nNote: in this lab DataFrame and Dataset API calls will be indistinguishable. Internally, however, *flights* are represented as DataFrames and *delayedFlights* as Datasets in the examples below.","dateUpdated":"2017-06-13T19:03:55+0000","config":{"editorSetting":{},"editorMode":"ace/mode/markdown","colWidth":12,"editorHide":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"

Part 1: Using DataFrame/Dataset API to Analyze the Airline Data

Note: in this lab DataFrame and Dataset API calls will be indistinguishable. Internally, however, flights are represented as DataFrames and delayedFlights as Datasets in the examples below.

"}]},"apps":[],"jobName":"paragraph_1497380635128_-2131611310","id":"20160410-003138_650819453","dateCreated":"2017-06-13T19:03:55+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:21070"},{"title":"Show a subset of columns","text":"%spark2\n\n// Show a subset of columns with \"select\"\nflights.select(\"UniqueCarrier\", \"FlightNum\", \"DepDelay\", \"ArrDelay\", \"Distance\").show()","dateUpdated":"2017-06-13T19:03:55+0000","config":{"editorSetting":{"editOnDblClick":false,"language":"scala"},"editorMode":"ace/mode/scala","colWidth":12,"editorHide":false,"title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1497380635128_-2131611310","id":"20160410-003138_1188332400","dateCreated":"2017-06-13T19:03:55+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:21071"},{"title":"Apply a filter to find flights delayed more than 15 min","text":"%spark2\n\n// Create a Dataset containing flights with delayed departure by more than 15 min using \"filter\"\nval delayedFlights = flights\n .select(\"UniqueCarrier\", \"DepDelay\")\n .filter($\"DepDelay\" > 15)\n \ndelayedFlights.show()","dateUpdated":"2017-06-13T19:03:55+0000","config":{"editorSetting":{"editOnDblClick":false,"language":"scala"},"editorMode":"ace/mode/scala","colWidth":12,"editorHide":false,"title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1497380635128_-2131611310","id":"20160410-003138_704729700","dateCreated":"2017-06-13T19:03:55+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:21072"},{"title":"Display percentage of delayed flights","text":"%spark2\n\nval numTotalFlights = flights.count()\nval numDelayedFlights = delayedFlights.count()\n\n// Print total number of delayed flights\nprintln(\"Percentage of Delayed Flights: \" + (numDelayedFlights.toFloat/numTotalFlights*100) + \"%\")","dateUpdated":"2017-06-13T19:03:55+0000","config":{"editorSetting":{"editOnDblClick":false,"language":"scala"},"editorMode":"ace/mode/scala","colWidth":12,"editorHide":false,"title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1497380635129_-2131996059","id":"20160410-003138_1019754695","dateCreated":"2017-06-13T19:03:55+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:21073"},{"text":"%md\n\nWe can also create a user defined function (UDF) to determine delays.","dateUpdated":"2017-06-13T19:03:55+0000","config":{"editorSetting":{},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"

We can also create a user defined function (UDF) to determine delays.

"}]},"apps":[],"jobName":"paragraph_1497380635129_-2131996059","id":"20161017-203635_1855560775","dateCreated":"2017-06-13T19:03:55+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:21074"},{"title":" Create a UDF to determine delays","text":"%spark2\n\nimport org.apache.spark.sql.functions.udf\n\n// Define a UDF to find delayed flights\n\n// Assume:\n// if ArrDelay is not available then Delayed = False\n// if ArrDelay > 15 min then Delayed = True else False\n\nval isDelayedUDF = udf((time: String) => if (time == \"NA\") 0 else if (time.toInt > 15) 1 else 0)","dateUpdated":"2017-06-13T19:03:55+0000","config":{"editorSetting":{"editOnDblClick":false,"language":"scala"},"colWidth":12,"editorMode":"ace/mode/scala","title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1497380635129_-2131996059","id":"20161017-203017_1781904338","dateCreated":"2017-06-13T19:03:55+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:21075"},{"title":"Create a new DataFrame with IsDelayed column","text":"%spark2\n\nval flightsWithDelays = flights.select($\"Year\", $\"Month\", $\"DayofMonth\", $\"UniqueCarrier\", $\"FlightNum\", $\"DepDelay\", \n isDelayedUDF($\"DepDelay\").alias(\"IsDelayed\"))\n \nflightsWithDelays.show(5)","dateUpdated":"2017-06-13T19:03:55+0000","config":{"editorSetting":{"editOnDblClick":false,"language":"scala"},"colWidth":12,"editorMode":"ace/mode/scala","title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1497380635129_-2131996059","id":"20161017-203358_1309594443","dateCreated":"2017-06-13T19:03:55+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:21076"},{"text":"%md\n\n\nNote that now we have a new table with a column that indicates whether a flight is delayed or not. This will allow us to calculate percentage of delayed flights in one pass.","dateUpdated":"2017-06-13T19:03:55+0000","config":{"tableHide":false,"editorSetting":{"editOnDblClick":true,"language":"markdown"},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"

Note that now we have a new table with a column that indicates whether a flight is delayed or not. This will allow us to calculate percentage of delayed flights in one pass.

"}]},"apps":[],"jobName":"paragraph_1497380635130_-2130841812","id":"20161017-205652_1397194952","dateCreated":"2017-06-13T19:03:55+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:21077"},{"title":"Calculate percentage of delayed flights using flightsWithDelays DataFrame","text":"%spark2\n\nflightsWithDelays.agg((sum(\"IsDelayed\") * 100 / count(\"DepDelay\")).alias(\"Percentage of Delayed Flights\")).show()","dateUpdated":"2017-06-13T19:03:55+0000","config":{"editorSetting":{"editOnDblClick":false,"language":"scala"},"colWidth":12,"editorMode":"ace/mode/scala","editorHide":false,"title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1497380635130_-2130841812","id":"20161017-205750_819957102","dateCreated":"2017-06-13T19:03:55+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:21078"},{"text":"%md\n\nAs you can see above, this is a very clean way of displaying a percentage of delayed flights. UDFs are useful in creating additional functions that are commonly used.\n\nNow let's explore our flights a bit more and find some averages.","dateUpdated":"2017-06-13T19:03:55+0000","config":{"editorSetting":{},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"

As you can see above, this is a very clean way of displaying a percentage of delayed flights. UDFs are useful in creating additional functions that are commonly used.

Now let’s explore our flights a bit more and find some averages.

"}]},"apps":[],"jobName":"paragraph_1497380635130_-2130841812","id":"20161017-205919_1405069576","dateCreated":"2017-06-13T19:03:55+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:21079"},{"title":"Find Avg Taxi-in","text":"%spark2\n\nflights.select(\"Origin\", \"Dest\", \"TaxiIn\")\n .groupBy(\"Origin\", \"Dest\")\n .agg(avg(\"TaxiIn\")\n .alias(\"AvgTaxiIn\"))\n .orderBy(desc(\"AvgTaxiIn\"))\n .show(10)","dateUpdated":"2017-06-13T19:03:55+0000","config":{"editorSetting":{"editOnDblClick":false,"language":"scala"},"editorMode":"ace/mode/scala","colWidth":6,"editorHide":false,"title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1497380635130_-2130841812","id":"20160410-003138_1488719873","dateCreated":"2017-06-13T19:03:55+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:21080"},{"title":"Find Avg Taxi-out","text":"%spark2\n\nflights.select(\"Origin\", \"Dest\", \"TaxiOut\")\n .groupBy(\"Origin\", \"Dest\")\n .agg(avg(\"TaxiOut\")\n .alias(\"AvgTaxiOut\"))\n .orderBy(desc(\"AvgTaxiOut\"))\n .show(10)","dateUpdated":"2017-06-13T19:03:55+0000","config":{"editorSetting":{"editOnDblClick":false,"language":"scala"},"editorMode":"ace/mode/scala","colWidth":6,"editorHide":false,"title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1497380635131_-2131226561","id":"20160410-003138_840324935","dateCreated":"2017-06-13T19:03:55+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:21081"},{"text":"%md\n### Part 2: Using SQL API to Analyze the Airline Data","dateUpdated":"2017-06-13T19:03:55+0000","config":{"editorSetting":{},"editorMode":"ace/mode/markdown","colWidth":12,"editorHide":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"

Part 2: Using SQL API to Analyze the Airline Data

"}]},"apps":[],"jobName":"paragraph_1497380635131_-2131226561","id":"20160410-003138_582934314","dateCreated":"2017-06-13T19:03:55+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:21082"},{"title":"Is there a more interactive way to display query results?","text":"%md\n\nAs you can see, the data displayed in Part 1 of this notebook isn't too interactive. To have a more dynamic experience, let's create a temporary (in-memory) view that we can query against and interact with the resulting data in a table or graph format. The temporary view will allow us to execute SQL queries against it.\n\nNote that the temporary view will reside in memory as long as the Spark session is alive.","dateUpdated":"2017-06-13T19:03:55+0000","config":{"editorSetting":{},"editorMode":"ace/mode/markdown","colWidth":12,"editorHide":true,"title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"

As you can see, the data displayed in Part 1 of this notebook isn’t too interactive. To have a more dynamic experience, let’s create a temporary (in-memory) view that we can query against and interact with the resulting data in a table or graph format. The temporary view will allow us to execute SQL queries against it.

Note that the temporary view will reside in memory as long as the Spark session is alive.

"}]},"apps":[],"jobName":"paragraph_1497380635131_-2131226561","id":"20160410-003138_556617784","dateCreated":"2017-06-13T19:03:55+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:21083"},{"title":"Register a Temporary View","text":"%spark2\n\n// Convert flights DataFrame to a temporary view\nflights.createOrReplaceTempView(\"flightsView\")","dateUpdated":"2017-06-13T19:03:55+0000","config":{"editorSetting":{"editOnDblClick":false,"language":"scala"},"editorMode":"ace/mode/scala","colWidth":12,"editorHide":false,"title":true,"results":[],"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1497380635132_-2133150306","id":"20160410-003138_636329356","dateCreated":"2017-06-13T19:03:55+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:21084"},{"title":"Preview Data in an interactive table format","text":"%spark2.sql\n\nSELECT * FROM flightsView LIMIT 20","dateUpdated":"2017-06-13T19:03:55+0000","config":{"editorSetting":{"editOnDblClick":false,"language":"sql"},"editorMode":"ace/mode/sql","colWidth":12,"editorHide":false,"title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[{"name":"Year","index":0,"aggr":"sum"}],"values":[{"name":"Month","index":1,"aggr":"sum"}],"groups":[],"scatter":{"xAxis":{"name":"Year","index":0,"aggr":"sum"},"yAxis":{"name":"Month","index":1,"aggr":"sum"}}}}],"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1497380635132_-2133150306","id":"20160410-003138_318924232","dateCreated":"2017-06-13T19:03:55+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:21085"},{"title":"Register a User Defined Function (UDF)","text":"%spark2\n\n// Register a helper UDF to find delayed flights\n// Note that this is a UDF specific for use with the sparkSession\n\n// Assume:\n// if ArrDelay is not available then Delayed = False\n// if ArrDelay > 15 min then Delayed = True else False\n\nspark.udf.register(\"isDelayedUDF\", (time: String) => if (time == \"NA\") 0 else if (time.toInt > 15) 1 else 0)","dateUpdated":"2017-06-13T19:03:55+0000","config":{"editorSetting":{"editOnDblClick":false,"language":"scala"},"editorMode":"ace/mode/scala","colWidth":12,"editorHide":false,"title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1497380635132_-2133150306","id":"20160410-003138_40384312","dateCreated":"2017-06-13T19:03:55+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:21086"},{"title":"Compare Total Number of Delayed Flights by Carrier","text":"%spark2.sql\n--- Compare Total Number of Delayed Flights by Carrier\nSELECT UniqueCarrier, SUM(isDelayedUDF(DepDelay)) AS NumDelays FROM flightsView GROUP BY UniqueCarrier","dateUpdated":"2017-06-13T19:03:55+0000","config":{"editorSetting":{"editOnDblClick":false,"language":"sql"},"editorMode":"ace/mode/sql","colWidth":6,"editorHide":false,"title":true,"results":[{"graph":{"mode":"pieChart","height":296,"optionOpen":false,"keys":[{"name":"UniqueCarrier","index":0,"aggr":"sum"}],"values":[{"name":"NumDelays","index":1,"aggr":"sum"}],"groups":[],"scatter":{"yAxis":{"name":"NumDelays","index":1,"aggr":"sum"}}}}],"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1497380635132_-2133150306","id":"20160410-003138_134299332","dateCreated":"2017-06-13T19:03:55+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:21087"},{"title":"Compare Total Delayed Time (min) by Carrier","text":"%spark2.sql\n--- Compare Total Delayed Time (min) by Carrier\nSELECT UniqueCarrier, SUM(DepDelay) AS TotalTimeDelay FROM flightsView GROUP BY UniqueCarrier","dateUpdated":"2017-06-13T19:03:55+0000","config":{"editorSetting":{"editOnDblClick":false,"language":"sql"},"editorMode":"ace/mode/sql","colWidth":6,"editorHide":false,"title":true,"results":[{"graph":{"mode":"multiBarChart","height":300,"optionOpen":false,"keys":[{"name":"UniqueCarrier","index":0,"aggr":"sum"}],"values":[{"name":"TotalTimeDelay","index":1,"aggr":"sum"}],"groups":[],"scatter":{"xAxis":{"name":"UniqueCarrier","index":0,"aggr":"sum"},"yAxis":{"name":"TotalTimeDelay","index":1,"aggr":"sum"}}}}],"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1497380635132_-2133150306","id":"20160410-003138_163559927","dateCreated":"2017-06-13T19:03:55+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:21088"},{"title":"Find Average Distance Travelled by Carrier","text":"%spark2.sql\n--- Find Average Distance Travelled by Carrier\nSELECT UniqueCarrier, avg(Distance) AS AvgDistanceTraveled FROM flightsView GROUP BY UniqueCarrier ORDER BY AvgDistanceTraveled DESC","dateUpdated":"2017-06-13T19:03:55+0000","config":{"editorSetting":{"editOnDblClick":false,"language":"sql"},"editorMode":"ace/mode/sql","colWidth":12,"editorHide":false,"title":true,"results":[{"graph":{"mode":"pieChart","height":300,"optionOpen":false,"keys":[{"name":"UniqueCarrier","index":0,"aggr":"sum"}],"values":[{"name":"AvgDistanceTraveled","index":1,"aggr":"sum"}],"groups":[],"scatter":{"xAxis":{"name":"UniqueCarrier","index":0,"aggr":"sum"},"yAxis":{"name":"AvgDistanceTraveled","index":1,"aggr":"sum"}}}}],"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1497380635133_-2133535055","id":"20160410-003138_172624929","dateCreated":"2017-06-13T19:03:55+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:21089"},{"title":"Find Out When Most Flights Get Delayed by Day of Week","text":"%spark2.sql\n\nSELECT DayOfWeek, CASE WHEN isDelayedUDF(DepDelay) = 1 THEN 'delayed' ELSE 'ok' END AS Delay, COUNT(1) AS Count\nFROM flightsView\nGROUP BY DayOfWeek, CASE WHEN isDelayedUDF(DepDelay) = 1 THEN 'delayed' ELSE 'ok' END\nORDER BY DayOfWeek","dateUpdated":"2017-06-13T19:03:55+0000","config":{"editorSetting":{"editOnDblClick":false,"language":"sql"},"editorMode":"ace/mode/sql","colWidth":12,"editorHide":false,"title":true,"results":[{"graph":{"mode":"multiBarChart","height":300,"optionOpen":false,"keys":[{"name":"DayOfWeek","index":0,"aggr":"sum"}],"values":[{"name":"Count","index":2,"aggr":"sum"}],"groups":[{"name":"Delay","index":1,"aggr":"sum"}],"scatter":{"xAxis":{"name":"DayOfWeek","index":0,"aggr":"sum"},"yAxis":{"name":"Delay","index":1,"aggr":"sum"}}}}],"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1497380635133_-2133535055","id":"20160410-003138_56774606","dateCreated":"2017-06-13T19:03:55+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:21090"},{"title":"Find Out When Most Flights Get Delayed by Hour","text":"%spark2.sql\n\nSELECT CAST(CRSDepTime / 100 AS INT) AS Hour, CASE WHEN isDelayedUDF(DepDelay) = 1 THEN 'delayed' ELSE 'ok' END AS Delay, COUNT(1) AS Count\nFROM flightsView\nGROUP BY CAST(CRSDepTime / 100 AS INT), CASE WHEN isDelayedUDF(DepDelay) = 1 THEN 'delayed' ELSE 'ok' END\nORDER BY Hour","dateUpdated":"2017-06-13T19:03:55+0000","config":{"editorSetting":{"editOnDblClick":false,"language":"sql"},"editorMode":"ace/mode/sql","colWidth":12,"editorHide":false,"title":true,"results":[{"graph":{"mode":"stackedAreaChart","height":300,"optionOpen":false,"keys":[{"name":"Hour","index":0,"aggr":"sum"}],"values":[{"name":"Count","index":2,"aggr":"sum"}],"groups":[{"name":"Delay","index":1,"aggr":"sum"}],"scatter":{"xAxis":{"name":"Hour","index":0,"aggr":"sum"},"yAxis":{"name":"Delay","index":1,"aggr":"sum"}}}}],"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1497380635133_-2133535055","id":"20160410-003138_728063774","dateCreated":"2017-06-13T19:03:55+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:21091"},{"title":"Putting it all together","text":"%md\n\nNow, with all these basic analytics in Part 1 and 2 of this lab, you should have a fairly good idea which flights have the most delays, on which routes, from which airports, at which hour, on which days of the week and months of the year, and be able to start making meaningful predictions yourself. That's the power of using Spark with Zeppelin -- having one powerful environment to perform data munging, wrangling, visualization and more on large datasets.","dateUpdated":"2017-06-13T19:03:55+0000","config":{"editorSetting":{},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"

Now, with all these basic analytics in Part 1 and 2 of this lab, you should have a fairly good idea which flights have the most delays, on which routes, from which airports, at which hour, on which days of the week and months of the year, and be able to start making meaningful predictions yourself. That’s the power of using Spark with Zeppelin – having one powerful environment to perform data munging, wrangling, visualization and more on large datasets.

"}]},"apps":[],"jobName":"paragraph_1497380635133_-2133535055","id":"20161017-210202_1567750763","dateCreated":"2017-06-13T19:03:55+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:21092"},{"text":"%md\n\n## Persisting Results / Data\n\nFinally, let's persist some of our results by saving our DataFrames in an optimized file format called ORC.\n","dateUpdated":"2017-06-13T19:03:55+0000","config":{"editorSetting":{},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"

Persisting Results / Data

Finally, let’s persist some of our results by saving our DataFrames in an optimized file format called ORC.

"}]},"apps":[],"jobName":"paragraph_1497380635134_-2132380808","id":"20161017-212723_1255606607","dateCreated":"2017-06-13T19:03:55+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:21093"},{"text":"%angular\n\n

Save Modes

\n\n\n\n\n \n \n \t\t\n \n \n \n \t\n \n \n \n \t\t\n \n \n \n \t\t\n \n \n \n \n \n

Mode (Scala/Java)	Meaning
`SaveMode.ErrorIfExists (default)`	When saving a DataFrame to a data source, if data already exists, an exception is expected to be thrown.
`SaveMode.Append`	When saving a DataFrame to a data source, if data/table already exists, contents of the DataFrame are expected to be appended to existing data.
`SaveMode.Overwrite`	Overwrite mode means that when saving a DataFrame to a data source, if data/table already exists, existing data is expected to be overwritten by the contents of the DataFrame.
`SaveMode.Ignore`	Ignore mode means that when saving a DataFrame to a data source, if data already exists, the save operation is expected to not save the contents of the DataFrame and to not change the existing data. This is similar to a CREATE TABLE IF NOT EXISTS in SQL.

Save Modes

\n\n\n\n\n \n \n \t\t\n \n \n \n \t\n \n \n \n \t\t\n \n \n \n \t\t\n \n \n \n \n \n

Mode (Scala/Java)	Meaning
`SaveMode.ErrorIfExists (default)`	When saving a DataFrame to a data source, if data already exists, an exception is expected to be thrown.
`SaveMode.Append`	When saving a DataFrame to a data source, if data/table already exists, contents of the DataFrame are expected to be appended to existing data.
`SaveMode.Overwrite`	Overwrite mode means that when saving a DataFrame to a data source, if data/table already exists, existing data is expected to be overwritten by the contents of the DataFrame.
`SaveMode.Ignore`	Ignore mode means that when saving a DataFrame to a data source, if data already exists, the save operation is expected to not save the contents of the DataFrame and to not change the existing data. This is similar to a CREATE TABLE IF NOT EXISTS in SQL.

\n\n
\nNote: Save operations can optionally take a SaveMode, that specifies how to handle existing data if present. It is important to realize that these save modes do not utilize any locking and are not atomic. Additionally, when performing an Overwrite, the data will be deleted before writing out the new data."}]},"apps":[],"jobName":"paragraph_1497380635134_-2132380808","id":"20160410-003138_206029012","dateCreated":"2017-06-13T19:03:55+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:21094"},{"title":"Save to ORC file","text":"%spark2\n\nimport org.apache.spark.sql.SaveMode\n\n// Save and Overwrite our new DataFrame to an ORC file\nflightsWithDelays.write.format(\"orc\").mode(SaveMode.Overwrite).save(\"flightsWithDelays.orc\")","dateUpdated":"2017-06-13T19:03:55+0000","config":{"editorSetting":{"editOnDblClick":false,"language":"scala"},"editorMode":"ace/mode/scala","colWidth":12,"editorHide":false,"title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1497380635134_-2132380808","id":"20160410-003138_985965720","dateCreated":"2017-06-13T19:03:55+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:21095"},{"title":"What is an ORC file format?","text":"%md\n\nORC (Optimized Row-Column) is a self-describing, type-aware columnar file format designed for Hadoop workloads. It is optimized for large streaming reads, but with integrated support for finding required rows quickly. Storing data in a columnar format lets the reader read, decompress, and process only the values that are required for the current query. Because ORC files are type-aware, the writer chooses the most appropriate encoding for the type and builds an internal index as the file is written. More information [here](https://orc.apache.org/).","dateUpdated":"2017-06-13T19:03:55+0000","config":{"editorSetting":{},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"

ORC (Optimized Row-Column) is a self-describing, type-aware columnar file format designed for Hadoop workloads. It is optimized for large streaming reads, but with integrated support for finding required rows quickly. Storing data in a columnar format lets the reader read, decompress, and process only the values that are required for the current query. Because ORC files are type-aware, the writer chooses the most appropriate encoding for the type and builds an internal index as the file is written. More information here.

"}]},"apps":[],"jobName":"paragraph_1497380635135_-2132765557","id":"20161017-103614_1279292421","dateCreated":"2017-06-13T19:03:55+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:21096"},{"title":"Load back from an ORC file","text":"%spark2\n\n// Load results back from ORC file\nval test = spark.read.format(\"orc\").load(\"flightsWithDelays.orc\")\n\n// Assert both DataFrames of the same size.\n// Note that if assertion succeeds no warning messages will be printed\nassert (test.count == flightsWithDelays.count, println(\"Assertion Fail: Files are of different sizes.\"))\n\ntest.show(10)","dateUpdated":"2017-06-13T19:03:55+0000","config":{"editorSetting":{"editOnDblClick":false,"language":"scala"},"editorMode":"ace/mode/scala","colWidth":12,"editorHide":false,"title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1497380635135_-2132765557","id":"20160410-003138_1142035788","dateCreated":"2017-06-13T19:03:55+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:21097"},{"text":"%md\n\nWe can also create permanent tables, instead of temporary views, using `saveAsTable`. The resulting table will still exist even after your Spark program has restarted.","dateUpdated":"2017-06-13T19:03:55+0000","config":{"editorSetting":{},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"

We can also create permanent tables, instead of temporary views, using saveAsTable. The resulting table will still exist even after your Spark program has restarted.

"}]},"apps":[],"jobName":"paragraph_1497380635135_-2132765557","id":"20161017-212315_1033823107","dateCreated":"2017-06-13T19:03:55+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:21098"},{"title":"Save DataFrame as Permanent Table","text":"%spark2\n\nflightsWithDelays.write.format(\"orc\").mode(SaveMode.Overwrite).saveAsTable(\"flightswithdelaystbl\")","dateUpdated":"2017-06-13T19:03:55+0000","config":{"editorSetting":{"editOnDblClick":false,"language":"scala"},"colWidth":12,"editorMode":"ace/mode/scala","title":true,"results":[],"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1497380635135_-2132765557","id":"20161017-212148_1432557096","dateCreated":"2017-06-13T19:03:55+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:21099"},{"title":"Show Tables/Views","text":"%spark2.sql\n\nSHOW TABLES\n\n-- Note that flightsWithDelaysTbl is a permanent table instead of a temporary view!","dateUpdated":"2017-06-13T19:03:55+0000","config":{"editorSetting":{"editOnDblClick":false,"language":"sql"},"colWidth":12,"editorMode":"ace/mode/sql","title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[{"name":"tableName","index":0,"aggr":"sum"}],"values":[{"name":"isTemporary","index":1,"aggr":"sum"}],"groups":[],"scatter":{"xAxis":{"name":"tableName","index":0,"aggr":"sum"},"yAxis":{"name":"isTemporary","index":1,"aggr":"sum"}}}}],"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1497380635136_1852478874","id":"20161017-212228_2044087527","dateCreated":"2017-06-13T19:03:55+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:21100"},{"title":"Querying a Permanent Table","text":"%spark2.sql\n\nSELECT COUNT(1) AS Total from flightswithdelaystbl -- As you can see, there's no difference in querying a temporary view vs a permanent table","dateUpdated":"2017-06-13T19:03:55+0000","config":{"editorSetting":{"editOnDblClick":false,"language":"sql"},"colWidth":12,"editorMode":"ace/mode/sql","title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[{"name":"Total","index":0,"aggr":"sum"}],"values":[],"groups":[],"scatter":{"xAxis":{"name":"Total","index":0,"aggr":"sum"}}}}],"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1497380635136_1852478874","id":"20161017-212847_790820933","dateCreated":"2017-06-13T19:03:55+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:21101"},{"title":"Final Words","text":"%md\n\nThis should get you started working in Scala with DataFrame, Dataset and SQL Spark APIs that are part of the Spark SQL Module. You should now have the basic tools and code samples to start working on your own data sets: from brining in/downloading datasets, to moving them from local storage to HDFS, to transforming datasets into Spark DataFrames/Datasets/temporary views, querying the data, performing basic calcuations, visualizing, and finally persisiting your results. That's a great start!","dateUpdated":"2017-06-13T19:03:55+0000","config":{"editorSetting":{},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"

This should get you started working in Scala with DataFrame, Dataset and SQL Spark APIs that are part of the Spark SQL Module. You should now have the basic tools and code samples to start working on your own data sets: from brining in/downloading datasets, to moving them from local storage to HDFS, to transforming datasets into Spark DataFrames/Datasets/temporary views, querying the data, performing basic calcuations, visualizing, and finally persisiting your results. That’s a great start!

"}]},"apps":[],"jobName":"paragraph_1497380635136_1852478874","id":"20161017-214817_1787337666","dateCreated":"2017-06-13T19:03:55+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:21102"},{"title":"Additional Resources","text":"%md\n\nWe hope you've enjoyed this introductory lab. Below are additional resources that you should find useful:\n\n1. [Hortonworks Apache Spark Tutorials](http://hortonworks.com/tutorials/#tuts-developers) are your natural next step where you can explore Spark in more depth.\n2. [Hortonworks Community Connection (HCC)](https://community.hortonworks.com/spaces/85/data-science.html?type=question) is a great resource for questions and answers on Spark, Data Analytics/Science, and many more Big Data topics.\n3. [Hortonworks Apache Spark Docs](http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.0/bk_spark-component-guide/content/ch_developing-spark-apps.html) - official Spark documentation.\n4. [Hortonworks Apache Zeppelin Docs](http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.0/bk_zeppelin-component-guide/content/ch_using_zeppelin.html) - official Zeppelin documentation.","dateUpdated":"2017-06-13T19:03:55+0000","config":{"tableHide":false,"editorSetting":{"editOnDblClick":true,"language":"markdown"},"editorMode":"ace/mode/markdown","colWidth":10,"editorHide":true,"title":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"

We hope you’ve enjoyed this introductory lab. Below are additional resources that you should find useful:

Hortonworks Apache Spark Tutorials are your natural next step where you can explore Spark in more depth.
Hortonworks Community Connection (HCC) is a great resource for questions and answers on Spark, Data Analytics/Science, and many more Big Data topics.
Hortonworks Apache Spark Docs - official Spark documentation.
Hortonworks Apache Zeppelin Docs - official Zeppelin documentation.

"}]},"apps":[],"jobName":"paragraph_1497380635137_1852094126","id":"20160410-003138_2048237853","dateCreated":"2017-06-13T19:03:55+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:21103"},{"text":"%angular\n
\n\n\n $\"HCC\"$ \n\n","dateUpdated":"2017-06-13T19:03:55+0000","config":{"editorSetting":{},"editorMode":"ace/mode/scala","colWidth":2,"editorHide":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"ANGULAR","data":"
\n\n\n $\"HCC\"$ \n\n"}]},"apps":[],"jobName":"paragraph_1497380635137_1852094126","id":"20160410-003138_1663715025","dateCreated":"2017-06-13T19:03:55+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:21104"},{"text":"","dateUpdated":"2017-06-13T19:03:55+0000","config":{"editorSetting":{"editOnDblClick":false,"language":"scala"},"colWidth":12,"editorMode":"ace/mode/scala","results":{},"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1497380635137_1852094126","id":"20161018-143604_1206436852","dateCreated":"2017-06-13T19:03:55+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:21105"}],"name":"Labs / Spark 2.x / Data Worker / Scala / 101 - Intro to SparkSQL","id":"2CJW53M52","angularObjects":{"2CMWM7C19:shared_process":[],"2CKY5D33N:shared_process":[],"2CKMJA5KB:shared_process":[],"2CMWK94UA:shared_process":[],"2C4U48MY3_spark2:shared_process":[],"2CKYPQRG1:shared_process":[],"2CHKNVRZG:shared_process":[]},"config":{"looknfeel":"default","personalizedMode":"false"},"info":{}}