{"paragraphs":[{"text":"%md\n\n## Exploring Spark SQL Module\n#### with an Airline Dataset\n\n**Level**: Beginner\n**Language**: Scala\n**Requirements**: \n- [HDP 2.6](http://hortonworks.com/products/sandbox/) (or later) or [HDCloud](https://hortonworks.github.io/hdp-aws/)\n- Spark 2.x\n\n**Author**: Robert Hryniewicz\n**Follow** [@RobH8z](https://twitter.com/RobertH8z)","dateUpdated":"2017-06-13T19:04:13+0000","config":{"tableHide":false,"editorSetting":{"editOnDblClick":true,"language":"markdown"},"editorMode":"ace/mode/markdown","colWidth":12,"editorHide":true,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
Level: Beginner
Language: Scala
Requirements:
- HDP 2.6 (or later) or HDCloud
- Spark 2.x
Author: Robert Hryniewicz
Follow @RobH8z
In this lab you will use Spark SQL via DataFrames API in Part 1 of the lab and SQL API in Part 2 of the lab to explore an Airline Dataset. This is a very interesting dataset that is further explored in other demo notebooks.
\nA Dataset is a distributed collection of data. Dataset provides the benefits of strong typing, ability to use powerful lambda functions with the benefits of (Spark SQL’s) optimized execution engine. A Dataset can be constructed from JVM objects and then manipulated using functional transformations (map, flatMap, filter, etc.). The Dataset API is available in Scala and Java.
\nA DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. The DataFrame API is available in Scala, Java, Python, and R. In Scala and Java, a DataFrame is represented by a Dataset of Rows. In the Scala API, DataFrame is simply a type alias of Dataset[Row]. (Note that in Scala type parameters (generics) are enclosed in square brackets.)
\nThroughout this document, we will often refer to Scala/Java Datasets of Rows as DataFrames. [source]
\nThroughout this lab we will use basic Scala syntax. If you would like to learn more about Scala, here’s an excellent introductory Tutorial.
\nTo run a paragraph in a Zeppelin notebook you can either click the play
button (blue triangle) on the right-hand side or simply press Shift + Enter
.
In the following paragraphs we are going to execute Spark code, run shell commands to download and move files, run sql queries etc. Each paragraph will start with %
followed by an interpreter name, e.g. %spark2
for a Spark 2.x interpreter. Different interpreter names indicate what will be executed: code, markdown, html etc. This allows you to perform data ingestion, munging, wrangling, visualization, analysis, processing and more, all in one place!
Throughtout this notebook we will use the following interpreters:
\n%spark2
- Spark interpreter to run Spark code written in Scala%spark2.sql
- Spark SQL interprter (to execute SQL queries against temporary tables in Spark)%sh
- Shell interpreter to run shell commands%angular
- Angular interpreter to run Angular and HTML code%md
- Markdown for displaying formatted text, links, and imagesTo learn more about Zeppelin interpreters check out this link.
\n\n | Name | \nDescription | \n
---|---|---|
1 | Year | 1987-2008 | \n
2 | Month | 1-12 | \n
3 | DayofMonth | 1-31 | \n
4 | DayOfWeek | 1 (Monday) - 7 (Sunday) | \n
5 | DepTime | actual departure time (local, hhmm) | \n
6 | CRSDepTime | scheduled departure time (local, hhmm) | \n
7 | ArrTime | actual arrival time (local, hhmm) | \n
8 | CRSArrTime | scheduled arrival time (local, hhmm) | \n
9 | UniqueCarrier | unique carrier code | \n
10 | FlightNum | flight number | \n
11 | TailNum | plane tail number | \n
12 | ActualElapsedTime | in minutes | \n
13 | CRSElapsedTime | in minutes | \n
14 | AirTime | in minutes | \n
15 | ArrDelay | arrival delay, in minutes | \n
16 | DepDelay | departure delay, in minutes | \n
17 | Origin | origin IATA airport code | \n
18 | Dest | destination IATA airport code | \n
19 | Distance | in miles | \n
20 | TaxiIn | taxi in time, in minutes | \n
21 | TaxiOut | taxi out time in minutes | \n
22 | Cancelled | was the flight cancelled? | \n
23 | CancellationCode | reason for cancellation (A = carrier, B = weather, C = NAS, D = security) | \n
24 | Diverted | 1 = yes, 0 = no | \n
25 | CarrierDelay | in minutes | \n
26 | WeatherDelay | in minutes | \n
27 | NASDelay | in minutes | \n
28 | SecurityDelay | in minutes | \n
29 | LateAircraftDelay | in minutes | \n