{"paragraphs":[{"text":"%md\n#Analyzing network intrusion dataset with Python and Spark\n###By [Saptak Sen](http://saptak.in)\n---\n\n###Introduction\n\nIn this tutorial we are going to analyze a network intrusion events dataset with **Python** in a Zeppelin Notebook. The Zeppelin Notebook provides an easy and interactive surface for data scientists to explore data through data processing engines like **Hive** and **Spark**. Hive and Spark in turn benefit from robust managability, resource allocation policies, security and scalability running on a YARN managed infrasturcture.\n\n```\n ┏━━━━━━━━━━━━━━━━━━━━━┓\n ┃ ┃\n ┃ Executer ┃\n ┌─────────────────────────────────────────────────────▶┃ Cache, Task ┃\n │ ┃ ┃\n │ ┃ ┃\n │ ┗━━━━━━━━━━━━━━━━━━━━━┛\n │\n │\n │\n┏━━━━━━━━━━━━━━━━━━━━━┓ ┏━━━━━━━━━━━━━━━━━━━━━┓ ┏━━━━━━━━━━━━━━━━━━━━━┓\n┃ ┃ ┃ ┃ ┃ ┃\n┃ Driver ┃ ┃ Cluster Manager ┃ ┃ Executer ┃\n┃ Spark Context ┃─────────▶┃ (YARN) ┃─────────▶┃ Cache, Task ┃\n┃ ┃ ┃ ┃ ┃ ┃\n┃ ┃ ┃ ┃ ┃ ┃\n┗━━━━━━━━━━━━━━━━━━━━━┛ ┗━━━━━━━━━━━━━━━━━━━━━┛ ┗━━━━━━━━━━━━━━━━━━━━━┛\n │\n │\n │\n │ ┏━━━━━━━━━━━━━━━━━━━━━┓\n │ ┃ ┃\n │ ┃ Executer ┃\n └─────────────────────────────────────────────────────▶┃ Cache, Task ┃\n ┃ ┃\n ┃ ┃\n ┗━━━━━━━━━━━━━━━━━━━━━┛\n```\n\n###Resilient Distributed Dataset\nThe key concept in Spark are RDDs or Resilient Distributed Datasets.RDD is a fault-tolerant collection of elements that can be operated on in parallel.\n \n\n###Spark Application\nA typical Spark application has the following four phases:\n\n```\n┏━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ ┃\n┃ Instantiate ┃\n┃ Input RDDs ┃\n┃ ┃\n┗━━━━━━━━━━━┳━━━━━━━━━━━┛\n │\n │\n┏━━━━━━━━━━━▼━━━━━━━━━━━┓\n┃ ┃\n┃ Transform RDDs ┃\n┃ ┃\n┃ ┃\n┗━━━━━━━━━━━┳━━━━━━━━━━━┛\n │\n │\n┏━━━━━━━━━━━▼━━━━━━━━━━━┓\n┃ ┃\n┃ Persist ┃\n┃ Intermediate RDDs ┃\n┃ ┃\n┗━━━━━━━━━━━┳━━━━━━━━━━━┛\n │\n │\n┏━━━━━━━━━━━▼━━━━━━━━━━━┓\n┃ ┃\n┃ Action on RDDs ┃\n┃ ┃\n┃ ┃\n┗━━━━━━━━━━━━━━━━━━━━━━━┛\n```\n\n###Commenting with Markdown\n\nThis block of comments and instructions is formatted as [Markdown](https://help.github.com/articles/github-flavored-markdown/). Every block of code or comments in Zeppelin is also often referred to as a paragraph. \n\nA block formatted as markdown can be specified when we start the block with `%md`. Any block is executed and rendered when we click the play button on the top-left-hand corner of a block.\n\n###Downloading the Data with Shell commands\n\nIn the next block of code we are going to download the data using shell commands. The shell command in a Zeppelin notebook can be invoked when we prepend the block of shell commands with a line containing the characters `%sh`.","dateUpdated":"2015-10-25T04:33:45+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false,"enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445313251381_496191774","id":"20151020-035411_911943644","result":{"code":"SUCCESS","type":"HTML","msg":"
In this tutorial we are going to analyze a network intrusion events dataset with Python in a Zeppelin Notebook. The Zeppelin Notebook provides an easy and interactive surface for data scientists to explore data through data processing engines like Hive and Spark. Hive and Spark in turn benefit from robust managability, resource allocation policies, security and scalability running on a YARN managed infrasturcture.
\n ┏━━━━━━━━━━━━━━━━━━━━━┓\n ┃ ┃\n ┃ Executer ┃\n ┌─────────────────────────────────────────────────────▶┃ Cache, Task ┃\n │ ┃ ┃\n │ ┃ ┃\n │ ┗━━━━━━━━━━━━━━━━━━━━━┛\n │\n │\n │\n┏━━━━━━━━━━━━━━━━━━━━━┓ ┏━━━━━━━━━━━━━━━━━━━━━┓ ┏━━━━━━━━━━━━━━━━━━━━━┓\n┃ ┃ ┃ ┃ ┃ ┃\n┃ Driver ┃ ┃ Cluster Manager ┃ ┃ Executer ┃\n┃ Spark Context ┃─────────▶┃ (YARN) ┃─────────▶┃ Cache, Task ┃\n┃ ┃ ┃ ┃ ┃ ┃\n┃ ┃ ┃ ┃ ┃ ┃\n┗━━━━━━━━━━━━━━━━━━━━━┛ ┗━━━━━━━━━━━━━━━━━━━━━┛ ┗━━━━━━━━━━━━━━━━━━━━━┛\n │\n │\n │\n │ ┏━━━━━━━━━━━━━━━━━━━━━┓\n │ ┃ ┃\n │ ┃ Executer ┃\n └─────────────────────────────────────────────────────▶┃ Cache, Task ┃\n ┃ ┃\n ┃ ┃\n ┗━━━━━━━━━━━━━━━━━━━━━┛\n
\nThe key concept in Spark are RDDs or Resilient Distributed Datasets.RDD is a fault-tolerant collection of elements that can be operated on in parallel.
\nA typical Spark application has the following four phases:
\n┏━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ ┃\n┃ Instantiate ┃\n┃ Input RDDs ┃\n┃ ┃\n┗━━━━━━━━━━━┳━━━━━━━━━━━┛\n │\n │\n┏━━━━━━━━━━━▼━━━━━━━━━━━┓\n┃ ┃\n┃ Transform RDDs ┃\n┃ ┃\n┃ ┃\n┗━━━━━━━━━━━┳━━━━━━━━━━━┛\n │\n │\n┏━━━━━━━━━━━▼━━━━━━━━━━━┓\n┃ ┃\n┃ Persist ┃\n┃ Intermediate RDDs ┃\n┃ ┃\n┗━━━━━━━━━━━┳━━━━━━━━━━━┛\n │\n │\n┏━━━━━━━━━━━▼━━━━━━━━━━━┓\n┃ ┃\n┃ Action on RDDs ┃\n┃ ┃\n┃ ┃\n┗━━━━━━━━━━━━━━━━━━━━━━━┛\n
\nThis block of comments and instructions is formatted as Markdown. Every block of code or comments in Zeppelin is also often referred to as a paragraph.
\nA block formatted as markdown can be specified when we start the block with %md
. Any block is executed and rendered when we click the play button on the top-left-hand corner of a block.
In the next block of code we are going to download the data using shell commands. The shell command in a Zeppelin notebook can be invoked when we prepend the block of shell commands with a line containing the characters %sh
.
In the next code block we are using function calls in Scala to fetch environment information. To specify a code block as Scala it is enough to just start the code block with %spark
Whether you are programming in Scala or Python the important thing to notice is the sc
object better know as SparkContext.\n
SparkContext is created by your driver program in this case Zeppelin and Pyspark. We will use the SparkContext to further instantiate RDDs in the next section.
Below we create a RDD by calling the textFile function on the SparkContext and passing the HDFS path to the raw datset. You can also create RDDs from:
\nThe code block below starts with %pyspark
which indicates we are going to use the Python programming language to interact with Spark. Also for the rest of the tutorial we will continue to use PySpark.
When we execute the next section of the code block to create the RDDs we will notice that it executes super fast. The reason it executes super fast is because it actually does not touch the data yet. You can continue to apply various transformation operations on this RDD and still it will not touch the data, but only construct a DAG or a Directed Acyclic Graph.
\n"},"dateCreated":"2015-10-25T02:35:20+0000","dateStarted":"2015-10-25T04:34:17+0000","dateFinished":"2015-10-25T04:34:17+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6446"},{"text":"%pyspark\n\ninput_file = \"hdfs:///tmp/kddcup.data_10_percent.gz\"\n\nraw_rdd = sc.textFile(input_file)","dateUpdated":"2015-10-21T03:12:50+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445314258380_-1158392129","id":"20151020-041058_1023360958","result":{"code":"SUCCESS","type":"TEXT","msg":""},"dateCreated":"2015-10-20T04:10:58+0000","dateStarted":"2015-10-21T03:12:50+0000","dateFinished":"2015-10-21T03:12:51+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6447"},{"text":"%md\n###RDD Actions\n\nIn the next code block we will count the number of lines in the file by calling the count function on the RDD. Count function on a RDD is a Action operation, which means Spark and YARN will be forced to allocate resource and execute the DAG it has been creating to calculate the result. We will notice that the next code block takes a little longer to run than the previous one for that reason. ","dateUpdated":"2015-10-25T04:34:49+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445744281710_1925900435","id":"20151025-033801_974200319","result":{"code":"SUCCESS","type":"HTML","msg":"In the next code block we will count the number of lines in the file by calling the count function on the RDD. Count function on a RDD is a Action operation, which means Spark and YARN will be forced to allocate resource and execute the DAG it has been creating to calculate the result. We will notice that the next code block takes a little longer to run than the previous one for that reason.
\n"},"dateCreated":"2015-10-25T03:38:01+0000","dateStarted":"2015-10-25T04:34:47+0000","dateFinished":"2015-10-25T04:34:47+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6448"},{"text":"%pyspark\n\nprint raw_rdd.count()","dateUpdated":"2015-10-20T03:26:46+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445314597162_-710268574","id":"20151020-041637_1396716094","result":{"code":"SUCCESS","type":"TEXT","msg":"494021\n"},"dateCreated":"2015-10-20T04:16:37+0000","dateStarted":"2015-10-20T03:26:46+0000","dateFinished":"2015-10-20T03:26:49+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6449"},{"text":"%md\n###Inspect what the data looks like","dateUpdated":"2015-10-25T04:35:01+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445746674496_622572543","id":"20151025-041754_1010831945","result":{"code":"SUCCESS","type":"HTML","msg":"Filtering out the normal and attack events
\n"},"dateCreated":"2015-10-25T04:44:51+0000","dateStarted":"2015-10-25T04:45:20+0000","dateFinished":"2015-10-25T04:45:20+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6473"},{"text":"%pyspark\n\nnormal_csv_data = csv_rdd.filter(lambda x: x[41]==\"normal.\")\nattack_csv_data = csv_rdd.filter(lambda x: x[41]!=\"normal.\")","dateUpdated":"2015-10-20T04:52:45+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445359598168_1932664998","id":"20151020-164638_2050317014","result":{"code":"SUCCESS","type":"TEXT","msg":""},"dateCreated":"2015-10-20T04:46:38+0000","dateStarted":"2015-10-20T04:52:45+0000","dateFinished":"2015-10-20T04:52:45+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6474"},{"text":"%md\n###Reduce","dateUpdated":"2015-10-25T04:46:18+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445748353904_949197524","id":"20151025-044553_1060339044","result":{"code":"SUCCESS","type":"HTML","msg":"