{"paragraphs":[{"text":"%md\n#Analyzing network intrusion dataset with Python and Spark\n###By [Saptak Sen](http://saptak.in)\n---\n\n###Introduction\n\nIn this tutorial we are going to analyze a network intrusion events dataset with **Python** in a Zeppelin Notebook. The Zeppelin Notebook provides an easy and interactive surface for data scientists to explore data through data processing engines like **Hive** and **Spark**. Hive and Spark in turn benefit from robust managability, resource allocation policies, security and scalability running on a YARN managed infrasturcture.\n\n```\n ┏━━━━━━━━━━━━━━━━━━━━━┓\n ┃ ┃\n ┃ Executer ┃\n ┌─────────────────────────────────────────────────────▶┃ Cache, Task ┃\n │ ┃ ┃\n │ ┃ ┃\n │ ┗━━━━━━━━━━━━━━━━━━━━━┛\n │\n │\n │\n┏━━━━━━━━━━━━━━━━━━━━━┓ ┏━━━━━━━━━━━━━━━━━━━━━┓ ┏━━━━━━━━━━━━━━━━━━━━━┓\n┃ ┃ ┃ ┃ ┃ ┃\n┃ Driver ┃ ┃ Cluster Manager ┃ ┃ Executer ┃\n┃ Spark Context ┃─────────▶┃ (YARN) ┃─────────▶┃ Cache, Task ┃\n┃ ┃ ┃ ┃ ┃ ┃\n┃ ┃ ┃ ┃ ┃ ┃\n┗━━━━━━━━━━━━━━━━━━━━━┛ ┗━━━━━━━━━━━━━━━━━━━━━┛ ┗━━━━━━━━━━━━━━━━━━━━━┛\n │\n │\n │\n │ ┏━━━━━━━━━━━━━━━━━━━━━┓\n │ ┃ ┃\n │ ┃ Executer ┃\n └─────────────────────────────────────────────────────▶┃ Cache, Task ┃\n ┃ ┃\n ┃ ┃\n ┗━━━━━━━━━━━━━━━━━━━━━┛\n```\n\n###Resilient Distributed Dataset\nThe key concept in Spark are RDDs or Resilient Distributed Datasets.RDD is a fault-tolerant collection of elements that can be operated on in parallel.\n \n\n###Spark Application\nA typical Spark application has the following four phases:\n\n```\n┏━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ ┃\n┃ Instantiate ┃\n┃ Input RDDs ┃\n┃ ┃\n┗━━━━━━━━━━━┳━━━━━━━━━━━┛\n │\n │\n┏━━━━━━━━━━━▼━━━━━━━━━━━┓\n┃ ┃\n┃ Transform RDDs ┃\n┃ ┃\n┃ ┃\n┗━━━━━━━━━━━┳━━━━━━━━━━━┛\n │\n │\n┏━━━━━━━━━━━▼━━━━━━━━━━━┓\n┃ ┃\n┃ Persist ┃\n┃ Intermediate RDDs ┃\n┃ ┃\n┗━━━━━━━━━━━┳━━━━━━━━━━━┛\n │\n │\n┏━━━━━━━━━━━▼━━━━━━━━━━━┓\n┃ ┃\n┃ Action on RDDs ┃\n┃ ┃\n┃ ┃\n┗━━━━━━━━━━━━━━━━━━━━━━━┛\n```\n\n###Commenting with Markdown\n\nThis block of comments and instructions is formatted as [Markdown](https://help.github.com/articles/github-flavored-markdown/). Every block of code or comments in Zeppelin is also often referred to as a paragraph. \n\nA block formatted as markdown can be specified when we start the block with `%md`. Any block is executed and rendered when we click the play button on the top-left-hand corner of a block.\n\n###Downloading the Data with Shell commands\n\nIn the next block of code we are going to download the data using shell commands. The shell command in a Zeppelin notebook can be invoked when we prepend the block of shell commands with a line containing the characters `%sh`.","dateUpdated":"2015-10-25T04:33:45+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false,"enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445313251381_496191774","id":"20151020-035411_911943644","result":{"code":"SUCCESS","type":"HTML","msg":"

Analyzing network intrusion dataset with Python and Spark

\n

By Saptak Sen

\n
\n

Introduction

\n

In this tutorial we are going to analyze a network intrusion events dataset with Python in a Zeppelin Notebook. The Zeppelin Notebook provides an easy and interactive surface for data scientists to explore data through data processing engines like Hive and Spark. Hive and Spark in turn benefit from robust managability, resource allocation policies, security and scalability running on a YARN managed infrasturcture.

\n
                                                                  ┏━━━━━━━━━━━━━━━━━━━━━┓\n                                                                  ┃                     ┃\n                                                                  ┃      Executer       ┃\n           ┌─────────────────────────────────────────────────────▶┃     Cache, Task     ┃\n           │                                                      ┃                     ┃\n           │                                                      ┃                     ┃\n           │                                                      ┗━━━━━━━━━━━━━━━━━━━━━┛\n           │\n           │\n           │\n┏━━━━━━━━━━━━━━━━━━━━━┓          ┏━━━━━━━━━━━━━━━━━━━━━┓          ┏━━━━━━━━━━━━━━━━━━━━━┓\n┃                     ┃          ┃                     ┃          ┃                     ┃\n┃       Driver        ┃          ┃   Cluster Manager   ┃          ┃      Executer       ┃\n┃    Spark Context    ┃─────────▶┃       (YARN)        ┃─────────▶┃     Cache, Task     ┃\n┃                     ┃          ┃                     ┃          ┃                     ┃\n┃                     ┃          ┃                     ┃          ┃                     ┃\n┗━━━━━━━━━━━━━━━━━━━━━┛          ┗━━━━━━━━━━━━━━━━━━━━━┛          ┗━━━━━━━━━━━━━━━━━━━━━┛\n           │\n           │\n           │\n           │                                                      ┏━━━━━━━━━━━━━━━━━━━━━┓\n           │                                                      ┃                     ┃\n           │                                                      ┃      Executer       ┃\n           └─────────────────────────────────────────────────────▶┃     Cache, Task     ┃\n                                                                  ┃                     ┃\n                                                                  ┃                     ┃\n                                                                  ┗━━━━━━━━━━━━━━━━━━━━━┛\n
\n

Resilient Distributed Dataset

\n

The key concept in Spark are RDDs or Resilient Distributed Datasets.RDD is a fault-tolerant collection of elements that can be operated on in parallel.

\n

Spark Application

\n

A typical Spark application has the following four phases:

\n
┏━━━━━━━━━━━━━━━━━━━━━━━┓\n┃                       ┃\n┃      Instantiate      ┃\n┃      Input RDDs       ┃\n┃                       ┃\n┗━━━━━━━━━━━┳━━━━━━━━━━━┛\n            │\n            │\n┏━━━━━━━━━━━▼━━━━━━━━━━━┓\n┃                       ┃\n┃    Transform RDDs     ┃\n┃                       ┃\n┃                       ┃\n┗━━━━━━━━━━━┳━━━━━━━━━━━┛\n            │\n            │\n┏━━━━━━━━━━━▼━━━━━━━━━━━┓\n┃                       ┃\n┃        Persist        ┃\n┃   Intermediate RDDs   ┃\n┃                       ┃\n┗━━━━━━━━━━━┳━━━━━━━━━━━┛\n            │\n            │\n┏━━━━━━━━━━━▼━━━━━━━━━━━┓\n┃                       ┃\n┃    Action on RDDs     ┃\n┃                       ┃\n┃                       ┃\n┗━━━━━━━━━━━━━━━━━━━━━━━┛\n
\n

Commenting with Markdown

\n

This block of comments and instructions is formatted as Markdown. Every block of code or comments in Zeppelin is also often referred to as a paragraph.

\n

A block formatted as markdown can be specified when we start the block with %md. Any block is executed and rendered when we click the play button on the top-left-hand corner of a block.

\n

Downloading the Data with Shell commands

\n

In the next block of code we are going to download the data using shell commands. The shell command in a Zeppelin notebook can be invoked when we prepend the block of shell commands with a line containing the characters %sh.

\n"},"dateCreated":"2015-10-20T03:54:11+0000","dateStarted":"2015-10-25T04:33:44+0000","dateFinished":"2015-10-25T04:33:44+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6442"},{"text":"%sh\n\n#remove existing copies of dataset from HDFS\nhadoop fs -rm /tmp/kddcup.data_10_percent.gz\n\n#Download the data and pace it into the /tmp folder of Hortonworks Sandbox\nwget http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data_10_percent.gz -O /tmp/kddcup.data_10_percent.gz\n\n#Copy the data to HDFS on Hortonworks Sandbox\nhadoop fs -put /tmp/kddcup.data_10_percent.gz /tmp\n\n#Verify the data has been copied to the /tmp folder on HDFS\nhadoop fs -ls -h /tmp/kddcup.data_10_percent.gz\n\n#Remove the dataset /tmp folder on the Sandbox local filesystem now that the data has been copied to HDFS\nrm /tmp/kddcup.data_10_percent.gz\n\n","dateUpdated":"2015-10-25T03:45:32+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorHide":false,"tableHide":false,"editorMode":"ace/mode/sh","enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445313335253_1771957830","id":"20151020-035535_1468713571","result":{"code":"SUCCESS","type":"TEXT","msg":"15/10/25 00:17:07 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 360 minutes, Emptier interval = 0 minutes.\nMoved: 'hdfs://sandbox.hortonworks.com:8020/tmp/kddcup.data_10_percent.gz' to trash at: hdfs://sandbox.hortonworks.com:8020/user/zeppelin/.Trash/Current\n--2015-10-25 00:17:07-- http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data_10_percent.gz\nResolving kdd.ics.uci.edu... 128.195.1.95\nConnecting to kdd.ics.uci.edu|128.195.1.95|:80... connected.\nHTTP request sent, awaiting response... 200 OK\nLength: 2144903 (2.0M) [application/x-gzip]\nSaving to: “/tmp/kddcup.data_10_percent.gz”\n\n 0K .......... .......... .......... .......... .......... 2% 397K 5s\n 50K .......... .......... .......... .......... .......... 4% 531K 4s\n 100K .......... .......... .......... .......... .......... 7% 687K 4s\n 150K .......... .......... .......... .......... .......... 9% 972K 3s\n 200K .......... .......... .......... .......... .......... 11% 1.02M 3s\n 250K .......... .......... .......... .......... .......... 14% 1.16M 3s\n 300K .......... .......... .......... .......... .......... 16% 1.21M 2s\n 350K .......... .......... .......... .......... .......... 19% 1.11M 2s\n 400K .......... .......... .......... .......... .......... 21% 1.52M 2s\n 450K .......... .......... .......... .......... .......... 23% 1.50M 2s\n 500K .......... .......... .......... .......... .......... 26% 1.38M 2s\n 550K .......... .......... .......... .......... .......... 28% 1.42M 2s\n 600K .......... .......... .......... .......... .......... 31% 2.55M 1s\n 650K .......... .......... .......... .......... .......... 33% 2.44M 1s\n 700K .......... .......... .......... .......... .......... 35% 1.83M 1s\n 750K .......... .......... .......... .......... .......... 38% 1.79M 1s\n 800K .......... .......... .......... .......... .......... 40% 2.82M 1s\n 850K .......... .......... .......... .......... .......... 42% 1.39M 1s\n 900K .......... .......... .......... .......... .......... 45% 3.81M 1s\n 950K .......... .......... .......... .......... .......... 47% 2.68M 1s\n 1000K .......... .......... .......... .......... .......... 50% 1.50M 1s\n 1050K .......... .......... .......... .......... .......... 52% 6.88M 1s\n 1100K .......... .......... .......... .......... .......... 54% 1.37M 1s\n 1150K .......... .......... .......... .......... .......... 57% 21.8M 1s\n 1200K .......... .......... .......... .......... .......... 59% 1.40M 1s\n 1250K .......... .......... .......... .......... .......... 62% 16.9M 1s\n 1300K .......... .......... .......... .......... .......... 64% 1.47M 1s\n 1350K .......... .......... .......... .......... .......... 66% 4.43M 0s\n 1400K .......... .......... .......... .......... .......... 69% 10.0M 0s\n 1450K .......... .......... .......... .......... .......... 71% 1.36M 0s\n 1500K .......... .......... .......... .......... .......... 73% 5.31M 0s\n 1550K .......... .......... .......... .......... .......... 76% 2.42M 0s\n 1600K .......... .......... .......... .......... .......... 78% 2.69M 0s\n 1650K .......... .......... .......... .......... .......... 81% 11.1M 0s\n 1700K .......... .......... .......... .......... .......... 83% 2.37M 0s\n 1750K .......... .......... .......... .......... .......... 85% 1.87M 0s\n 1800K .......... .......... .......... .......... .......... 88% 2.80M 0s\n 1850K .......... .......... .......... .......... .......... 90% 1.56M 0s\n 1900K .......... .......... .......... .......... .......... 93% 9.33M 0s\n 1950K .......... .......... .......... .......... .......... 95% 3.72M 0s\n 2000K .......... .......... .......... .......... .......... 97% 1.76M 0s\n 2050K .......... .......... .......... .......... .... 100% 23.7M=1.2s\n\n2015-10-25 00:17:09 (1.67 MB/s) - “/tmp/kddcup.data_10_percent.gz” saved [2144903/2144903]\n\n-rw-r--r-- 1 zeppelin hdfs 2.0 M 2015-10-25 00:17 /tmp/kddcup.data_10_percent.gz\n"},"dateCreated":"2015-10-20T03:55:35+0000","dateStarted":"2015-10-25T12:17:03+0000","dateFinished":"2015-10-25T12:17:15+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6443"},{"text":"%md\n###Using Scala function calls to fetch environment variables\n\nIn the next code block we are using function calls in Scala to fetch environment information. To specify a code block as Scala it is enough to just start the code block with `%spark` \n\nWhether you are programming in Scala or Python the important thing to notice is the `sc` object better know as SparkContext.\nSparkContext is created by your driver program in this case Zeppelin and Pyspark. We will use the SparkContext to further instantiate RDDs in the next section.\n","dateUpdated":"2015-10-25T04:34:06+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445732721909_821451398","id":"20151025-002521_506859008","result":{"code":"SUCCESS","type":"HTML","msg":"

Using Scala function calls to fetch environment variables

\n

In the next code block we are using function calls in Scala to fetch environment information. To specify a code block as Scala it is enough to just start the code block with %spark

\n

Whether you are programming in Scala or Python the important thing to notice is the sc object better know as SparkContext.\n
SparkContext is created by your driver program in this case Zeppelin and Pyspark. We will use the SparkContext to further instantiate RDDs in the next section.

\n"},"dateCreated":"2015-10-25T12:25:21+0000","dateStarted":"2015-10-25T04:34:05+0000","dateFinished":"2015-10-25T04:34:05+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6444"},{"text":"%spark\nsc.version\nsc.getConf.get(\"spark.home\")\nSystem.getenv().get(\"PYTHONPATH\")\nSystem.getenv().get(\"SPARK_HOME\")","dateUpdated":"2015-10-25T12:18:49+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"enabled":true,"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445731751611_1947060880","id":"20151025-000911_681184475","result":{"code":"SUCCESS","type":"TEXT","msg":"res6: String = 1.3.1\nres7: String = /usr/hdp/current/spark-client/\nres8: String = /usr/hdp/current/spark-client//python/lib/py4j-0.8.2.1-src.zip:/usr/hdp/current/spark-client//python/:/usr/hdp/current/spark-client//python\nres9: String = /usr/hdp/2.3.0.0-2557/spark\n"},"dateCreated":"2015-10-25T12:09:11+0000","dateStarted":"2015-10-25T12:18:49+0000","dateFinished":"2015-10-25T12:18:50+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6445"},{"text":"%md\n###Creating RDDs\n\nBelow we create a RDD by calling the textFile function on the SparkContext and passing the HDFS path to the raw datset. You can also create RDDs from:\n\n * JDBC\n * Cassandra\n * HBase\n * Elasticsearch\n * JSON, CSV, sequence files, object files, ORC, Parquet, Avro\n\nThe code block below starts with `%pyspark` which indicates we are going to use the Python programming language to interact with Spark. Also for the rest of the tutorial we will continue to use PySpark.\n\nWhen we execute the next section of the code block to create the RDDs we will notice that it executes super fast. The reason it executes super fast is because it actually does not touch the data yet. You can continue to apply various transformation operations on this RDD and still it will not touch the data, but only construct a DAG or a Directed Acyclic Graph. \n","dateUpdated":"2015-10-25T04:34:22+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445740520897_-1225878006","id":"20151025-023520_258833607","result":{"code":"SUCCESS","type":"HTML","msg":"

Creating RDDs

\n

Below we create a RDD by calling the textFile function on the SparkContext and passing the HDFS path to the raw datset. You can also create RDDs from:

\n\n

The code block below starts with %pyspark which indicates we are going to use the Python programming language to interact with Spark. Also for the rest of the tutorial we will continue to use PySpark.

\n

When we execute the next section of the code block to create the RDDs we will notice that it executes super fast. The reason it executes super fast is because it actually does not touch the data yet. You can continue to apply various transformation operations on this RDD and still it will not touch the data, but only construct a DAG or a Directed Acyclic Graph.

\n"},"dateCreated":"2015-10-25T02:35:20+0000","dateStarted":"2015-10-25T04:34:17+0000","dateFinished":"2015-10-25T04:34:17+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6446"},{"text":"%pyspark\n\ninput_file = \"hdfs:///tmp/kddcup.data_10_percent.gz\"\n\nraw_rdd = sc.textFile(input_file)","dateUpdated":"2015-10-21T03:12:50+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445314258380_-1158392129","id":"20151020-041058_1023360958","result":{"code":"SUCCESS","type":"TEXT","msg":""},"dateCreated":"2015-10-20T04:10:58+0000","dateStarted":"2015-10-21T03:12:50+0000","dateFinished":"2015-10-21T03:12:51+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6447"},{"text":"%md\n###RDD Actions\n\nIn the next code block we will count the number of lines in the file by calling the count function on the RDD. Count function on a RDD is a Action operation, which means Spark and YARN will be forced to allocate resource and execute the DAG it has been creating to calculate the result. We will notice that the next code block takes a little longer to run than the previous one for that reason. ","dateUpdated":"2015-10-25T04:34:49+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445744281710_1925900435","id":"20151025-033801_974200319","result":{"code":"SUCCESS","type":"HTML","msg":"

RDD Actions

\n

In the next code block we will count the number of lines in the file by calling the count function on the RDD. Count function on a RDD is a Action operation, which means Spark and YARN will be forced to allocate resource and execute the DAG it has been creating to calculate the result. We will notice that the next code block takes a little longer to run than the previous one for that reason.

\n"},"dateCreated":"2015-10-25T03:38:01+0000","dateStarted":"2015-10-25T04:34:47+0000","dateFinished":"2015-10-25T04:34:47+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6448"},{"text":"%pyspark\n\nprint raw_rdd.count()","dateUpdated":"2015-10-20T03:26:46+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445314597162_-710268574","id":"20151020-041637_1396716094","result":{"code":"SUCCESS","type":"TEXT","msg":"494021\n"},"dateCreated":"2015-10-20T04:16:37+0000","dateStarted":"2015-10-20T03:26:46+0000","dateFinished":"2015-10-20T03:26:49+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6449"},{"text":"%md\n###Inspect what the data looks like","dateUpdated":"2015-10-25T04:35:01+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445746674496_622572543","id":"20151025-041754_1010831945","result":{"code":"SUCCESS","type":"HTML","msg":"

Inspect what the data looks like

\n"},"dateCreated":"2015-10-25T04:17:54+0000","dateStarted":"2015-10-25T04:34:59+0000","dateFinished":"2015-10-25T04:34:59+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6450"},{"text":"%pyspark\nprint raw_rdd.take(5)","dateUpdated":"2015-10-20T03:29:23+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445314653519_690510226","id":"20151020-041733_761077228","result":{"code":"SUCCESS","type":"TEXT","msg":"[u'0,tcp,http,SF,181,5450,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,9,9,1.00,0.00,0.11,0.00,0.00,0.00,0.00,0.00,normal.', u'0,tcp,http,SF,239,486,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,19,19,1.00,0.00,0.05,0.00,0.00,0.00,0.00,0.00,normal.', u'0,tcp,http,SF,235,1337,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,29,29,1.00,0.00,0.03,0.00,0.00,0.00,0.00,0.00,normal.', u'0,tcp,http,SF,219,1337,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,6,6,0.00,0.00,0.00,0.00,1.00,0.00,0.00,39,39,1.00,0.00,0.03,0.00,0.00,0.00,0.00,0.00,normal.', u'0,tcp,http,SF,217,2032,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,6,6,0.00,0.00,0.00,0.00,1.00,0.00,0.00,49,49,1.00,0.00,0.02,0.00,0.00,0.00,0.00,0.00,normal.']\n"},"dateCreated":"2015-10-20T04:17:33+0000","dateStarted":"2015-10-20T03:29:23+0000","dateFinished":"2015-10-20T03:29:24+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6451"},{"text":"%md\n###Filtering lines in the data","dateUpdated":"2015-10-25T04:35:18+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445747262179_83921318","id":"20151025-042742_1412779040","result":{"code":"SUCCESS","type":"HTML","msg":"

Filtering lines in the data

\n"},"dateCreated":"2015-10-25T04:27:42+0000","dateStarted":"2015-10-25T04:35:16+0000","dateFinished":"2015-10-25T04:35:16+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6452"},{"text":"%pyspark\nnormal_raw_rdd = raw_rdd.filter(lambda x: 'normal.' in x)","dateUpdated":"2015-10-20T03:34:09+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445355084334_-378282668","id":"20151020-153124_467610492","result":{"code":"SUCCESS","type":"TEXT","msg":""},"dateCreated":"2015-10-20T03:31:24+0000","dateStarted":"2015-10-20T03:34:09+0000","dateFinished":"2015-10-20T03:34:09+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6453"},{"text":"%md\n###Count the filtered RDD","dateUpdated":"2015-10-25T04:35:27+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445747321066_-631818906","id":"20151025-042841_261479489","result":{"code":"SUCCESS","type":"HTML","msg":"

Count the filtered RDD

\n"},"dateCreated":"2015-10-25T04:28:41+0000","dateStarted":"2015-10-25T04:35:26+0000","dateFinished":"2015-10-25T04:35:26+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6454"},{"text":"%pyspark\n\nnormal_count = normal_raw_rdd.count()\n\nprint normal_count","dateUpdated":"2015-10-20T03:46:19+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445355249510_-1527665550","id":"20151020-153409_1001360617","result":{"code":"SUCCESS","type":"TEXT","msg":"97278\n"},"dateCreated":"2015-10-20T03:34:09+0000","dateStarted":"2015-10-20T03:42:30+0000","dateFinished":"2015-10-20T03:42:33+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6455"},{"text":"%md\n###Importing local libraries","dateUpdated":"2015-10-25T04:35:37+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445747364593_277989650","id":"20151025-042924_1173734316","result":{"code":"SUCCESS","type":"HTML","msg":"

Importing local libraries

\n"},"dateCreated":"2015-10-25T04:29:24+0000","dateStarted":"2015-10-25T04:35:36+0000","dateFinished":"2015-10-25T04:35:36+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6456"},{"text":"%pyspark\n\nfrom pprint import pprint\n\ncsv_rdd = raw_rdd.map(lambda x: x.split(\",\"))\n\nhead_rows = csv_rdd.take(5)\n\npprint(head_rows[0])\n","dateUpdated":"2015-10-20T03:47:37+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445355570293_-2123365761","id":"20151020-153930_2114893315","result":{"code":"SUCCESS","type":"TEXT","msg":"[u'0',\n u'tcp',\n u'http',\n u'SF',\n u'181',\n u'5450',\n u'0',\n u'0',\n u'0',\n u'0',\n u'0',\n u'1',\n u'0',\n u'0',\n u'0',\n u'0',\n u'0',\n u'0',\n u'0',\n u'0',\n u'0',\n u'0',\n u'8',\n u'8',\n u'0.00',\n u'0.00',\n u'0.00',\n u'0.00',\n u'1.00',\n u'0.00',\n u'0.00',\n u'9',\n u'9',\n u'1.00',\n u'0.00',\n u'0.11',\n u'0.00',\n u'0.00',\n u'0.00',\n u'0.00',\n u'0.00',\n u'normal.']\n"},"dateCreated":"2015-10-20T03:39:30+0000","dateStarted":"2015-10-20T03:47:37+0000","dateFinished":"2015-10-20T03:47:38+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6457"},{"text":"%md\n###Using map functions in transformation","dateUpdated":"2015-10-25T04:47:03+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445747424512_361172033","id":"20151025-043024_234751730","result":{"code":"SUCCESS","type":"HTML","msg":"

Using map functions in transformation

\n"},"dateCreated":"2015-10-25T04:30:24+0000","dateStarted":"2015-10-25T04:47:01+0000","dateFinished":"2015-10-25T04:47:01+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6458"},{"text":"%pyspark\n\ndef parse_interaction(line):\n elems = line.split(\",\")\n tag = elems[41]\n return (tag, elems)\n \nkey_csv_rdd = raw_rdd.map(parse_interaction)\nhead_rows = key_csv_rdd.take(5)\npprint(head_rows[0])","dateUpdated":"2015-10-20T03:52:28+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445356057950_-1714872988","id":"20151020-154737_124701995","result":{"code":"SUCCESS","type":"TEXT","msg":"(u'normal.',\n [u'0',\n u'tcp',\n u'http',\n u'SF',\n u'181',\n u'5450',\n u'0',\n u'0',\n u'0',\n u'0',\n u'0',\n u'1',\n u'0',\n u'0',\n u'0',\n u'0',\n u'0',\n u'0',\n u'0',\n u'0',\n u'0',\n u'0',\n u'8',\n u'8',\n u'0.00',\n u'0.00',\n u'0.00',\n u'0.00',\n u'1.00',\n u'0.00',\n u'0.00',\n u'9',\n u'9',\n u'1.00',\n u'0.00',\n u'0.11',\n u'0.00',\n u'0.00',\n u'0.00',\n u'0.00',\n u'0.00',\n u'normal.'])\n"},"dateCreated":"2015-10-20T03:47:37+0000","dateStarted":"2015-10-20T03:52:28+0000","dateFinished":"2015-10-20T03:52:29+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6459"},{"text":"%md\n###Using collect() on the Spark driver","dateUpdated":"2015-10-25T04:36:46+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445747452161_-1653511125","id":"20151025-043052_1223908016","result":{"code":"SUCCESS","type":"HTML","msg":"

Using collect() on the Spark driver

\n"},"dateCreated":"2015-10-25T04:30:52+0000","dateStarted":"2015-10-25T04:36:19+0000","dateFinished":"2015-10-25T04:36:19+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6460"},{"text":"%pyspark\n\n#all_raw_rdd = raw_rdd.collect()\n\n","dateUpdated":"2015-10-20T03:56:21+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445356335049_1272528198","id":"20151020-155215_321328122","result":{"code":"SUCCESS","type":"TEXT","msg":""},"dateCreated":"2015-10-20T03:52:15+0000","dateStarted":"2015-10-20T03:56:21+0000","dateFinished":"2015-10-20T03:56:21+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6461"},{"text":"%md\n###Extracting a smaller sample of the RDD","dateUpdated":"2015-10-25T04:37:13+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445747801288_-747955066","id":"20151025-043641_741109691","result":{"code":"SUCCESS","type":"HTML","msg":"

Extracting a smaller sample of the RDD

\n"},"dateCreated":"2015-10-25T04:36:41+0000","dateStarted":"2015-10-25T04:37:11+0000","dateFinished":"2015-10-25T04:37:11+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6462"},{"text":"%pyspark\n\nraw_rdd_sample = raw_rdd.sample(False, 0.1, 1234)\nprint raw_rdd_sample.count()\nprint raw_rdd.count()","dateUpdated":"2015-10-20T04:06:50+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445356581574_-2002902129","id":"20151020-155621_1541966917","result":{"code":"SUCCESS","type":"TEXT","msg":"49493\n494021\n"},"dateCreated":"2015-10-20T03:56:21+0000","dateStarted":"2015-10-20T04:06:50+0000","dateFinished":"2015-10-20T04:06:55+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6463"},{"text":"%md\n###Local sample for local processing","dateUpdated":"2015-10-25T04:39:09+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445747893654_-72550536","id":"20151025-043813_1909205882","result":{"code":"SUCCESS","type":"HTML","msg":"

Local sample for local processing

\n"},"dateCreated":"2015-10-25T04:38:13+0000","dateStarted":"2015-10-25T04:39:07+0000","dateFinished":"2015-10-25T04:39:07+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6464"},{"text":"%pyspark\n\nraw_data_local_sample = raw_rdd.takeSample(False, 1000, 1234)\n\nnormal_data_sample = [x.split(\",\") for x in raw_data_local_sample if \"normal.\" in x]\n\nnormal_data_sample_size = len(normal_data_sample)\n\nnormal_ratio = normal_data_sample_size/1000.0\n\nprint normal_ratio\n","dateUpdated":"2015-10-20T04:24:31+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445357210419_-350067719","id":"20151020-160650_1218451495","result":{"code":"SUCCESS","type":"TEXT","msg":"0.188\n"},"dateCreated":"2015-10-20T04:06:50+0000","dateStarted":"2015-10-20T04:24:31+0000","dateFinished":"2015-10-20T04:24:37+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6465"},{"text":"%md\n###Subtracting on RDD from another","dateUpdated":"2015-10-25T04:40:28+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445748008000_1758454156","id":"20151025-044008_680146541","result":{"code":"SUCCESS","type":"HTML","msg":"

Subtracting on RDD from another

\n"},"dateCreated":"2015-10-25T04:40:08+0000","dateStarted":"2015-10-25T04:40:26+0000","dateFinished":"2015-10-25T04:40:26+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6466"},{"text":"%pyspark\n\nattack_raw_rdd = raw_rdd.subtract(normal_raw_rdd)\n\nprint attack_raw_rdd.count()","dateUpdated":"2015-10-20T04:35:31+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445358271919_1921988384","id":"20151020-162431_1736006903","result":{"code":"SUCCESS","type":"TEXT","msg":"396743\n"},"dateCreated":"2015-10-20T04:24:31+0000","dateStarted":"2015-10-20T04:35:31+0000","dateFinished":"2015-10-20T04:35:47+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6467"},{"text":"%md\n###Listing the distinct items","dateUpdated":"2015-10-25T04:42:23+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445748123501_-1609132018","id":"20151025-044203_1596662643","result":{"code":"SUCCESS","type":"HTML","msg":"

Listing the distinct items

\n"},"dateCreated":"2015-10-25T04:42:03+0000","dateStarted":"2015-10-25T04:42:21+0000","dateFinished":"2015-10-25T04:42:21+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6468"},{"text":"%pyspark\n\nprotocols = csv_rdd.map(lambda x: x[1]).distinct()\n\nprint protocols.collect()\n","dateUpdated":"2015-10-20T04:40:28+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445358931528_-942099397","id":"20151020-163531_966055264","result":{"code":"SUCCESS","type":"TEXT","msg":"[u'udp', u'icmp', u'tcp']\n"},"dateCreated":"2015-10-20T04:35:31+0000","dateStarted":"2015-10-20T04:40:28+0000","dateFinished":"2015-10-20T04:40:33+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6469"},{"text":"%pyspark\n\nservices = csv_rdd.map(lambda x: x[2]).distinct()\nprint services.collect()","dateUpdated":"2015-10-20T04:41:29+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445359218712_1821565632","id":"20151020-164018_963420148","result":{"code":"SUCCESS","type":"TEXT","msg":"[u'domain', u'http_443', u'Z39_50', u'smtp', u'urp_i', u'private', u'echo', u'shell', u'red_i', u'eco_i', u'sunrpc', u'ftp_data', u'urh_i', u'pm_dump', u'pop_3', u'pop_2', u'systat', u'ftp', u'uucp', u'whois', u'netbios_dgm', u'efs', u'remote_job', u'daytime', u'ntp_u', u'finger', u'ldap', u'netbios_ns', u'kshell', u'iso_tsap', u'ecr_i', u'nntp', u'printer', u'domain_u', u'uucp_path', u'courier', u'exec', u'time', u'netstat', u'telnet', u'gopher', u'rje', u'sql_net', u'link', u'auth', u'netbios_ssn', u'csnet_ns', u'X11', u'IRC', u'tftp_u', u'login', u'supdup', u'name', u'nnsp', u'mtp', u'http', u'bgp', u'ctf', u'hostnames', u'klogin', u'vmnet', u'tim_i', u'discard', u'imap4', u'other', u'ssh']\n"},"dateCreated":"2015-10-20T04:40:18+0000","dateStarted":"2015-10-20T04:41:29+0000","dateFinished":"2015-10-20T04:41:34+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6470"},{"text":"%md\n###Cartesion product","dateUpdated":"2015-10-25T04:43:55+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445748207407_935535911","id":"20151025-044327_920393858","result":{"code":"SUCCESS","type":"HTML","msg":"

Cartesion product

\n"},"dateCreated":"2015-10-25T04:43:27+0000","dateStarted":"2015-10-25T04:43:53+0000","dateFinished":"2015-10-25T04:43:53+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6471"},{"text":"%pyspark\n\nproduct = protocols.cartesian(services).collect()\n\nprint len(product)","dateUpdated":"2015-10-20T04:42:53+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445359280466_489611436","id":"20151020-164120_1905949484","result":{"code":"SUCCESS","type":"TEXT","msg":"198\n"},"dateCreated":"2015-10-20T04:41:20+0000","dateStarted":"2015-10-20T04:42:53+0000","dateFinished":"2015-10-20T04:42:54+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6472"},{"text":"%md\n###Filtering\nFiltering out the normal and attack events","dateUpdated":"2015-10-25T04:45:22+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445748291734_-1031028444","id":"20151025-044451_1279874784","result":{"code":"SUCCESS","type":"HTML","msg":"

Filtering

\n

Filtering out the normal and attack events

\n"},"dateCreated":"2015-10-25T04:44:51+0000","dateStarted":"2015-10-25T04:45:20+0000","dateFinished":"2015-10-25T04:45:20+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6473"},{"text":"%pyspark\n\nnormal_csv_data = csv_rdd.filter(lambda x: x[41]==\"normal.\")\nattack_csv_data = csv_rdd.filter(lambda x: x[41]!=\"normal.\")","dateUpdated":"2015-10-20T04:52:45+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445359598168_1932664998","id":"20151020-164638_2050317014","result":{"code":"SUCCESS","type":"TEXT","msg":""},"dateCreated":"2015-10-20T04:46:38+0000","dateStarted":"2015-10-20T04:52:45+0000","dateFinished":"2015-10-20T04:52:45+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6474"},{"text":"%md\n###Reduce","dateUpdated":"2015-10-25T04:46:18+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445748353904_949197524","id":"20151025-044553_1060339044","result":{"code":"SUCCESS","type":"HTML","msg":"

Reduce

\n"},"dateCreated":"2015-10-25T04:45:53+0000","dateStarted":"2015-10-25T04:46:16+0000","dateFinished":"2015-10-25T04:46:16+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6475"},{"text":"%pyspark\n\nnormal_duration_data = normal_csv_data.map(lambda x: int(x[0]))\ntotal_normal_duration = normal_duration_data.reduce(lambda x, y: x + y)\nprint total_normal_duration\n\nattack_duration_data = attack_csv_data.map(lambda x: int(x[0]))\ntotal_attack_duration = attack_duration_data.reduce(lambda x, y: x + y)\nprint total_attack_duration","dateUpdated":"2015-10-20T04:54:51+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445359373966_-769348262","id":"20151020-164253_1147743871","result":{"code":"SUCCESS","type":"TEXT","msg":"21075991\n2626792\n"},"dateCreated":"2015-10-20T04:42:53+0000","dateStarted":"2015-10-20T04:54:51+0000","dateFinished":"2015-10-20T04:55:03+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6476"},{"text":"%md\n\n###Deriving Mean","dateUpdated":"2015-10-25T04:16:16+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445789462921_1718547695","id":"20151025-161102_1843984349","result":{"code":"SUCCESS","type":"HTML","msg":"

Deriving Mean

\n"},"dateCreated":"2015-10-25T04:11:02+0000","dateStarted":"2015-10-25T04:16:15+0000","dateFinished":"2015-10-25T04:16:15+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6477"},{"text":"%pyspark\n\nnormal_count = normal_duration_data.count()\nattack_count = attack_duration_data.count()\n\nprint round(total_normal_duration/float(normal_count),3)\nprint round(total_attack_duration/float(attack_count),3)","dateUpdated":"2015-10-20T04:58:10+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445359968953_-435512099","id":"20151020-165248_1533675296","result":{"code":"SUCCESS","type":"TEXT","msg":"216.657\n6.621\n"},"dateCreated":"2015-10-20T04:52:48+0000","dateStarted":"2015-10-20T04:58:10+0000","dateFinished":"2015-10-20T04:58:19+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6478"},{"text":"%md\n\n###Using Aggregate functions to calulate mean in one pass","dateUpdated":"2015-10-25T04:16:12+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445789563316_-2070136021","id":"20151025-161243_1508443786","result":{"code":"SUCCESS","type":"HTML","msg":"

Using Aggregate functions to calulate mean in one pass

\n"},"dateCreated":"2015-10-25T04:12:43+0000","dateStarted":"2015-10-25T04:16:08+0000","dateFinished":"2015-10-25T04:16:08+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6479"},{"text":"%pyspark\n\nnormal_sum_count = normal_duration_data.aggregate(\n (0,0), # the initial value\n (lambda acc, value: (acc[0] + value, acc[1] + 1)), # combine val/acc\n (lambda acc1, acc2: (acc1[0] + acc2[0], acc1[1] + acc2[1]))\n)\n\nprint round(normal_sum_count[0]/float(normal_sum_count[1]),3)\n\nattack_sum_count = attack_duration_data.aggregate(\n (0,0), # the initial value\n (lambda acc, value: (acc[0] + value, acc[1] + 1)), # combine value with acc\n (lambda acc1, acc2: (acc1[0] + acc2[0], acc1[1] + acc2[1])) # combine accumulators\n)\n\nprint round(attack_sum_count[0]/float(attack_sum_count[1]),3)","dateUpdated":"2015-10-20T05:03:35+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445360227366_1966058242","id":"20151020-165707_2016593579","result":{"code":"SUCCESS","type":"TEXT","msg":"216.657\n6.621\n"},"dateCreated":"2015-10-20T04:57:07+0000","dateStarted":"2015-10-20T05:03:35+0000","dateFinished":"2015-10-20T05:03:45+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6480"},{"text":"%md\n\n###Constructing a Key-Value RDD","dateUpdated":"2015-10-25T04:16:01+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445789629759_1224007441","id":"20151025-161349_962088361","result":{"code":"SUCCESS","type":"HTML","msg":"

Constructing a Key-Value RDD

\n"},"dateCreated":"2015-10-25T04:13:49+0000","dateStarted":"2015-10-25T04:15:59+0000","dateFinished":"2015-10-25T04:15:59+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6481"},{"text":"%pyspark\n\nkey_value_rdd = csv_rdd.map(lambda x: (x[41], x))\n\nprint key_value_rdd.take(1)","dateUpdated":"2015-10-20T05:06:39+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445360513956_1977242796","id":"20151020-170153_1603800777","result":{"code":"SUCCESS","type":"TEXT","msg":"[(u'normal.', [u'0', u'tcp', u'http', u'SF', u'181', u'5450', u'0', u'0', u'0', u'0', u'0', u'1', u'0', u'0', u'0', u'0', u'0', u'0', u'0', u'0', u'0', u'0', u'8', u'8', u'0.00', u'0.00', u'0.00', u'0.00', u'1.00', u'0.00', u'0.00', u'9', u'9', u'1.00', u'0.00', u'0.11', u'0.00', u'0.00', u'0.00', u'0.00', u'0.00', u'normal.'])]\n"},"dateCreated":"2015-10-20T05:01:53+0000","dateStarted":"2015-10-20T05:06:39+0000","dateFinished":"2015-10-20T05:06:39+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6482"},{"text":"%md\n###Using reduceByKey to group and aggregate","dateUpdated":"2015-10-25T04:16:05+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445789675820_-2017624636","id":"20151025-161435_1491781947","result":{"code":"SUCCESS","type":"HTML","msg":"

Using reduceByKey to group and aggregate

\n"},"dateCreated":"2015-10-25T04:14:35+0000","dateStarted":"2015-10-25T04:16:03+0000","dateFinished":"2015-10-25T04:16:03+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6483"},{"text":"%pyspark\n\nkey_value_duration = csv_rdd.map(lambda x: (x[41], float(x[0]))) \ndurations_by_key = key_value_duration.reduceByKey(lambda x, y: x + y)\n\nprint durations_by_key.collect()","dateUpdated":"2015-10-20T05:08:17+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445360785953_1686355659","id":"20151020-170625_2051205977","result":{"code":"SUCCESS","type":"TEXT","msg":"[(u'guess_passwd.', 144.0), (u'nmap.', 0.0), (u'warezmaster.', 301.0), (u'rootkit.', 1008.0), (u'warezclient.', 627563.0), (u'smurf.', 0.0), (u'pod.', 0.0), (u'neptune.', 0.0), (u'normal.', 21075991.0), (u'spy.', 636.0), (u'ftp_write.', 259.0), (u'phf.', 18.0), (u'portsweep.', 1991911.0), (u'teardrop.', 0.0), (u'buffer_overflow.', 2751.0), (u'land.', 0.0), (u'imap.', 72.0), (u'loadmodule.', 326.0), (u'perl.', 124.0), (u'multihop.', 1288.0), (u'back.', 284.0), (u'ipsweep.', 43.0), (u'satan.', 64.0)]\n"},"dateCreated":"2015-10-20T05:06:25+0000","dateStarted":"2015-10-20T05:08:17+0000","dateFinished":"2015-10-20T05:08:23+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6484"},{"text":"%md\n###Using countByKey on a key-value RDD","dateUpdated":"2015-10-25T04:18:35+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445789793036_-371147819","id":"20151025-161633_1074893320","result":{"code":"SUCCESS","type":"HTML","msg":"

Using countByKey on a key-value RDD

\n"},"dateCreated":"2015-10-25T04:16:33+0000","dateStarted":"2015-10-25T04:18:33+0000","dateFinished":"2015-10-25T04:18:33+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6485"},{"text":"%pyspark\n\ncounts_by_key = key_value_rdd.countByKey()\nprint counts_by_key","dateUpdated":"2015-10-20T05:09:56+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445360897800_867746273","id":"20151020-170817_439456736","result":{"code":"SUCCESS","type":"TEXT","msg":"defaultdict(, {u'guess_passwd.': 53, u'nmap.': 231, u'warezmaster.': 20, u'rootkit.': 10, u'warezclient.': 1020, u'smurf.': 280790, u'pod.': 264, u'neptune.': 107201, u'normal.': 97278, u'spy.': 2, u'ftp_write.': 8, u'phf.': 4, u'portsweep.': 1040, u'teardrop.': 979, u'buffer_overflow.': 30, u'land.': 21, u'imap.': 12, u'loadmodule.': 9, u'perl.': 3, u'multihop.': 7, u'back.': 2203, u'ipsweep.': 1247, u'satan.': 1589})\n"},"dateCreated":"2015-10-20T05:08:17+0000","dateStarted":"2015-10-20T05:09:56+0000","dateFinished":"2015-10-20T05:10:04+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6486"},{"text":"%md\n\n###Using combineByKey to get duration and attempts by type in a single pass from a key-value RDD","dateUpdated":"2015-10-25T04:20:24+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445789964058_2095521707","id":"20151025-161924_2005284664","result":{"code":"SUCCESS","type":"HTML","msg":"

Using combineByKey to get duration and attempts by type in a single pass from a key-value RDD

\n"},"dateCreated":"2015-10-25T04:19:24+0000","dateStarted":"2015-10-25T04:20:22+0000","dateFinished":"2015-10-25T04:20:22+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6487"},{"text":"%pyspark\n\nsum_counts = key_value_duration.combineByKey(\n (lambda x: (x, 1)), # the initial value, with value x and count 1\n (lambda acc, value: (acc[0]+value, acc[1]+1)), # how to combine a pair value with the accumulator: sum value, and increment count\n (lambda acc1, acc2: (acc1[0]+acc2[0], acc1[1]+acc2[1])) # combine accumulators\n)\n\nprint sum_counts.collectAsMap()","dateUpdated":"2015-10-20T05:11:27+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445360996725_1939610736","id":"20151020-170956_1204906051","result":{"code":"SUCCESS","type":"TEXT","msg":"{u'guess_passwd.': (144.0, 53), u'nmap.': (0.0, 231), u'loadmodule.': (326.0, 9), u'rootkit.': (1008.0, 10), u'warezclient.': (627563.0, 1020), u'smurf.': (0.0, 280790), u'pod.': (0.0, 264), u'neptune.': (0.0, 107201), u'normal.': (21075991.0, 97278), u'spy.': (636.0, 2), u'ftp_write.': (259.0, 8), u'phf.': (18.0, 4), u'portsweep.': (1991911.0, 1040), u'teardrop.': (0.0, 979), u'buffer_overflow.': (2751.0, 30), u'land.': (0.0, 21), u'imap.': (72.0, 12), u'warezmaster.': (301.0, 20), u'perl.': (124.0, 3), u'multihop.': (1288.0, 7), u'back.': (284.0, 2203), u'ipsweep.': (43.0, 1247), u'satan.': (64.0, 1589)}\n"},"dateCreated":"2015-10-20T05:09:56+0000","dateStarted":"2015-10-20T05:11:27+0000","dateFinished":"2015-10-20T05:11:33+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6488"},{"text":"%md\n###Sorting a key-value RDD","dateUpdated":"2015-10-25T04:22:26+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445790043973_-1286497746","id":"20151025-162043_805316413","result":{"code":"SUCCESS","type":"HTML","msg":"

Sorting a key-value RDD

\n"},"dateCreated":"2015-10-25T04:20:43+0000","dateStarted":"2015-10-25T04:22:25+0000","dateFinished":"2015-10-25T04:22:25+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6489"},{"text":"%pyspark\n\nduration_means_by_type = sum_counts.map(lambda (key,value): (key, round(value[0]/value[1],3))).collectAsMap()\n\n# Print them sorted\nfor tag in sorted(duration_means_by_type, key=duration_means_by_type.get, reverse=True):\n print tag, duration_means_by_type[tag]","dateUpdated":"2015-10-20T05:12:50+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445361087520_46969599","id":"20151020-171127_166126716","result":{"code":"SUCCESS","type":"TEXT","msg":"portsweep. 1915.299\nwarezclient. 615.258\nspy. 318.0\nnormal. 216.657\nmultihop. 184.0\nrootkit. 100.8\nbuffer_overflow. 91.7\nperl. 41.333\nloadmodule. 36.222\nftp_write. 32.375\nwarezmaster. 15.05\nimap. 6.0\nphf. 4.5\nguess_passwd. 2.717\nback. 0.129\nsatan. 0.04\nipsweep. 0.034\nnmap. 0.0\nsmurf. 0.0\npod. 0.0\nneptune. 0.0\nteardrop. 0.0\nland. 0.0\n"},"dateCreated":"2015-10-20T05:11:27+0000","dateStarted":"2015-10-20T05:12:50+0000","dateFinished":"2015-10-20T05:12:51+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6490"},{"text":"%md\n\n###Creating a DataFrame","dateUpdated":"2015-10-25T04:22:22+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445790122018_164329060","id":"20151025-162202_1511009897","result":{"code":"SUCCESS","type":"HTML","msg":"

Creating a DataFrame

\n"},"dateCreated":"2015-10-25T04:22:02+0000","dateStarted":"2015-10-25T04:22:21+0000","dateFinished":"2015-10-25T04:22:21+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6491"},{"text":"%pyspark\n\nrow_data = csv_rdd.map(lambda p: Row(\n duration=int(p[0]), \n protocol_type=p[1],\n service=p[2],\n flag=p[3],\n src_bytes=int(p[4]),\n dst_bytes=int(p[5])\n )\n)\n\ninteractions_df = sqlContext.createDataFrame(row_data)\ninteractions_df.registerTempTable(\"interactions\")","dateUpdated":"2015-10-20T07:43:40+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445361170900_2134707432","id":"20151020-171250_677927139","result":{"code":"SUCCESS","type":"TEXT","msg":""},"dateCreated":"2015-10-20T05:12:50+0000","dateStarted":"2015-10-20T07:43:40+0000","dateFinished":"2015-10-20T07:43:42+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6492"},{"text":"%md\n###Querying with SQL statments over a DataFrame","dateUpdated":"2015-10-25T04:23:25+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445790160947_1701647747","id":"20151025-162240_154982058","result":{"code":"SUCCESS","type":"HTML","msg":"

Querying with SQL statments over a DataFrame

\n"},"dateCreated":"2015-10-25T04:22:40+0000","dateStarted":"2015-10-25T04:23:24+0000","dateFinished":"2015-10-25T04:23:24+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6493"},{"text":"%sql\n\nSELECT duration, dst_bytes FROM interactions WHERE protocol_type = 'tcp' AND duration > 1000 AND dst_bytes = 0","dateUpdated":"2015-10-20T07:44:17+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/sql","enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445370220648_-53808448","id":"20151020-194340_1845988549","result":{"code":"SUCCESS","type":"TABLE","msg":"duration\tdst_bytes\n5057\t0\n5059\t0\n5051\t0\n5056\t0\n5051\t0\n5039\t0\n5062\t0\n5041\t0\n5056\t0\n5064\t0\n5043\t0\n5061\t0\n5049\t0\n5061\t0\n5048\t0\n5047\t0\n5044\t0\n5063\t0\n5068\t0\n5062\t0\n5046\t0\n5052\t0\n5044\t0\n5054\t0\n5039\t0\n5058\t0\n5051\t0\n5032\t0\n5063\t0\n5040\t0\n5051\t0\n5066\t0\n5044\t0\n5051\t0\n5036\t0\n5055\t0\n2426\t0\n5047\t0\n5057\t0\n5037\t0\n5057\t0\n5062\t0\n5051\t0\n5051\t0\n5053\t0\n5064\t0\n5044\t0\n5051\t0\n5033\t0\n5066\t0\n5063\t0\n5056\t0\n5042\t0\n5063\t0\n5060\t0\n5056\t0\n5049\t0\n5043\t0\n5039\t0\n5041\t0\n42448\t0\n42088\t0\n41065\t0\n40929\t0\n40806\t0\n40682\t0\n40571\t0\n40448\t0\n40339\t0\n40232\t0\n40121\t0\n36783\t0\n36674\t0\n36570\t0\n36467\t0\n36323\t0\n36204\t0\n32038\t0\n31925\t0\n31809\t0\n31709\t0\n31601\t0\n31501\t0\n31401\t0\n31301\t0\n31194\t0\n31061\t0\n30935\t0\n30835\t0\n30735\t0\n30619\t0\n30518\t0\n30418\t0\n30317\t0\n30217\t0\n30077\t0\n25420\t0\n22921\t0\n22821\t0\n22721\t0\n22616\t0\n22516\t0\n22416\t0\n22316\t0\n22216\t0\n21987\t0\n21887\t0\n21767\t0\n21661\t0\n21561\t0\n21455\t0\n21334\t0\n21223\t0\n21123\t0\n20983\t0\n14682\t0\n14420\t0\n14319\t0\n14198\t0\n14098\t0\n13998\t0\n13898\t0\n13796\t0\n13678\t0\n13578\t0\n13448\t0\n13348\t0\n13241\t0\n13141\t0\n13033\t0\n12933\t0\n12833\t0\n12733\t0\n12001\t0\n5678\t0\n5010\t0\n1298\t0\n1031\t0\n36438\t0\n","comment":"","msgTable":[[{"key":"dst_bytes","value":"5057"},{"key":"dst_bytes","value":"0"}],[{"value":"5059"},{"value":"0"}],[{"value":"5051"},{"value":"0"}],[{"value":"5056"},{"value":"0"}],[{"value":"5051"},{"value":"0"}],[{"value":"5039"},{"value":"0"}],[{"value":"5062"},{"value":"0"}],[{"value":"5041"},{"value":"0"}],[{"value":"5056"},{"value":"0"}],[{"value":"5064"},{"value":"0"}],[{"value":"5043"},{"value":"0"}],[{"value":"5061"},{"value":"0"}],[{"value":"5049"},{"value":"0"}],[{"value":"5061"},{"value":"0"}],[{"value":"5048"},{"value":"0"}],[{"value":"5047"},{"value":"0"}],[{"value":"5044"},{"value":"0"}],[{"value":"5063"},{"value":"0"}],[{"value":"5068"},{"value":"0"}],[{"value":"5062"},{"value":"0"}],[{"value":"5046"},{"value":"0"}],[{"value":"5052"},{"value":"0"}],[{"value":"5044"},{"value":"0"}],[{"value":"5054"},{"value":"0"}],[{"value":"5039"},{"value":"0"}],[{"value":"5058"},{"value":"0"}],[{"value":"5051"},{"value":"0"}],[{"value":"5032"},{"value":"0"}],[{"value":"5063"},{"value":"0"}],[{"value":"5040"},{"value":"0"}],[{"value":"5051"},{"value":"0"}],[{"value":"5066"},{"value":"0"}],[{"value":"5044"},{"value":"0"}],[{"value":"5051"},{"value":"0"}],[{"value":"5036"},{"value":"0"}],[{"value":"5055"},{"value":"0"}],[{"value":"2426"},{"value":"0"}],[{"value":"5047"},{"value":"0"}],[{"value":"5057"},{"value":"0"}],[{"value":"5037"},{"value":"0"}],[{"value":"5057"},{"value":"0"}],[{"value":"5062"},{"value":"0"}],[{"value":"5051"},{"value":"0"}],[{"value":"5051"},{"value":"0"}],[{"value":"5053"},{"value":"0"}],[{"value":"5064"},{"value":"0"}],[{"value":"5044"},{"value":"0"}],[{"value":"5051"},{"value":"0"}],[{"value":"5033"},{"value":"0"}],[{"value":"5066"},{"value":"0"}],[{"value":"5063"},{"value":"0"}],[{"value":"5056"},{"value":"0"}],[{"value":"5042"},{"value":"0"}],[{"value":"5063"},{"value":"0"}],[{"value":"5060"},{"value":"0"}],[{"value":"5056"},{"value":"0"}],[{"value":"5049"},{"value":"0"}],[{"value":"5043"},{"value":"0"}],[{"value":"5039"},{"value":"0"}],[{"value":"5041"},{"value":"0"}],[{"value":"42448"},{"value":"0"}],[{"value":"42088"},{"value":"0"}],[{"value":"41065"},{"value":"0"}],[{"value":"40929"},{"value":"0"}],[{"value":"40806"},{"value":"0"}],[{"value":"40682"},{"value":"0"}],[{"value":"40571"},{"value":"0"}],[{"value":"40448"},{"value":"0"}],[{"value":"40339"},{"value":"0"}],[{"value":"40232"},{"value":"0"}],[{"value":"40121"},{"value":"0"}],[{"value":"36783"},{"value":"0"}],[{"value":"36674"},{"value":"0"}],[{"value":"36570"},{"value":"0"}],[{"value":"36467"},{"value":"0"}],[{"value":"36323"},{"value":"0"}],[{"value":"36204"},{"value":"0"}],[{"value":"32038"},{"value":"0"}],[{"value":"31925"},{"value":"0"}],[{"value":"31809"},{"value":"0"}],[{"value":"31709"},{"value":"0"}],[{"value":"31601"},{"value":"0"}],[{"value":"31501"},{"value":"0"}],[{"value":"31401"},{"value":"0"}],[{"value":"31301"},{"value":"0"}],[{"value":"31194"},{"value":"0"}],[{"value":"31061"},{"value":"0"}],[{"value":"30935"},{"value":"0"}],[{"value":"30835"},{"value":"0"}],[{"value":"30735"},{"value":"0"}],[{"value":"30619"},{"value":"0"}],[{"value":"30518"},{"value":"0"}],[{"value":"30418"},{"value":"0"}],[{"value":"30317"},{"value":"0"}],[{"value":"30217"},{"value":"0"}],[{"value":"30077"},{"value":"0"}],[{"value":"25420"},{"value":"0"}],[{"value":"22921"},{"value":"0"}],[{"value":"22821"},{"value":"0"}],[{"value":"22721"},{"value":"0"}],[{"value":"22616"},{"value":"0"}],[{"value":"22516"},{"value":"0"}],[{"value":"22416"},{"value":"0"}],[{"value":"22316"},{"value":"0"}],[{"value":"22216"},{"value":"0"}],[{"value":"21987"},{"value":"0"}],[{"value":"21887"},{"value":"0"}],[{"value":"21767"},{"value":"0"}],[{"value":"21661"},{"value":"0"}],[{"value":"21561"},{"value":"0"}],[{"value":"21455"},{"value":"0"}],[{"value":"21334"},{"value":"0"}],[{"value":"21223"},{"value":"0"}],[{"value":"21123"},{"value":"0"}],[{"value":"20983"},{"value":"0"}],[{"value":"14682"},{"value":"0"}],[{"value":"14420"},{"value":"0"}],[{"value":"14319"},{"value":"0"}],[{"value":"14198"},{"value":"0"}],[{"value":"14098"},{"value":"0"}],[{"value":"13998"},{"value":"0"}],[{"value":"13898"},{"value":"0"}],[{"value":"13796"},{"value":"0"}],[{"value":"13678"},{"value":"0"}],[{"value":"13578"},{"value":"0"}],[{"value":"13448"},{"value":"0"}],[{"value":"13348"},{"value":"0"}],[{"value":"13241"},{"value":"0"}],[{"value":"13141"},{"value":"0"}],[{"value":"13033"},{"value":"0"}],[{"value":"12933"},{"value":"0"}],[{"value":"12833"},{"value":"0"}],[{"value":"12733"},{"value":"0"}],[{"value":"12001"},{"value":"0"}],[{"value":"5678"},{"value":"0"}],[{"value":"5010"},{"value":"0"}],[{"value":"1298"},{"value":"0"}],[{"value":"1031"},{"value":"0"}],[{"value":"36438"},{"value":"0"}]],"columnNames":[{"name":"duration","index":0,"aggr":"sum"},{"name":"dst_bytes","index":1,"aggr":"sum"}],"rows":[["5057","0"],["5059","0"],["5051","0"],["5056","0"],["5051","0"],["5039","0"],["5062","0"],["5041","0"],["5056","0"],["5064","0"],["5043","0"],["5061","0"],["5049","0"],["5061","0"],["5048","0"],["5047","0"],["5044","0"],["5063","0"],["5068","0"],["5062","0"],["5046","0"],["5052","0"],["5044","0"],["5054","0"],["5039","0"],["5058","0"],["5051","0"],["5032","0"],["5063","0"],["5040","0"],["5051","0"],["5066","0"],["5044","0"],["5051","0"],["5036","0"],["5055","0"],["2426","0"],["5047","0"],["5057","0"],["5037","0"],["5057","0"],["5062","0"],["5051","0"],["5051","0"],["5053","0"],["5064","0"],["5044","0"],["5051","0"],["5033","0"],["5066","0"],["5063","0"],["5056","0"],["5042","0"],["5063","0"],["5060","0"],["5056","0"],["5049","0"],["5043","0"],["5039","0"],["5041","0"],["42448","0"],["42088","0"],["41065","0"],["40929","0"],["40806","0"],["40682","0"],["40571","0"],["40448","0"],["40339","0"],["40232","0"],["40121","0"],["36783","0"],["36674","0"],["36570","0"],["36467","0"],["36323","0"],["36204","0"],["32038","0"],["31925","0"],["31809","0"],["31709","0"],["31601","0"],["31501","0"],["31401","0"],["31301","0"],["31194","0"],["31061","0"],["30935","0"],["30835","0"],["30735","0"],["30619","0"],["30518","0"],["30418","0"],["30317","0"],["30217","0"],["30077","0"],["25420","0"],["22921","0"],["22821","0"],["22721","0"],["22616","0"],["22516","0"],["22416","0"],["22316","0"],["22216","0"],["21987","0"],["21887","0"],["21767","0"],["21661","0"],["21561","0"],["21455","0"],["21334","0"],["21223","0"],["21123","0"],["20983","0"],["14682","0"],["14420","0"],["14319","0"],["14198","0"],["14098","0"],["13998","0"],["13898","0"],["13796","0"],["13678","0"],["13578","0"],["13448","0"],["13348","0"],["13241","0"],["13141","0"],["13033","0"],["12933","0"],["12833","0"],["12733","0"],["12001","0"],["5678","0"],["5010","0"],["1298","0"],["1031","0"],["36438","0"]]},"dateCreated":"2015-10-20T07:43:40+0000","dateStarted":"2015-10-20T07:44:17+0000","dateFinished":"2015-10-20T07:44:30+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6494"},{"text":"%md\n###Printing the schema of the DataFrame","dateUpdated":"2015-10-25T04:24:09+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445790221990_2096086534","id":"20151025-162341_1003254343","result":{"code":"SUCCESS","type":"HTML","msg":"

Printing the schema of the DataFrame

\n"},"dateCreated":"2015-10-25T04:23:41+0000","dateStarted":"2015-10-25T04:24:07+0000","dateFinished":"2015-10-25T04:24:08+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6495"},{"text":"%pyspark\n\ninteractions_df.printSchema()","dateUpdated":"2015-10-20T07:48:19+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445370257266_529132109","id":"20151020-194417_1802294435","result":{"code":"SUCCESS","type":"TEXT","msg":"root\n |-- dst_bytes: long (nullable = true)\n |-- duration: long (nullable = true)\n |-- flag: string (nullable = true)\n |-- protocol_type: string (nullable = true)\n |-- service: string (nullable = true)\n |-- src_bytes: long (nullable = true)\n\n"},"dateCreated":"2015-10-20T07:44:17+0000","dateStarted":"2015-10-20T07:48:19+0000","dateFinished":"2015-10-20T07:48:19+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6496"},{"text":"%md \n###Querying a DataFrame","dateUpdated":"2015-10-25T04:25:45+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445790215151_-547132953","id":"20151025-162335_648508524","result":{"code":"SUCCESS","type":"HTML","msg":"

Querying a DataFrame

\n"},"dateCreated":"2015-10-25T04:23:35+0000","dateStarted":"2015-10-25T04:25:44+0000","dateFinished":"2015-10-25T04:25:44+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6497"},{"text":"%pyspark\n\ninteractions_df.select(\"protocol_type\", \"duration\", \"dst_bytes\").groupBy(\"protocol_type\").count().show()","dateUpdated":"2015-10-20T07:54:05+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445370429303_-127605260","id":"20151020-194709_695055854","result":{"code":"SUCCESS","type":"TEXT","msg":"protocol_type count \nudp 20354 \ntcp 190065\nicmp 283602\n"},"dateCreated":"2015-10-20T07:47:09+0000","dateStarted":"2015-10-20T07:54:05+0000","dateFinished":"2015-10-20T07:54:25+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6498"},{"text":"%pyspark\n\ninteractions_df.select(\"protocol_type\", \"duration\", \"dst_bytes\").filter(interactions_df.duration>1000).filter(interactions_df.dst_bytes==0).groupBy(\"protocol_type\").count().show()","dateUpdated":"2015-10-20T07:56:41+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445370845618_350615197","id":"20151020-195405_164097486","result":{"code":"SUCCESS","type":"TEXT","msg":"protocol_type count\ntcp 139 \n"},"dateCreated":"2015-10-20T07:54:05+0000","dateStarted":"2015-10-20T07:56:41+0000","dateFinished":"2015-10-20T07:56:56+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6499"},{"text":"%md\n\n###Calculating a derived column on a DataFrame","dateUpdated":"2015-10-25T04:25:41+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445790308872_-2043903577","id":"20151025-162508_1142356987","result":{"code":"SUCCESS","type":"HTML","msg":"

Calculating a derived column on a DataFrame

\n"},"dateCreated":"2015-10-25T04:25:08+0000","dateStarted":"2015-10-25T04:25:40+0000","dateFinished":"2015-10-25T04:25:40+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6500"},{"text":"%pyspark\n\ndef get_label_type(label):\n if label!=\"normal.\":\n return \"attack\"\n else:\n return \"normal\"\n \nrow_labeled_data = csv_rdd.map(lambda p: Row(\n duration=int(p[0]), \n protocol_type=p[1],\n service=p[2],\n flag=p[3],\n src_bytes=int(p[4]),\n dst_bytes=int(p[5]),\n label=get_label_type(p[41])\n )\n)\ninteractions_labeled_df = sqlContext.createDataFrame(row_labeled_data)","dateUpdated":"2015-10-20T07:59:09+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445371001131_1501462098","id":"20151020-195641_1610602237","result":{"code":"SUCCESS","type":"TEXT","msg":""},"dateCreated":"2015-10-20T07:56:41+0000","dateStarted":"2015-10-20T07:59:09+0000","dateFinished":"2015-10-20T07:59:10+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6501"},{"text":"%pyspark\n\ninteractions_labeled_df.select(\"label\").groupBy(\"label\").count().show()","dateUpdated":"2015-10-20T07:59:57+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445371149738_-1817214992","id":"20151020-195909_100412061","result":{"code":"SUCCESS","type":"TEXT","msg":"label count \nattack 396743\nnormal 97278 \n"},"dateCreated":"2015-10-20T07:59:09+0000","dateStarted":"2015-10-20T07:59:57+0000","dateFinished":"2015-10-20T08:00:19+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6502"},{"text":"%md\n\n###groupBY on a DataFrame ","dateUpdated":"2015-10-25T04:26:56+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445790377976_198351717","id":"20151025-162617_120349085","result":{"code":"SUCCESS","type":"HTML","msg":"

groupBY on a DataFrame

\n"},"dateCreated":"2015-10-25T04:26:17+0000","dateStarted":"2015-10-25T04:26:55+0000","dateFinished":"2015-10-25T04:26:55+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6503"},{"text":"%pyspark\n\ninteractions_labeled_df.select(\"label\", \"protocol_type\").groupBy(\"label\", \"protocol_type\").count().show()","dateUpdated":"2015-10-20T08:01:16+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445371197970_-292562371","id":"20151020-195957_179265569","result":{"code":"SUCCESS","type":"TEXT","msg":"label protocol_type count \nattack udp 1177 \nattack tcp 113252\nattack icmp 282314\nnormal udp 19177 \nnormal tcp 76813 \nnormal icmp 1288 \n"},"dateCreated":"2015-10-20T07:59:57+0000","dateStarted":"2015-10-20T08:01:16+0000","dateFinished":"2015-10-20T08:01:31+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6504"},{"text":"%pyspark\n\ninteractions_labeled_df.select(\"label\", \"protocol_type\", \"dst_bytes\").groupBy(\"label\", \"protocol_type\", interactions_labeled_df.dst_bytes==0).count().show()","dateUpdated":"2015-10-20T08:03:26+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445371276497_-310706866","id":"20151020-200116_1068349697","result":{"code":"SUCCESS","type":"TEXT","msg":"label protocol_type (dst_bytes = 0) count \nnormal icmp true 1288 \nattack udp true 1166 \nattack udp false 11 \nnormal udp true 3594 \nnormal udp false 15583 \nattack tcp true 110583\nattack tcp false 2669 \nnormal tcp true 9313 \nnormal tcp false 67500 \nattack icmp true 282314\n"},"dateCreated":"2015-10-20T08:01:16+0000","dateStarted":"2015-10-20T08:03:26+0000","dateFinished":"2015-10-20T08:03:40+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:6505"},{"config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"enabled":true,"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1445371351603_1370519946","id":"20151020-200231_820336316","dateCreated":"2015-10-20T08:02:31+0000","status":"READY","progressUpdateIntervalMs":500,"$$hashKey":"object:6506"}],"name":"Getting Started / Hello World","id":"2B48PF7SN","angularObjects":{},"config":{"looknfeel":"default"},"info":{}}