{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Scalable Web Server Log Analytics with Apache Spark\n",
"\n",
"Apache Spark is an excellent and ideal framework for wrangling, analyzing and modeling on structured and unstructured data - at scale! In this tutorial, we will be focusing on one of the most popular case studies in the industry - log analytics.\n",
"\n",
"Typically, server logs are a very common data source in enterprises and often contain a gold mine of actionable insights and information. Log data comes from many sources in an enterprise, such as the web, client and compute servers, applications, user-generated content, flat files. They can be used for monitoring servers, improving business and customer intelligence, building recommendation systems, fraud detection, and much more.\n",
"\n",
"Spark allows you to dump and store your logs in files on disk cheaply, while still providing rich APIs to perform data analysis at scale. This hands-on will show you how to use Apache Spark on real-world production logs from NASA and learn data wrangling and basic yet powerful techniques in exploratory data analysis."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Part 1 - Setting up Dependencies"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" "
],
"text/plain": [
""
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"spark"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sqlContext"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from pyspark.context import SparkContext\n",
"from pyspark.sql.context import SQLContext\n",
"from pyspark.sql.session import SparkSession\n",
" \n",
"sc = SparkContext()\n",
"sqlContext = SQLContext(sc)\n",
"spark = SparkSession(sc)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"import re\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Basic Regular Expressions"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<_sre.SRE_Match object; span=(0, 25), match=\"I'm searching for a spark\"> 0 25\n",
"<_sre.SRE_Match object; span=(25, 36), match=' in PySpark'> 25 36\n"
]
}
],
"source": [
"m = re.finditer(r'.*?(spark).*?', \"I'm searching for a spark in PySpark\", re.I)\n",
"for match in m:\n",
" print(match, match.start(), match.end())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this case study, we will analyze log datasets from NASA Kennedy Space Center web server in Florida. The full data set is freely available for download [__here__](http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html).\n",
"\n",
"These two datasets contain two months' worth of all HTTP requests to the NASA Kennedy Space Center WWW server in Florida. You can head over to the [__website__](http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html) and download the following files as needed.\n",
"\n",
"- Jul 01 to Jul 31, ASCII format, 20.7 MB gzip compressed, 205.2 MB uncompressed: [ftp://ita.ee.lbl.gov/traces/NASA_access_log_Jul95.gz](ftp://ita.ee.lbl.gov/traces/NASA_access_log_Jul95.gz)\n",
"- Aug 04 to Aug 31, ASCII format, 21.8 MB gzip compressed, 167.8 MB uncompressed: [ftp://ita.ee.lbl.gov/traces/NASA_access_log_Aug95.gz](ftp://ita.ee.lbl.gov/traces/NASA_access_log_Aug95.gz)\n",
"\n",
"Make sure both the files are in the same directory as this notebook."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Part 2 - Loading and Viewing the NASA Log Dataset\n",
"\n",
"Given that our data is stored in the following mentioned path, let's load it into a DataFrame. We'll do this in steps. First, we'll use `sqlContext.read.text()` or `spark.read.text()` to read the text file. This will produce a DataFrame with a single string column called `value`."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['NASA_access_log_Jul95.gz', 'NASA_access_log_Aug95.gz']"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import glob\n",
"\n",
"raw_data_files = glob.glob('*.gz')\n",
"raw_data_files"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Taking a look at the metadata of our dataframe"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"root\n",
" |-- value: string (nullable = true)\n",
"\n"
]
}
],
"source": [
"base_df = spark.read.text(raw_data_files)\n",
"base_df.printSchema()"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"pyspark.sql.dataframe.DataFrame"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(base_df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also convert a dataframe to an RDD if needed"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"pyspark.rdd.RDD"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"base_df_rdd = base_df.rdd\n",
"type(base_df_rdd)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Viewing sample data in our dataframe\n",
"Looks like it needs to be wrangled and parsed!"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-----------------------------------------------------------------------------------------------------------------------+\n",
"|value |\n",
"+-----------------------------------------------------------------------------------------------------------------------+\n",
"|199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] \"GET /history/apollo/ HTTP/1.0\" 200 6245 |\n",
"|unicomp6.unicomp.net - - [01/Jul/1995:00:00:06 -0400] \"GET /shuttle/countdown/ HTTP/1.0\" 200 3985 |\n",
"|199.120.110.21 - - [01/Jul/1995:00:00:09 -0400] \"GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0\" 200 4085 |\n",
"|burger.letters.com - - [01/Jul/1995:00:00:11 -0400] \"GET /shuttle/countdown/liftoff.html HTTP/1.0\" 304 0 |\n",
"|199.120.110.21 - - [01/Jul/1995:00:00:11 -0400] \"GET /shuttle/missions/sts-73/sts-73-patch-small.gif HTTP/1.0\" 200 4179|\n",
"|burger.letters.com - - [01/Jul/1995:00:00:12 -0400] \"GET /images/NASA-logosmall.gif HTTP/1.0\" 304 0 |\n",
"|burger.letters.com - - [01/Jul/1995:00:00:12 -0400] \"GET /shuttle/countdown/video/livevideo.gif HTTP/1.0\" 200 0 |\n",
"|205.212.115.106 - - [01/Jul/1995:00:00:12 -0400] \"GET /shuttle/countdown/countdown.html HTTP/1.0\" 200 3985 |\n",
"|d104.aa.net - - [01/Jul/1995:00:00:13 -0400] \"GET /shuttle/countdown/ HTTP/1.0\" 200 3985 |\n",
"|129.94.144.152 - - [01/Jul/1995:00:00:13 -0400] \"GET / HTTP/1.0\" 200 7074 |\n",
"+-----------------------------------------------------------------------------------------------------------------------+\n",
"only showing top 10 rows\n",
"\n"
]
}
],
"source": [
"base_df.show(10, truncate=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Getting data from an RDD is slightly different. You can see how the data representation is different in the following RDD"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Row(value='199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] \"GET /history/apollo/ HTTP/1.0\" 200 6245'),\n",
" Row(value='unicomp6.unicomp.net - - [01/Jul/1995:00:00:06 -0400] \"GET /shuttle/countdown/ HTTP/1.0\" 200 3985'),\n",
" Row(value='199.120.110.21 - - [01/Jul/1995:00:00:09 -0400] \"GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0\" 200 4085'),\n",
" Row(value='burger.letters.com - - [01/Jul/1995:00:00:11 -0400] \"GET /shuttle/countdown/liftoff.html HTTP/1.0\" 304 0'),\n",
" Row(value='199.120.110.21 - - [01/Jul/1995:00:00:11 -0400] \"GET /shuttle/missions/sts-73/sts-73-patch-small.gif HTTP/1.0\" 200 4179'),\n",
" Row(value='burger.letters.com - - [01/Jul/1995:00:00:12 -0400] \"GET /images/NASA-logosmall.gif HTTP/1.0\" 304 0'),\n",
" Row(value='burger.letters.com - - [01/Jul/1995:00:00:12 -0400] \"GET /shuttle/countdown/video/livevideo.gif HTTP/1.0\" 200 0'),\n",
" Row(value='205.212.115.106 - - [01/Jul/1995:00:00:12 -0400] \"GET /shuttle/countdown/countdown.html HTTP/1.0\" 200 3985'),\n",
" Row(value='d104.aa.net - - [01/Jul/1995:00:00:13 -0400] \"GET /shuttle/countdown/ HTTP/1.0\" 200 3985'),\n",
" Row(value='129.94.144.152 - - [01/Jul/1995:00:00:13 -0400] \"GET / HTTP/1.0\" 200 7074')]"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"base_df_rdd.take(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Part 3 - Data Wrangling\n",
"\n",
"In this section, we will try and clean and parse our log dataset to really extract structured attributes with meaningful information from each log message.\n",
"\n",
"### Data understanding\n",
"If you're familiar with web server logs, you'll recognize that the above displayed data is in [Common Log Format](https://www.w3.org/Daemon/User/Config/Logging.html#common-logfile-format). \n",
"\n",
"The fields are:\n",
"__`remotehost rfc931 authuser [date] \"request\" status bytes`__\n",
"\n",
"\n",
"| field | meaning |\n",
"| ------------- | ---------------------------------------------------------------------- |\n",
"| _remotehost_ | Remote hostname (or IP number if DNS hostname is not available or if [DNSLookup](https://www.w3.org/Daemon/User/Config/General.html#DNSLookup) is off). |\n",
"| _rfc931_ | The remote logname of the user if at all it is present. |\n",
"| _authuser_ | The username of the remote user after authentication by the HTTP server. |\n",
"| _[date]_ | Date and time of the request. |\n",
"| _\"request\"_ | The request, exactly as it came from the browser or client. |\n",
"| _status_ | The [HTTP status code](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes) the server sent back to the client. |\n",
"| _bytes_ | The number of bytes (`Content-Length`) transferred to the client. |\n",
"\n",
"We will need to use some specific techniques to parse, match and extract these attributes from the log data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data Parsing and Extraction with Regular Expressions\n",
"\n",
"Next, we have to parse it into individual columns. We'll use the special built-in [regexp\\_extract()](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.regexp_extract)\n",
"function to do the parsing. This function matches a column against a regular expression with one or more [capture groups](http://regexone.com/lesson/capturing_groups) and allows you to extract one of the matched groups. We'll use one regular expression for each field we wish to extract.\n",
"\n",
"You must have heard or used a fair bit of regular expressions by now. If you find regular expressions confusing (and they certainly _can_ be), and you want to learn more about them, we recommend checking out the\n",
"[RegexOne web site](http://regexone.com/). You might also find [_Regular Expressions Cookbook_](http://shop.oreilly.com/product/0636920023630.do), by Goyvaerts and Levithan, to be useful as a reference."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Let's take a look at our dataset dimensions"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(3461613, 1)\n"
]
}
],
"source": [
"print((base_df.count(), len(base_df.columns)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's extract and take a look at some sample log messages"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] \"GET /history/apollo/ HTTP/1.0\" 200 6245',\n",
" 'unicomp6.unicomp.net - - [01/Jul/1995:00:00:06 -0400] \"GET /shuttle/countdown/ HTTP/1.0\" 200 3985',\n",
" '199.120.110.21 - - [01/Jul/1995:00:00:09 -0400] \"GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0\" 200 4085',\n",
" 'burger.letters.com - - [01/Jul/1995:00:00:11 -0400] \"GET /shuttle/countdown/liftoff.html HTTP/1.0\" 304 0',\n",
" '199.120.110.21 - - [01/Jul/1995:00:00:11 -0400] \"GET /shuttle/missions/sts-73/sts-73-patch-small.gif HTTP/1.0\" 200 4179',\n",
" 'burger.letters.com - - [01/Jul/1995:00:00:12 -0400] \"GET /images/NASA-logosmall.gif HTTP/1.0\" 304 0',\n",
" 'burger.letters.com - - [01/Jul/1995:00:00:12 -0400] \"GET /shuttle/countdown/video/livevideo.gif HTTP/1.0\" 200 0',\n",
" '205.212.115.106 - - [01/Jul/1995:00:00:12 -0400] \"GET /shuttle/countdown/countdown.html HTTP/1.0\" 200 3985',\n",
" 'd104.aa.net - - [01/Jul/1995:00:00:13 -0400] \"GET /shuttle/countdown/ HTTP/1.0\" 200 3985',\n",
" '129.94.144.152 - - [01/Jul/1995:00:00:13 -0400] \"GET / HTTP/1.0\" 200 7074',\n",
" 'unicomp6.unicomp.net - - [01/Jul/1995:00:00:14 -0400] \"GET /shuttle/countdown/count.gif HTTP/1.0\" 200 40310',\n",
" 'unicomp6.unicomp.net - - [01/Jul/1995:00:00:14 -0400] \"GET /images/NASA-logosmall.gif HTTP/1.0\" 200 786',\n",
" 'unicomp6.unicomp.net - - [01/Jul/1995:00:00:14 -0400] \"GET /images/KSC-logosmall.gif HTTP/1.0\" 200 1204',\n",
" 'd104.aa.net - - [01/Jul/1995:00:00:15 -0400] \"GET /shuttle/countdown/count.gif HTTP/1.0\" 200 40310',\n",
" 'd104.aa.net - - [01/Jul/1995:00:00:15 -0400] \"GET /images/NASA-logosmall.gif HTTP/1.0\" 200 786']"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sample_logs = [item['value'] for item in base_df.take(15)]\n",
"sample_logs"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Extracting host names\n",
"\n",
"Let's try and write some regular expressions to extract the host name from the logs"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['199.72.81.55',\n",
" 'unicomp6.unicomp.net',\n",
" '199.120.110.21',\n",
" 'burger.letters.com',\n",
" '199.120.110.21',\n",
" 'burger.letters.com',\n",
" 'burger.letters.com',\n",
" '205.212.115.106',\n",
" 'd104.aa.net',\n",
" '129.94.144.152',\n",
" 'unicomp6.unicomp.net',\n",
" 'unicomp6.unicomp.net',\n",
" 'unicomp6.unicomp.net',\n",
" 'd104.aa.net',\n",
" 'd104.aa.net']"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"host_pattern = r'(^\\S+\\.[\\S+\\.]+\\S+)\\s'\n",
"hosts = [re.search(host_pattern, item).group(1)\n",
" if re.search(host_pattern, item)\n",
" else 'no match'\n",
" for item in sample_logs]\n",
"hosts"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Extracting timestamps \n",
"\n",
"Let's now try and use regular expressions to extract the timestamp fields from the logs"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['01/Jul/1995:00:00:01 -0400',\n",
" '01/Jul/1995:00:00:06 -0400',\n",
" '01/Jul/1995:00:00:09 -0400',\n",
" '01/Jul/1995:00:00:11 -0400',\n",
" '01/Jul/1995:00:00:11 -0400',\n",
" '01/Jul/1995:00:00:12 -0400',\n",
" '01/Jul/1995:00:00:12 -0400',\n",
" '01/Jul/1995:00:00:12 -0400',\n",
" '01/Jul/1995:00:00:13 -0400',\n",
" '01/Jul/1995:00:00:13 -0400',\n",
" '01/Jul/1995:00:00:14 -0400',\n",
" '01/Jul/1995:00:00:14 -0400',\n",
" '01/Jul/1995:00:00:14 -0400',\n",
" '01/Jul/1995:00:00:15 -0400',\n",
" '01/Jul/1995:00:00:15 -0400']"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ts_pattern = r'\\[(\\d{2}/\\w{3}/\\d{4}:\\d{2}:\\d{2}:\\d{2} -\\d{4})]'\n",
"timestamps = [re.search(ts_pattern, item).group(1) for item in sample_logs]\n",
"timestamps"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Extracting HTTP Request Method, URIs and Protocol \n",
"\n",
"Let's now try and use regular expressions to extract the HTTP request methods, URIs and Protocol patterns fields from the logs"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('GET', '/history/apollo/', 'HTTP/1.0'),\n",
" ('GET', '/shuttle/countdown/', 'HTTP/1.0'),\n",
" ('GET', '/shuttle/missions/sts-73/mission-sts-73.html', 'HTTP/1.0'),\n",
" ('GET', '/shuttle/countdown/liftoff.html', 'HTTP/1.0'),\n",
" ('GET', '/shuttle/missions/sts-73/sts-73-patch-small.gif', 'HTTP/1.0'),\n",
" ('GET', '/images/NASA-logosmall.gif', 'HTTP/1.0'),\n",
" ('GET', '/shuttle/countdown/video/livevideo.gif', 'HTTP/1.0'),\n",
" ('GET', '/shuttle/countdown/countdown.html', 'HTTP/1.0'),\n",
" ('GET', '/shuttle/countdown/', 'HTTP/1.0'),\n",
" ('GET', '/', 'HTTP/1.0'),\n",
" ('GET', '/shuttle/countdown/count.gif', 'HTTP/1.0'),\n",
" ('GET', '/images/NASA-logosmall.gif', 'HTTP/1.0'),\n",
" ('GET', '/images/KSC-logosmall.gif', 'HTTP/1.0'),\n",
" ('GET', '/shuttle/countdown/count.gif', 'HTTP/1.0'),\n",
" ('GET', '/images/NASA-logosmall.gif', 'HTTP/1.0')]"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"method_uri_protocol_pattern = r'\\\"(\\S+)\\s(\\S+)\\s*(\\S*)\\\"'\n",
"method_uri_protocol = [re.search(method_uri_protocol_pattern, item).groups()\n",
" if re.search(method_uri_protocol_pattern, item)\n",
" else 'no match'\n",
" for item in sample_logs]\n",
"method_uri_protocol"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Extracting HTTP Status Codes\n",
"\n",
"Let's now try and use regular expressions to extract the HTTP status codes from the logs"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['200', '200', '200', '304', '200', '304', '200', '200', '200', '200', '200', '200', '200', '200', '200']\n"
]
}
],
"source": [
"status_pattern = r'\\s(\\d{3})\\s'\n",
"status = [re.search(status_pattern, item).group(1) for item in sample_logs]\n",
"print(status)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Extracting HTTP Response Content Size\n",
"\n",
"Let's now try and use regular expressions to extract the HTTP response content size from the logs"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['6245', '3985', '4085', '0', '4179', '0', '0', '3985', '3985', '7074', '40310', '786', '1204', '40310', '786']\n"
]
}
],
"source": [
"content_size_pattern = r'\\s(\\d+)$'\n",
"content_size = [re.search(content_size_pattern, item).group(1) for item in sample_logs]\n",
"print(content_size)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Putting it all together \n",
"\n",
"Let's now try and leverage all the regular expression patterns we previously built and use the `regexp_extract(...)` method to build our dataframe with all the log attributes neatly extracted in their own separate columns."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+--------------------+--------------------+------+--------------------+--------+------+------------+\n",
"| host| timestamp|method| endpoint|protocol|status|content_size|\n",
"+--------------------+--------------------+------+--------------------+--------+------+------------+\n",
"| 199.72.81.55|01/Jul/1995:00:00...| GET| /history/apollo/|HTTP/1.0| 200| 6245|\n",
"|unicomp6.unicomp.net|01/Jul/1995:00:00...| GET| /shuttle/countdown/|HTTP/1.0| 200| 3985|\n",
"| 199.120.110.21|01/Jul/1995:00:00...| GET|/shuttle/missions...|HTTP/1.0| 200| 4085|\n",
"| burger.letters.com|01/Jul/1995:00:00...| GET|/shuttle/countdow...|HTTP/1.0| 304| 0|\n",
"| 199.120.110.21|01/Jul/1995:00:00...| GET|/shuttle/missions...|HTTP/1.0| 200| 4179|\n",
"| burger.letters.com|01/Jul/1995:00:00...| GET|/images/NASA-logo...|HTTP/1.0| 304| 0|\n",
"| burger.letters.com|01/Jul/1995:00:00...| GET|/shuttle/countdow...|HTTP/1.0| 200| 0|\n",
"| 205.212.115.106|01/Jul/1995:00:00...| GET|/shuttle/countdow...|HTTP/1.0| 200| 3985|\n",
"| d104.aa.net|01/Jul/1995:00:00...| GET| /shuttle/countdown/|HTTP/1.0| 200| 3985|\n",
"| 129.94.144.152|01/Jul/1995:00:00...| GET| /|HTTP/1.0| 200| 7074|\n",
"+--------------------+--------------------+------+--------------------+--------+------+------------+\n",
"only showing top 10 rows\n",
"\n",
"(3461613, 7)\n"
]
}
],
"source": [
"from pyspark.sql.functions import regexp_extract\n",
"\n",
"logs_df = base_df.select(regexp_extract('value', host_pattern, 1).alias('host'),\n",
" regexp_extract('value', ts_pattern, 1).alias('timestamp'),\n",
" regexp_extract('value', method_uri_protocol_pattern, 1).alias('method'),\n",
" regexp_extract('value', method_uri_protocol_pattern, 2).alias('endpoint'),\n",
" regexp_extract('value', method_uri_protocol_pattern, 3).alias('protocol'),\n",
" regexp_extract('value', status_pattern, 1).cast('integer').alias('status'),\n",
" regexp_extract('value', content_size_pattern, 1).cast('integer').alias('content_size'))\n",
"logs_df.show(10, truncate=True)\n",
"print((logs_df.count(), len(logs_df.columns)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Finding Missing Values\n",
"\n",
"Missing and null values are the bane of data analysis and machine learning. Let's see how well our data parsing and extraction logic worked. First, let's verify that there are no null rows in the original dataframe."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(base_df\n",
" .filter(base_df['value']\n",
" .isNull())\n",
" .count())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If our data parsing and extraction worked properly, we should not have any rows with potential null values. Let's try and put that to test!"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"33905"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"bad_rows_df = logs_df.filter(logs_df['host'].isNull()| \n",
" logs_df['timestamp'].isNull() | \n",
" logs_df['method'].isNull() |\n",
" logs_df['endpoint'].isNull() |\n",
" logs_df['status'].isNull() |\n",
" logs_df['content_size'].isNull()|\n",
" logs_df['protocol'].isNull())\n",
"bad_rows_df.count()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ouch! Looks like we have over 33K missing values in our data! Can we handle this?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Do remember, this is not a regular pandas dataframe which you can directly query and get which columns have null. Our so-called _big dataset_ is residing on disk which can potentially be present in multiple nodes in a spark cluster. So how do we find out which columns have potential nulls? \n",
"\n",
"### Finding Null Counts\n",
"\n",
"We can typically use the following technique to find out which columns have null values. \n",
"\n",
"(__Note:__ This approach is adapted from an [excellent answer](http://stackoverflow.com/a/33901312) on StackOverflow.)"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['host',\n",
" 'timestamp',\n",
" 'method',\n",
" 'endpoint',\n",
" 'protocol',\n",
" 'status',\n",
" 'content_size']"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"logs_df.columns"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+----+---------+------+--------+--------+------+------------+\n",
"|host|timestamp|method|endpoint|protocol|status|content_size|\n",
"+----+---------+------+--------+--------+------+------------+\n",
"| 0| 0| 0| 0| 0| 1| 33905|\n",
"+----+---------+------+--------+--------+------+------------+\n",
"\n"
]
}
],
"source": [
"from pyspark.sql.functions import col\n",
"from pyspark.sql.functions import sum as spark_sum\n",
"\n",
"def count_null(col_name):\n",
" return spark_sum(col(col_name).isNull().cast('integer')).alias(col_name)\n",
"\n",
"# Build up a list of column expressions, one per column.\n",
"exprs = [count_null(col_name) for col_name in logs_df.columns]\n",
"\n",
"# Run the aggregation. The *exprs converts the list of expressions into\n",
"# variable function arguments.\n",
"logs_df.agg(*exprs).show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Well, looks like we have one missing value in the `status` column and everything else is in the `content_size` column. \n",
"Let's see if we can figure out what's wrong!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Handling nulls in HTTP status\n",
"\n",
"Our original parsing regular expression for the `status` column was:\n",
"\n",
"__```\n",
"regexp_extract('value', r'\\s(\\d{3})\\s', 1).cast('integer').alias('status')\n",
"```__ \n",
"\n",
"Could it be that there are more digits making our regular expression wrong? or is the data point itself bad? Let's try and find out!\n",
"\n",
"**Note**: In the expression below, `~` means \"not\"."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"null_status_df = base_df.filter(~base_df['value'].rlike(r'\\s(\\d{3})\\s'))\n",
"null_status_df.count()"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+--------+\n",
"|value |\n",
"+--------+\n",
"|alyssa.p|\n",
"+--------+\n",
"\n"
]
}
],
"source": [
"null_status_df.show(truncate=False)"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+----+---------+------+--------+--------+------+------------+\n",
"|host|timestamp|method|endpoint|protocol|status|content_size|\n",
"+----+---------+------+--------+--------+------+------------+\n",
"| | | | | |null |null |\n",
"+----+---------+------+--------+--------+------+------------+\n",
"\n"
]
}
],
"source": [
"bad_status_df = null_status_df.select(regexp_extract('value', host_pattern, 1).alias('host'),\n",
" regexp_extract('value', ts_pattern, 1).alias('timestamp'),\n",
" regexp_extract('value', method_uri_protocol_pattern, 1).alias('method'),\n",
" regexp_extract('value', method_uri_protocol_pattern, 2).alias('endpoint'),\n",
" regexp_extract('value', method_uri_protocol_pattern, 3).alias('protocol'),\n",
" regexp_extract('value', status_pattern, 1).cast('integer').alias('status'),\n",
" regexp_extract('value', content_size_pattern, 1).cast('integer').alias('content_size'))\n",
"bad_status_df.show(truncate=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Looks like the record itself is an incomplete record with no useful information, the best option would be to drop this record as follows!"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3461613"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"logs_df.count()"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3461612"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"logs_df = logs_df[logs_df['status'].isNotNull()] \n",
"logs_df.count()"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+----+---------+------+--------+--------+------+------------+\n",
"|host|timestamp|method|endpoint|protocol|status|content_size|\n",
"+----+---------+------+--------+--------+------+------------+\n",
"| 0| 0| 0| 0| 0| 0| 33904|\n",
"+----+---------+------+--------+--------+------+------------+\n",
"\n"
]
}
],
"source": [
"exprs = [count_null(col_name) for col_name in logs_df.columns]\n",
"logs_df.agg(*exprs).show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Handling nulls in HTTP content size\n",
"\n",
"Based on our previous regular expression, our original parsing regular expression for the `content_size` column was:\n",
"\n",
"__```\n",
"regexp_extract('value', r'\\s(\\d+)$', 1).cast('integer').alias('content_size')\n",
"```__ \n",
"\n",
"Could there be missing data in our original dataset itself? Let's try and find out!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Find out the records in our base data frame with potential missing content sizes"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"33905"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"null_content_size_df = base_df.filter(~base_df['value'].rlike(r'\\s\\d+$'))\n",
"null_content_size_df.count()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Display the top ten records of your data frame having missing content sizes"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Row(value='dd15-062.compuserve.com - - [01/Jul/1995:00:01:12 -0400] \"GET /news/sci.space.shuttle/archive/sci-space-shuttle-22-apr-1995-40.txt HTTP/1.0\" 404 -'),\n",
" Row(value='dynip42.efn.org - - [01/Jul/1995:00:02:14 -0400] \"GET /software HTTP/1.0\" 302 -'),\n",
" Row(value='ix-or10-06.ix.netcom.com - - [01/Jul/1995:00:02:40 -0400] \"GET /software/winvn HTTP/1.0\" 302 -'),\n",
" Row(value='ix-or10-06.ix.netcom.com - - [01/Jul/1995:00:03:24 -0400] \"GET /software HTTP/1.0\" 302 -'),\n",
" Row(value='link097.txdirect.net - - [01/Jul/1995:00:05:06 -0400] \"GET /shuttle HTTP/1.0\" 302 -'),\n",
" Row(value='ix-war-mi1-20.ix.netcom.com - - [01/Jul/1995:00:05:13 -0400] \"GET /shuttle/missions/sts-78/news HTTP/1.0\" 302 -'),\n",
" Row(value='ix-war-mi1-20.ix.netcom.com - - [01/Jul/1995:00:05:58 -0400] \"GET /shuttle/missions/sts-72/news HTTP/1.0\" 302 -'),\n",
" Row(value='netport-27.iu.net - - [01/Jul/1995:00:10:19 -0400] \"GET /pub/winvn/readme.txt HTTP/1.0\" 404 -'),\n",
" Row(value='netport-27.iu.net - - [01/Jul/1995:00:10:28 -0400] \"GET /pub/winvn/readme.txt HTTP/1.0\" 404 -'),\n",
" Row(value='dynip38.efn.org - - [01/Jul/1995:00:10:50 -0400] \"GET /software HTTP/1.0\" 302 -')]"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"null_content_size_df.take(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It is quite evident that the bad raw data records correspond to error responses, where no content was sent back and the server emitted a \"`-`\" for the `content_size` field. \n",
"\n",
"Since we don't want to discard those rows from our analysis, let's impute or fill them to 0."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Fix the rows with null content\\_size\n",
"\n",
"The easiest solution is to replace the null values in `logs_df` with 0 like we discussed earlier. The Spark DataFrame API provides a set of functions and fields specifically designed for working with null values, among them:\n",
"\n",
"* [fillna()](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.fillna), which fills null values with specified non-null values.\n",
"* [na](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.na), which returns a [DataFrameNaFunctions](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameNaFunctions) object with many functions for operating on null columns.\n",
"\n",
"There are several ways to invoke this function. The easiest is just to replace _all_ null columns with known values. But, for safety, it's better to pass a Python dictionary containing (column\\_name, value) mappings. That's what we'll do. A sample example from the documentation is depicted below\n",
"\n",
"```\n",
">>> df4.na.fill({'age': 50, 'name': 'unknown'}).show()\n",
"+---+------+-------+\n",
"|age|height| name|\n",
"+---+------+-------+\n",
"| 10| 80| Alice|\n",
"| 5| null| Bob|\n",
"| 50| null| Tom|\n",
"| 50| null|unknown|\n",
"+---+------+-------+\n",
"```\n",
"\n",
"Now we use this function and fill all the missing values in the `content_size` field with 0!"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"logs_df = logs_df.na.fill({'content_size': 0})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now assuming everything we have done so far worked, we should have no missing values \\ nulls in our dataset. Let's verify this!"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+----+---------+------+--------+--------+------+------------+\n",
"|host|timestamp|method|endpoint|protocol|status|content_size|\n",
"+----+---------+------+--------+--------+------+------------+\n",
"| 0| 0| 0| 0| 0| 0| 0|\n",
"+----+---------+------+--------+--------+------+------------+\n",
"\n"
]
}
],
"source": [
"exprs = [count_null(col_name) for col_name in logs_df.columns]\n",
"logs_df.agg(*exprs).show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Look at that, no missing values! "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Handling Temporal Fields (Timestamp)\n",
"\n",
"Now that we have a clean, parsed DataFrame, we have to parse the timestamp field into an actual timestamp. The Common Log Format time is somewhat non-standard. A User-Defined Function (UDF) is the most straightforward way to parse it."
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from pyspark.sql.functions import udf\n",
"\n",
"month_map = {\n",
" 'Jan': 1, 'Feb': 2, 'Mar':3, 'Apr':4, 'May':5, 'Jun':6, 'Jul':7,\n",
" 'Aug':8, 'Sep': 9, 'Oct':10, 'Nov': 11, 'Dec': 12\n",
"}\n",
"\n",
"def parse_clf_time(text):\n",
" \"\"\" Convert Common Log time format into a Python datetime object\n",
" Args:\n",
" text (str): date and time in Apache time format [dd/mmm/yyyy:hh:mm:ss (+/-)zzzz]\n",
" Returns:\n",
" a string suitable for passing to CAST('timestamp')\n",
" \"\"\"\n",
" # NOTE: We're ignoring the time zones here, might need to be handled depending on the problem you are solving\n",
" return \"{0:04d}-{1:02d}-{2:02d} {3:02d}:{4:02d}:{5:02d}\".format(\n",
" int(text[7:11]),\n",
" month_map[text[3:6]],\n",
" int(text[0:2]),\n",
" int(text[12:14]),\n",
" int(text[15:17]),\n",
" int(text[18:20])\n",
" )"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['01/Jul/1995:00:00:01 -0400',\n",
" '01/Jul/1995:00:00:06 -0400',\n",
" '01/Jul/1995:00:00:09 -0400',\n",
" '01/Jul/1995:00:00:11 -0400',\n",
" '01/Jul/1995:00:00:11 -0400']"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sample_ts = [item['timestamp'] for item in logs_df.select('timestamp').take(5)]\n",
"sample_ts"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['1995-07-01 00:00:01',\n",
" '1995-07-01 00:00:06',\n",
" '1995-07-01 00:00:09',\n",
" '1995-07-01 00:00:11',\n",
" '1995-07-01 00:00:11']"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"[parse_clf_time(item) for item in sample_ts]"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+--------------------+------+--------------------+--------+------+------------+-------------------+\n",
"| host|method| endpoint|protocol|status|content_size| time|\n",
"+--------------------+------+--------------------+--------+------+------------+-------------------+\n",
"| 199.72.81.55| GET| /history/apollo/|HTTP/1.0| 200| 6245|1995-07-01 00:00:01|\n",
"|unicomp6.unicomp.net| GET| /shuttle/countdown/|HTTP/1.0| 200| 3985|1995-07-01 00:00:06|\n",
"| 199.120.110.21| GET|/shuttle/missions...|HTTP/1.0| 200| 4085|1995-07-01 00:00:09|\n",
"| burger.letters.com| GET|/shuttle/countdow...|HTTP/1.0| 304| 0|1995-07-01 00:00:11|\n",
"| 199.120.110.21| GET|/shuttle/missions...|HTTP/1.0| 200| 4179|1995-07-01 00:00:11|\n",
"| burger.letters.com| GET|/images/NASA-logo...|HTTP/1.0| 304| 0|1995-07-01 00:00:12|\n",
"| burger.letters.com| GET|/shuttle/countdow...|HTTP/1.0| 200| 0|1995-07-01 00:00:12|\n",
"| 205.212.115.106| GET|/shuttle/countdow...|HTTP/1.0| 200| 3985|1995-07-01 00:00:12|\n",
"| d104.aa.net| GET| /shuttle/countdown/|HTTP/1.0| 200| 3985|1995-07-01 00:00:13|\n",
"| 129.94.144.152| GET| /|HTTP/1.0| 200| 7074|1995-07-01 00:00:13|\n",
"+--------------------+------+--------------------+--------+------+------------+-------------------+\n",
"only showing top 10 rows\n",
"\n"
]
}
],
"source": [
"udf_parse_time = udf(parse_clf_time)\n",
"\n",
"logs_df = logs_df.select('*', udf_parse_time(logs_df['timestamp']).cast('timestamp').alias('time')).drop('timestamp')\n",
"logs_df.show(10, truncate=True)"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"root\n",
" |-- host: string (nullable = true)\n",
" |-- method: string (nullable = true)\n",
" |-- endpoint: string (nullable = true)\n",
" |-- protocol: string (nullable = true)\n",
" |-- status: integer (nullable = true)\n",
" |-- content_size: integer (nullable = false)\n",
" |-- time: timestamp (nullable = true)\n",
"\n"
]
}
],
"source": [
"logs_df.printSchema()"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
host
\n",
"
method
\n",
"
endpoint
\n",
"
protocol
\n",
"
status
\n",
"
content_size
\n",
"
time
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
199.72.81.55
\n",
"
GET
\n",
"
/history/apollo/
\n",
"
HTTP/1.0
\n",
"
200
\n",
"
6245
\n",
"
1995-07-01 00:00:01
\n",
"
\n",
"
\n",
"
1
\n",
"
unicomp6.unicomp.net
\n",
"
GET
\n",
"
/shuttle/countdown/
\n",
"
HTTP/1.0
\n",
"
200
\n",
"
3985
\n",
"
1995-07-01 00:00:06
\n",
"
\n",
"
\n",
"
2
\n",
"
199.120.110.21
\n",
"
GET
\n",
"
/shuttle/missions/sts-73/mission-sts-73.html
\n",
"
HTTP/1.0
\n",
"
200
\n",
"
4085
\n",
"
1995-07-01 00:00:09
\n",
"
\n",
"
\n",
"
3
\n",
"
burger.letters.com
\n",
"
GET
\n",
"
/shuttle/countdown/liftoff.html
\n",
"
HTTP/1.0
\n",
"
304
\n",
"
0
\n",
"
1995-07-01 00:00:11
\n",
"
\n",
"
\n",
"
4
\n",
"
199.120.110.21
\n",
"
GET
\n",
"
/shuttle/missions/sts-73/sts-73-patch-small.gif
\n",
"
HTTP/1.0
\n",
"
200
\n",
"
4179
\n",
"
1995-07-01 00:00:11
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" host method \\\n",
"0 199.72.81.55 GET \n",
"1 unicomp6.unicomp.net GET \n",
"2 199.120.110.21 GET \n",
"3 burger.letters.com GET \n",
"4 199.120.110.21 GET \n",
"\n",
" endpoint protocol status \\\n",
"0 /history/apollo/ HTTP/1.0 200 \n",
"1 /shuttle/countdown/ HTTP/1.0 200 \n",
"2 /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0 200 \n",
"3 /shuttle/countdown/liftoff.html HTTP/1.0 304 \n",
"4 /shuttle/missions/sts-73/sts-73-patch-small.gif HTTP/1.0 200 \n",
"\n",
" content_size time \n",
"0 6245 1995-07-01 00:00:01 \n",
"1 3985 1995-07-01 00:00:06 \n",
"2 4085 1995-07-01 00:00:09 \n",
"3 0 1995-07-01 00:00:11 \n",
"4 4179 1995-07-01 00:00:11 "
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"logs_df.limit(5).toPandas()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's now cache `logs_df` since we will be using it extensively for our data analysis section in the next part!"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"DataFrame[host: string, method: string, endpoint: string, protocol: string, status: int, content_size: int, time: timestamp]"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"logs_df.cache()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Part 4 - Data Analysis on our Web Logs\n",
"\n",
"Now that we have a DataFrame containing the parsed log file as a data frame, we can perform some interesting exploratory data analysis (EDA)\n",
"\n",
"## Content Size Statistics\n",
"\n",
"Let's compute some statistics about the sizes of content being returned by the web server. In particular, we'd like to know what are the average, minimum, and maximum content sizes.\n",
"\n",
"We can compute the statistics by calling `.describe()` on the `content_size` column of `logs_df`. The `.describe()` function returns the count, mean, stddev, min, and max of a given column."
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
summary
\n",
"
content_size
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
count
\n",
"
3461612
\n",
"
\n",
"
\n",
"
1
\n",
"
mean
\n",
"
18928.844398216785
\n",
"
\n",
"
\n",
"
2
\n",
"
stddev
\n",
"
73031.47260949228
\n",
"
\n",
"
\n",
"
3
\n",
"
min
\n",
"
0
\n",
"
\n",
"
\n",
"
4
\n",
"
max
\n",
"
6823936
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" summary content_size\n",
"0 count 3461612\n",
"1 mean 18928.844398216785\n",
"2 stddev 73031.47260949228\n",
"3 min 0\n",
"4 max 6823936"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"content_size_summary_df = logs_df.describe(['content_size'])\n",
"content_size_summary_df.toPandas()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Alternatively, we can use SQL to directly calculate these statistics. You can explore many useful functions within the `pyspark.sql.functions` module in the [documentation](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions).\n",
"\n",
"After we apply the `.agg()` function, we call `toPandas()` to extract and convert the result into a `pandas` dataframe which has better formatting on Jupyter notebooks"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
min_content_size
\n",
"
max_content_size
\n",
"
mean_content_size
\n",
"
std_content_size
\n",
"
count_content_size
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
0
\n",
"
6823936
\n",
"
18928.844398
\n",
"
73031.472609
\n",
"
3461612
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" min_content_size max_content_size mean_content_size std_content_size \\\n",
"0 0 6823936 18928.844398 73031.472609 \n",
"\n",
" count_content_size \n",
"0 3461612 "
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from pyspark.sql import functions as F\n",
"\n",
"(logs_df.agg(F.min(logs_df['content_size']).alias('min_content_size'),\n",
" F.max(logs_df['content_size']).alias('max_content_size'),\n",
" F.mean(logs_df['content_size']).alias('mean_content_size'),\n",
" F.stddev(logs_df['content_size']).alias('std_content_size'),\n",
" F.count(logs_df['content_size']).alias('count_content_size'))\n",
" .toPandas())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## HTTP Status Code Analysis\n",
"\n",
"Next, let's look at the status code values that appear in the log. We want to know which status code values appear in the data and how many times. \n",
"\n",
"We again start with `logs_df`, then group by the `status` column, apply the `.count()` aggregation function, and sort by the `status` column."
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"status_freq_df = (logs_df\n",
" .groupBy('status')\n",
" .count()\n",
" .sort('status')\n",
" .cache())"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Total distinct HTTP Status Codes: 8\n"
]
}
],
"source": [
"print('Total distinct HTTP Status Codes:', status_freq_df.count())"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
status
\n",
"
count
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
200
\n",
"
3100524
\n",
"
\n",
"
\n",
"
2
\n",
"
304
\n",
"
266773
\n",
"
\n",
"
\n",
"
1
\n",
"
302
\n",
"
73070
\n",
"
\n",
"
\n",
"
5
\n",
"
404
\n",
"
20899
\n",
"
\n",
"
\n",
"
4
\n",
"
403
\n",
"
225
\n",
"
\n",
"
\n",
"
6
\n",
"
500
\n",
"
65
\n",
"
\n",
"
\n",
"
7
\n",
"
501
\n",
"
41
\n",
"
\n",
"
\n",
"
3
\n",
"
400
\n",
"
15
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" status count\n",
"0 200 3100524\n",
"2 304 266773\n",
"1 302 73070\n",
"5 404 20899\n",
"4 403 225\n",
"6 500 65\n",
"7 501 41\n",
"3 400 15"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"status_freq_pd_df = (status_freq_df\n",
" .toPandas()\n",
" .sort_values(by=['count'],\n",
" ascending=False))\n",
"status_freq_pd_df"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Requirement already up-to-date: seaborn in /usr/local/anaconda/lib/python3.6/site-packages (0.9.0)\n",
"Requirement not upgraded as not directly required: matplotlib>=1.4.3 in /usr/local/anaconda/lib/python3.6/site-packages (from seaborn) (2.2.2)\n",
"Requirement not upgraded as not directly required: numpy>=1.9.3 in /usr/local/anaconda/lib/python3.6/site-packages (from seaborn) (1.13.3)\n",
"Requirement not upgraded as not directly required: scipy>=0.14.0 in /usr/local/anaconda/lib/python3.6/site-packages (from seaborn) (1.1.0)\n",
"Requirement not upgraded as not directly required: pandas>=0.15.2 in /usr/local/anaconda/lib/python3.6/site-packages (from seaborn) (0.20.3)\n",
"Requirement not upgraded as not directly required: cycler>=0.10 in /usr/local/anaconda/lib/python3.6/site-packages (from matplotlib>=1.4.3->seaborn) (0.10.0)\n",
"Requirement not upgraded as not directly required: pytz in /usr/local/anaconda/lib/python3.6/site-packages (from matplotlib>=1.4.3->seaborn) (2018.4)\n",
"Requirement not upgraded as not directly required: six>=1.10 in /usr/local/anaconda/lib/python3.6/site-packages (from matplotlib>=1.4.3->seaborn) (1.11.0)\n",
"Requirement not upgraded as not directly required: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/anaconda/lib/python3.6/site-packages (from matplotlib>=1.4.3->seaborn) (2.2.0)\n",
"Requirement not upgraded as not directly required: kiwisolver>=1.0.1 in /usr/local/anaconda/lib/python3.6/site-packages (from matplotlib>=1.4.3->seaborn) (1.0.1)\n",
"Requirement not upgraded as not directly required: python-dateutil>=2.1 in /usr/local/anaconda/lib/python3.6/site-packages (from matplotlib>=1.4.3->seaborn) (2.7.3)\n",
"Requirement not upgraded as not directly required: setuptools in /usr/local/anaconda/lib/python3.6/site-packages (from kiwisolver>=1.0.1->matplotlib>=1.4.3->seaborn) (39.2.0)\n",
"\u001b[31mpyspark 2.4.0 requires py4j==0.10.7, which is not installed.\u001b[0m\n",
"\u001b[33mYou are using pip version 10.0.1, however version 19.0.3 is available.\n",
"You should consider upgrading via the 'pip install --upgrade pip' command.\u001b[0m\n"
]
}
],
"source": [
"!pip install -U seaborn"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAWAAAAFgCAYAAACFYaNMAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAGYpJREFUeJzt3X2QZXWd3/H3xxnxGUGZRWRIQJ2K\nwSeEDo7RrA9EGDARNKyFFWViiONGWLVWjbimxKdUaYyyYpQKW8wCxl1kUcOEVdnJiFprLQ+NIDCw\nhFYxzCzCyAzgrhUM+M0f99d46Z0euod77w/p96vqVp/7Pb9zft++3fOZ0+eee2+qCknS5D2mdwOS\ntFQZwJLUiQEsSZ0YwJLUiQEsSZ0YwJLUiQEsSZ0YwJLUiQEsSZ0s793AI92aNWvqm9/8Zu82JP1m\nyUIGeQT8EH72s5/1bkHSo5QBLEmdGMCS1IkBLEmdGMCS1IkBLEmdGMCS1IkBLEmdGMCS1IkBLEmd\nGMCS1IkBLEmdGMCS1InvhrZIh73vvInNddWnTpzYXJImzyNgSerEAJakTgxgSerEAJakTgxgSerE\nAJakTsYWwEken+SKJD9IsjnJR1r9oCSXJ5lJ8uUke7T649r9mbb+wKF9faDVb0py1FB9TavNJDl1\nqL7oOSRp0sZ5BHwv8OqqehFwCLAmyWrgk8DpVfUcYAdwUht/ErCj1U9v40hyMHAC8DxgDfCFJMuS\nLAM+DxwNHAy8qY1lsXNIUg9jC+Aa+Nt297HtVsCrgQtb/VzguLZ8bLtPW39EkrT6+VV1b1X9GJgB\nDm+3mar6UVX9EjgfOLZts9g5JGnixnoOuB2pXgPcAWwEfgjcVVX3tSFbgP3b8v7ArQBt/d3A04fr\nc7aZr/703Zhjbt/rkkwnmd62bdvuffOS9BDGGsBVdX9VHQKsZHDE+txxzjcqVXVWVU1V1dSKFSt6\ntyPpUWoiV0FU1V3ApcBLgb2SzL4HxUpga1veChwA0NY/FbhzuD5nm/nqd+7GHJI0ceO8CmJFkr3a\n8hOA1wA3Mgji49uwtcBFbXlDu09b/62qqlY/oV3BcBCwCrgCuBJY1a542IPBE3Ub2jaLnUOSJm6c\n74a2H3Buu1rhMcAFVXVxkhuA85N8HLgaOLuNPxv4YpIZYDuDQKWqNie5ALgBuA84uaruB0hyCnAJ\nsAxYX1Wb277ev5g5JKmHeAC4a1NTUzU9Pf3Afd+OUtICLOjqKl8JJ0mdGMCS1IkBLEmdGMCS1IkB\nLEmdGMCS1IkBLEmdGMCS1IkBLEmdGMCS1IkBLEmdGMCS1IkBLEmdGMCS1IkBLEmdGMCS1IkBLEmd\nGMCS1IkBLEmdGMCS1IkBLEmdGMCS1IkBLEmdGMCS1IkBLEmdGMCS1IkBLEmdGMCS1IkBLEmdGMCS\n1IkBLEmdGMCS1IkBLEmdGMCS1IkBLEmdGMCS1IkBLEmdjC2AkxyQ5NIkNyTZnORdrf7hJFuTXNNu\nxwxt84EkM0luSnLUUH1Nq80kOXWoflCSy1v9y0n2aPXHtfszbf2BDzWHJE3aOI+A7wPeU1UHA6uB\nk5Mc3NadXlWHtNvXAdq6E4DnAWuALyRZlmQZ8HngaOBg4E1D+/lk29dzgB3ASa1+ErCj1U9v4+ad\nY3wPgSTNb2wBXFW3VdX32/LPgRuB/XexybHA+VV1b1X9GJgBDm+3mar6UVX9EjgfODZJgFcDF7bt\nzwWOG9rXuW35QuCINn6+OSRp4iZyDridAngxcHkrnZLk2iTrk+zdavsDtw5ttqXV5qs/Hbirqu6b\nU3/Qvtr6u9v4+fY1t991SaaTTG/btm3R368kLcTYAzjJk4GvAO+uqnuAM4FnA4cAtwGfHncPi1VV\nZ1XVVFVNrVixonc7kh6lxhrASR7LIHy/VFVfBaiq26vq/qr6FfBH/PoUwFbggKHNV7bafPU7gb2S\nLJ9Tf9C+2vqntvHz7UuSJm6cV0EEOBu4sao+M1Tfb2jY64Hr2/IG4IR2BcNBwCrgCuBKYFW74mEP\nBk+ibaiqAi4Fjm/brwUuGtrX2rZ8PPCtNn6+OSRp4pY/9JDd9jLgLcB1Sa5ptT9gcBXDIUABtwBv\nB6iqzUkuAG5gcAXFyVV1P0CSU4BLgGXA+qra3Pb3fuD8JB8HrmYQ+LSvX0wyA2xnENq7nEOSJi2D\nA0PNZ2pqqqanpx+4f9j7zpvY3Fd96sSJzSVppLKQQb4STpI6MYAlqRMDWJI6MYAlqRMDWJI6MYAl\nqRMDWJI6MYAlqRMDWJI6MYAlqRMDWJI6MYAlqRMDWJI6MYAlqRMDWJI6MYAlqRMDWJI6MYAlqRMD\nWJI6MYAlqRMDWJI6MYAlqRMDWJI6MYAlqRMDWJI6MYAlqRMDWJI6MYAlqRMDWJI6MYAlqRMDWJI6\nMYAlqRMDWJI6MYAlqRMDWJI6MYAlqRMDWJI6MYAlqZOxBXCSA5JcmuSGJJuTvKvVn5ZkY5Kb29e9\nWz1Jzkgyk+TaJIcO7WttG39zkrVD9cOSXNe2OSNJdncOSZq0cR4B3we8p6oOBlYDJyc5GDgV2FRV\nq4BN7T7A0cCqdlsHnAmDMAVOA14CHA6cNhuobczbhrZb0+qLmkOSehhbAFfVbVX1/bb8c+BGYH/g\nWODcNuxc4Li2fCxwXg1cBuyVZD/gKGBjVW2vqh3ARmBNW7dnVV1WVQWcN2dfi5lDkiZuIueAkxwI\nvBi4HNi3qm5rq34K7NuW9wduHdpsS6vtqr5lJ3V2Y465/a5LMp1ketu2bQv7JiVpkcYewEmeDHwF\neHdV3TO8rh251jjn3505quqsqpqqqqkVK1aMqTNJS91YAzjJYxmE75eq6qutfPvsn/3t6x2tvhU4\nYGjzla22q/rKndR3Zw5JmrhxXgUR4Gzgxqr6zNCqDcDslQxrgYuG6ie2KxVWA3e30wiXAEcm2bs9\n+XYkcElbd0+S1W2uE+fsazFzSNLELR/jvl8GvAW4Lsk1rfYHwCeAC5KcBPwEeGNb93XgGGAG+AXw\nVoCq2p7kY8CVbdxHq2p7W34HcA7wBOAb7cZi55CkHsYWwFX1l0DmWX3ETsYXcPI8+1oPrN9JfRp4\n/k7qdy52DkmaNF8JJ0mdGMCS1IkBLEmdGMCS1IkBLEmdGMCS1IkBLEmdGMCS1IkBLEmdGMCS1IkB\nLEmdGMCS1IkBLEmdGMCS1IkBLEmdGMCS1IkBLEmdGMCS1MmCAjjJpoXUJEkLt8vPhEvyeOCJwD7t\nE4lnP+NtT2D/MfcmSY9qD/WhnG8H3g08E7iKXwfwPcB/HWNfkvSot8sArqrPAp9N8ntV9bkJ9SRJ\nS8KCPpa+qj6X5J8CBw5vU1XnjakvSXrUW1AAJ/ki8GzgGuD+Vi7AAJak3bSgAAamgIOrqsbZjCQt\nJQu9Dvh64BnjbESSlpqFHgHvA9yQ5Arg3tliVb1uLF1J0hKw0AD+8DibkKSlaKFXQXxn3I1I0lKz\n0Ksgfs7gqgeAPYDHAn9XVXuOqzFJerRb6BHwU2aXkwQ4Flg9rqYkaSlY9Luh1cD/AI4aQz+StGQs\n9BTEG4buPobBdcH/dywdSdISsdCrIP7l0PJ9wC0MTkNIknbTQs8Bv3XcjUjSUrPQN2RfmeRrSe5o\nt68kWTnu5iTp0WyhT8L9MbCBwfsCPxP4n60mSdpNCw3gFVX1x1V1X7udA6wYY1+S9Ki30AC+M8mb\nkyxrtzcDd+5qgyTr2+mK64dqH06yNck17XbM0LoPJJlJclOSo4bqa1ptJsmpQ/WDklze6l9Osker\nP67dn2nrD3yoOSSph4UG8L8F3gj8FLgNOB74Nw+xzTnAmp3UT6+qQ9rt6wBJDgZOAJ7XtvnCbNgD\nnweOBg4G3tTGAnyy7es5wA7gpFY/CdjR6qe3cfPOscDvX5JGbqEB/FFgbVWtqKrfYhDIH9nVBlX1\nXWD7Avd/LHB+Vd1bVT8GZoDD222mqn5UVb8EzgeOba/GezVwYdv+XOC4oX2d25YvBI4YevXezuaQ\npC4WGsAvrKods3eqajvw4t2c85Qk17ZTFHu32v7ArUNjtrTafPWnA3dV1X1z6g/aV1t/dxs/377+\nniTrkkwnmd62bdvufZeS9BAWGsCPGQpLkjyNhb+IY9iZDD7a6BAGpzI+vRv7GLuqOquqpqpqasUK\nn2uUNB4LDdFPA3+V5M/a/d8B/tNiJ6uq22eXk/wRcHG7uxU4YGjoylZjnvqdwF5Jlrej3OHxs/va\nkmQ58NQ2fldzSNLELegIuH368RuA29vtDVX1xcVOlmS/obuvZ/BRRzC4xviEdgXDQcAq4ArgSmBV\nu+JhDwZPom1on013KYMnAwHWAhcN7WttWz4e+FYbP98cktTFgk8jVNUNwA0LHZ/kT4FXAvsk2QKc\nBrwyySEM3lv4FuDtbd+bk1zQ9n8fcHJV3d/2cwpwCbAMWF9Vm9sU7wfOT/Jx4Grg7FY/G/hikhkG\nTwKe8FBzSFIP8YOOd21qaqqmp6cfuH/Y+86b2NxXferEic0laaSykEGLfj9gSdJoGMCS1IkBLEmd\nGMCS1IkBLEmdGMCS1IkBLEmdGMCS1IkBLEmdGMCS1IkBLEmdGMCS1IkBLEmdGMCS1IkBLEmdGMCS\n1IkBLEmdGMCS1IkBLEmdGMCS1IkBLEmdGMCS1IkBLEmdGMCS1IkBLEmdGMCS1IkBLEmdGMCS1IkB\nLEmdGMCS1IkBLEmdGMCS1IkBLEmdGMCS1IkBLEmdGMCS1IkBLEmdjC2Ak6xPckeS64dqT0uyMcnN\n7everZ4kZySZSXJtkkOHtlnbxt+cZO1Q/bAk17VtzkiS3Z1DknoY5xHwOcCaObVTgU1VtQrY1O4D\nHA2sard1wJkwCFPgNOAlwOHAabOB2sa8bWi7NbszhyT1MrYArqrvAtvnlI8Fzm3L5wLHDdXPq4HL\ngL2S7AccBWysqu1VtQPYCKxp6/asqsuqqoDz5uxrMXNIUheTPge8b1Xd1pZ/CuzblvcHbh0at6XV\ndlXfspP67szx9yRZl2Q6yfS2bdsW+K1J0uJ0exKuHbnWI3GOqjqrqqaqamrFihVj6EySJh/At8/+\n2d++3tHqW4EDhsatbLVd1VfupL47c0hSF5MO4A3A7JUMa4GLhuontisVVgN3t9MIlwBHJtm7Pfl2\nJHBJW3dPktXt6ocT5+xrMXNIUhfLx7XjJH8KvBLYJ8kWBlczfAK4IMlJwE+AN7bhXweOAWaAXwBv\nBaiq7Uk+BlzZxn20qmaf2HsHgystngB8o91Y7ByS1MvYAriq3jTPqiN2MraAk+fZz3pg/U7q08Dz\nd1K/c7FzSFIPvhJOkjoxgCWpEwNYkjoxgCWpEwNYkjoxgCWpEwNYkjoxgCWpEwNYkjoxgCWpEwNY\nkjoxgCWpEwNYkjoxgCWpEwNYkjoxgCWpEwNYkjoxgCWpEwNYkjoxgCWpEwNYkjoxgCWpEwNYkjox\ngCWpEwNYkjoxgCWpEwNYkjoxgCWpEwNYkjoxgCWpEwNYkjoxgCWpEwNYkjoxgCWpEwNYkjoxgCWp\nEwNYkjoxgCWpky4BnOSWJNcluSbJdKs9LcnGJDe3r3u3epKckWQmybVJDh3az9o2/uYka4fqh7X9\nz7Rts6s5JKmHnkfAr6qqQ6pqqt0/FdhUVauATe0+wNHAqnZbB5wJgzAFTgNeAhwOnDYUqGcCbxva\nbs1DzCFJE/dIOgVxLHBuWz4XOG6ofl4NXAbslWQ/4ChgY1Vtr6odwEZgTVu3Z1VdVlUFnDdnXzub\nQ5ImrlcAF/AXSa5Ksq7V9q2q29ryT4F92/L+wK1D225ptV3Vt+ykvqs5HiTJuiTTSaa3bdu26G9O\nkhZiead5X15VW5P8FrAxyV8Pr6yqSlLjbGBXc1TVWcBZAFNTU2PtQ9LS1eUIuKq2tq93AF9jcA73\n9nb6gPb1jjZ8K3DA0OYrW21X9ZU7qbOLOSRp4iYewEmelOQps8vAkcD1wAZg9kqGtcBFbXkDcGK7\nGmI1cHc7jXAJcGSSvduTb0cCl7R19yRZ3a5+OHHOvnY2hyRNXI9TEPsCX2tXhi0H/qSqvpnkSuCC\nJCcBPwHe2MZ/HTgGmAF+AbwVoKq2J/kYcGUb99Gq2t6W3wGcAzwB+Ea7AXxinjkkaeImHsBV9SPg\nRTup3wkcsZN6ASfPs6/1wPqd1KeB5y90Dknq4ZF0GZokLSkGsCR1YgBLUicGsCR1YgBLUicGsCR1\nYgBLUicGsCR1YgBLUicGsCR1YgBLUicGsCR1YgBLUicGsCR1YgBLUicGsCR10utDOfUw/J+PvmBi\nc/2DD103sbmkpcYjYEnqxACWpE4MYEnqxACWpE4MYEnqxACWpE4MYEnqxACWpE4MYEnqxACWpE4M\nYEnqxACWpE4MYEnqxACWpE4MYEnqxACWpE4MYEnqxE/E0G572edeNrG5vvd735vYXNKkeAQsSZ0Y\nwJLUiQEsSZ0syQBOsibJTUlmkpzaux9JS9OSexIuyTLg88BrgC3AlUk2VNUNfTvT7vrOb79iYnO9\n4rvfmdhcevRbikfAhwMzVfWjqvolcD5wbOeeJC1BqarePUxUkuOBNVX179r9twAvqapThsasA9a1\nu/8IuOlhTrsP8LOHuY9RsI9HVg9gH4+0HmA0ffysqtY81KAldwpiIarqLOCsUe0vyXRVTY1qf/bx\n6OjBPh55PUy6j6V4CmIrcMDQ/ZWtJkkTtRQD+EpgVZKDkuwBnABs6NyTpCVoyZ2CqKr7kpwCXAIs\nA9ZX1eYxTzuy0xkPk3382iOhB7CPYY+EHmCCfSy5J+Ek6ZFiKZ6CkKRHBANYkjoxgB+mJAckuTTJ\nDUk2J3lXqz8tycYkN7eve7d6kpzRXgZ9bZJDR9TH45NckeQHrY+PtPpBSS5v8325PfE4vN2/SlJJ\nRnLZzWL7SPL77bG7NsmmJP9wFH20fS9LcnWSi3fVw9D4kT4Wi+0jye8muS7JNUn+MsnBI+zhlqF9\nT7faRH9Hd6OP5yb5qyT3JnnvqHpo+17oz+Rx7f5MW3/gKPswgB+++4D3VNXBwGrg5PYP51RgU1Wt\nAja1+wBHA6vabR1w5oj6uBd4dVW9CDgEWJNkNfBJ4PSqeg6wAzhpdoMkTwHeBVw+oh52p4+rgamq\neiFwIfCfR9jLu4Abh+5P+rFYbB9/UlUvqKpDGDwOnxlxH6+qqkOGrnGd9O/oYvvYDrwT+C8jnh8W\n/jM5CdjR6qe3caNTVd5GeAMuYvA+EzcB+7XafsBNbfm/AW8aGv/AuBH28ETg+8BLGLyiZ3mrvxS4\nZGjcHwKvBb7NIARH/VgsqI+h8S8GvjeiuVcy+Mf8auBiID0ei8X2MbTdm4BvjLCPW4B95tQm/ju6\nmD6G1n8YeG+PnwmDq6Ve2paXt3EZVS8eAY9Q+/PkxQyOovatqtvaqp8C+7bl/YFbhzbb0mqjmH9Z\nkmuAO4CNwA+Bu6rqvrlztT8rD6iqPx/F3LvbxxwnAd8YURt/CPwH4Fft/tPn62Gcj8Vi+mi9nJzk\nhwyOgN85wj4K+IskV2XwUnvo8Du6yD7GZTE/kwcei7b+7jZ+JAzgEUnyZOArwLur6p7hdTX473Ps\n1/tV1f01+PN1JYM3HXruPL0+hsGft+/p2cecnt4MTAGferjzJ/kXwB1VddUCxo7tsVhMH7Oq6vNV\n9Wzg/cB/HGE7L6+qQxmcXjg5yW/PmXciv6O9+9idn8k4LbkXYoxDkscyCN8vVdVXW/n2JPtV1W1J\n9mNwNAgTeCl0Vd2V5FIGf0rtlWR5+997dq6nAM8Hvp0E4BnAhiSvq6rpCfYBQJJ/DnwQeEVV3TuC\nqV8GvC7JMcDjgT2Bz87Twzgfi8X0Mdf5jPDca1VtbV/vSPI1Bv8xTvx3dJF9jMNifyazj8WWJMuB\npwJ3jqoZj4Afpgz+1Z4N3FhVw0+abADWtuW1DM4Nz9ZPbM80rwbuHvrz6+H0sSLJXm35CQzOQ98I\nXAocP9xHVd1dVftU1YFVdSBwGTCS8F1MH23Mixmcc3xdVY3kH15VfaCqVrbv7QTgW1X1r3fWwzgf\ni8X0AZBk1dDmrwVufrg9tP0+qT3JSJInAUcC1zP539HF9jFyi/2ZzOnt+DZ+dEfoozqZvFRvwMsZ\n/Ml0LXBNux3D4DzRJgb/iP4X8LQ2PgzeEP6HwHWM6Akf4IUMrii4lsEv9Yda/VnAFcAM8GfA43ay\n7bd79dEem9uHHrsNI/75vBK4uMdjsdg+GByJbW6Pw6XA80Y097OAH7TbZuCDrT7p39HF9vEMBudj\n7wHuast7Tvhn8vh2f6atf9Yofy98KbIkdeIpCEnqxACWpE4MYEnqxACWpE4MYEnqxACW5kjy7iRP\nHNU4aT5ehibNkeQWBte+7vKjyRc6TpqPR8Ba0tqrs/48g/cvvj7JacAzgUvby6hJcmaS6Tz4/Y3f\nuZNxfzu03+OTnNOWf6ft+wdJvjvhb1GPYL4XhJa6NcDfVNVrAZI8FXgrg/esnT2y/WBVbU+yDNiU\n5IVVdUaS358zbj4fAo6qqq2zL9OWwCNg6TrgNUk+meSfVdXdOxnzxiTfZ/AS6+cBi/2kiu8B5yR5\nG4NP4pYAj4C1xFXV/27vB3wM8PEkm4bXJzkIeC/wT6pqRzut8Pj5dje0/MCYqvrdJC9h8AY7VyU5\nrKpG9o5a+s3lEbCWtCTPBH5RVf+dwXsRHwr8nMHbVMLg7Qr/Drg7yb4M3sd21vA4GLyt4j9u7zH8\n+qE5nl1Vl1fVh4BtPPitHrWEeQSspe4FwKeS/Ar4f8C/Z/D+xd9M8jdV9aokVwN/zeCTEb43tO1Z\nw+MYfJbZxQxCdhp4chv3qfZWk2Hwrl8/mMD3pd8AXoYmSZ14CkKSOjGAJakTA1iSOjGAJakTA1iS\nOjGAJakTA1iSOvn/yqE9LQkw+W4AAAAASUVORK5CYII=\n",
"text/plain": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"log_freq_pd_df = (log_freq_df\n",
" .toPandas()\n",
" .sort_values(by=['log(count)'],\n",
" ascending=False))\n",
"sns.catplot(x='status', y='log(count)', data=log_freq_pd_df, \n",
" kind='bar', order=status_freq_pd_df['status'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Analyzing Frequent Hosts\n",
"\n",
"Let's look at hosts that have accessed the server frequently. We will try to get the count of total accesses by each `host` and then sort by the counts and display only the top ten most frequent hosts."
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+--------------------+-----+\n",
"|host |count|\n",
"+--------------------+-----+\n",
"|piweba3y.prodigy.com|21988|\n",
"|piweba4y.prodigy.com|16437|\n",
"|piweba1y.prodigy.com|12825|\n",
"|edams.ksc.nasa.gov |11964|\n",
"|163.206.89.4 |9697 |\n",
"|news.ti.com |8161 |\n",
"|www-d1.proxy.aol.com|8047 |\n",
"|alyssa.prodigy.com |8037 |\n",
"| |7660 |\n",
"|siltb10.orl.mmc.com |7573 |\n",
"+--------------------+-----+\n",
"\n"
]
}
],
"source": [
"host_sum_df =(logs_df\n",
" .groupBy('host')\n",
" .count()\n",
" .sort('count', ascending=False).limit(10))\n",
"\n",
"host_sum_df.show(truncate=False)"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"''"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"host_sum_pd_df = host_sum_df.toPandas()\n",
"host_sum_pd_df.iloc[8]['host']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Looks like we have some empty strings as one of the top host names! This teaches us a valuable lesson to not just check for nulls but also potentially empty strings when data wrangling."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Display the Top 20 Frequent EndPoints\n",
"\n",
"Now, let's visualize the number of hits to endpoints (URIs) in the log. To perform this task, we start with our `logs_df` and group by the `endpoint` column, aggregate by count, and sort in descending order like the previous question."
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"paths_df = (logs_df\n",
" .groupBy('endpoint')\n",
" .count()\n",
" .sort('count', ascending=False).limit(20))"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
endpoint
\n",
"
count
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
/images/NASA-logosmall.gif
\n",
"
208714
\n",
"
\n",
"
\n",
"
1
\n",
"
/images/KSC-logosmall.gif
\n",
"
164970
\n",
"
\n",
"
\n",
"
2
\n",
"
/images/MOSAIC-logosmall.gif
\n",
"
127908
\n",
"
\n",
"
\n",
"
3
\n",
"
/images/USA-logosmall.gif
\n",
"
127074
\n",
"
\n",
"
\n",
"
4
\n",
"
/images/WORLD-logosmall.gif
\n",
"
125925
\n",
"
\n",
"
\n",
"
5
\n",
"
/images/ksclogo-medium.gif
\n",
"
121572
\n",
"
\n",
"
\n",
"
6
\n",
"
/ksc.html
\n",
"
83909
\n",
"
\n",
"
\n",
"
7
\n",
"
/images/launch-logo.gif
\n",
"
76006
\n",
"
\n",
"
\n",
"
8
\n",
"
/history/apollo/images/apollo-logo1.gif
\n",
"
68896
\n",
"
\n",
"
\n",
"
9
\n",
"
/shuttle/countdown/
\n",
"
64736
\n",
"
\n",
"
\n",
"
10
\n",
"
/
\n",
"
63171
\n",
"
\n",
"
\n",
"
11
\n",
"
/images/ksclogosmall.gif
\n",
"
61393
\n",
"
\n",
"
\n",
"
12
\n",
"
/shuttle/missions/missions.html
\n",
"
47315
\n",
"
\n",
"
\n",
"
13
\n",
"
/images/launchmedium.gif
\n",
"
40687
\n",
"
\n",
"
\n",
"
14
\n",
"
/htbin/cdt_main.pl
\n",
"
39871
\n",
"
\n",
"
\n",
"
15
\n",
"
/shuttle/missions/sts-69/mission-sts-69.html
\n",
"
31574
\n",
"
\n",
"
\n",
"
16
\n",
"
/shuttle/countdown/liftoff.html
\n",
"
29865
\n",
"
\n",
"
\n",
"
17
\n",
"
/icons/menu.xbm
\n",
"
29190
\n",
"
\n",
"
\n",
"
18
\n",
"
/shuttle/missions/sts-69/sts-69-patch-small.gif
\n",
"
29118
\n",
"
\n",
"
\n",
"
19
\n",
"
/icons/blank.xbm
\n",
"
28852
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" endpoint count\n",
"0 /images/NASA-logosmall.gif 208714\n",
"1 /images/KSC-logosmall.gif 164970\n",
"2 /images/MOSAIC-logosmall.gif 127908\n",
"3 /images/USA-logosmall.gif 127074\n",
"4 /images/WORLD-logosmall.gif 125925\n",
"5 /images/ksclogo-medium.gif 121572\n",
"6 /ksc.html 83909\n",
"7 /images/launch-logo.gif 76006\n",
"8 /history/apollo/images/apollo-logo1.gif 68896\n",
"9 /shuttle/countdown/ 64736\n",
"10 / 63171\n",
"11 /images/ksclogosmall.gif 61393\n",
"12 /shuttle/missions/missions.html 47315\n",
"13 /images/launchmedium.gif 40687\n",
"14 /htbin/cdt_main.pl 39871\n",
"15 /shuttle/missions/sts-69/mission-sts-69.html 31574\n",
"16 /shuttle/countdown/liftoff.html 29865\n",
"17 /icons/menu.xbm 29190\n",
"18 /shuttle/missions/sts-69/sts-69-patch-small.gif 29118\n",
"19 /icons/blank.xbm 28852"
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"paths_pd_df = paths_df.toPandas()\n",
"paths_pd_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Top Ten Error Endpoints\n",
"\n",
"What are the top ten endpoints requested which did not have return code 200 (HTTP Status OK)? \n",
"\n",
"We create a sorted list containing the endpoints and the number of times that they were accessed with a non-200 return code and show the top ten."
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"not200_df = (logs_df\n",
" .filter(logs_df['status'] != 200))\n",
"\n",
"error_endpoints_freq_df = (not200_df\n",
" .groupBy('endpoint')\n",
" .count()\n",
" .sort('count', ascending=False)\n",
" .limit(10)\n",
" )"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+---------------------------------------+-----+\n",
"|endpoint |count|\n",
"+---------------------------------------+-----+\n",
"|/images/NASA-logosmall.gif |40082|\n",
"|/images/KSC-logosmall.gif |23763|\n",
"|/images/MOSAIC-logosmall.gif |15245|\n",
"|/images/USA-logosmall.gif |15142|\n",
"|/images/WORLD-logosmall.gif |14773|\n",
"|/images/ksclogo-medium.gif |13559|\n",
"|/images/launch-logo.gif |8806 |\n",
"|/history/apollo/images/apollo-logo1.gif|7489 |\n",
"|/ |6296 |\n",
"|/images/ksclogosmall.gif |5669 |\n",
"+---------------------------------------+-----+\n",
"\n"
]
}
],
"source": [
"error_endpoints_freq_df.show(truncate=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Total number of Unique Hosts\n",
"\n",
"What were the total number of unique hosts who visited the NASA website in these two months? We can find this out with a few transformations."
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"137933"
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"unique_host_count = (logs_df\n",
" .select('host')\n",
" .distinct()\n",
" .count())\n",
"unique_host_count"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Number of Unique Daily Hosts\n",
"\n",
"For an advanced example, let's look at a way to determine the number of unique hosts in the entire log on a day-by-day basis. This computation will give us counts of the number of unique daily hosts. \n",
"\n",
"We'd like a DataFrame sorted by increasing day of the month which includes the day of the month and the associated number of unique hosts for that day. \n",
"\n",
"Think about the steps that you need to perform to count the number of different hosts that make requests *each* day.\n",
"*Since the log only covers a single month, you can ignore the month.* You may want to use the [`dayofmonth` function](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.dayofmonth) in the `pyspark.sql.functions` module (which we have already imported as __`F`__.\n",
"\n",
"\n",
"**`host_day_df`**\n",
"\n",
"A DataFrame with two columns\n",
"\n",
"| column | explanation |\n",
"| ------ | -------------------- |\n",
"| `host` | the host name |\n",
"| `day` | the day of the month |\n",
"\n",
"There will be one row in this DataFrame for each row in `logs_df`. Essentially, we are just transforming each row of `logs_df`. For example, for this row in `logs_df`:\n",
"\n",
"```\n",
"unicomp6.unicomp.net - - [01/Aug/1995:00:35:41 -0400] \"GET /shuttle/missions/sts-73/news HTTP/1.0\" 302 -\n",
"```\n",
"\n",
"your `host_day_df` should have:\n",
"\n",
"```\n",
"unicomp6.unicomp.net 1\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+--------------------+---+\n",
"|host |day|\n",
"+--------------------+---+\n",
"|199.72.81.55 |1 |\n",
"|unicomp6.unicomp.net|1 |\n",
"|199.120.110.21 |1 |\n",
"|burger.letters.com |1 |\n",
"|199.120.110.21 |1 |\n",
"+--------------------+---+\n",
"only showing top 5 rows\n",
"\n"
]
}
],
"source": [
"host_day_df = logs_df.select(logs_df.host, \n",
" F.dayofmonth('time').alias('day'))\n",
"host_day_df.show(5, truncate=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**`host_day_distinct_df`**\n",
"\n",
"This DataFrame has the same columns as `host_day_df`, but with duplicate (`day`, `host`) rows removed."
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-----------------------+---+\n",
"|host |day|\n",
"+-----------------------+---+\n",
"|129.94.144.152 |1 |\n",
"|slip1.yab.com |1 |\n",
"|205.184.190.47 |1 |\n",
"|204.120.34.71 |1 |\n",
"|ppp3_130.bekkoame.or.jp|1 |\n",
"+-----------------------+---+\n",
"only showing top 5 rows\n",
"\n"
]
}
],
"source": [
"host_day_distinct_df = (host_day_df\n",
" .dropDuplicates())\n",
"host_day_distinct_df.show(5, truncate=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**`daily_unique_hosts_df`**\n",
"\n",
"A DataFrame with two columns:\n",
"\n",
"| column | explanation |\n",
"| ------- | -------------------------------------------------- |\n",
"| `day` | the day of the month |\n",
"| `count` | the number of unique requesting hosts for that day |"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"c = sns.catplot(x='day', y='count', \n",
" data=daily_hosts_df, \n",
" kind='point', height=5, \n",
" aspect=1.5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Average Number of Daily Requests per Host\n",
"\n",
"In the previous example, we looked at a way to determine the number of unique hosts in the entire log on a day-by-day basis. Let's now try and find the average number of requests being made per Host to the NASA website per day based on our logs. \n",
"\n",
"We'd like a DataFrame sorted by increasing day of the month which includes the day of the month and the associated number of average requests made for that day per Host. "
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"