{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Nginx log analysis with pandas and matplotlib" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By Jess Johnson [http://grokcode.com](http://grokcode.com)\n", "\n", "This notebook analyzes Nginx access logs in order to estimate the capacity needed to survive a traffic spike. I'm looking at access logs for [Author Alcove](http://authoralcove.com) which was hit by a big traffic spike when it spent around 24 hours at the top of [/r/books](http://reddit.com/r/books/). The site was hugged to death by reddit. Visitors experienced very slow load times, and many people couldn't access the site at all due to 50x errors. So let's estimate how much extra capacity would be needed to survive this spike.\n", "\n", "The source for this notebook is located on [github](https://github.com/grokcode/ipython-notebooks). See a mistake? Pull requests welcome.\n", "\n", "Thanks to Nikolay Koldunov for his [notebook on Apache log analysis](http://nbviewer.ipython.org/github/koldunovn/nk_public_notebooks/blob/master/Apache_log.ipynb), and thanks to my bro Aaron for the much needed optimism and server optimization tips while everything was on fire.\n", "\n", "OK let's get started." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Setup" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%pylab inline" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Populating the interactive namespace from numpy and matplotlib\n" ] } ], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's import the usual suspects." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd\n", "import numpy as np\n", "import sys\n", "import matplotlib.pyplot as plt" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will also use [apachelog](https://code.google.com/p/apachelog/), which is a module for parsing apache logs, but it works fine with nginx logs as long as we give it the right format string. You can install it with `pip install apachelog`. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "import apachelog" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Parsing the log" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I started out by doing some command line preprocessing on the log in order to remove bots. I used `egrep -v` to filter out the bots that were hitting the site the most often. These were Googlebot, Bingbot, the New Relic uptime checker, Buidu spider, and a few others. A more careful approach would filter out everything on one of the known bot lists ([like this one](http://www.robotstxt.org/db.html)), but I'm going to play it a bit fast and loose.\n", "\n", "First of all let's get a sample line out of the `access.log` and try to parse it. Here is a description of the codes in the log format we are working with:\n", "\n", " %h - remote host (ie the client IP)\n", " %l - identity of the user determined by identd (not usually used since not reliable)\n", " %u - user name determined by HTTP authentication\n", " %t - time the server finished processing the request.\n", " %r - request line from the client. (\"GET / HTTP/1.0\")\n", " %>s - status code sent from the server to the client (200, 404 etc.)\n", " %b - size of the response to the client (in bytes)\n", " %i - Referer is the page that linked to this URL.\n", " User-agent - the browser identification string\n", " %V - the server name according to the UseCanonicalName setting" ] }, { "cell_type": "code", "collapsed": false, "input": [ "sample_string = '178.137.91.215 - - [21/Feb/2014:06:44:53 +0000] \"GET /work/homepages-maths-year-6/ HTTP/1.0\" \\\n", "200 10427 \"http://authoralcove.com/work/homepages-maths-year-6/\" \"Opera/9.80 (Windows NT 6.1; WOW64; U; ru) \\\n", "Presto/2.10.289 Version/12.00\" \"-\"'\n", "nformat = r'%h %l %u %t \\\"%r\\\" %>s %b \\\"%i\\\" \\\"%{User-Agent}i\\\" \\\"%V\\\"'\n", "p = apachelog.parser(nformat)\n", "data = p.parse(sample_string)\n", "data" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 3, "text": [ "{'%>s': '200',\n", " '%V': '-',\n", " '%b': '10427',\n", " '%h': '178.137.91.215',\n", " '%i': 'http://authoralcove.com/work/homepages-maths-year-6/',\n", " '%l': '-',\n", " '%r': 'GET /work/homepages-maths-year-6/ HTTP/1.0',\n", " '%t': '[21/Feb/2014:06:44:53 +0000]',\n", " '%u': '-',\n", " '%{User-Agent}i': 'Opera/9.80 (Windows NT 6.1; WOW64; U; ru) Presto/2.10.289 Version/12.00'}" ] } ], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's parse each line while preparing the access time so that pandas will be able to handle it." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from apachelog import ApacheLogParserError\n", "log_list = []\n", "with open('./private-data/access.log') as f:\n", " for line in f.readlines():\n", " try:\n", " data = p.parse(line)\n", " except ApacheLogParserError:\n", " sys.stderr.write(\"Unable to parse %s\" % line)\n", " data['%t'] = data['%t'][1:12]+' '+data['%t'][13:21]+' '+data['%t'][22:27]\n", " log_list.append(data)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Loading into pandas." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from pandas import Series, DataFrame, Panel\n", "df = DataFrame(log_list)\n", "df[0:2]" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", " | %>s | \n", "%V | \n", "%b | \n", "%h | \n", "%i | \n", "%l | \n", "%r | \n", "%t | \n", "%u | \n", "%{User-Agent}i | \n", "
---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "200 | \n", "- | \n", "10427 | \n", "178.137.91.215 | \n", "http://authoralcove.com/work/homepages-maths-y... | \n", "- | \n", "GET /work/homepages-maths-year-6/ HTTP/1.0 | \n", "21/Feb/2014 06:44:53 +0000 | \n", "- | \n", "Opera/9.80 (Windows NT 6.1; WOW64; U; ru) Pres... | \n", "
1 | \n", "200 | \n", "- | \n", "7507 | \n", "202.46.54.40 | \n", "- | \n", "- | \n", "GET / HTTP/1.1 | \n", "21/Feb/2014 06:53:47 +0000 | \n", "- | \n", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT ... | \n", "
2 rows \u00d7 10 columns
\n", "\n", " | %>s | \n", "%b | \n", "%r | \n", "%t | \n", "
---|---|---|---|---|
0 | \n", "200 | \n", "10427 | \n", "GET /work/homepages-maths-year-6/ HTTP/1.0 | \n", "21/Feb/2014 06:44:53 +0000 | \n", "
1 | \n", "200 | \n", "7507 | \n", "GET / HTTP/1.1 | \n", "21/Feb/2014 06:53:47 +0000 | \n", "
2 rows \u00d7 4 columns
\n", "\n", " | Status | \n", "b | \n", "Request | \n", "Time | \n", "
---|---|---|---|---|
0 | \n", "200 | \n", "10427 | \n", "GET /work/homepages-maths-year-6/ HTTP/1.0 | \n", "21/Feb/2014 06:44:53 +0000 | \n", "
1 | \n", "200 | \n", "7507 | \n", "GET / HTTP/1.1 | \n", "21/Feb/2014 06:53:47 +0000 | \n", "
2 rows \u00d7 4 columns
\n", "\n", " | Bad Gateway | \n", "Client Closed | \n", "Found | \n", "Gateway Timeout | \n", "Moved Permenantely | \n", "Not Found | \n", "Not Modified | \n", "OK | \n", "
---|---|---|---|---|---|---|---|---|
Time | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
2014-02-21 06:00:00 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "3 | \n", "
2014-02-21 07:00:00 | \n", "0 | \n", "0 | \n", "3 | \n", "0 | \n", "0 | \n", "4 | \n", "1 | \n", "29 | \n", "
2014-02-21 08:00:00 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "2 | \n", "
2014-02-21 09:00:00 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "2 | \n", "0 | \n", "3 | \n", "
2014-02-21 10:00:00 | \n", "0 | \n", "6 | \n", "0 | \n", "3 | \n", "0 | \n", "0 | \n", "0 | \n", "2 | \n", "
5 rows \u00d7 8 columns
\n", "\n", " | Status | \n", "b | \n", "Request | \n", "
---|---|---|---|
Time | \n", "\n", " | \n", " | \n", " |
2014-02-21 06:44:53 | \n", "200 | \n", "0.009944 | \n", "GET /work/homepages-maths-year-6/ HTTP/1.0 | \n", "
2014-02-21 06:53:47 | \n", "200 | \n", "0.007159 | \n", "GET / HTTP/1.1 | \n", "
2014-02-21 06:54:22 | \n", "200 | \n", "0.007159 | \n", "GET / HTTP/1.1 | \n", "
2014-02-21 07:01:28 | \n", "200 | \n", "0.002501 | \n", "GET / HTTP/1.1 | \n", "
2014-02-21 07:02:11 | \n", "200 | \n", "0.004643 | \n", "GET /accounts/create-account/ HTTP/1.1 | \n", "
2014-02-21 07:03:06 | \n", "302 | \n", "0.000000 | \n", "POST /accounts/create-account/ HTTP/1.1 | \n", "
2014-02-21 07:03:07 | \n", "200 | \n", "0.003963 | \n", "GET /work/lists/most-popular/ HTTP/1.1 | \n", "
2014-02-21 07:03:41 | \n", "200 | \n", "0.001887 | \n", "GET /work/lists/most-popular/?page=2&querystri... | \n", "
2014-02-21 07:03:48 | \n", "200 | \n", "0.002122 | \n", "GET /work/lists/most-popular/?page=3&querystri... | \n", "
2014-02-21 07:03:55 | \n", "200 | \n", "0.002145 | \n", "GET /work/lists/most-popular/?page=4&querystri... | \n", "
10 rows \u00d7 3 columns
\n", "