{ "metadata": { "name": "", "signature": "sha256:8a9a6d9cc2559bad7354c8681c31812d79f65d24cbedb905ce007ffa6e7c9c15" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# HTTP/1.1 and HTTP/2: A Performance Comparison for Python\n", "\n", "If you don't pay any attention to my Twitter feed, you might have missed the fact that I have spent the last few months working on a client-side HTTP/2 stack for Python, called [hyper](http://hyper.rtfd.org/en/latest/). This project has been a lot of fun, and a gigantic amount of work, but has finally begun to reach a stage where some of the more crass bugs have been worked out.\n", "\n", "For this reason, I think it's time to begin analysing the relative performance of HTTP/1.1 and HTTP/2 in some example use-cases, to get an idea of where things stand.\n", "\n", "Like any good scientist, I don't want to just dive in and explore: I first want to establish what I expect to see. These expectations come from two places: familiarity with `hyper`, and familiarity with HTTP in general.\n", "\n", "My expectation is that `hyper` is, in its current form, going to compare to the standard Python HTTP stack as follows:\n", "\n", "- `hyper` will be more CPU intensive\n", "- `hyper` will be slower\n", "- `hyper` will increase the amount of data sent on the network for workloads involving a _small_ number of HTTP requests\n", "- `hyper` will decrease the amount of data sent on the network for workloads involving a _large_ number of HTTP requests\n", "\n", "This is for the following reasons. Firstly, `hyper` will consume more CPU because it has substantially more work to do than a standard HTTP stack. `hyper` needs to process each HTTP/2 frame (of which there will be at least 4 per request-response cycle), burning CPU all the while to do so. Conversely, the standard HTTP/1.1 stack in Python can do relatively little work, reading headers line-by-line and then the body in one go, requiring almost no transformation between wire format and in-memory representation.\n", "\n", "Secondly, `hyper` will be slower because it has to cross from user-space to kernel-space and back again twice per frame read. This is because `hyper` needs to read 8 bytes from the wire (to find out the frame length), followed by the data for the frame itself. This context-switching is expensive, and not something that needs to be done in quite the same way for HTTP.\n", "\n", "For workloads involving a small number of requests, HTTP/2 does not provide particular bandwidth savings or improve network efficiency. The bandwidth savings provided by HTTP/2 come from header compression, which is at its most effective when sending and receiving multiple requests/responses with very similar headers. For small numbers of requests, this provides little saving. The network efficiency savings come from having long-lived TCP connections resize their connection window appropriately, but this benefit will be lost when sending relatively small numbers of requests. As the cherry on top of this cake, there's some additional HTTP/2 overhead in the form of framing and window management which will lead to HTTP/2 needing to send more bytes than HTTP/1.1 did.\n", "\n", "HTTP/2's major win _should_ be in the area of workloads with large numbers of requests. Here, HTTP/2's header compression and long-lived connections should be expected to provide savings in network usage.\n", "\n", "These are my expectations. Let's dive in and see what we can see.\n", "\n", "## The Set Up\n", "\n", "First, I need to install `hyper`. Because of some ongoing issues regarding upstream dependencies I will be running this test in Python 3.4 using the `h2-10` branch of `hyper` (which, despite its name, implements the h2-12 implementation draft of HTTP/2). As such, I went away and installed that branch using `pip`.\n", "\n", "Let's confirm that `hyper` is installed and functioning by importing it and sending a test query to Twitter, who have a HTTP/2 implementation running on their servers." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import hyper\n", "c = hyper.HTTP20Connection('twitter.com')\n", "c.request('GET', '/')\n", "r = c.getresponse()\n", "print(r.status)\n", "r.close()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "200\n" ] } ], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "If all's gone well, we should print a `200` status code. My machine is correctly installed, so that works out just fine for me. Those of you who haven't seen `hyper` before might be confused by the bizarre API. This API is, weirdly, _intentionally_ bad. This is because it's effectively a drop-in replacement for the standard library's venerable [httplib/http.client](https://docs.python.org/3/library/http.client.html) module. This design decision is deliberate, making it possible for people to implement abstraction layers that correctly use HTTP/2 or HTTP/1.1 as appropriate. `hyper` is expected to grow such an abstraction layer at some point, when I find more time to work on it.\n", "\n", "Alright, we know that `hyper` is working, let's just confirm that we can do some of the same nonsense using `http.client`." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import http.client as http\n", "c = http.HTTPSConnection('twitter.com')\n", "c.request('GET', '/')\n", "r = c.getresponse()\n", "print(r.status)\n", "r.close()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "200\n" ] } ], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Again, we should see the same `200` status code. This means we're set up and ready to start comparing.\n", "\n", "## Part 1: Comparing `hyper` to `http.client`\n", "\n", "Let's begin by doing some simple timing of a single request/response cycle. To try to be fair, we'll force both libraries to read the entire response from the network. Our plan is simply to see which one is faster.\n", "\n", "First, let's whip up a quick utility for timing stuff." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import time\n", "\n", "class Timer(object):\n", " def __init__(self):\n", " self.start = None\n", " self.end = None\n", " self.interval = None\n", " \n", " def __enter__(self):\n", " self.start = time.time()\n", " return self\n", " \n", " def __exit__(self, *args):\n", " self.end = time.time()\n", " self.interval = self.end - self.start" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's get started. Fastest to read Twitter's homepage wins." ] }, { "cell_type": "code", "collapsed": false, "input": [ "c1 = http.HTTPSConnection('twitter.com')\n", "c2 = hyper.HTTP20Connection('twitter.com')\n", "\n", "with Timer() as t1:\n", " c1.request('GET', '/')\n", " r1 = c1.getresponse()\n", " d1 = r1.read()\n", " \n", "with Timer() as t2:\n", " c2.request('GET', '/')\n", " r2 = c2.getresponse()\n", " d2 = r2.read()\n", " \n", "c1.close()\n", "c2.close()\n", "\n", "print(\"HTTP/1.1 total time: {:.3f}\".format(t1.interval))\n", "print(\"HTTP/2 total time: {:.3f}\".format(t2.interval))" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "HTTP/1.1 total time: 0.681\n", "HTTP/2 total time: 0.796\n" ] } ], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alright, this matches roughly what I was expecting: at the scope of a single request, HTTP/2 is slower. This isn't really a representative HTTP request though, because it contains almost no headers. Let's put those in as well, using the ones that Requests will normally send." ] }, { "cell_type": "code", "collapsed": false, "input": [ "headers = {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'User-Agent': 'python-requests/2.2.1 CPython/3.4.1 Windows/7'}\n", "\n", "c1 = http.HTTPSConnection('twitter.com')\n", "c2 = hyper.HTTP20Connection('twitter.com')\n", "\n", "with Timer() as t1:\n", " c1.request('GET', '/', headers=headers)\n", " r1 = c1.getresponse()\n", " d1 = r1.read()\n", " \n", "with Timer() as t2:\n", " c2.request('GET', '/', headers=headers)\n", " r2 = c2.getresponse()\n", " d2 = r2.read()\n", " \n", "c1.close()\n", "c2.close()\n", "\n", "print(\"HTTP/1.1 total time: {:.3f}\".format(t1.interval))\n", "print(\"HTTP/2 total time: {:.3f}\".format(t2.interval))" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "HTTP/1.1 total time: 0.554\n", "HTTP/2 total time: 0.828\n" ] } ], "prompt_number": 5 }, { "cell_type": "markdown", "metadata": {}, "source": [ "No huge difference, but now we're a bit closer to something approaching reality.\n", "\n", "Let's now look at something approaching a real workload: spidering. Suppose you were interested in spidering the entirety of the [nghttp2](https://nghttp2.org/) website. A simple spider might work by opening the home page and downloading it, then looking for anything that looks like another nghttp2.org URL. To avoid infinite loops, a small set of visited pages will be kept.\n", "\n", "Let's do this in HTTP/1.1 first. Naively, we might use a single HTTP connection. This limits us to serially scraping the pages: each URL needs to be accessed one at a time. Below is a sample implementation." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import collections\n", "import re\n", "import itertools\n", "\n", "ABSOLUTE_URL_RE = re.compile(b'