{ "metadata": { "name": "", "signature": "sha256:69aa86ddfeccd1eaf92568cbe7fbcb689cb5edebd005c952253268251f5e9a91" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Blaze - Fast on-disk queries with Pandas + BColz \n", "\n", "\n", "\n", "Yesterday I showed how Blaze uses Pandas to query collections of large CSV files. ([notebook here](http://nbviewer.ipython.org/url/blaze.pydata.org/notebooks/timings-csv.ipynb)). If you're able to drive Pandas intelligently then we can query a 16GB dataset in around 3 minutes. This was about as fast as Postgres *after* the CSV files had been loaded in (a process that takes about 20 minutes if done right.)\n", "\n", "The fact that Pandas+CSV is competitive with Postgres is startling because CSV is a very low-tech and expensive storage solution. If we replace CSV with a faster solution then Pandas is able to really shine. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## BColz\n", "\n", "[BColz](http://bcolz.blosc.org/) is an on-disk binary format (like HDF5) that stores columns separately for faster access. It has a straightforward data model and interacts well with NumPy." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Times\n", "\n", "We compare the performance of our query between a stack of CSV files and a BColz array." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### CSV - 3 minutes" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from blaze import *" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "code", "collapsed": false, "input": [ "d = Data('trip_data_*.csv')\n", "\n", "expr = by(d.passenger_count, avg_distance=d.trip_distance.mean(), \n", " count=d.passenger_count.count())\n", "\n", "%time _ = compute(expr)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "CPU times: user 2min 12s, sys: 14 s, total: 2min 26s\n", "Wall time: 3min 10s\n" ] } ], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### BColz - 30 seconds" ] }, { "cell_type": "code", "collapsed": false, "input": [ "d = Data('trip_data.bcolz')\n", "\n", "expr = by(d.passenger_count, avg_distance=d.trip_distance.mean(), \n", " count=d.passenger_count.count())\n", "\n", "%time _ = compute(expr)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "CPU times: user 16.7 s, sys: 3.91 s, total: 20.6 s\n", "Wall time: 20.7 s\n" ] } ], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Disclaimer: This may have benefitted from OS-level caching. These times range from 20s to 40s." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### BColz + Multiprocessing - 10 seconds" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import multiprocessing\n", "pool = multiprocessing.Pool(4)\n", "\n", "%time _ = compute(expr, map=pool.map)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "CPU times: user 169 ms, sys: 13.2 ms, total: 182 ms\n", "Wall time: 7.26 s\n" ] } ], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How did we migrate data from CSV to BColz?\n", "\n", "The time difference above strongly motivates the use of efficient binary storage systems. Unfortunately migrating data between formats is usually hard enough that everyday users don't bother.\n", "\n", "The [`into`](https://github.com/ContinuumIO/blaze) project eases this process. In principle `into` lets us write something as naive as the following:\n", "\n", " into('trip.bcolz', 'trip_data_*.csv')\n", "\n", "Into works for a variety of formats, not just csv and bcolz.\n", "\n", "\n", "## Better performance with fixed length strings\n", "\n", "We can often improve performance by specifying more information. In our case the text columns in our dataset happen to have fixed lengths (they are just hash values.) By specifying these lengths explicilty we help BColz store and retrieve the data more effectively. We specify this additional information with a `datashape` specifying the column types exactly. \n", "\n", "In just a moment we'll talk about how to automatically detect datashapes." ] }, { "cell_type": "code", "collapsed": false, "input": [ "ds = dshape(\"\"\"var * {\n", " medallion: string[32, 'ascii'],\n", " hack_license: string[32, 'ascii'],\n", " vendor_id: string[3, 'ascii'],\n", " rate_code: int32,\n", " store_and_fwd_flag: string[1, 'ascii'],\n", " pickup_datetime: datetime,\n", " dropoff_datetime: datetime,\n", " passenger_count: int32,\n", " trip_time_in_secs: int32,\n", " trip_distance: float64,\n", " pickup_longitude: float64,\n", " pickup_latitude: float64,\n", " dropoff_longitude: float64,\n", " dropoff_latitude: float64\n", " }\"\"\")\n", "\n", "# Drop old version\n", "drop('trip_data.bcolz')\n", "\n", "# Migrate data\n", "%time into('trip_data.bcolz', 'trip_data_*.csv', dshape=ds)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "CPU times: user 16min 16s, sys: 1min 54s, total: 18min 10s\n", "Wall time: 19min 15s\n" ] }, { "metadata": {}, "output_type": "pyout", "prompt_number": 5, "text": [ "ctable((173179759,), [('medallion', 'S32'), ('hack_license', 'S32'), ('vendor_id', 'S3'), ('rate_code', '