{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Comparing speedups when using compression with blz barrays" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this tutorial we are going to show that, for some cases, it is faster to have the data compressed due to the disk speed limitations.\n", "\n", "To show this, we are going to generate a 30000x20000 mandelbrot's fractal (~12 minutes), store it to a blz without compression and then copy that blz to a new one with different compressors and compression levels. We are then going to downsample the compressed blz to show the times (~30 minutes)." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import numba\n", "import numpy as np\n", "import pylab as plt\n", "from time import time\n", "import blz\n", "from shutil import rmtree\n", "import csv\n", "from math import sqrt\n", "from collections import defaultdict\n", "import pandas" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The code used here is discussed in the \"Generating huge Mandelbrot's fractals\" notebook." ] }, { "cell_type": "code", "collapsed": false, "input": [ "@numba.njit\n", "def mandel(x, y, max_iters):\n", " \"\"\"\n", " Given the real and imaginary parts of a complex number,\n", " determine if it is a candidate for membership in the Mandelbrot\n", " set given a fixed number of iterations.\n", " \"\"\"\n", " c = complex(x, y)\n", " z = 0.0j\n", " for i in xrange(max_iters):\n", " z = z*z + c\n", " if (z.real*z.real + z.imag*z.imag) >= 4:\n", " return i\n", "\n", " return max_iters\n", "\n", "\n", "def create_fractal(height, width, min_x, max_x, min_y, max_y, image, row, iters):\n", "\n", " pixel_size_x = (max_x - min_x) / width\n", " pixel_size_y = (max_y - min_y) / height\n", "\n", " for x in xrange(height):\n", "\n", " imag = min_y + x * pixel_size_y\n", "\n", " for y in xrange(width):\n", "\n", " real = min_x + y * pixel_size_x\n", " color = mandel(real, imag, iters)\n", " row[y] = color\n", "\n", " image.append(row)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 3 }, { "cell_type": "code", "collapsed": false, "input": [ "height = 20000\n", "width = 30000\n", "\n", "#If the blz already exist, remove it\n", "rmtree('images/Mandelbrot.blz', ignore_errors=True)\n", "\n", "image = blz.zeros((0, width), rootdir='images/Mandelbrot.blz', dtype=np.uint8,\n", " expectedlen=height*width,\n", " bparams=blz.bparams(clevel=0))\n", "row = np.zeros((width), dtype=np.uint8)\n", "\n", "t1 = time()\n", "create_fractal(height, width, -2.0, 1.0, -1.0, 1.0, image, row, 20)\n", "t2 = time()\n", "\n", "image.flush()\n", "\n", "print t2-t1" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "697.405315161\n" ] } ], "prompt_number": 6 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we have the big fractal in images/Mandelbrot.blz it takes ~577MiB of disk space. Let's do some benchmarks." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This will be the function that will create the new blz given a previous one but with new compression parameters." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def copy(src, dest, clevel=5, shuffle=True, cname=\"blosclz\"):\n", "\n", " \"\"\"\n", "\n", " Parameters\n", " ----------\n", " clevel : int (0 <= clevel < 10)\n", " The compression level.\n", " shuffle : bool\n", " Whether the shuffle filter is active or not.\n", " cname : string ('blosclz', 'lz4hc', 'snappy', 'zlib', others?)\n", " Select the compressor to use inside Blosc.\n", "\n", " \"\"\"\n", "\n", " src = blz.barray(rootdir=src)\n", " \n", " img_copied = src.copy(rootdir=dest, bparams=blz.bparams(clevel=clevel, shuffle=shuffle, cname=cname), expectedlen=src.size)\n", " img_copied.flush()" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 7 }, { "cell_type": "markdown", "metadata": {}, "source": [ "This code will be used to downsample all the images, it has been discussed on the notebook \"Numba and blz\"" ] }, { "cell_type": "code", "collapsed": false, "input": [ "@numba.njit\n", "def mymean(src, p0, p1, y):\n", "\n", " factor = 1/(1. * p0 * p1)\n", "\n", " for i in range(y.shape[0]):\n", " for j in range(y.shape[1]):\n", " s = 0.\n", " for k in range(p0):\n", " for l in range(p1):\n", " s += src[(p0*i)+k, (p1*j)+l] * factor\n", "\n", " y[i, j] = s\n", "\n", "\n", "def downsample(orig, down_cell, cache_size=2**21):\n", "\n", " c0, c1 = down_cell\n", "\n", " #Let's calculate the matrix dimensions\n", " pixel_size = orig[0, 0].nbytes\n", " n = int(round(sqrt(cache_size/pixel_size), 0))\n", "\n", " #How many complete matrices?\n", " hor = int(orig.shape[1]) / n\n", " ver = int(orig.shape[0]) / n\n", "\n", " #Complete matrix dimensions\n", " submatrix_n = round(n/float(c0))\n", " submatrix_center_shape = (submatrix_n, submatrix_n)\n", " submatrix_center = np.empty(submatrix_center_shape, dtype=orig.dtype)\n", "\n", " #Bottom border matrix dimensions\n", " ver_px = round((int(orig.shape[0]) % n) / c0, 0)\n", " submatrix_bottom_shape = (ver_px, submatrix_n)\n", " submatrix_bottom = np.empty(submatrix_bottom_shape, dtype=orig.dtype)\n", "\n", " #Right border matrix dimensions\n", " hor_px = round((int(orig.shape[1]) % n) / c1, 0)\n", " submatrix_right_shape = (submatrix_n, hor_px)\n", " submatrix_right = np.empty(submatrix_right_shape, dtype=orig.dtype)\n", "\n", " #Corner matrix dimensions\n", " submatrix_corner_shape = (ver_px, hor_px)\n", " submatrix_corner = np.empty(submatrix_corner_shape, dtype=orig.dtype)\n", "\n", " #We build the final container\n", " final_shape = (submatrix_n * ver + ver_px,\n", " submatrix_n * hor + hor_px)\n", " \n", " final = np.empty(final_shape, orig.dtype)\n", "\n", " #Downsample the middle of the image\n", " for i in xrange(ver):\n", " for j in xrange(hor):\n", "\n", " #Get the optimal matrix\n", " submatrix = orig[i*n:(i+1)*n, j*n:(j+1)*n]\n", " mymean(submatrix, c0, c1, submatrix_center)\n", " final[i*submatrix_n:(i+1)*submatrix_n,\n", " j*submatrix_n:(j+1)*submatrix_n] = submatrix_center\n", "\n", " #Downsample the right border\n", " for i in range(ver):\n", "\n", " submatrix = orig[i*n:(i+1)*n, hor*n:]\n", " mymean(submatrix, c0, c1, submatrix_right)\n", " final[i * submatrix_n:(i+1)*submatrix_n,\n", " submatrix_n*hor:] = submatrix_right\n", "\n", " #Downsample the bottom border\n", " for j in range(hor):\n", "\n", " submatrix = orig[ver*n:, j*n:(j+1)*n]\n", " mymean(submatrix, c0, c1, submatrix_bottom)\n", " final[submatrix_n*ver:,\n", " j * submatrix_n:(j+1)*submatrix_n] = submatrix_bottom\n", "\n", " #Downsample the corner\n", " submatrix = orig[n*ver:, n*hor:]\n", " mymean(submatrix, c0, c1, submatrix_corner)\n", " final[submatrix_n*ver:, submatrix_n*hor:] = submatrix_corner\n", "\n", " return final" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have all the ingredients, let's build ourselves a benchmark code. (Note: This code saves the benchmark results to .csv files)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def benchmark(src, cmethods):\n", "\n", " for method in cmethods:\n", "\n", " myfile = open('csv/' + method + '.csv', 'wb')\n", " wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)\n", " wr.writerow(['Compression level', 'Compressed size', 'Compression ratio', 'Compression time', 'Downsampling time'])\n", "\n", " for compression_level in xrange(0,10):\n", "\n", " #I get the original image and compress it\n", " tc1 = time()\n", " copy(src, 'images/temp.blz', compression_level, False, method)\n", " tc2 = time()\n", "\n", " #Get disk size\n", " img = blz.barray(rootdir='images/temp.blz')\n", " disk_size = img.cbytes\n", "\n", " #Now I downsample it measuring the time\n", " t1 = time()\n", " downsample(img, (4,4))\n", " t2 = time()\n", "\n", " #I should store the size of the file, compression method, shuffle and compression ratio\n", " row = [compression_level, disk_size, round(img.nbytes/float(disk_size),3), str(tc2 - tc1), str(t2 - t1)]\n", "\n", " #Add it to the csv\n", " wr.writerow(row)\n", " rmtree('images/temp.blz', ignore_errors=True)\n", "\n", " myfile.close()\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 9 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ok, so we are going to see how well does the downsample function behave when working with compressed blzs." ] }, { "cell_type": "code", "collapsed": false, "input": [ "src = 'images/Mandelbrot.blz'\n", "cmethods = ['blosclz', 'lz4hc', 'snappy', 'zlib']\n", "\n", "t1 = time()\n", "benchmark(src, cmethods)\n", "t2 = time()\n", "\n", "print t2-t1" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "1625.42109704\n" ] } ], "prompt_number": 11 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we should have 4 csvs with the relevant data. We first convert them to Python dictionaries so we can easily plot them." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def get_dict(filename):\n", " \n", " columns = defaultdict(list)\n", " with open('csv/' + filename) as f:\n", " reader = csv.reader(f)\n", " reader.next()\n", " for row in reader:\n", " for (i,v) in enumerate(row):\n", " if i == 1:\n", " columns[i].append(float(v)/131072)\n", " continue\n", " columns[i].append(v)\n", " return columns\n", "\n", "blosclz = get_dict('blosclz.csv')\n", "lz4hc = get_dict('lz4hc.csv')\n", "snappy = get_dict('snappy.csv')\n", "zlib = get_dict('zlib.csv')" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's see the raw data." ] }, { "cell_type": "code", "collapsed": false, "input": [ "print 'blosclz data'\n", "df = pandas.read_csv('csv/blosclz.csv')\n", "df\n" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "blosclz data\n" ] }, { "html": [ "
\n", " | Compression level | \n", "Compressed size | \n", "Compression ratio | \n", "Compression time | \n", "Downsampling time | \n", "
---|---|---|---|---|---|
0 | \n", "0 | \n", "600864672 | \n", "0.999 | \n", "2.782878 | \n", "28.031901 | \n", "
1 | \n", "1 | \n", "12165281 | \n", "49.321 | \n", "1.763561 | \n", "22.019383 | \n", "
2 | \n", "2 | \n", "11202146 | \n", "53.561 | \n", "1.769053 | \n", "23.470012 | \n", "
3 | \n", "3 | \n", "11202146 | \n", "53.561 | \n", "1.822704 | \n", "23.755093 | \n", "
4 | \n", "4 | \n", "10853311 | \n", "55.283 | \n", "1.826486 | \n", "23.889582 | \n", "
5 | \n", "5 | \n", "10853311 | \n", "55.283 | \n", "1.836225 | \n", "23.851175 | \n", "
6 | \n", "6 | \n", "10765857 | \n", "55.732 | \n", "1.852662 | \n", "23.982295 | \n", "
7 | \n", "7 | \n", "10805864 | \n", "55.525 | \n", "1.803914 | \n", "24.268449 | \n", "
8 | \n", "8 | \n", "10829057 | \n", "55.406 | \n", "1.844621 | \n", "24.067227 | \n", "
9 | \n", "9 | \n", "10879027 | \n", "55.152 | \n", "1.889094 | \n", "23.935759 | \n", "
10 rows \u00d7 5 columns
\n", "\n", " | Compression level | \n", "Compressed size | \n", "Compression ratio | \n", "Compression time | \n", "Downsampling time | \n", "
---|---|---|---|---|---|
0 | \n", "0 | \n", "600864672 | \n", "0.999 | \n", "2.607177 | \n", "26.384232 | \n", "
1 | \n", "1 | \n", "7051804 | \n", "85.085 | \n", "22.025753 | \n", "24.304401 | \n", "
2 | \n", "2 | \n", "7051804 | \n", "85.085 | \n", "21.674532 | \n", "25.063633 | \n", "
3 | \n", "3 | \n", "7051804 | \n", "85.085 | \n", "21.675338 | \n", "24.217175 | \n", "
4 | \n", "4 | \n", "6629752 | \n", "90.501 | \n", "25.386786 | \n", "23.898448 | \n", "
5 | \n", "5 | \n", "6629752 | \n", "90.501 | \n", "25.408783 | \n", "23.106808 | \n", "
6 | \n", "6 | \n", "6416552 | \n", "93.508 | \n", "37.387300 | \n", "17.693755 | \n", "
7 | \n", "7 | \n", "6312160 | \n", "95.055 | \n", "38.977684 | \n", "17.679116 | \n", "
8 | \n", "8 | \n", "6312160 | \n", "95.055 | \n", "38.699354 | \n", "17.719355 | \n", "
9 | \n", "9 | \n", "6312160 | \n", "95.055 | \n", "38.842180 | \n", "17.736349 | \n", "
10 rows \u00d7 5 columns
\n", "\n", " | Compression level | \n", "Compressed size | \n", "Compression ratio | \n", "Compression time | \n", "Downsampling time | \n", "
---|---|---|---|---|---|
0 | \n", "0 | \n", "600864672 | \n", "0.999 | \n", "2.801109 | \n", "29.384442 | \n", "
1 | \n", "1 | \n", "598817657 | \n", "1.002 | \n", "2.807479 | \n", "31.367154 | \n", "
2 | \n", "2 | \n", "598817657 | \n", "1.002 | \n", "4.694462 | \n", "34.749116 | \n", "
3 | \n", "3 | \n", "598817657 | \n", "1.002 | \n", "2.703463 | \n", "31.206180 | \n", "
4 | \n", "4 | \n", "598671833 | \n", "1.002 | \n", "2.538312 | \n", "35.297246 | \n", "
5 | \n", "5 | \n", "598671833 | \n", "1.002 | \n", "2.621801 | \n", "31.044807 | \n", "
6 | \n", "6 | \n", "580402650 | \n", "1.034 | \n", "2.621331 | \n", "32.259069 | \n", "
7 | \n", "7 | \n", "543933158 | \n", "1.103 | \n", "2.541712 | \n", "35.558219 | \n", "
8 | \n", "8 | \n", "543933158 | \n", "1.103 | \n", "2.808962 | \n", "35.409513 | \n", "
9 | \n", "9 | \n", "600864672 | \n", "0.999 | \n", "3.597196 | \n", "33.235315 | \n", "
10 rows \u00d7 5 columns
\n", "\n", " | Compression level | \n", "Compressed size | \n", "Compression ratio | \n", "Compression time | \n", "Downsampling time | \n", "
---|---|---|---|---|---|
0 | \n", "0 | \n", "600864672 | \n", "0.999 | \n", "2.696286 | \n", "29.421912 | \n", "
1 | \n", "1 | \n", "9521032 | \n", "63.018 | \n", "3.106901 | \n", "34.339349 | \n", "
2 | \n", "2 | \n", "9179054 | \n", "65.366 | \n", "3.149718 | \n", "34.597896 | \n", "
3 | \n", "3 | \n", "8742380 | \n", "68.631 | \n", "3.139577 | \n", "34.406265 | \n", "
4 | \n", "4 | \n", "6430329 | \n", "93.308 | \n", "6.326290 | \n", "40.333203 | \n", "
5 | \n", "5 | \n", "6150632 | \n", "97.551 | \n", "6.540398 | \n", "40.457701 | \n", "
6 | \n", "6 | \n", "5518118 | \n", "108.733 | \n", "12.883182 | \n", "53.341339 | \n", "
7 | \n", "7 | \n", "5166095 | \n", "116.142 | \n", "12.944861 | \n", "53.405852 | \n", "
8 | \n", "8 | \n", "4539219 | \n", "132.181 | \n", "17.993135 | \n", "52.008409 | \n", "
9 | \n", "9 | \n", "4351807 | \n", "137.874 | \n", "16.323670 | \n", "51.841789 | \n", "
10 rows \u00d7 5 columns
\n", "