{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# PyTables" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import addutils.toc ; addutils.toc.js(ipy_notebook=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "PyTables is an high-performance, on-disk data container, query engine and computation kernel, with an easy-to-use interface, developed by Francesc Alted since 2002." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from addutils import css_notebook\n", "css_notebook()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**PyTables - What is it?**\n", "\n", "PyTables is a Python package wich allows dealing with HDF5 tables and:\n", "\n", "* a binary data container for on-disk, structured data\n", "* with support for data compression: Zlib, bzip2, LZO and Blosc\n", "* with powerful query and indexing capabilities\n", "* can perform out-of-core (data on-disk) operations very efficiently\n", "* based on the standard de-facto [HDF5](http://www.hdfgroup.org/HDF5/) format\n", "* free software ([BSD license](http://opensource.org/licenses/BSD-2-Clause))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**PyTables - What is not**\n", "\n", "* NOT a relational database replacement\n", "* not a distributed database\n", "* not extremely secure of safe (it's more about speed)\n", "* not a mere HDF5 wrapper" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1 The Array Object" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**TODO**: The immage above has a wrong command. The right command is file.create_array()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import tables as tb" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "/my_array1 (Array(5, 10)) ''\n", " atom := Int64Atom(shape=(), dflt=0)\n", " maindim := 0\n", " flavor := 'numpy'\n", " byteorder := 'little'\n", " chunkshape := None" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f = tb.open_file('temp/atest.h5', 'w') # Create a new file in \"write\" mode\n", "a = np.arange(50).reshape(5,10) # Create a NumPy array\n", "f.create_array(f.root, 'my_array1', a) # Save the array" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `create_array` method returns a handler of the array **on disk**. This handler reports that we are working with an array called `my_array1` made of 64 bit integers. 'Flavor' indicates that this array has been created by numpy.\n", "\n", "Now we can retrieve the data from disk by using the indexing notation:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9],\n", " [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],\n", " [20, 21, 22, 23, 24, 25, 26, 27, 28, 29],\n", " [30, 31, 32, 33, 34, 35, 36, 37, 38, 39],\n", " [40, 41, 42, 43, 44, 45, 46, 47, 48, 49]])" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f.root.my_array1[:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also select sub-arrays. In this case, just the data related to the sub-array are actually read from disk:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[10, 11, 12, 13, 14],\n", " [30, 31, 32, 33, 34]])" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f.root.my_array1[1:5:2,0:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using `np.allclose` we can check that the data read from disk are equal to the corresponding data in RAM:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.allclose(f.root.my_array1[1:5:2,0:5], a[1:5:2,0:5])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "HDF5 files have a hierarchical structure. We have now a `atest.h5` file that contains an Array named `'my_array1'`. We can create a second array in the same file called `'my_array2'`:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "/ (RootGroup) ''\n", " children := ['my_array1' (Array), 'my_array2' (Array)]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f.create_array(f.root, 'my_array2', np.arange(10))\n", "f.root" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "... And we can attach to these arrays additional information in form of attributes (metadata):" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "/my_array1._v_attrs (AttributeSet), 4 attributes:\n", " [CLASS := 'ARRAY',\n", " FLAVOR := 'numpy',\n", " TITLE := '',\n", " VERSION := '2.4']" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f.root.my_array1.attrs" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "/my_array1._v_attrs (AttributeSet), 5 attributes:\n", " [CLASS := 'ARRAY',\n", " FLAVOR := 'numpy',\n", " MY_ATTRIBUTE := 'This is a metadata I can attach to any array',\n", " TITLE := '',\n", " VERSION := '2.4']" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f.root.my_array1.attrs.MY_ATTRIBUTE = \"This is a metadata I can attach to any array\"\n", "f.root.my_array1.attrs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now check the `temp/atest.h5` filesize: it's zero! This is because PyTables is highly optimized and keeps the data in RAM (bufferize IO) until the file is closed or there is not available RAM or when you explicitly call a `flush()` method." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "# Flush data to the file (very important to keep all your data safe!)\n", "f.flush()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check again the `atest.h5` filesize: now the data has been flushed and the file got a size different from zero." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "f.close() # close access to file" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2 The CArray Object" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When creating a new CArray (Compressible Array), type and shape must be passed to the constructor:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "/my_carray1 (CArray(10000, 1000)) ''\n", " atom := Float64Atom(shape=(), dflt=0.0)\n", " maindim := 0\n", " flavor := 'numpy'\n", " byteorder := 'little'\n", " chunkshape := (16, 1000)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f = tb.open_file('temp/ctest.h5', 'w')\n", "f.create_carray(f.root, 'my_carray1', tb.Float64Atom(), (10000,1000))" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "f.flush()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now check the `temp/ctest.h5` filesize: it's 1KB even if the array is Float64 10000x1000. This is because PyTables just stored the CArray metadata. Now we'll push some data in the CArray container. For simplicity we define a new name for the carray handle: `ca = f.root.my_carray1`" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 16 ms, sys: 32 ms, total: 48 ms\n", "Wall time: 47.3 ms\n" ] } ], "source": [ "ca = f.root.my_carray1\n", "na = np.linspace(0, 1, int(1e7)).reshape(10000,1000)\n", "%time ca[:] = na" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "f.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now check the `temp/catest.h5` filesize: it's 76MB whis is exactly the uncompressed space required by a 64bit 10000x1000 matrix.\n", "\n", "CArray allows for data compression: lets try to use a `blosc` compressor with `complevel=5`: filesize must be reduced to 8.7MB!" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 216 ms, sys: 20 ms, total: 236 ms\n", "Wall time: 233 ms\n" ] } ], "source": [ "f = tb.open_file('temp/ctest.h5', 'w')\n", "f.create_carray(f.root, 'my_carray1', tb.Float64Atom(), (10000,1000),\n", " filters=tb.Filters(complevel=5, complib='blosc'))\n", "%time f.root.my_carray1[:] = na\n", "f.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***Try by yourself*** the following compression options:\n", " \n", " (complevel=9, complib='blosc')\n", " (complevel=5, complib='zlib')\n", " (complevel=9, complib='lzo')\n", " (complevel=5, complib='bzip2')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now check how much do it takes to read a small, non-contiguous sub-array by using the `%timeit` magic function: the read operation will be run multiple times to better measure the execution time:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "45.2 µs ± 1.34 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n" ] } ], "source": [ "f = tb.open_file('temp/ctest.h5', 'r')\n", "%timeit f.root.my_carray1[:4,::100]\n", "f.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3 The Table Object" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To create a new Table Object we must first describe the fields:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "class TabularData(tb.IsDescription):\n", " col1 = tb.StringCol(200)\n", " col2 = tb.IntCol()\n", " col3 = tb.Float32Col(10)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "# Open a file and create the Table container\n", "f = tb.open_file('temp/atable.h5', 'w')\n", "t = f.create_table(f.root, 'my_table1', TabularData, 'Table Title',\n", " filters=tb.Filters(complevel=5, complib='blosc'))" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "/my_table1 (Table(0,), shuffle, blosc(5)) 'Table Title'\n", " description := {\n", " \"col1\": StringCol(itemsize=200, shape=(), dflt=b'', pos=0),\n", " \"col2\": Int32Col(shape=(), dflt=0, pos=1),\n", " \"col3\": Float32Col(shape=(10,), dflt=0.0, pos=2)}\n", " byteorder := 'little'\n", " chunkshape := (268,)" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "t" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Insert time: 3.518s\n" ] } ], "source": [ "# Fill the table with some 1 million rows\n", "from time import time\n", "t0 = time()\n", "r = t.row\n", "for i in range(1000*1000):\n", " r['col1'] = str(i)\n", " r['col2'] = i+1\n", " r['col3'] = np.arange(10, dtype=np.float32)+i\n", " r.append()\n", "t.flush()\n", "print (\"Insert time: {:.3f}s\".format(time()-t0,))" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "/my_table1 (Table(1000000,), shuffle, blosc(5)) 'Table Title'\n", " description := {\n", " \"col1\": StringCol(itemsize=200, shape=(), dflt=b'', pos=0),\n", " \"col2\": Int32Col(shape=(), dflt=0, pos=1),\n", " \"col3\": Float32Col(shape=(10,), dflt=0.0, pos=2)}\n", " byteorder := 'little'\n", " chunkshape := (268,)" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "t" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`chunkshape := (268,)` means that every 268 rows a datachunk is created, compressed and saved on disk.\n", "\n", "The uncompressed size of the dataset can be calculated by multiplying the number of rows `t.shape[0]` by the size of each row `t.dtype.itemsize`. In this example the size of each row is 244 bytes: 200 bytes for the string field, 4 for the Int field and 40 for the ten Float32. If you check the filesize on disk you will see that since we used a blosc compression algorithm, we managed to reduce the **filesize from 232MB to 3.6MB**!" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "232.696533203125" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "t.shape[0]*t.dtype.itemsize/2**20." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With Tables we can do queries. For example here we extract values of `col1` where `col2 < 10`: in **less than one second we query 1.000.000 records**." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 168 ms, sys: 0 ns, total: 168 ms\n", "Wall time: 165 ms\n" ] }, { "data": { "text/plain": [ "[b'0', b'1', b'2', b'3', b'4', b'5', b'6', b'7', b'8']" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%time [r['col1'] for r in t if r['col2'] < 10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can be even faster by using **in-kernel methods** instead of using the regular Python condition. Condition defined as a string with the `where` method are evaluated. [numexpr](http://code.google.com/p/numexpr/) is a package that accepts the expression as a string, analyzes it, rewrites it more efficiently, and compiles it on the fly into code suitable to its internal virtual machine (VM). Due to its integrated just-in-time (JIT) compiler, it does not require a compiler at runtime:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 160 ms, sys: 20 ms, total: 180 ms\n", "Wall time: 158 ms\n" ] }, { "data": { "text/plain": [ "[b'0', b'1', b'2', b'3', b'4', b'5', b'6', b'7', b'8']" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Repeat the query, but using in-kernel method\n", "%time [r['col1'] for r in t.where('col2 < 10')]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatively, to reach even greater performances, the Table Object support indexing for every column. We can index the colum two:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 396 ms, sys: 12 ms, total: 408 ms\n", "Wall time: 407 ms\n" ] }, { "data": { "text/plain": [ "1000000" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%time t.cols.col2.create_csindex()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From now on, any query involving col2 will be sped-up many times. In this case we query the whole 1.000.000 records in less than 200us (0.0002s):" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "71.9 µs ± 107 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n" ] } ], "source": [ "%timeit [r['col1'] for r in t.where('col2 < 10')]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Queries can involve both indexed and non-indexed colums, in this case the speed-up will be less noticeable. here we do the query in few seconds because `col3` is not indexed:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1.93 s, sys: 12 ms, total: 1.94 s\n", "Wall time: 1.93 s\n" ] }, { "data": { "text/plain": [ "[b'10', b'11', b'12', b'13', b'14', b'15', b'16', b'17', b'18', b'19']" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Performing complex conditions (regular query)\n", "%time [r['col1'] for r in t if r['col2'] > 10 and r['col3'][0] < 20]" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "f.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4 The EArray Object" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will illustrate how to use the PyTables EArray (Extensible Array) Object by performing an Out-of-Core calculation. PyTables leverages [numexpr](http://code.google.com/p/numexpr/) to perform computations with arrays that are on disk and not in memory (Out-of-Core). `Numexpr` performs the computation on large arrays by splitting the arrays in smaller blocks of data, those blocks are then uploaded to the CPU cache memory and the computations are done without macking data copies on RAM" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import tables as tb" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "f = tb.open_file('temp/poly1.h5', 'w')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create an empty EArray then populate it with 10 chuncks of 1.000.000 values, we'll have an array with 1 Column and 10.000.000 Rows. In the `createEArray` method we define the dimension along which the EArray can be expanded. For example, in this case whe define by using `(0,)` that the array will be expanded along the row dimension. Similarly, if we wanted to expand it by columns, we would use `(,0)`." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "ea = f.create_earray(f.root, 'my_earray1', tb.Float64Atom(), (0,),\n", " filters=tb.Filters(complevel=5, complib='blosc'))\n", "for s in range(10):\n", " ea.append(np.linspace(s, s+1, int(1e6)))\n", "ea.flush()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create the expression to compute: this expression must be defined as a string. Basically, for each row the polynomial will be calculated:" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "expr = tb.Expr('0.25*ea**3 + 0.75*ea**2 + 1.5*ea - 2')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we have to create an array to store the resulting values, in this case we can decide to use a Compressed Array (CArray) with the same lenght of the EArray: 10.000.000 rows" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "if hasattr(f.root, 'output_values'):\n", " f.removeNode(f.root.y)\n", "y = f.create_carray(f.root, 'output_values', tb.Float64Atom(), (len(ea),),\n", " filters=tb.Filters(complevel=5, complib='blosc'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Specify that the ouput of the expression has to go to **y** on disk" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "expr.set_output(y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On a standard PC this will take less than a second. This time is significant if you think that data are loaded and stored directly on disk while doing calculation:" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 260 ms, sys: 36 ms, total: 296 ms\n", "Wall time: 197 ms\n" ] }, { "data": { "text/plain": [ "/output_values (CArray(10000000,), shuffle, blosc(5)) ''\n", " atom := Float64Atom(shape=(), dflt=0.0)\n", " maindim := 0\n", " flavor := 'numpy'\n", " byteorder := 'little'\n", " chunkshape := (16384,)" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%time expr.eval()" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "f.flush()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "Visit [www.add-for.com]() for more tutorials and updates.\n", "\n", "This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License." ] } ], "metadata": { "kernelspec": { "display_name": "Python [conda env:addfor_tutorials]", "language": "python", "name": "conda-env-addfor_tutorials-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 1 }