{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction\n", "\n", "Numpy is great for exploratory data analysis because it encourages the analyst to calculate one operation at a time, rather than one datum at a time. To compute an expression like\n", "\n", "$$m = \\sqrt{(E_1 + E_2)^2 - (p_{x1} + p_{x2})^2 - (p_{y1} + p_{y2})^2 - (p_{z1} + p_{z2})^2}$$\n", "\n", "you might first compute $\\sqrt{(p_{x1} + p_{x2})^2 + (p_{y1} + p_{y2})^2}$ for all data (which is a meaningful quantity: $p_T$), then compute $\\sqrt{{p_T}^2 + (p_{z1} + p_{z2})^2}$ for all data (another meaningful quantity: $|p|$), then compute the whole expression as $\\sqrt{(E_1 + E_2)^2 - |p|^2}$. Performing each step separately on all data lets you plot and cross-check distributions of partial computations, to discover surprises as early as possible.\n", "\n", "This order of data processing is called \"columnar\" in the sense that a dataset may be visualized as a table in which rows are repeated measurements and columns are the different measurable quantities (same layout as [Pandas DataFrames](https://pandas.pydata.org)). It is also called \"vectorized\" in that a Single (virtual) Instruction is applied to Multiple Data (virtual SIMD). Numpy can be hundreds to thousands of times faster than pure Python because it avoids the overhead of handling Python instructions in the loop over numbers. Most data processing languages (R, MATLAB, IDL, all the way back to APL) work this way: an interactive interpreter controlling fast, array-at-a-time math.\n", "\n", "However, it's difficult to apply this methodology to non-rectangular data. If your dataset has nested structure, a different number of values per row, different data types in the same column, or cross-references or even circular references, Numpy can't help you.\n", "\n", "If you try to make an array with non-trivial types:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([{'x': 1, 'y': 1.1}, {'x': 2, 'y': 2.2}, {'x': 3, 'y': 3.3},\n", " {'x': 4, 'y': 4.4}, {'x': 5, 'y': 5.5}], dtype=object)" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy\n", "nested = numpy.array([{\"x\": 1, \"y\": 1.1}, {\"x\": 2, \"y\": 2.2}, {\"x\": 3, \"y\": 3.3}, {\"x\": 4, \"y\": 4.4}, {\"x\": 5, \"y\": 5.5}])\n", "nested" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Numpy gives up and returns a `dtype=object` array, which means Python objects and pure Python processing. You don't get the columnar operations or the performance boost.\n", "\n", "For instance, you might want to say" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " unsupported operand type(s) for +: 'dict' and 'int'\n" ] } ], "source": [ "try:\n", " nested + 100\n", "except Exception as err:\n", " print(type(err), str(err))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "but there is no vectorized addition for an array of dicts because there is no addition for dicts defined in pure Python. Numpy is not using its vectorized routines—it's calling Python code on each element.\n", "\n", "The same applies to variable-length data, such as lists of lists, where the inner lists have different lengths. This is a more serious shortcoming than the above because the list of dicts (Python's equivalent of an \"[array of structs](https://en.wikipedia.org/wiki/AOS_and_SOA)\") could be manually reorganized into two numerical arrays, `\"x\"` and `\"y\"` (a \"[struct of arrays](https://en.wikipedia.org/wiki/AOS_and_SOA)\"). Not so with a list of variable-length lists." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([list([1.1, 2.2, 3.3]), list([]), list([4.4, 5.5]), list([6.6]),\n", " list([7.7, 8.8, 9.9])], dtype=object)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "varlen = numpy.array([[1.1, 2.2, 3.3], [], [4.4, 5.5], [6.6], [7.7, 8.8, 9.9]])\n", "varlen" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As before, we get a `dtype=object` without vectorized methods." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " can only concatenate list (not \"int\") to list\n" ] } ], "source": [ "try:\n", " varlen + 100\n", "except Exception as err:\n", " print(type(err), str(err))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What's worse, this array looks purely numerical and could have been made by a process that was *supposed* to create equal-length inner lists.\n", "\n", "Awkward Array provides a way of talking about these data structures as arrays." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ " ] at 0x7bc6e4d337b8>" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import awkward0\n", "nested = awkward0.fromiter([{\"x\": 1, \"y\": 1.1}, {\"x\": 2, \"y\": 2.2}, {\"x\": 3, \"y\": 3.3}, {\"x\": 4, \"y\": 4.4}, {\"x\": 5, \"y\": 5.5}])\n", "nested" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This `Table` is a columnar data structure with the same meaning as the Python data we built it with. To undo `awkward0.fromiter`, call `.tolist()`." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "[{'x': 1, 'y': 1.1},\n", " {'x': 2, 'y': 2.2},\n", " {'x': 3, 'y': 3.3},\n", " {'x': 4, 'y': 4.4},\n", " {'x': 5, 'y': 5.5}]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nested.tolist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Values at the same position of the tree structure are contiguous in memory: this is a struct of arrays." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([1, 2, 3, 4, 5])" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nested.contents[\"x\"]" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([1.1, 2.2, 3.3, 4.4, 5.5])" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nested.contents[\"y\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Having a structure like this means that we can perform vectorized operations on the whole structure with relatively few Python instructions (number of Python instructions scales with the complexity of the data type, not with the number of values in the dataset)." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "[{'x': 101, 'y': 101.1},\n", " {'x': 102, 'y': 102.2},\n", " {'x': 103, 'y': 103.3},\n", " {'x': 104, 'y': 104.4},\n", " {'x': 105, 'y': 105.5}]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(nested + 100).tolist()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "[{'x': 101, 'y': 101.1},\n", " {'x': 202, 'y': 202.2},\n", " {'x': 303, 'y': 303.3},\n", " {'x': 404, 'y': 404.4},\n", " {'x': 505, 'y': 505.5}]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(nested + numpy.arange(100, 600, 100)).tolist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It's less obvious that variable-length data can be represented in a columnar format, but it can." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "varlen = awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5], [6.6], [7.7, 8.8, 9.9]])\n", "varlen" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Unlike Numpy's `dtype=object` array, the inner lists are *not* Python lists and the numerical values *are* contiguous in memory. This is made possible by representing the structure (where each inner list starts and stops) in one array and the values in another." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "(array([3, 0, 2, 1, 3]), array([1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9]))" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "varlen.counts, varlen.content" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(For fast random access, the more basic representation is `varlen.offsets`, which is in turn a special case of a `varlen.starts, varlen.stops` pair. These details are discussed below.)\n", "\n", "A structure like this can be broadcast like Numpy with a small number of Python instructions (scales with the complexity of the data type, not the number of values)." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "varlen + 100" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "varlen + numpy.arange(100, 600, 100)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can even slice this object as though it were multidimensional (each element is a tensor of the same rank, but with different numbers of dimensions)." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Skip the first two inner lists; skip the last value in each inner list that remains.\n", "varlen[2:, :-1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data are not rectangular, so some inner lists might have as many elements as your selection. Don't worry—you'll get error messages." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " index 1 is out of bounds for jagged min size 0\n" ] } ], "source": [ "try:\n", " varlen[:, 1]\n", "except Exception as err:\n", " print(type(err), str(err))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Masking with the `.counts` is handy because all the Numpy advanced indexing rules apply (in an extended sense) to jagged arrays." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([2.2, 5.5, 8.8])" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "varlen[varlen.counts > 1, 1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I've only presented the two most important Awkward classes, `Table` and `JaggedArray` (and not how they combine). Each class is presented in more detail below. For now, I'd just like to point out that you can make crazy complicated data structures" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [], "source": [ "crazy = awkward0.fromiter([[1.21, 4.84, None, 10.89, None],\n", " [19.36, [30.25]],\n", " [{\"x\": 36, \"y\": {\"z\": 49}}, None, {\"x\": 64, \"y\": {\"z\": 81}}]\n", " ])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and they vectorize and slice as expected." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "[[1.1, 2.2, None, 3.3000000000000003, None],\n", " [4.4, [5.5]],\n", " [{'x': 6.0, 'y': {'z': 7.0}}, None, {'x': 8.0, 'y': {'z': 9.0}}]]" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "numpy.sqrt(crazy).tolist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is because any Awkward Array can be the content of any other Awkward Array. Like Numpy, the features of Awkward Array are simple, yet compose nicely to let you build what you need." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Overview with sample datasets\n", "\n", "Many of the examples in this tutorial use `awkward0.fromiter` to make Awkward Arrays from lists and `array.tolist()` to turn them back into lists (or dicts for `Table`, tuples for `Table` with anonymous fields, Python objects for `ObjectArrays`, etc.). These should be considered slow methods, since Python instructions are executed in the loop, but that's a necessary part of examining or building Python objects.\n", "\n", "Ideally, you'd want to get your data from a binary, columnar source and produce binary, columnar output, or convert only once and reuse the converted data. [Parquet](https://parquet.apache.org) is a popular columnar format for storing data on disk and [Arrow](https://arrow.apache.org) is a popular columnar format for sharing data in memory (between functions or applications). [ROOT](https://root.cern) is a popular columnar format for particle physicists, and [uproot](https://github.com/scikit-hep/uproot) natively produces Awkward Arrays from ROOT files.\n", "\n", "[HDF5](https://www.hdfgroup.org) and its Python library [h5py](https://www.h5py.org/) are columnar, but only for rectangular arrays, unlike the others mentioned here. Awkward Array can *wrap* HDF5 with an interpretation layer to store columnar data structures, but then the Awkward Array library wuold be needed to read the data back in a meaningful way. Awkward also has a native file format, `.awkd` files, which are simply ZIP archives of columns as binary blobs and metadata (just as Numpy's `.npz` is a ZIP of arrays with metadata). The HDF5, awkd, and pickle serialization procedures use the same protocol, which has backward and forward compatibility features." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## NASA exoplanets from a Parquet file\n", "\n", "Let's start by opening a Parquet file. Awkward reads Parquet through the [pyarrow](https://arrow.apache.org/docs/python) module, which is an optional dependency, so be sure you have it installed before trying the next line." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ " ... ] at 0x7bc6e4ca14e0>" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stars = awkward0.fromparquet(\"tests/samples/exoplanets.parquet\")\n", "stars" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(There is also an `awkward0.toparquet` that takes the file name and array as arguments.)\n", "\n", "Columns are accessible with square brackets and strings" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stars[\"name\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "or by dot-attribute (if the name doesn't have weird characters and doesn't conflict with a method or property name)." ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "(,\n", " )" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stars.ra, stars.dec" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This file contains data about extrasolar planets and their host stars. As such, it's a `Table` full of Numpy arrays and `JaggedArrays`. The star attributes (`\"name\"`, `\"ra\"` or right ascension in degrees, `\"dec\"` or declination in degrees, `\"dist\"` or distance in parsecs, `\"mass\"` in multiples of the sun's mass, and `\"radius\"` in multiples of the sun's radius) are plain Numpy arrays and the planet attributes (`\"name\"`, `\"orbit\"` or orbital distance in AU, `\"eccen\"` or eccentricity, `\"period\"` or periodicity in days, `\"mass\"` in multiples of Jupyter's mass, and `\"radius\"` in multiples of Jupiter's radius) are jagged because each star may have a different number of planets." ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stars.planet_name" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "(,\n", " )" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stars.planet_period, stars.planet_orbit" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For large arrays, only the first and last values are printed: the second-to-last star has three planets; all the other stars shown here have one planet.\n", "\n", "These arrays are called `ChunkedArrays` because the Parquet file is lazily read in chunks (Parquet's row group structure). The `ChunkedArray` (subdivides the file) contains `VirtualArrays` (read one chunk on demand), which generate the `JaggedArrays`. This is an illustration of how each Awkward class provides one feature, and you get desired behavior by combining them.\n", "\n", "The `ChunkedArrays` and `VirtualArrays` support the same Numpy-like access as `JaggedArray`, so we can compute with them just as we would any other array." ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# distance in parsecs → distance in light years\n", "stars.dist * 3.26156" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# for all stars, drop the first planet\n", "stars.planet_mass[:, 1:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## NASA exoplanets from an Arrow buffer\n", "\n", "The pyarrow implementation of Arrow is more complete than its implementation of Parquet, so we can use more features in the Arrow format, such as nested tables.\n", "\n", "Unlike Parquet, which is intended as a file format, Arrow is a memory format. You might get an Arrow buffer as the output of another function, through interprocess communication, from a network RPC call, a message bus, etc. Arrow can be saved as files, though this isn't common. In this case, we'll get it from a file." ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "
... ] at 0x7bc6e4a87940>" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pyarrow\n", "arrow_buffer = pyarrow.ipc.open_file(open(\"tests/samples/exoplanets.arrow\", \"rb\")).get_batch(0)\n", "stars = awkward0.fromarrow(arrow_buffer)\n", "stars" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(There is also an `awkward0.toarrow` that takes an Awkward Array as its only argument, returning the relevant Arrow structure.)\n", "\n", "This file is structured differently. Instead of jagged arrays of numbers like `\"planet_mass\"`, `\"planet_period\"`, and `\"planet_orbit\"`, this file has a jagged table of `\"planets\"`. A jagged table is a `JaggedArray` of `Table`." ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "] [] [] ... [] [ ] []] at 0x7bc6e4a7bba8>" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stars[\"planets\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that the square brackets are nested, but the contents are `` objects. The second-to-last star has three planets, as before.\n", "\n", "We can find the non-jagged `Table` in the `JaggedArray.content`." ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "
... ] at 0x7bc6e4a7bb70>" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stars[\"planets\"].content" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When viewed as Python lists and dicts, the `'planets'` field is a list of planet dicts, each with its own fields." ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "[{'dec': 17.792868,\n", " 'dist': 93.37,\n", " 'mass': 2.7,\n", " 'name': '11 Com',\n", " 'planets': [{'eccen': 0.231,\n", " 'mass': 19.4,\n", " 'name': 'b',\n", " 'orbit': 1.29,\n", " 'period': 326.03,\n", " 'radius': nan}],\n", " 'ra': 185.179276,\n", " 'radius': 19.0},\n", " {'dec': 71.823898,\n", " 'dist': 125.72,\n", " 'mass': 2.78,\n", " 'name': '11 UMi',\n", " 'planets': [{'eccen': 0.08,\n", " 'mass': 14.74,\n", " 'name': 'b',\n", " 'orbit': 1.53,\n", " 'period': 516.21997,\n", " 'radius': nan}],\n", " 'ra': 229.27453599999998,\n", " 'radius': 29.79}]" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stars[:2].tolist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Despite being packaged in an arguably more intuitive way, we can still get jagged arrays of numbers by requesting `\"planets\"` and a planet attribute (two column selections) without specifying which star or which parent." ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stars.planets.name" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stars.planets.mass" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Even though the `Table` is hidden inside the `JaggedArray`, its `columns` pass through to the top." ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "['dec', 'dist', 'mass', 'name', 'planets', 'ra', 'radius']" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stars.columns" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "['eccen', 'mass', 'name', 'orbit', 'period', 'radius']" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stars.planets.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For a more global view of the structures contained within one of these arrays, print out its high-level type. (\"High-level\" because it presents logical distinctions, like jaggedness and tables, but not physical distinctions, like chunking and virtualness.)" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0, 2935) -> 'dec' -> float64\n", " 'dist' -> float64\n", " 'mass' -> float64\n", " 'name' -> \n", " 'planets' -> [0, inf) -> 'eccen' -> float64\n", " 'mass' -> float64\n", " 'name' -> \n", " 'orbit' -> float64\n", " 'period' -> float64\n", " 'radius' -> float64\n", " 'ra' -> float64\n", " 'radius' -> float64\n" ] } ], "source": [ "print(stars.type)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above should be read like a function's data type: `argument type -> return type` for the function that takes an index in square brackets and returns something else. For example, the first `[0, 2935)` means that you could put any non-negative integer less than `2935` in square brackets after `stars`, like this:" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stars[1734]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and get an object that would take `'dec'`, `'dist'`, `'mass'`, `'name'`, `'planets'`, `'ra'`, or `'radius'` in its square brackets. The return type depends on which of those strings you provide." ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "0.54" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stars[1734][\"mass\"] # type is float64" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "'Kepler-186'" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stars[1734][\"name\"] # type is " ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "
] at 0x7bc6e4a98400>" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stars[1734][\"planets\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The planets have their own table structure:" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0, 5) -> 'eccen' -> float64\n", " 'mass' -> float64\n", " 'name' -> \n", " 'orbit' -> float64\n", " 'period' -> float64\n", " 'radius' -> float64\n" ] } ], "source": [ "print(stars[1734][\"planets\"].type)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that within the context of `stars`, the `planets` could take any non-negative integer `[0, inf)`, but for a particular star, the allowed domain is known with more precision: `[0, 5)`. This is because `stars[\"planets\"]` is a jagged array—a different number of planets for each star—but one `stars[1734][\"planets\"]` is a simple array—five planets for *this* star.\n", "\n", "Passing a non-negative integer less than 5 to this array, we get an object that takes one of six strings: : `'eccen'`, `'mass'`, `'name'`, `'orbit'`, `'period'`, and `'radius'`." ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stars[1734][\"planets\"][4]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and the return type of these depends on which string you provide." ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "129.9441" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stars[1734][\"planets\"][4][\"period\"] # type is float" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "'f'" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stars[1734][\"planets\"][4][\"name\"] # type is " ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "{'eccen': 0.04,\n", " 'mass': nan,\n", " 'name': 'f',\n", " 'orbit': 0.432,\n", " 'period': 129.9441,\n", " 'radius': 0.10400000000000001}" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stars[1734][\"planets\"][4].tolist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(Incidentally, this is a [potentially habitable exoplanet](https://www.nasa.gov/ames/kepler/kepler-186f-the-first-earth-size-planet-in-the-habitable-zone), the first ever discovered.)" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "('Kepler-186', 'f')" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stars[1734][\"name\"], stars[1734][\"planets\"][4][\"name\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some of these arguments \"commute\" and others don't. Dimensional axes have a particular order, so you can't request a planet by its row number before selecting a star, but you can swap a column-selection (string) and a row-selection (integer). For a rectangular table, it's easy to see how you can slice column-first or row-first, but it even works when the table is jagged." ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "'f'" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stars[\"planets\"][\"name\"][1734][4]" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "'f'" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stars[1734][\"planets\"][4][\"name\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "None of these intermediate slices actually process data, so you can slice in any order that is logically correct without worrying about performance. Projections, even multi-column projections" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "[{'name': 'b', 'eccen': nan, 'orbit': 0.0343, 'period': 3.8867907},\n", " {'name': 'c', 'eccen': nan, 'orbit': 0.0451, 'period': 7.267302},\n", " {'name': 'd', 'eccen': nan, 'orbit': 0.0781, 'period': 13.342996},\n", " {'name': 'e', 'eccen': nan, 'orbit': 0.11, 'period': 22.407704},\n", " {'name': 'f', 'eccen': 0.04, 'orbit': 0.432, 'period': 129.9441}]" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "orbits = stars[\"planets\"][[\"name\", \"eccen\", \"orbit\", \"period\"]]\n", "orbits[1734].tolist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "are a useful way to restructure data without incurring a runtime cost." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Relationship to Pandas\n", "\n", "Arguably, this kind of dataset could be manipulated as a [Pandas DataFrame](https://pandas.pydata.org) instead of Awkward Arrays. Despite the variable number of planets per star, the exoplanets dataset could be flattened into a rectangular DataFrame, in which the distinction between solar systems is represented by a two-component index (leftmost pair of columns below), a [MultiIndex](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html)." ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
decdistmassnameplanetsraradius
eccenmassnameorbitperiodradius
29310-15.9374803.600.78490.18000.012371010.538000162.870000NaN26.017012NaN
1-15.9374803.600.78490.16000.012371021.334000636.130000NaN26.017012NaN
2-15.9374803.600.78490.06000.005511030.13300020.000000NaN26.017012NaN
3-15.9374803.600.78490.23000.005761040.24300049.410000NaN26.017012NaN
2932030.245163112.642.30530.031020.60000981.170000305.500000NaN107.78488226.80
2933041.40546013.411.30480.02150.68760980.0592224.617033NaN24.1993451.56
141.40546013.411.30480.25961.98100990.827774241.258000NaN24.1993451.56
241.40546013.411.30480.29874.132001002.5132901276.460000NaN24.1993451.56
293408.46145256.272.20550.00002.80000980.680000136.750000NaN298.56201212.00
\n", "" ], "text/plain": [ " dec dist mass name planets \\\n", " eccen mass name orbit \n", "2931 0 -15.937480 3.60 0.78 49 0.1800 0.01237 101 0.538000 \n", " 1 -15.937480 3.60 0.78 49 0.1600 0.01237 102 1.334000 \n", " 2 -15.937480 3.60 0.78 49 0.0600 0.00551 103 0.133000 \n", " 3 -15.937480 3.60 0.78 49 0.2300 0.00576 104 0.243000 \n", "2932 0 30.245163 112.64 2.30 53 0.0310 20.60000 98 1.170000 \n", "2933 0 41.405460 13.41 1.30 48 0.0215 0.68760 98 0.059222 \n", " 1 41.405460 13.41 1.30 48 0.2596 1.98100 99 0.827774 \n", " 2 41.405460 13.41 1.30 48 0.2987 4.13200 100 2.513290 \n", "2934 0 8.461452 56.27 2.20 55 0.0000 2.80000 98 0.680000 \n", "\n", " ra radius \n", " period radius \n", "2931 0 162.870000 NaN 26.017012 NaN \n", " 1 636.130000 NaN 26.017012 NaN \n", " 2 20.000000 NaN 26.017012 NaN \n", " 3 49.410000 NaN 26.017012 NaN \n", "2932 0 305.500000 NaN 107.784882 26.80 \n", "2933 0 4.617033 NaN 24.199345 1.56 \n", " 1 241.258000 NaN 24.199345 1.56 \n", " 2 1276.460000 NaN 24.199345 1.56 \n", "2934 0 136.750000 NaN 298.562012 12.00 " ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "awkward0.topandas(stars, flatten=True)[-9:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this representation, each star's attributes must be duplicated for all of its planets, and it is not possible to show stars that have no planets (not present in this dataset), but the information is preserved in a way that Pandas can recognize and operate on. (For instance, `.unstack()` would widen each planet attribute into a separate column per planet and simplify the index to strictly one row per star.)\n", "\n", "The limitation is that only a single jagged structure can be represented by a DataFrame. The structure can be arbitrarily deep in `Tables` (which add depth to the column names)," ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ae
bc
d
0012.03
1045.06
145.16
2078.09
178.19
278.29
\n", "
" ], "text/plain": [ " a e\n", " b c \n", " d \n", "0 0 1 2.0 3\n", "1 0 4 5.0 6\n", " 1 4 5.1 6\n", "2 0 7 8.0 9\n", " 1 7 8.1 9\n", " 2 7 8.2 9" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "array = awkward0.fromiter([{\"a\": {\"b\": 1, \"c\": {\"d\": [2]}}, \"e\": 3},\n", " {\"a\": {\"b\": 4, \"c\": {\"d\": [5, 5.1]}}, \"e\": 6},\n", " {\"a\": {\"b\": 7, \"c\": {\"d\": [8, 8.1, 8.2]}}, \"e\": 9}])\n", "awkward0.topandas(array, flatten=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and arbitrarily deep in `JaggedArrays` (which add depth to the row names)," ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
00012.2
113.3
214.4
2015.5
116.6
100101.1
10102.2
1103.3
30104.4
2101009.9
\n", "
" ], "text/plain": [ " a b\n", "0 0 0 1 2.2\n", " 1 1 3.3\n", " 2 1 4.4\n", " 2 0 1 5.5\n", " 1 1 6.6\n", "1 0 0 10 1.1\n", " 1 0 10 2.2\n", " 1 10 3.3\n", " 3 0 10 4.4\n", "2 1 0 100 9.9" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "array = awkward0.fromiter([{\"a\": 1, \"b\": [[2.2, 3.3, 4.4], [], [5.5, 6.6]]},\n", " {\"a\": 10, \"b\": [[1.1], [2.2, 3.3], [], [4.4]]},\n", " {\"a\": 100, \"b\": [[], [9.9]]}])\n", "awkward0.topandas(array, flatten=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and they can even have two `JaggedArrays` at the same level if their number of elements is the same (at all levels of depth)." ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
00001.11
112.22
223.33
2004.44
115.55
10001.11
1002.22
113.33
3004.44
21009.99
\n", "
" ], "text/plain": [ " a b\n", "0 0 0 0 1.1 1\n", " 1 1 2.2 2\n", " 2 2 3.3 3\n", " 2 0 0 4.4 4\n", " 1 1 5.5 5\n", "1 0 0 0 1.1 1\n", " 1 0 0 2.2 2\n", " 1 1 3.3 3\n", " 3 0 0 4.4 4\n", "2 1 0 0 9.9 9" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "array = awkward0.fromiter([{\"a\": [[1.1, 2.2, 3.3], [], [4.4, 5.5]], \"b\": [[1, 2, 3], [], [4, 5]]},\n", " {\"a\": [[1.1], [2.2, 3.3], [], [4.4]], \"b\": [[1], [2, 3], [], [4]]},\n", " {\"a\": [[], [9.9]], \"b\": [[], [9]]}])\n", "awkward0.topandas(array, flatten=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But if there are two `JaggedArrays` with *different* structure at the same level, a single DataFrame cannot represent them." ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " this array has more than one jagged array structure\n" ] } ], "source": [ "array = awkward0.fromiter([{\"a\": [1, 2, 3], \"b\": [1.1, 2.2]},\n", " {\"a\": [1], \"b\": [1.1, 2.2, 3.3]},\n", " {\"a\": [1, 2], \"b\": []}])\n", "try:\n", " awkward0.topandas(array, flatten=True)\n", "except Exception as err:\n", " print(type(err), str(err))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To describe data like these, you'd need two DataFrames, and any calculations involving both `\"a\"` and `\"b\"` would have to include a join on those DataFrames. Awkward Arrays are not limited in this way: the last `array` above is a valid Awkward Array and is useful for calculations that mix `\"a\"` and `\"b\"`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## LHC data from a ROOT file\n", "\n", "Particle physicsts need structures like these—in fact, they have been a staple of particle physics analyses for decades. The [ROOT](https://root.cern) file format was developed in the mid-90's to serialize arbitrary C++ data structures in a columnar way (replacing ZEBRA and similar Fortran projects that date back to the 70's). The [PyROOT](https://root.cern.ch/pyroot) library dynamically wraps these objects to present them in Python, though with a performance penalty. The [uproot](https://github.com/scikit-hep/uproot) library reads columnar data directly from ROOT files in Python without intermediary C++." ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ " ... ] at 0x7bc6cf212d30>" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import uproot\n", "events = uproot.open(\"http://scikit-hep.org/uproot/examples/HZZ-objects.root\")[\"events\"].lazyarrays()\n", "events" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "['jetp4',\n", " 'jetbtag',\n", " 'jetid',\n", " 'muonp4',\n", " 'muonq',\n", " 'muoniso',\n", " 'electronp4',\n", " 'electronq',\n", " 'electroniso',\n", " 'photonp4',\n", " 'photoniso',\n", " 'MET',\n", " 'MC_bquarkhadronic',\n", " 'MC_bquarkleptonic',\n", " 'MC_wdecayb',\n", " 'MC_wdecaybbar',\n", " 'MC_lepton',\n", " 'MC_leptonpdgid',\n", " 'MC_neutrino',\n", " 'num_primaryvertex',\n", " 'trigger_isomu24',\n", " 'eventweight']" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "events.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is a typical particle physics dataset (though small!) in that it represents the momentum and energy (`\"p4\"` for [Lorentz 4-momentum](https://en.wikipedia.org/wiki/Four-vector)) of several different species of particles: `\"jet\"`, `\"muon\"`, `\"electron\"`, and `\"photon\"`. Each collision can produce a different number of particles in each species. Other variables, such as missing transverse energy or `\"MET\"`, have one value per collision event. Events with zero particles in a species are valuable for the event-level data." ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The first event has two muons.\n", "events.muonp4" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The first event has zero jets.\n", "events.jetp4" ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Every event has exactly one MET.\n", "events.MET" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Unlike the exoplanet data, these events cannot be represented as a DataFrame because of the different numbers of particles in each species and because zero-particle events have value. Even with just `\"muonp4\"`, `\"jetp4\"`, and `\"MET\"`, there is no translation." ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " at least one array needed to concatenate\n" ] } ], "source": [ "try:\n", " awkward0.topandas(events[[\"muonp4\", \"jetp4\", \"MET\"]], flatten=True)\n", "except Exception as err:\n", " print(type(err), str(err))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It could be described as a collection of DataFrames, in which every operation relating particles in the same event would require a join. But that would make analysis harder, not easier. An event has meaning on its own." ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "{'jetp4': [],\n", " 'jetbtag': [],\n", " 'jetid': [],\n", " 'muonp4': [TLorentzVector(-52.899, -11.655, -8.1608, 54.779),\n", " TLorentzVector(37.738, 0.69347, -11.308, 39.402)],\n", " 'muonq': [1, -1],\n", " 'muoniso': [4.200153350830078, 2.1510612964630127],\n", " 'electronp4': [],\n", " 'electronq': [],\n", " 'electroniso': [],\n", " 'photonp4': [],\n", " 'photoniso': [],\n", " 'MET': TVector2(5.9128, 2.5636),\n", " 'MC_bquarkhadronic': TVector3(0, 0, 0),\n", " 'MC_bquarkleptonic': TVector3(0, 0, 0),\n", " 'MC_wdecayb': TVector3(0, 0, 0),\n", " 'MC_wdecaybbar': TVector3(0, 0, 0),\n", " 'MC_lepton': TVector3(0, 0, 0),\n", " 'MC_leptonpdgid': 0,\n", " 'MC_neutrino': TVector3(0, 0, 0),\n", " 'num_primaryvertex': 6,\n", " 'trigger_isomu24': True,\n", " 'eventweight': 0.009271008893847466}" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "events[0].tolist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Particle physics isn't alone in this: analyzing JSON-formatted log files in production systems or allele likelihoods in genomics are two other fields where variable-length, nested structures can help. Arbitrary data structures are useful and working with them in columns provides a new way to do exploratory data analysis: one array at a time." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Awkward Array data model\n", "\n", "Awkward Array features are provided by a suite of classes that each extend Numpy arrays in one small way. These classes may then be composed to combine features.\n", "\n", "In this sense, Numpy arrays are Awkward Array's most basic array class. A Numpy array is a small Python object that points to a large, contiguous region of memory, and, as much as possible, operations replace or change the small Python object, not the big data buffer. Therefore, many Numpy operations are *views*, rather than *in-place operations* or *copies*, leaving the original value intact but returning a new value that is linked to the original. Assigning to arrays and in-place operations are allowed, but they are more complicated to use because one must be aware of which arrays are views and which are copies.\n", "\n", "Awkward Array's model is to treat all arrays as though they were immutable, favoring views over copies, and not providing any high-level in-place operations on low-level memory buffers (i.e. no in-place assignment).\n", "\n", "Numpy provides complete control over the interpretation of an `N` dimensional array. A Numpy array has a [dtype](https://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html) to interpret bytes as signed and unsigned integers of various bit-widths, floating-point numbers, booleans, little endian and big endian, fixed-width bytestrings (for applications such as 6-byte MAC addresses or human-readable strings with padding), or [record arrays](https://docs.scipy.org/doc/numpy/user/basics.rec.html) for contiguous structures. A Numpy array has a [pointer](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.ctypes.html) to the first element of its data buffer (`array.ctypes.data`) and a [shape](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.shape.html) to describe its `N` dimensions as a rank-`N` tensor. Only `shape[0]` is the length as returned by the Python function `len`. Furthermore, an [order](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.flags.html) flag determines if rank > 1 arrays are laid out in \"C\" order or \"Fortran\" order. A Numpy array also has a [stride](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.strides.html) to determine how many bytes separate one element from the next. (Data in a Numpy array need not be strictly contiguous, but they must be regular: the number of bytes seprating them is a constant.) This stride may even be negative to describe a reversed view of an array, which allows any `slice` of an array, even those with `skip != 1` to be a view, rather than a copy. Numpy arrays also have flags to determine whether they [own](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.flags.html) their data buffer (and should therefore delete it when the Python object goes out of scope) and whether the data buffer is [writable](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.flags.html)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "The biggest restriction on this data model is that Numpy arrays are strictly rectangular. The `shape` and `stride` are constants, enforcing a regular layout. Awkward's `JaggedArray` is a generalization of Numpy's rank-2 arrays—that is, arrays of arrays—in that the inner arrays of a `JaggedArray` may all have different lengths. For higher ranks, such as arrays of arrays of arrays, put a `JaggedArray` inside another as its `content`. An important special case of `JaggedArray` is `StringArray`, whose `content` is interpreted as characters (with or without encoding), which represents an array of strings without unnecessary padding, as in Numpy's case.\n", "\n", "Although Numpy's [record arrays](https://docs.scipy.org/doc/numpy/user/basics.rec.html) present a buffer as a table, with differently typed, named columns, that table must be contiguous or interleaved (with non-trivial `strides`) in memory: an [array of structs](https://en.wikipedia.org/wiki/AOS_and_SOA). Awkward's `Table` provides the same interface, except that each column may be anywhere in memory, stored in a `contents` dict mapping field names to arrays. This is a true generalization: a `Table` may be a wrapped view of a Numpy record array, but not vice-versa. Use a `Table` anywhere you'd have a record/class/struct in non-columnar data structures. A `Table` with anonymous (integer-valued, rather than string-valued) fields is like an array of strongly typed tuples.\n", "\n", "Numpy has a [masked array](https://docs.scipy.org/doc/numpy/reference/maskedarray.html) module for nullable data—values that may be \"missing\" (like Python's `None`). Naturally, the only kinds of arrays Numpy can mask are subclasses of its own `ndarray`, and we need to be able to mask any Awkward Array, so the Awkward library defines its own `MaskedArray`. Additionally, we sometimes want to mask with bits, rather than bytes (e.g. for Arrow compatibility), so there's a `BitMaskedArray`, and sometimes we want to mask large structures without using memory for the masked-out values, so there's an `IndexedMaskedArray` (fusing the functionality of a `MaskedArray` with an `IndexedArray`).\n", "\n", "Numpy has no provision for an array containing different data types (\"heterogeneous\"), but Awkward Array has a `UnionArray`. The `UnionArray` stores data for each type as separate `contents` and identifies the types and positions of each element in the `contents` using `tags` and `index` arrays (equivalent to Arrow's [dense union type](https://arrow.apache.org/docs/memory_layout.html#dense-union-type) with `types` and `offsets` buffers). As a data type, unions are a counterpart to records or tuples (making `UnionArray` a counterpart to `Table`): each record/tuple contains *all* of its `contents` but a union contains *any* of its `contents`. (Note that a `UnionArray` may be the best way to interleave two arrays, even if they have the same type. Heterogeneity is not a necessary feature of a `UnionArray`.)\n", "\n", "Numpy has a `dtype=object` for arrays of Python objects, but Awkward's `ObjectArray` creates Python objects on demand from array data. A large dataset of some `Point` class, containing floating-point members `x` and `y`, can be stored as an `ObjectArray` of a `Table` of `x` and `y` with much less memory than a Numpy array of `Point` objects. The `ObjectArray` has a `generator` function that produces Python objects from array elements. `StringArray` is also a special case of `ObjectArray`, which instantiates variable-length character contents as Python strings.\n", "\n", "Although an `ObjectArray` can save memory, creating Python objects in a loop may still use more computation time than is necessary. Therefore, Awkward Arrays can also have vectorized `Methods`—bound functions that operate on the array data, rather than instantiating every Python object in an `ObjectArray`. Although an `ObjectArray` is a good use-case for `Methods`, any Awkward Array can have them. (The second most common case being a `JaggedArray` of `ObjectArrays`.)\n", "\n", "The nesting of Awkward Arrays within Awkward Arrays need not be tree-like: they can have cross-references and cyclic references (using ordinary Python assignment). `IndexedArray` can aid in building complex structures: it is simply an integer `index` that would be applied to its `content` with [integer array indexing](https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#integer-array-indexing) to get any element. `IndexedArray` is the equivalent of a pointer in non-columnar data structures.\n", "\n", "The counterpart of an `IndexedArray` is a `SparseArray`: whereas an `IndexedArray` consists of pointers *to* elements of its `content`, a `SparseArray` consists of pointers *from* elements of its content, representing a very large array in terms of its non-zero (or non-`default`) elements. Awkward's `SparseArray` is a [coordinate format (COO)](https://scipy-lectures.org/advanced/scipy_sparse/coo_matrix.html), one-dimensional array.\n", "\n", "Another limitation of Numpy is that arrays cannot span multiple memory buffers. Awkward's `ChunkedArray` represents a single logical array made of physical `chunks` that may be anywhere in memory. A `ChunkedArray`'s `chunksizes` may be known or unknown. One application of `ChunkedArray` is to append data to an array without allocating on every call: `AppendableArray` allocates memory in equal-sized chunks.\n", "\n", "Another application of `ChunkedArray` is to lazily load data in chunks. Awkward's `VirtualArray` calls its `generator` function to materialize an array when needed, and a `ChunkedArray` of `VirtualArrays` is a classic lazy-loading array, used to gradually read Parquet and ROOT files. In most libraries, lazy-loading is not a part of the data but a feature of the reading interface. Nesting virtualness makes it possible to load `Tables` within `Tables`, where even the columns of the inner `Tables` are on-demand.\n", "\n", "For more details, see [array classes](https://github.com/scikit-hep/awkward-0.x/blob/master/docs/classes.adoc).\n", "\n", " * [Jaggedness](https://github.com/scikit-hep/awkward-0.x/blob/master/docs/classes.adoc#jaggedness)\n", " * [JaggedArray](https://github.com/scikit-hep/awkward-0.x/blob/master/docs/classes.adoc#jaggedarray)\n", " * [Helper functions](https://github.com/scikit-hep/awkward-0.x/blob/master/docs/classes.adoc#helper-functions)\n", " * [Product types](https://github.com/scikit-hep/awkward-0.x/blob/master/docs/classes.adoc#product-types)\n", " * [Table](https://github.com/scikit-hep/awkward-0.x/blob/master/docs/classes.adoc#table)\n", " * [Sum types](https://github.com/scikit-hep/awkward-0.x/blob/master/docs/classes.adoc#sum-types)\n", " * [UnionArray](https://github.com/scikit-hep/awkward-0.x/blob/master/docs/classes.adoc#unionarray)\n", " * [Option types](https://github.com/scikit-hep/awkward-0.x/blob/master/docs/classes.adoc#option-types)\n", " * [MaskedArray](https://github.com/scikit-hep/awkward-0.x/blob/master/docs/classes.adoc#maskedarray)\n", " * [BitMaskedArray](https://github.com/scikit-hep/awkward-0.x/blob/master/docs/classes.adoc#bitmaskedarray)\n", " * [IndexedMaskedArray](https://github.com/scikit-hep/awkward-0.x/blob/master/docs/classes.adoc#indexedmaskedarray)\n", " * [Indirection](https://github.com/scikit-hep/awkward-0.x/blob/master/docs/classes.adoc#indirection)\n", " * [IndexedArray](https://github.com/scikit-hep/awkward-0.x/blob/master/docs/classes.adoc#indexedarray)\n", " * [SparseArray](https://github.com/scikit-hep/awkward-0.x/blob/master/docs/classes.adoc#sparsearray)\n", " * [Helper functions](https://github.com/scikit-hep/awkward-0.x/blob/master/docs/classes.adoc#helper-functions-1)\n", " * [Opaque objects](https://github.com/scikit-hep/awkward-0.x/blob/master/docs/classes.adoc#opaque-objects)\n", " * [Mix-in Methods](https://github.com/scikit-hep/awkward-0.x/blob/master/docs/classes.adoc#mix-in-methods)\n", " * [ObjectArray](https://github.com/scikit-hep/awkward-0.x/blob/master/docs/classes.adoc#objectarray)\n", " * [StringArray](https://github.com/scikit-hep/awkward-0.x/blob/master/docs/classes.adoc#stringarray)\n", " * [Non-contiguousness](https://github.com/scikit-hep/awkward-0.x/blob/master/docs/classes.adoc#non-contiguousness)\n", " * [ChunkedArray](https://github.com/scikit-hep/awkward-0.x/blob/master/docs/classes.adoc#chunkedarray)\n", " * [AppendableArray](https://github.com/scikit-hep/awkward-0.x/blob/master/docs/classes.adoc#appendablearray)\n", " * [Laziness](https://github.com/scikit-hep/awkward-0.x/blob/master/docs/classes.adoc#laziness)\n", " * [VirtualArray](https://github.com/scikit-hep/awkward-0.x/blob/master/docs/classes.adoc#virtualarray)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Mutability\n", "\n", "Awkward Arrays are considered immutable in the sense that elements of the data cannot be modified in-place. That is, assignment with square brackets at an integer index raises an error. Awkward does not prevent the underlying Numpy arrays from being modified in-place, though that can lead to confusing results—the behavior is left undefined. The reason for this omission in functionality is that the internal representation of columnar data structures is more constrained than their non-columnar counterparts: some in-place modification can't be defined, and others have surprising side-effects.\n", "\n", "However, the Python objects representing Awkward Arrays can be changed in-place. Each class has properties defining its structure, such as `content`, and these may be replaced at any time. (Replacing properties does not change values in any Numpy arrays.) In fact, this is the only way to build cyclic references: an object in Python must be assigned to a name before that name can be used as a reference.\n", "\n", "Awkward Arrays are appendable, but only through `AppendableArray`, and `Table` columns may be added, changed, or removed. The only use of square-bracket assignment (i.e. `__setitem__`) is to modify `Table` columns.\n", "\n", "Awkward Arrays produced by an external program may grow continuously, as long as more deeply nested arrays are filled first. That is, the `content` of a `JaggedArray` must be updated before updating its structure arrays (`starts` and `stops`). The definitions of Awkward Array validity allow for nested elements with no references pointing at them (\"unreachable\" elements), but not for references pointing to a nested element that doesn't exist." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Relationship to Arrow\n", "\n", "[Apache Arrow](https://arrow.apache.org) is a cross-language, columnar memory format for complex data structures. There is intentionally a high degree of overlap between Awkward Array and Arrow. But whereas Arrow's focus is data portability, Awkward's focus is computation: it would not be unusual to get data from Arrow, compute something with Awkward Array, then return it to another Arrow buffer. For this reason, `awkward0.fromarrow` is a zero-copy view. Awkward's data representation is broader than Arrow's, so `awkward0.toarrow` does, in general, perform a copy.\n", "\n", "The main difference between Awkward Array and Arrow is that Awkward Array does not require all arrays to be included within a contiguous memory buffer, though libraries like [pyarrow](https://arrow.apache.org/docs/python) relax this criterion while building a compliant Arrow buffer. This restriction does imply that Arrow cannot encode cross-references or cyclic dependencies.\n", "\n", "Arrow also doesn't have the luxury of relying on Numpy to define its [primitive arrays](https://arrow.apache.org/docs/memory_layout.html#primitive-value-arrays), so it has a fixed endianness, has no regular tensors without expressing it as a jagged array, and requires 32-bit integers for indexing, instead of taking whatever integer type a user provides.\n", "\n", "[Nullability](https://arrow.apache.org/docs/memory_layout.html#null-bitmaps) is an optional property of every data type in Arrow, but it's a structure element in Awkward. Similarly, [dictionary encoding](https://arrow.apache.org/docs/memory_layout.html#dictionary-encoding) is built into Arrow as a fundamental property, but it would be built from an `IndexedArray` in Awkward. Chunking and lazy-loading are supported by readers such as [pyarrow](https://arrow.apache.org/docs/python), but they're not part of the Arrow data model.\n", "\n", "The following list translates Awkward Array classes and features to their Arrow counterparts, if possible.\n", "\n", "* `JaggedArray`: Arrow's [list type](https://arrow.apache.org/docs/memory_layout.html#list-type).\n", "* `Table`: Arrow's [struct type](https://arrow.apache.org/docs/memory_layout.html#struct-type), though columns can be added to or removed from Awkward `Tables` whereas Arrow is strictly immutable.\n", "* `BitMaskedArray`: every data type in Arrow potentially has a [null bitmap](https://arrow.apache.org/docs/memory_layout.html#null-bitmaps), though it's an explicit array structure in Awkward. (Arrow has no counterpart for Awkward's `MaskedArray` or `IndexedMaskedArray`.)\n", "* `UnionArray`: directly equivalent to Arrow's [dense union](https://arrow.apache.org/docs/memory_layout.html#dense-union-type). Arrow also has a [sparse union](https://arrow.apache.org/docs/memory_layout.html#sparse-union-type), which Awkward Array only has as a `UnionArray.fromtags` constructor that builds the dense union on the fly from a sparse union.\n", "* `ObjectArray` and `Methods`: no counterpart because Arrow must be usable in any language.\n", "* `StringArray`: \"string\" is a logical type built on top of Arrow's [list type](https://arrow.apache.org/docs/memory_layout.html#list-type).\n", "* `IndexedArray`: no counterpart (though its role in building [dictionary encoding](https://arrow.apache.org/docs/memory_layout.html#dictionary-encoding) is built into Arrow as a fundamental property).\n", "* `SparseArray`: no counterpart.\n", "* `ChunkedArray`: no counterpart (though a reader may deal with non-contiguous data).\n", "* `AppendableArray`: no counterpart; Arrow is strictly immutable.\n", "* `VirtualArray`: no counterpart (though a reader may lazily load data)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# High-level operations: common to all classes\n", "\n", "There are three levels of abstraction in Awkward Array: high-level operations for data analysis, low-level operations for engineering the structure of the data, and implementation details. Implementation details are handled in the usual way for Python: if exposed at all, class, method, and function names begin with underscores and are not guaranteed to be stable from one release to the next.\n", "\n", "The distinction between high-level operations and low-level operations is more subtle and developed as Awkward Array was put to use. Data analysts care about the logical structure of the data—whether it is jagged, what the column names are, whether certain values could be `None`, etc. Data engineers (or an analyst in \"engineering mode\") care about contiguousness, how much data are in memory at a given time, whether strings are dictionary-encoded, whether arrays have unreachable elements, etc. The dividing line is between high-level types and low-level array layout (both of which are defined in their own sections below). The following Awkward classes have the same high-level type as their content:\n", "\n", "* `IndexedArray` because indirection to type `T` has type `T`,\n", "* `SparseArray` because a lookup of elements with type `T` has type `T`,\n", "* `ChunkedArray` because the chunks, which must have the same type as each other, collectively have that type when logically concatenated,\n", "* `AppendableArray` because it's a special case of `ChunkedArray`,\n", "* `VirtualArray` because it produces an array of a given type on demand,\n", "* `UnionArray` has the same type as its `contents` *only if* all `contents` have the same type as each other.\n", "\n", "All other classes, such as `JaggedArray`, have a logically distinct type from their contents.\n", "\n", "This section describes a suite of operations that are common to all Awkward classes. For some high-level types, the operation is meaningless or results in an error, such as the jagged `counts` of an array that is not jagged at any level, or the `columns` of an array that contains no tables, but the operation has a well-defined action on every array class. To use these operations, you do need to understand the high-level type of your data, but not whether it is wrapped in an `IndexedArray`, a `SparseArray`, a `ChunkedArray`, an `AppendableArray`, or a `VirtualArray`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Slicing with square brackets\n", "\n", "The primary operation for all classes is slicing with square brackets. This is the operation defined by Python's `__getitem__` method. It is so basic that high-level types are defined in terms of what they return when a scalar argument is passed in square brakets.\n", "\n", "Just as Numpy's slicing reproduces but generalizes Python sequence behavior, Awkward Array reproduces (most of) [Numpy's slicing behavior](https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html) and generalizes it in certain cases. An integer argument, a single slice argument, a single Numpy array-like of booleans or integers, and a tuple of any of the above is handled just like Numpy. Awkward Array does not handle ellipsis (because the depth of an Awkward Array can be different on different branches of a `Table` or `UnionArray`) or `None` (because it's not always possible to insert a `newaxis`). Numpy [record arrays](https://docs.scipy.org/doc/numpy/user/basics.rec.html) accept a string or sequence of strings as a column argument if it is the only argument, not in a tuple with other types. Awkward Array accepts a string or sequence of strings if it contains a `Table` at some level.\n", "\n", "An integer argument selects one element from the top-level array (starting at zero), changing the type by decreasing rank or jaggedness by one level." ] }, { "cell_type": "code", "execution_count": 61, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([1.1, 2.2, 3.3])" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5], [6.6, 7.7, 8.8], [9.9]])\n", "a[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Negative indexes count backward from the last element," ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([9.9])" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a[-1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and the index (after translating negative indexes) must be at least zero and less than the length of the top-level array." ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " index -6 is out of bounds for axis 0 with size 5\n" ] } ], "source": [ "try:\n", " a[-6]\n", "except Exception as err:\n", " print(type(err), str(err))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A slice selects a range of elements from the top-level array, maintaining the array's type. The first index is the inclusive starting point (starting at zero) and the second index is the exclusive endpoint." ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a[2:4]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Python's slice syntax (above) or literal `slice` objects may be used." ] }, { "cell_type": "code", "execution_count": 65, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a[slice(2, 4)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Negative indexes count backward from the last element and endpoints may be omitted." ] }, { "cell_type": "code", "execution_count": 66, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a[-2:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Start and endpoints beyond the array are not errors: they are truncated." ] }, { "cell_type": "code", "execution_count": 67, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a[2:100]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A skip value (third index of the slice) sets the stride for indexing, allowing you to skip elements, and this skip can be negative. It cannot, however, be zero." ] }, { "cell_type": "code", "execution_count": 68, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a[::-1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A Numpy array-like of booleans with the same length as the array may be used to filter elements. Numpy has a specialized [numpy.compress](https://docs.scipy.org/doc/numpy/reference/generated/numpy.compress.html) function for this operation, but the only way to get it in Awkward Array is through square brackets." ] }, { "cell_type": "code", "execution_count": 69, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a[[True, True, False, True, False]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A Numpy array-like of integers with the same length as the array may be used to select a collection of indexes. Numpy has a specialized [numpy.take](https://docs.scipy.org/doc/numpy/reference/generated/numpy.take.html) function for this operation, but the only way to get it in Awkward Array is through square brakets. Negative indexes and repeated elements are handled in the same way as Numpy." ] }, { "cell_type": "code", "execution_count": 70, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a[[-1, 0, 1, 2, 2, 2]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A tuple of length `N` applies selections to the first `N` levels of rank or jaggedness. Our example array has only two levels, so we can apply two kinds of indexes." ] }, { "cell_type": "code", "execution_count": 71, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([4.4, 6.6, 9.9])" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a[2:, 0]" ] }, { "cell_type": "code", "execution_count": 72, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a[[True, False, True, True, False], ::-1]" ] }, { "cell_type": "code", "execution_count": 73, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a[[0, 3, 0], 1::]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As described in Numpy's [advanced indexing](https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#advanced-indexing), advanced indexes (boolean or integer arrays) are broadcast and iterated as one:" ] }, { "cell_type": "code", "execution_count": 74, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([1.1, 8.8])" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a[[0, 3], [True, False, True]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Awkward Array has two extensions beyond Numpy, both of which affect only jagged data. If an array is jagged and a jagged array of booleans with the same structure (same length at all levels) is passed in square brackets, only inner arrays would be filtered." ] }, { "cell_type": "code", "execution_count": 75, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([[ 1.1, 2.2, 3.3], [], [ 4.4, 5.5], [ 6.6, 7.7, 8.8], [ 9.9]])\n", "mask = awkward0.fromiter([[False, False, True], [], [True, True], [True, True, False], [False]])\n", "a[mask]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similarly, if an array is jagged and a jagged array of integers with the same structure is passed in square brackets, only inner arrays would be filtered/duplicated/rearranged." ] }, { "cell_type": "code", "execution_count": 76, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5], [6.6, 7.7, 8.8], [9.9]])\n", "index = awkward0.fromiter([[2, 2, 2, 2], [], [1, 0], [2, 1, 0], []])\n", "a[index]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Although all of the above use a `JaggedArray` as an example, the principles are general: you should get analogous results with jagged tables, masked jagged arrays, etc. Non-jagged arrays only support Numpy-like slicing.\n", "\n", "If an array contains a `Table`, it can be selected with a string or a sequence of strings, just like Numpy [record arrays](https://docs.scipy.org/doc/numpy/user/basics.rec.html)." ] }, { "cell_type": "code", "execution_count": 77, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "
] at 0x7bc6ded8cd30>" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([{\"x\": 1, \"y\": 1.1, \"z\": \"one\"}, {\"x\": 2, \"y\": 2.2, \"z\": \"two\"}, {\"x\": 3, \"y\": 3.3, \"z\": \"three\"}])\n", "a" ] }, { "cell_type": "code", "execution_count": 78, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([1, 2, 3])" ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a[\"x\"]" ] }, { "cell_type": "code", "execution_count": 79, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "[{'z': 'one', 'y': 1.1}, {'z': 'two', 'y': 2.2}, {'z': 'three', 'y': 3.3}]" ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a[[\"z\", \"y\"]].tolist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Like Numpy, integer indexes and string indexes commute if the integer index corresponds to a structure outside the `Table` (this condition is always met for Numpy record arrays)." ] }, { "cell_type": "code", "execution_count": 80, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "2.2" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a[\"y\"][1]" ] }, { "cell_type": "code", "execution_count": 81, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "2.2" ] }, "execution_count": 81, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a[1][\"y\"]" ] }, { "cell_type": "code", "execution_count": 82, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ " ] [] []] at 0x7bc6cc15a748>" ] }, "execution_count": 82, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([[{\"x\": 1, \"y\": 1.1, \"z\": \"one\"}, {\"x\": 2, \"y\": 2.2, \"z\": \"two\"}], [], [{\"x\": 3, \"y\": 3.3, \"z\": \"three\"}]])\n", "a" ] }, { "cell_type": "code", "execution_count": 83, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "2.2" ] }, "execution_count": 83, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a[\"y\"][0][1]" ] }, { "cell_type": "code", "execution_count": 84, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "2.2" ] }, "execution_count": 84, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a[0][\"y\"][1]" ] }, { "cell_type": "code", "execution_count": 85, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "2.2" ] }, "execution_count": 85, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a[0][1][\"y\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "but not" ] }, { "cell_type": "code", "execution_count": 86, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "
] at 0x7bc6e4a7b2b0>" ] }, "execution_count": 86, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([{\"x\": 1, \"y\": [1.1]}, {\"x\": 2, \"y\": [2.1, 2.2]}, {\"x\": 3, \"y\": [3.1, 3.2, 3.3]}])\n", "a" ] }, { "cell_type": "code", "execution_count": 87, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "3.2" ] }, "execution_count": 87, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a[\"y\"][2][1]" ] }, { "cell_type": "code", "execution_count": 88, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "3.2" ] }, "execution_count": 88, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a[2][\"y\"][1]" ] }, { "cell_type": "code", "execution_count": 89, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " no column named '_util_isstringslice'\n" ] } ], "source": [ "try:\n", " a[2][1][\"y\"]\n", "except Exception as err:\n", " print(type(err), str(err))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "because" ] }, { "cell_type": "code", "execution_count": 90, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "{'x': 3, 'y': [3.1, 3.2, 3.3]}" ] }, "execution_count": 90, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a[2].tolist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "cannot take a `1` argument before `\"y\"`.\n", "\n", "Just as integer indexes can be alternated with string/sequence of string indexes, so can slices, arrays, and tuples of slices and arrays." ] }, { "cell_type": "code", "execution_count": 91, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([1.1, 2.1, 3.1])" ] }, "execution_count": 91, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a[\"y\"][:, 0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Generally speaking, string and sequence of string indexes are *column* indexes, while all other types are *row* indexes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Assigning with square brackets\n", "\n", "As discussed above, Awkward Arrays are generally immutable with few exceptions. Row assignment is only possible via appending to an `AppendableArray`. Column assignment, reassignment, and deletion are in general allowed. The syntax for assigning and reassigning columns is through assignment to a square bracket expression. This operation is defined by Python's `__setitem__` method. The syntax for deleting columns is through the `del` operators on a square bracket expression. This operation is defined by Python's `__delitem__` method.\n", "\n", "Since only columns can be changed, only strings and sequences of strings are allowed as indexes." ] }, { "cell_type": "code", "execution_count": 92, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ " ] [] []] at 0x7bc6cc1a5390>" ] }, "execution_count": 92, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([[{\"x\": 1, \"y\": 1.1, \"z\": \"one\"}, {\"x\": 2, \"y\": 2.2, \"z\": \"two\"}], [], [{\"x\": 3, \"y\": 3.3, \"z\": \"three\"}]])\n", "a" ] }, { "cell_type": "code", "execution_count": 93, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "[[{'x': 1, 'y': 1.1, 'z': 'one', 'a': 100},\n", " {'x': 2, 'y': 2.2, 'z': 'two', 'a': 200}],\n", " [],\n", " [{'x': 3, 'y': 3.3, 'z': 'three', 'a': 300}]]" ] }, "execution_count": 93, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a[\"a\"] = awkward0.fromiter([[100, 200], [], [300]])\n", "a.tolist()" ] }, { "cell_type": "code", "execution_count": 94, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "[[{'x': 1, 'y': 1.1, 'z': 'one'}, {'x': 2, 'y': 2.2, 'z': 'two'}],\n", " [],\n", " [{'x': 3, 'y': 3.3, 'z': 'three'}]]" ] }, "execution_count": 94, "metadata": {}, "output_type": "execute_result" } ], "source": [ "del a[\"a\"]\n", "a.tolist()" ] }, { "cell_type": "code", "execution_count": 95, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "[[{'x': 1, 'y': 1.1, 'z': 'one', 'a': 100, 'b': 111},\n", " {'x': 2, 'y': 2.2, 'z': 'two', 'a': 200, 'b': 222}],\n", " [],\n", " [{'x': 3, 'y': 3.3, 'z': 'three', 'a': 300, 'b': 333}]]" ] }, "execution_count": 95, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a[[\"a\", \"b\"]] = awkward0.fromiter([[{\"first\": 100, \"second\": 111}, {\"first\": 200, \"second\": 222}], [], [{\"first\": 300, \"second\": 333}]])\n", "a.tolist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the names of the columns on the right-hand side of the assignment are irrelevant; we're setting two columns, there needs to be two columns on the right. Columns can be anonymous:" ] }, { "cell_type": "code", "execution_count": 96, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "[[{'x': 1, 'y': 1.1, 'z': 'one', 'a': 100, 'b': 111},\n", " {'x': 2, 'y': 2.2, 'z': 'two', 'a': 200, 'b': 222}],\n", " [],\n", " [{'x': 3, 'y': 3.3, 'z': 'three', 'a': 300, 'b': 333}]]" ] }, "execution_count": 96, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a[[\"a\", \"b\"]] = awkward0.Table(awkward0.fromiter([[100, 200], [], [300]]), awkward0.fromiter([[111, 222], [], [333]]))\n", "a.tolist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another thing to note is that the structure (lengths at all levels of jaggedness) must match if the depth is the same." ] }, { "cell_type": "code", "execution_count": 97, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " cannot broadcast JaggedArray to match JaggedArray with a different counts\n" ] } ], "source": [ "try:\n", " a[\"c\"] = awkward0.fromiter([[100, 200, 300], [400], [500, 600]])\n", "except Exception as err:\n", " print(type(err), str(err))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But if the right-hand side is shallower and can be *broadcasted* to the left-hand side, it will be. (See below for broadcasting.)" ] }, { "cell_type": "code", "execution_count": 98, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "[[{'x': 1, 'y': 1.1, 'z': 'one', 'a': 100, 'b': 111, 'c': 100},\n", " {'x': 2, 'y': 2.2, 'z': 'two', 'a': 200, 'b': 222, 'c': 100}],\n", " [],\n", " [{'x': 3, 'y': 3.3, 'z': 'three', 'a': 300, 'b': 333, 'c': 300}]]" ] }, "execution_count": 98, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a[\"c\"] = awkward0.fromiter([100, 200, 300])\n", "a.tolist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Numpy-like broadcasting\n", "\n", "In assignments and mathematical operations between higher-rank and lower-rank arrays, Numpy repeats values in the lower-rank array to \"fit,\" if possible, before applying the operation. This is called [boradcasting](https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html). For example," ] }, { "cell_type": "code", "execution_count": 99, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([[101.1, 102.2, 103.3],\n", " [104.4, 105.5, 106.6]])" ] }, "execution_count": 99, "metadata": {}, "output_type": "execute_result" } ], "source": [ "numpy.array([[1.1, 2.2, 3.3], [4.4, 5.5, 6.6]]) + 100" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Singletons are also expanded to fit." ] }, { "cell_type": "code", "execution_count": 100, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([[101.1, 102.2, 103.3],\n", " [204.4, 205.5, 206.6]])" ] }, "execution_count": 100, "metadata": {}, "output_type": "execute_result" } ], "source": [ "numpy.array([[1.1, 2.2, 3.3], [4.4, 5.5, 6.6]]) + numpy.array([[100], [200]])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Awkward Arrays have the same feature, but this has particularly useful effects for jagged arrays. In an operation involving two arrays of different depths of jaggedness, the shallower one expands to fit the deeper one." ] }, { "cell_type": "code", "execution_count": 101, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 101, "metadata": {}, "output_type": "execute_result" } ], "source": [ "awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]]) + awkward0.fromiter([100, 200, 300])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the `100` was broadcasted to all three of the elements of the first inner array, `200` was broadcasted to no elements in the second inner array (because the second inner array is empty), and `300` was broadcasted to all two of the elements of the third inner array.\n", "\n", "This is the columnar equivalent to accessing a variable defined outside of an inner loop." ] }, { "cell_type": "code", "execution_count": 102, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 0 101.1\n", "0 1 102.2\n", "0 2 103.3\n", "2 0 304.4\n", "2 1 305.5\n" ] } ], "source": [ "jagged = [[1.1, 2.2, 3.3], [], [4.4, 5.5]]\n", "flat = [100, 200, 300]\n", "for i in range(3):\n", " for j in range(len(jagged[i])):\n", " # j varies in this loop, but i is constant\n", " print(i, j, jagged[i][j] + flat[i])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Many translations of non-columnar code to columnar code has this form. It's often surprising to users that they don't have to do anything special to get this feature (e.g. `cross`)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Support for Numpy universal functions (ufuncs)\n", "\n", "Numpy's key feature of array-at-a-time programming is mainly provided by \"universal functions\" or \"ufuncs.\" This is a special class of function that applies a scalars → scalar kernel independently to aligned elements of internal arrays to return a same-shape output array. That is, for a scalars → scalar function `f(x1, ..., xN) → y`, the ufunc takes `N` input arrays of the same `shape` and returns one output array with that `shape` in which `output[i] = f(input1[i], ..., inputN[i])` for all `i`." ] }, { "cell_type": "code", "execution_count": 103, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([1., 2., 3., 4., 5.])" ] }, "execution_count": 103, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# N = 1\n", "numpy.sqrt(numpy.array([1, 4, 9, 16, 25]))" ] }, { "cell_type": "code", "execution_count": 104, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([[101.1, 202.2],\n", " [303.3, 404.4]])" ] }, "execution_count": 104, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# N = 2\n", "numpy.add(numpy.array([[1.1, 2.2], [3.3, 4.4]]), numpy.array([[100, 200], [300, 400]]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Keep in mind that a ufunc is not simply a function that has this property, but a specially named class, deriving from a type in the Numpy library." ] }, { "cell_type": "code", "execution_count": 105, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "(, )" ] }, "execution_count": 105, "metadata": {}, "output_type": "execute_result" } ], "source": [ "numpy.sqrt, numpy.add" ] }, { "cell_type": "code", "execution_count": 106, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "(True, True)" ] }, "execution_count": 106, "metadata": {}, "output_type": "execute_result" } ], "source": [ "isinstance(numpy.sqrt, numpy.ufunc), isinstance(numpy.add, numpy.ufunc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This class of functions can be overridden, and Awkward Array overrides them to recognize and properly handle Awkward Arrays." ] }, { "cell_type": "code", "execution_count": 107, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 107, "metadata": {}, "output_type": "execute_result" } ], "source": [ "numpy.sqrt(awkward0.fromiter([[1, 4, 9], [], [16, 25]]))" ] }, { "cell_type": "code", "execution_count": 108, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 108, "metadata": {}, "output_type": "execute_result" } ], "source": [ "numpy.add(awkward0.fromiter([[[1.1], 2.2], [], [3.3, None]]), awkward0.fromiter([[[100], 200], [], [None, 300]]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Only the primary action of the ufunc (`ufunc.__call__`) has been overridden; methods like `ufunc.at`, `ufunc.reduce`, and `ufunc.reduceat` are not supported. Also, the in-place `out` parameter is not supported because Awkward Array data cannot be changed in-place.\n", "\n", "For Awkward Arrays, the input arguments to a ufunc must all have the same structure or, if shallower, be broadcastable to the deepest structure. (See above for \"broadcasting.\") The scalar function is applied to elements at the same positions within this structure from different input arrays. The output array has this structure, populated by return values of the scalar function.\n", "\n", "* Rectangular arrays must have the same shape, just as in Numpy. A scalar can be broadcasted (expanded) to have the same shape as the arrays.\n", "* Jagged arrays must have the same number of elements in all inner arrays. A rectangular array with the same outer shape (i.e. containing scalars instead of inner arrays) can be broadcasted to inner arrays with the same lengths.\n", "* Tables must have the same sets of columns (though not necessarily in the same order). There is no broadcasting of missing columns.\n", "* Missing values (`None` from `MaskedArrays`) transform to missing values in every ufunc. That is, `None + 5` is `None`, `None + None` is `None`, etc.\n", "* Different data types (through a `UnionArray`) must be compatible at every site where values are included in the calculation. For instance, input arrays may contain tables with different sets of columns, but all inputs at index `i` must have the same sets of columns as each other:" ] }, { "cell_type": "code", "execution_count": 109, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "[{'x': 4, 'y': 4.4}, {'y': 4.4, 'z': 400}]" ] }, "execution_count": 109, "metadata": {}, "output_type": "execute_result" } ], "source": [ "numpy.add(awkward0.fromiter([{\"x\": 1, \"y\": 1.1}, {\"y\": 1.1, \"z\": 100}]),\n", " awkward0.fromiter([{\"x\": 3, \"y\": 3.3}, {\"y\": 3.3, \"z\": 300}])).tolist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Unary and binary operations on Awkward Arrays, such as `-x`, `x + y`, and `x**2`, are actually Numpy ufuncs, so all of the above applies to them as well (such as broadcasting the scalar `2` in `x**2`).\n", "\n", "Remember that only ufuncs have been overridden by Awkward Array: other Numpy functions such as `numpy.concatenate` are ignorant of Awkward Arrays and will attempt to convert them to Numpy first. In some cases, that may be what you want, but in many, especially any cases involving jagged arrays, it will be a major performance loss and a loss of functionality: jagged arrays turn into Numpy `dtype=object` arrays containing Numpy arrays, which can be a very large number of Python objects and doesn't behave as a multidimensional array.\n", "\n", "You can check to see if a function from Numpy is a ufunc with `isinstance`." ] }, { "cell_type": "code", "execution_count": 110, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 110, "metadata": {}, "output_type": "execute_result" } ], "source": [ "isinstance(numpy.concatenate, numpy.ufunc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and you can prevent accidental conversions to Numpy by setting `allow_tonumpy` to `False`, either on one array or globally on a whole class of Awkward Arrays. (See \"global switches\" below.)" ] }, { "cell_type": "code", "execution_count": 111, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([array([1.1, 2.2, 3.3]), array([], dtype=float64),\n", " array([4.4, 5.5]), array([6.6, 7.7, 8.8]), array([9.9])],\n", " dtype=object)" ] }, "execution_count": 111, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x = awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])\n", "y = awkward0.fromiter([[6.6, 7.7, 8.8], [9.9]])\n", "numpy.concatenate([x, y])" ] }, { "cell_type": "code", "execution_count": 112, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " awkward0.array.base.AwkwardArray.allow_tonumpy is False; refusing to convert to Numpy\n" ] } ], "source": [ "x.allow_tonumpy = False\n", "try:\n", " numpy.concatenate([x, y])\n", "except Exception as err:\n", " print(type(err), str(err))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Global switches\n", "\n", "The `AwkwardArray` abstract base class has the following switches to turn off sometmes-undesirable behavior. These switches could be set on the `AwkwardArray` class itself, affecting all Awkward Arrays, or they could be set on a particular class like `JaggedArray` to only affect `JaggedArray` instances, or they could be set on a particular instance, to affect only that instance.\n", "\n", "* `allow_tonumpy` (default is `True`); if `False`, forbid any action that would convert an Awkward Array into a Numpy array (with a likely loss of performance and functionality).\n", "* `allow_iter` (default is `True`); if `False`, forbid any action that would iterate over an Awkward Array in Python (except printing a few elements as part of its string representation).\n", "* `check_prop_valid` (default is `True`); if `False`, skip the single-property validity checks in array constructors and when setting properties.\n", "* `check_whole_valid` (default is `True`); if `False`, skip the whole-array validity checks that are typically called before methods that need them." ] }, { "cell_type": "code", "execution_count": 113, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 113, "metadata": {}, "output_type": "execute_result" } ], "source": [ "awkward0.AwkwardArray.check_prop_valid" ] }, { "cell_type": "code", "execution_count": 114, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 114, "metadata": {}, "output_type": "execute_result" } ], "source": [ "awkward0.JaggedArray.check_whole_valid" ] }, { "cell_type": "code", "execution_count": 115, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([array([1.1, 2.2, 3.3]), array([], dtype=float64),\n", " array([4.4, 5.5])], dtype=object)" ] }, "execution_count": 115, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])\n", "numpy.array(a)" ] }, { "cell_type": "code", "execution_count": 116, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " awkward0.array.base.AwkwardArray.allow_tonumpy is False; refusing to convert to Numpy\n" ] } ], "source": [ "a.allow_tonumpy = False\n", "try:\n", " numpy.array(a)\n", "except Exception as err:\n", " print(type(err), str(err))" ] }, { "cell_type": "code", "execution_count": 117, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "[array([1.1, 2.2, 3.3]), array([], dtype=float64), array([4.4, 5.5])]" ] }, "execution_count": 117, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(a)" ] }, { "cell_type": "code", "execution_count": 118, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " awkward0.array.base.AwkwardArray.allow_iter is False; refusing to iterate\n" ] } ], "source": [ "a.allow_iter = False\n", "try:\n", " list(a)\n", "except Exception as err:\n", " print(type(err), str(err))" ] }, { "cell_type": "code", "execution_count": 119, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 119, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generic properties and methods\n", "\n", "All Awkward Arrays have the following properties and methods." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `type`: the high-level type of the array. (See below for a detailed description of high-level types.)" ] }, { "cell_type": "code", "execution_count": 120, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [], "source": [ "a = awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])\n", "b = awkward0.fromiter([[1.1, 2.2, None, 3.3, None],\n", " [4.4, [5.5]],\n", " [{\"x\": 6, \"y\": {\"z\": 7}}, None, {\"x\": 8, \"y\": {\"z\": 9}}]\n", " ])" ] }, { "cell_type": "code", "execution_count": 121, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "ArrayType(3, inf, dtype('float64'))" ] }, "execution_count": 121, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.type" ] }, { "cell_type": "code", "execution_count": 122, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0, 3) -> [0, inf) -> float64\n" ] } ], "source": [ "print(a.type)" ] }, { "cell_type": "code", "execution_count": 123, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "ArrayType(3, inf, OptionType(UnionType(dtype('float64'), ArrayType(inf, dtype('float64')), TableType(x=dtype('int64'), y=TableType(z=dtype('int64'))))))" ] }, "execution_count": 123, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b.type" ] }, { "cell_type": "code", "execution_count": 124, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0, 3) -> [0, inf) -> ?((float64 |\n", " [0, inf) -> float64 |\n", " 'x' -> int64\n", " 'y' -> 'z' -> int64 ))\n" ] } ], "source": [ "print(b.type)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " * `layout`: the low-level layout of the array. (See below for a detailed description of low-level layouts.)" ] }, { "cell_type": "code", "execution_count": 125, "metadata": {}, "outputs": [ { "data": { "text/plain": [ " layout \n", "[ ()] JaggedArray(starts=layout[0], stops=layout[1], content=layout[2])\n", "[ 0] ndarray(shape=3, dtype=dtype('int64'))\n", "[ 1] ndarray(shape=3, dtype=dtype('int64'))\n", "[ 2] ndarray(shape=5, dtype=dtype('float64'))" ] }, "execution_count": 125, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.layout" ] }, { "cell_type": "code", "execution_count": 126, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ " layout \n", "[ ()] JaggedArray(starts=layout[0], stops=layout[1], content=layout[2])\n", "[ 0] ndarray(shape=3, dtype=dtype('int64'))\n", "[ 1] ndarray(shape=3, dtype=dtype('int64'))\n", "[ 2] IndexedMaskedArray(mask=layout[2, 0], content=layout[2, 1], maskedwhen=-1)\n", "[ 2, 0] ndarray(shape=10, dtype=dtype('int64'))\n", "[ 2, 1] UnionArray(tags=layout[2, 1, 0], index=layout[2, 1, 1], contents=[layout[2, 1, 2], layout[2, 1, 3], layout[2, 1, 4]])\n", "[ 2, 1, 0] ndarray(shape=7, dtype=dtype('uint8'))\n", "[ 2, 1, 1] ndarray(shape=7, dtype=dtype('int64'))\n", "[ 2, 1, 2] ndarray(shape=4, dtype=dtype('float64'))\n", "[ 2, 1, 3] JaggedArray(starts=layout[2, 1, 3, 0], stops=layout[2, 1, 3, 1], content=layout[2, 1, 3, 2])\n", "[ 2, 1, 3, 0] ndarray(shape=1, dtype=dtype('int64'))\n", "[ 2, 1, 3, 1] ndarray(shape=1, dtype=dtype('int64'))\n", "[ 2, 1, 3, 2] ndarray(shape=1, dtype=dtype('float64'))\n", "[ 2, 1, 4] Table(x=layout[2, 1, 4, 0], y=layout[2, 1, 4, 1])\n", "[ 2, 1, 4, 0] ndarray(shape=2, dtype=dtype('int64'))\n", "[ 2, 1, 4, 1] Table(z=layout[2, 1, 4, 1, 0])\n", "[2, 1, 4, 1, 0] ndarray(shape=2, dtype=dtype('int64'))" ] }, "execution_count": 126, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b.layout" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `dtype`: the [Numpy dtype](https://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html) that this array would have if cast as a Numpy array. Numpy dtypes cannot fully specify Awkward Arrays: use the `type` for an analyst-friendly description of the data type or `layout` for details about how the arrays are represented." ] }, { "cell_type": "code", "execution_count": 127, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "dtype('O')" ] }, "execution_count": 127, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])\n", "a.dtype # the closest Numpy dtype to a jagged array is dtype=object ('O')" ] }, { "cell_type": "code", "execution_count": 128, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([array([1.1, 2.2, 3.3]), array([], dtype=float64),\n", " array([4.4, 5.5])], dtype=object)" ] }, "execution_count": 128, "metadata": {}, "output_type": "execute_result" } ], "source": [ "numpy.array(a)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `shape`: the [Numpy shape](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.shape.html) that this array would have if cast as a Numpy array. This only specifies the first regular dimensions, not any jagged dimensions or regular dimensions nested within Awkward structures. The Python length (`__len__`) of the array is the first element of this `shape`." ] }, { "cell_type": "code", "execution_count": 129, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "(3,)" ] }, "execution_count": 129, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])\n", "a.shape" ] }, { "cell_type": "code", "execution_count": 130, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "3" ] }, "execution_count": 130, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(a)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following `JaggedArray` has two fixed-size dimensions at the top, followed by a jagged dimension inside of that. The shape only represents the first few dimensions." ] }, { "cell_type": "code", "execution_count": 131, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 131, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.JaggedArray.fromcounts([[3, 0], [2, 4]], [1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9])\n", "a" ] }, { "cell_type": "code", "execution_count": 132, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "(2, 2)" ] }, "execution_count": 132, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.shape" ] }, { "cell_type": "code", "execution_count": 133, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "2" ] }, "execution_count": 133, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(a)" ] }, { "cell_type": "code", "execution_count": 134, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0, 2) -> [0, 2) -> [0, inf) -> float64\n" ] } ], "source": [ "print(a.type)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Also, a dimension can effectively be fixed-size, but represented by a `JaggedArray`. The `shape` does not encompass any dimensions represented by a `JaggedArray`." ] }, { "cell_type": "code", "execution_count": 135, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 135, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Same structure, but it's JaggedArrays all the way down.\n", "b = a.structure1d()\n", "b" ] }, { "cell_type": "code", "execution_count": 136, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "(2,)" ] }, "execution_count": 136, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `size`: the product of `shape`, as in Numpy." ] }, { "cell_type": "code", "execution_count": 137, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "(2, 2)" ] }, "execution_count": 137, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.shape" ] }, { "cell_type": "code", "execution_count": 138, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "4" ] }, "execution_count": 138, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.size" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `nbytes`: the total number of bytes in all memory buffers referenced by the array, not including bytes in Python objects (which are Python-implementation dependent, not even available in PyPy). Same as the Numpy property of the same name." ] }, { "cell_type": "code", "execution_count": 139, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "72" ] }, "execution_count": 139, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])\n", "a.nbytes" ] }, { "cell_type": "code", "execution_count": 140, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "72" ] }, "execution_count": 140, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.offsets.nbytes + a.content.nbytes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `tolist()`: converts the array into Python objects: `lists` for arrays, `dicts` for table rows, `tuples` for table rows with anonymous fields and a `rowname` of `\"tuple\"`, `None` for missing data, and Python objects from `ObjectArrays`. This is an approximate inverse of `awkward0.fromiter`." ] }, { "cell_type": "code", "execution_count": 141, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "[[1.1, 2.2, 3.3], [], [4.4, 5.5]]" ] }, "execution_count": 141, "metadata": {}, "output_type": "execute_result" } ], "source": [ "awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]]).tolist()" ] }, { "cell_type": "code", "execution_count": 142, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "[{'x': 1, 'y': 1.1}, {'x': 2, 'y': 2.2}, {'x': 3, 'y': 3.3}]" ] }, "execution_count": 142, "metadata": {}, "output_type": "execute_result" } ], "source": [ "awkward0.fromiter([{\"x\": 1, \"y\": 1.1}, {\"x\": 2, \"y\": 2.2}, {\"x\": 3, \"y\": 3.3}]).tolist()" ] }, { "cell_type": "code", "execution_count": 143, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "[(1, 1.1), (2, 2.2), (3, 3.3)]" ] }, "execution_count": 143, "metadata": {}, "output_type": "execute_result" } ], "source": [ "awkward0.Table.named(\"tuple\", [1, 2, 3], [1.1, 2.2, 3.3]).tolist()" ] }, { "cell_type": "code", "execution_count": 144, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "[[1.1, 2.2, None], [], [None, 3.3]]" ] }, "execution_count": 144, "metadata": {}, "output_type": "execute_result" } ], "source": [ "awkward0.fromiter([[1.1, 2.2, None], [], [None, 3.3]]).tolist()" ] }, { "cell_type": "code", "execution_count": 145, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 145, "metadata": {}, "output_type": "execute_result" } ], "source": [ "class Point:\n", " def __init__(self, x, y):\n", " self.x, self.y = x, y\n", " def __repr__(self):\n", " return f\"Point({self.x}, {self.y})\"\n", "\n", "a = awkward0.fromiter([[Point(1, 1.1), Point(2, 2.2), Point(3, 3.3)], [], [Point(4, 4.4), Point(5, 5.5)]])\n", "a" ] }, { "cell_type": "code", "execution_count": 146, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "[[Point(1, 1.1), Point(2, 2.2), Point(3, 3.3)],\n", " [],\n", " [Point(4, 4.4), Point(5, 5.5)]]" ] }, "execution_count": 146, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.tolist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `valid(exception=False, message=False)`: manually invoke the whole-array validity checks on the top-level array (not recursively). With the default options, this function returns `True` if valid and `False` if not. If `exception=True`, it returns nothing on success and raises the appropriate exception on failure. If `message=True`, it returns `None` on success and the error string on failure. (TODO: `recursive=True`?)" ] }, { "cell_type": "code", "execution_count": 147, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 147, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.JaggedArray.fromcounts([3, 0, 2], [1.1, 2.2, 3.3, 4.4]) # content array is too short\n", "a.valid()" ] }, { "cell_type": "code", "execution_count": 148, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " maximum offset 5 is beyond the length of the content (4)\n" ] } ], "source": [ "try:\n", " a.valid(exception=True)\n", "except Exception as err:\n", " print(type(err), str(err))" ] }, { "cell_type": "code", "execution_count": 149, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "\": maximum offset 5 is beyond the length of the content (4)\"" ] }, "execution_count": 149, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.valid(message=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `astype(dtype)`: convert *nested Numpy arrays* into the given type while maintaining Awkward structure." ] }, { "cell_type": "code", "execution_count": 150, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 150, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])\n", "a.astype(numpy.int32)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `regular()`: convert the Awkward Array into a Numpy array and (unlike `numpy.array(awkward_array)`) raise an error if it cannot be faithfully represented." ] }, { "cell_type": "code", "execution_count": 151, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 151, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# This JaggedArray happens to have equal-sized inner arrays.\n", "a = awkward0.fromiter([[1.1, 2.2, 3.3], [4.4, 5.5, 6.6], [7.7, 8.8, 9.9]])\n", "a" ] }, { "cell_type": "code", "execution_count": 152, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([[1.1, 2.2, 3.3],\n", " [4.4, 5.5, 6.6],\n", " [7.7, 8.8, 9.9]])" ] }, "execution_count": 152, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.regular()" ] }, { "cell_type": "code", "execution_count": 153, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 153, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# This one does not.\n", "a = awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])\n", "a" ] }, { "cell_type": "code", "execution_count": 154, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " jagged array is not regular: different elements have different counts\n" ] } ], "source": [ "try:\n", " a.regular()\n", "except Exception as err:\n", " print(type(err), str(err))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `copy(optional constructor arguments...)`: copy an Awkward Array object, non-recursively and without copying memory buffers, possibly replacing some of its parameters. If the class is an Awkward subclass or has mix-in methods, they are propagated to the copy." ] }, { "cell_type": "code", "execution_count": 155, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [], "source": [ "class Special:\n", " def get(self, index):\n", " try:\n", " return self[index]\n", " except IndexError:\n", " return None\n", "\n", "JaggedArrayMethods = awkward0.Methods.mixin(Special, awkward0.JaggedArray)" ] }, { "cell_type": "code", "execution_count": 156, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 156, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])\n", "a.__class__ = JaggedArrayMethods\n", "a" ] }, { "cell_type": "code", "execution_count": 157, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([4.4, 5.5])" ] }, "execution_count": 157, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.get(2)" ] }, { "cell_type": "code", "execution_count": 158, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [], "source": [ "a.get(3)" ] }, { "cell_type": "code", "execution_count": 159, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 159, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b = a.copy(content=[100, 200, 300, 400, 500])\n", "b" ] }, { "cell_type": "code", "execution_count": 160, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([400, 500])" ] }, "execution_count": 160, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b.get(2)" ] }, { "cell_type": "code", "execution_count": 161, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [], "source": [ "b.get(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Internally, all the methods that return views of the array (like slicing) use `copy` to retain the special methods." ] }, { "cell_type": "code", "execution_count": 162, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 162, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = a[1:]\n", "c" ] }, { "cell_type": "code", "execution_count": 163, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([4.4, 5.5])" ] }, "execution_count": 163, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c.get(1)" ] }, { "cell_type": "code", "execution_count": 164, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [], "source": [ "c.get(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `deepcopy(optional constructor arguments...)`: like `copy`, except that it recursively copies all internal structure, including memory buffers associated with Numpy arrays." ] }, { "cell_type": "code", "execution_count": 165, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 165, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b = a.deepcopy(content=[100, 200, 300, 400, 500])\n", "b" ] }, { "cell_type": "code", "execution_count": 166, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 166, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Modify the structure of a (not recommended; this is a demo).\n", "a.starts[0] = 1\n", "a" ] }, { "cell_type": "code", "execution_count": 167, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 167, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# But b is not modified. (If it were, it would start with 200.)\n", "b" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `empty_like(optional constructor arguments...)`\n", "* `zeros_like(optional constructor arguments...)`\n", "* `ones_like(optional constructor arguments...)`: recursively copies structure, replacing contents with new uninitialized buffers, new buffers full of zeros, or new buffers full of ones. Not usually used in analysis, but needed for implementation." ] }, { "cell_type": "code", "execution_count": 168, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 168, "metadata": {}, "output_type": "execute_result" } ], "source": [ "d = a.zeros_like()\n", "d" ] }, { "cell_type": "code", "execution_count": 169, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 169, "metadata": {}, "output_type": "execute_result" } ], "source": [ "e = a.ones_like()\n", "e" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reducers\n", "\n", "All Awkward Arrays also have a complete set of reducer methods. Reducers can be found in Numpy as well (as array methods and as free-standing functions), but they're not called out as a special class the way that universal functions (\"ufuncs\") are. Reducers decrease the rank or jaggedness of an array by one dimension, replacing subarrays with scalars. Examples include `sum`, `min`, and `max`, but any monoid (associative operation with an identity) can be a reducer.\n", "\n", "In Awkward Array, reducers are only array methods (not free-standing functions) and unlike Numpy, they do not take an `axis` parameter. When a reducer is called at any level, it reduces the innermost dimension. (Since outer dimensions can be jagged, this is the only dimension that can be meaningfully reduced.)" ] }, { "cell_type": "code", "execution_count": 170, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 170, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([[[[1, 2], [3]], [[4, 5]]], [[[], [6, 7, 8, 9]]]])\n", "a" ] }, { "cell_type": "code", "execution_count": 171, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 171, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.sum()" ] }, { "cell_type": "code", "execution_count": 172, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 172, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.sum().sum()" ] }, { "cell_type": "code", "execution_count": 173, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([15, 30])" ] }, "execution_count": 173, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.sum().sum().sum()" ] }, { "cell_type": "code", "execution_count": 174, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "45" ] }, "execution_count": 174, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.sum().sum().sum().sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the following example, \"the deepest axis\" of different fields in the table are at different depths: singly jagged in `\"x\"` and doubly jagged array in `\"y\"`. The `sum` reduces each depth by one, producing a flat array `\"x\"` and a singly jagged array in `\"y\"`." ] }, { "cell_type": "code", "execution_count": 175, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "[{'x': [], 'y': [[0.1, 0.2], [], [0.3]]},\n", " {'x': [1, 2, 3], 'y': [[0.4], [], [0.5, 0.6]]}]" ] }, "execution_count": 175, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([{\"x\": [], \"y\": [[0.1, 0.2], [], [0.3]]}, {\"x\": [1, 2, 3], \"y\": [[0.4], [], [0.5, 0.6]]}])\n", "a.tolist()" ] }, { "cell_type": "code", "execution_count": 176, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "[{'x': 0, 'y': [0.30000000000000004, 0.0, 0.3]},\n", " {'x': 6, 'y': [0.4, 0.0, 1.1]}]" ] }, "execution_count": 176, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.sum().tolist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This sum cannot be reduced again because `\"x\"` is not jagged (would reduce to a scalar) and `\"y\"` is (would reduce to an array). The result cannot be scalar in one field (a single row, not a collection) and an array in another field (a collection)." ] }, { "cell_type": "code", "execution_count": 177, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " some Table columns are jagged and others are not\n" ] } ], "source": [ "try:\n", " a.sum().sum()\n", "except Exception as err:\n", " print(type(err), str(err))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A table can be reduced if all of its fields are jagged or if all of its fields are not jagged; here's an example of the latter." ] }, { "cell_type": "code", "execution_count": 178, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "[{'x': 1, 'y': 1.1}, {'x': 2, 'y': 2.2}, {'x': 3, 'y': 3.3}]" ] }, "execution_count": 178, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([{\"x\": 1, \"y\": 1.1}, {\"x\": 2, \"y\": 2.2}, {\"x\": 3, \"y\": 3.3}])\n", "a.tolist()" ] }, { "cell_type": "code", "execution_count": 179, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 179, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The resulting object is a scalar row—for your convenience, it has been labeled with the reducer that produced it." ] }, { "cell_type": "code", "execution_count": 180, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 180, "metadata": {}, "output_type": "execute_result" } ], "source": [ "isinstance(a.sum(), awkward0.Table.Row)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`UnionArrays` are even more constrained: they can only be reduced if they have primitive (Numpy) type." ] }, { "cell_type": "code", "execution_count": 181, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ " ] at 0x7bc6cc0d5470>" ] }, "execution_count": 181, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([1, 2, 3, {\"x\": 1, \"y\": 1.1}, {\"x\": 2, \"y\": 2.2}])\n", "a" ] }, { "cell_type": "code", "execution_count": 182, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " cannot reduce a UnionArray of non-primitive type\n" ] } ], "source": [ "try:\n", " a.sum()\n", "except Exception as err:\n", " print(type(err), str(err))" ] }, { "cell_type": "code", "execution_count": 183, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 183, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.UnionArray.fromtags([0, 0, 0, 1, 1],\n", " [numpy.array([1, 2, 3], dtype=numpy.int32),\n", " numpy.array([4, 5], dtype=numpy.float64)])\n", "a" ] }, { "cell_type": "code", "execution_count": 184, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "15.0" ] }, "execution_count": 184, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In all reducers, `NaN` in floating-point arrays and `None` in `MaskedArrays` are skipped, so these reducers are more like `numpy.nansum`, `numpy.nanmax`, and `numpy.nanmin`, but generalized to all nullable types." ] }, { "cell_type": "code", "execution_count": 185, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 185, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([[[[1.1, numpy.nan], [2.2]], [[None, 3.3]]], [[[], [None, numpy.nan, None]]]])\n", "a" ] }, { "cell_type": "code", "execution_count": 186, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 186, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.sum()" ] }, { "cell_type": "code", "execution_count": 187, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "[[{'x': 1, 'y': 1.1}, None, {'x': 3, 'y': 3.3}], [], [{'x': 4, 'y': nan}]]" ] }, "execution_count": 187, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([[{\"x\": 1, \"y\": 1.1}, None, {\"x\": 3, \"y\": 3.3}], [], [{\"x\": 4, \"y\": numpy.nan}]])\n", "a.tolist()" ] }, { "cell_type": "code", "execution_count": 188, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "[{'x': 4, 'y': 4.4}, {'x': 0, 'y': 0.0}, {'x': 4, 'y': 0.0}]" ] }, "execution_count": 188, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.sum().tolist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following reducers are defined as methods on all Awkward Arrays." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `reduce(ufunc, identity)`: generic reducer, calls `ufunc.reduceat` and returns `identity` for empty arrays." ] }, { "cell_type": "code", "execution_count": 189, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [], "source": [ "# numba.vectorize makes new ufuncs (requires type signatures and a kernel function)\n", "import numba\n", "@numba.vectorize([numba.int64(numba.int64, numba.int64)])\n", "def sum_mod_10(x, y):\n", " return (x + y) % 10" ] }, { "cell_type": "code", "execution_count": 190, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([ 6, 0, 15, 34])" ] }, "execution_count": 190, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([[1, 2, 3], [], [4, 5, 6], [7, 8, 9, 10]])\n", "a.sum()" ] }, { "cell_type": "code", "execution_count": 191, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([6, 0, 5, 4])" ] }, "execution_count": 191, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.reduce(sum_mod_10, 0)" ] }, { "cell_type": "code", "execution_count": 192, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([6, 0, 0, 4])" ] }, "execution_count": 192, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Missing (None) values are ignored.\n", "a = awkward0.fromiter([[1, 2, None, 3], [], [None, None, None], [7, 8, 9, 10]])\n", "a.reduce(sum_mod_10, 0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `any()`: boolean reducer, returns `True` if any (logical or) of the elements of an array are `True`, returns `False` for empty arrays." ] }, { "cell_type": "code", "execution_count": 193, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([False, True, True, False])" ] }, "execution_count": 193, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([[False, False], [True, True], [True, False], []])\n", "a.any()" ] }, { "cell_type": "code", "execution_count": 194, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([False, True, False])" ] }, "execution_count": 194, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Missing (None) values are ignored.\n", "a = awkward0.fromiter([[False, None], [True, None], [None]])\n", "a.any()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `all()`: boolean reducer, returns `True` if all (logical and) of the elements of an array are `True`, returns `True` for empty arrays." ] }, { "cell_type": "code", "execution_count": 195, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([False, True, False, True])" ] }, "execution_count": 195, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([[False, False], [True, True], [True, False], []])\n", "a.all()" ] }, { "cell_type": "code", "execution_count": 196, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([False, True, True])" ] }, "execution_count": 196, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Missing (None) values are ignored.\n", "a = awkward0.fromiter([[False, None], [True, None], [None]])\n", "a.all()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `count()`: returns the (integer) number of elements in an array, skipping `None` and `NaN`." ] }, { "cell_type": "code", "execution_count": 197, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([2, 0, 1])" ] }, "execution_count": 197, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([[1.1, 2.2, None], [], [3.3, numpy.nan]])\n", "a.count()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `count_nonzero()`: returns the (integer) number of non-zero elements in an array, skipping `None` and `NaN`." ] }, { "cell_type": "code", "execution_count": 198, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([2, 0, 1])" ] }, "execution_count": 198, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([[1.1, 2.2, None, 0], [], [3.3, numpy.nan, 0]])\n", "a.count_nonzero()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `sum()`: returns the sum of each array, skipping `None` and `NaN`, returning 0 for empty arrays." ] }, { "cell_type": "code", "execution_count": 199, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([3.3, 0. , 3.3])" ] }, "execution_count": 199, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([[1.1, 2.2, None], [], [3.3, numpy.nan]])\n", "a.sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `prod()`: returns the product (multiplication) of each array, skipping `None` and `NaN`, returning 1 for empty arrays." ] }, { "cell_type": "code", "execution_count": 200, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([2.42, 1. , 3.3 ])" ] }, "execution_count": 200, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([[1.1, 2.2, None], [], [3.3, numpy.nan]])\n", "a.prod()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `min()`: returns the minimum number in each array, skipping `None` and `NaN`, returning infinity or the largest possible integer for empty arrays. (Note that Numpy raises errors for empty arrays.)" ] }, { "cell_type": "code", "execution_count": 201, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([1.1, inf, 3.3])" ] }, "execution_count": 201, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([[1.1, 2.2, None], [], [3.3, numpy.nan]])\n", "a.min()" ] }, { "cell_type": "code", "execution_count": 202, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([ 1, 9223372036854775807, 3])" ] }, "execution_count": 202, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([[1, 2, None], [], [3]])\n", "a.min()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The identity of minimization is `inf` for floating-point values and `9223372036854775807` for `int64` because minimization with any other value would return the other value. This is more convenient for data analysts than raising an error because empty inner arrays are common." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `max()`: returns the maximum number in each array, skipping `None` and `NaN`, returning negative infinity or the smallest possible integer for empty arrays. (Note that Numpy raises errors for empty arrays.)" ] }, { "cell_type": "code", "execution_count": 203, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([ 2.2, -inf, 3.3])" ] }, "execution_count": 203, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([[1.1, 2.2, None], [], [3.3, numpy.nan]])\n", "a.max()" ] }, { "cell_type": "code", "execution_count": 204, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([ 2, -9223372036854775808, 3])" ] }, "execution_count": 204, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([[1, 2, None], [], [3]])\n", "a.max()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The identity of maximization is `-inf` for floating-point values and `-9223372036854775808` for `int64` because maximization with any other value would return the other value. This is more convenient for data analysts than raising an error because empty inner arrays are common.\n", "\n", "Note that the maximization-identity for unsigned types is `0`." ] }, { "cell_type": "code", "execution_count": 205, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 205, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.JaggedArray.fromcounts([3, 0, 2], numpy.array([1.1, 2.2, 3.3, 4.4, 5.5], dtype=numpy.uint16))\n", "a" ] }, { "cell_type": "code", "execution_count": 206, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([3, 0, 5], dtype=uint16)" ] }, "execution_count": 206, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.max()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Functions like mean and standard deviation aren't true reducers because they're not associative (`mean(mean(x1, x2, x3), mean(x4, x5))` is not equal to `mean(mean(x1, x2), mean(x3, x4, x5))`). However, they're useful methods that exist on all Awkward Arrays, defined in terms of reducers." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `moment(n, weight=None)`: returns the `n`th moment of each array (a floating-point value), skipping `None` and `NaN`, returning `NaN` for empty arrays. If `weight` is given, it is taken as an array of weights, which may have the same structure as the `array` or be broadcastable to it, though any broadcasted weights would have no effect on the moment." ] }, { "cell_type": "code", "execution_count": 207, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [], "source": [ "a = awkward0.fromiter([[1, 2, 3], [], [4, 5]])" ] }, { "cell_type": "code", "execution_count": 208, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([2. , nan, 4.5])" ] }, "execution_count": 208, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.moment(1)" ] }, { "cell_type": "code", "execution_count": 209, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([ 4.66666667, nan, 20.5 ])" ] }, "execution_count": 209, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.moment(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is the first moment (mean) with a weight broadcasted from a scalar and from a non-jagged array, to show how it doesn't affect the result. The moment is calculated over an inner array, so if a constant value is broadcasted to all elements of that inner array, they all get the same weight." ] }, { "cell_type": "code", "execution_count": 210, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([2. , nan, 4.5])" ] }, "execution_count": 210, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.moment(1)" ] }, { "cell_type": "code", "execution_count": 211, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([2. , nan, 4.5])" ] }, "execution_count": 211, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.moment(1, 100)" ] }, { "cell_type": "code", "execution_count": 212, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([2. , nan, 4.5])" ] }, "execution_count": 212, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.moment(1, numpy.array([100, 200, 300]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Only when the weight varies across an inner array does it have an effect." ] }, { "cell_type": "code", "execution_count": 213, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([2.89189189, nan, 5. ])" ] }, "execution_count": 213, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.moment(1, awkward0.fromiter([[1, 10, 100], [], [0, 100]]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `mean(weight=None)`: returns the mean of each array (a floating-point value), skipping `None` and `NaN`, returning `NaN` for empty arrays, using optional `weight` as above." ] }, { "cell_type": "code", "execution_count": 214, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([2. , nan, 4.5])" ] }, "execution_count": 214, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([[1, 2, 3], [], [4, 5]])\n", "a.mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `var(weight=None, ddof=0)`: returns the variance of each array (a floating-point value), skipping `None` and `NaN`, returning `NaN` for empty arrays, using optional `weight` as above. The `ddof` or \"Delta Degrees of Freedom\" replaces a divisor of `N` (count or sum of weights) with a divisor of `N - ddof`, following [numpy.var](https://docs.scipy.org/doc/numpy/reference/generated/numpy.var.html)." ] }, { "cell_type": "code", "execution_count": 215, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([0.66666667, nan, 0.25 ])" ] }, "execution_count": 215, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([[1, 2, 3], [], [4, 5]])\n", "a.var()" ] }, { "cell_type": "code", "execution_count": 216, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([1. , nan, 0.5])" ] }, "execution_count": 216, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.var(ddof=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `std(weight=None, ddof=0)`: returns the standard deviation of each array, the square root of the variance described above." ] }, { "cell_type": "code", "execution_count": 217, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([0.81649658, nan, 0.5 ])" ] }, "execution_count": 217, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.std()" ] }, { "cell_type": "code", "execution_count": 218, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([1. , nan, 0.70710678])" ] }, "execution_count": 218, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.std(ddof=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Properties and methods for jaggedness\n", "\n", "All Awkward Arrays have these methods, but they provide information about the first nested `JaggedArray` within a structure. If, for instance, the `JaggedArray` is within some structure that doesn't affect high-level type (e.g. `IndexedArray`, `ChunkedArray`, `VirtualArray`), then the methods are passed through to the `JaggedArray`. If it's nested within something that does change type, but can meaningfully pass on the call, such as `MaskedArray`, then that's what they do. If, however, it reaches a `Table`, which may have some jagged columns and some non-jagged columns, the propagation stops.\n", "\n", "* `counts`: Numpy array of the number of elements in each inner array of the shallowest `JaggedArray`. The `counts` may have rank > 1 if there are any fixed-size dimensions before the `JaggedArray`." ] }, { "cell_type": "code", "execution_count": 219, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([3, 0, 2, 4])" ] }, "execution_count": 219, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5], [6.6, 7.7, 8.8, 9.9]])\n", "a.counts" ] }, { "cell_type": "code", "execution_count": 220, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([ 3, 0, -1, 4])" ] }, "execution_count": 220, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# MaskedArrays return -1 for missing values.\n", "a = awkward0.fromiter([[1.1, 2.2, 3.3], [], None, [6.6, 7.7, 8.8, 9.9]])\n", "a.counts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A missing inner array (counts is `-1`) is distinct from an empty inner array (counts is `0`), but if you want to ensure that you're working with data that have at least `N` elements, `counts >= N` works." ] }, { "cell_type": "code", "execution_count": 221, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([ True, False, False, True])" ] }, "execution_count": 221, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.counts >= 1" ] }, { "cell_type": "code", "execution_count": 222, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 222, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a[a.counts >= 1]" ] }, { "cell_type": "code", "execution_count": 223, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([ 3, 0, -1, 4])" ] }, "execution_count": 223, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# UnionArrays return -1 for non-jagged arrays mixed with jagged arrays.\n", "a = awkward0.fromiter([[1.1, 2.2, 3.3], [], 999, [6.6, 7.7, 8.8, 9.9]])\n", "a.counts" ] }, { "cell_type": "code", "execution_count": 224, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([ 3, 0, -1, 4])" ] }, "execution_count": 224, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Same for tabular data, regardless of whether they contain nested jagged arrays.\n", "a = awkward0.fromiter([[1.1, 2.2, 3.3], [], {\"x\": 1, \"y\": [1.1, 1.2, 1.3]}, [6.6, 7.7, 8.8, 9.9]])\n", "a.counts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note! This means that pure `Tables` will always return zeros for counts, regardless of what they contain." ] }, { "cell_type": "code", "execution_count": 225, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([-1, -1, -1])" ] }, "execution_count": 225, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([{\"x\": [], \"y\": []}, {\"x\": [1], \"y\": [1.1]}, {\"x\": [1, 2], \"y\": [1.1, 2.2]}])\n", "a.counts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If all of the columns of a `Table` are `JaggedArrays` with the same structure, you probably want to zip them into a single `JaggedArray`." ] }, { "cell_type": "code", "execution_count": 226, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "] [ ]] at 0x7bc663c99278>" ] }, "execution_count": 226, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b = awkward0.JaggedArray.zip(x=a.x, y=a.y)\n", "b" ] }, { "cell_type": "code", "execution_count": 227, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([0, 1, 2])" ] }, "execution_count": 227, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b.counts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `flatten(axis=0)`: removes one level of structure (losing information about boundaries between inner arrays) at a depth of jaggedness given by `axis`." ] }, { "cell_type": "code", "execution_count": 228, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9])" ] }, "execution_count": 228, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5], [6.6, 7.7, 8.8, 9.9]])\n", "a.flatten()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Unlike a `JaggedArray`'s `content`, which is part of its low-level layout, `flatten()` performs a high-level logical operation. Here's an example of the distinction." ] }, { "cell_type": "code", "execution_count": 229, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 229, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# JaggedArray with an unusual but valid structure.\n", "a = awkward0.JaggedArray([3, 100, 0, 6], [6, 100, 2, 10],\n", " [4.4, 5.5, 999, 1.1, 2.2, 3.3, 6.6, 7.7, 8.8, 9.9, 123])\n", "a" ] }, { "cell_type": "code", "execution_count": 230, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9])" ] }, "execution_count": 230, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.flatten() # gives you a logically flattened array" ] }, { "cell_type": "code", "execution_count": 231, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([ 4.4, 5.5, 999. , 1.1, 2.2, 3.3, 6.6, 7.7, 8.8,\n", " 9.9, 123. ])" ] }, "execution_count": 231, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.content # gives you an internal structure component of the array" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In many cases, the output of `flatten()` corresponds to the output of `content`, but be aware of the difference and use the one you want.\n", "\n", "With `flatten(axis=1)`, we can internally flatten nested `JaggedArrays`." ] }, { "cell_type": "code", "execution_count": 232, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 232, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([[[1.1, 2.2], [3.3]], [], [[4.4, 5.5]], [[6.6, 7.7, 8.8], [], [9.9]]])\n", "a" ] }, { "cell_type": "code", "execution_count": 233, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 233, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.flatten(axis=0)" ] }, { "cell_type": "code", "execution_count": 234, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 234, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.flatten(axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Even if a `JaggedArray`'s inner structure is due to a fixed-shape Numpy array, the `axis` parameter propagates down and does the right thing." ] }, { "cell_type": "code", "execution_count": 235, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 235, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.JaggedArray.fromcounts(numpy.array([3, 0, 2]),\n", " numpy.array([[1, 1], [2, 2], [3, 3], [4, 4], [5, 5]]))\n", "a" ] }, { "cell_type": "code", "execution_count": 236, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "numpy.ndarray" ] }, "execution_count": 236, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(a.content)" ] }, { "cell_type": "code", "execution_count": 237, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 237, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.flatten(axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But, unlike Numpy, we can't ask for an `axis` starting from the other end (with a negative index). The \"deepest array\" is not a well-defined concept for Awkward Arrays." ] }, { "cell_type": "code", "execution_count": 238, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " axis must be a non-negative integer (can't count from the end)\n" ] } ], "source": [ "try:\n", " a.flatten(axis=-1)\n", "except Exception as err:\n", " print(type(err), str(err))" ] }, { "cell_type": "code", "execution_count": 239, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 239, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([[[1.1, 2.2], [3.3]], [], None, [[6.6, 7.7, 8.8], [], [9.9]]])\n", "a" ] }, { "cell_type": "code", "execution_count": 240, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 240, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.flatten(axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `pad(length, maskedwhen=True, clip=False)`: ensures that each inner array has at least `length` elements by filling in the empty spaces with `None` (i.e. by inserting a `MaskedArray` layer). The `maskedwhen` parameter determines whether `mask[i] == True` means the element is `None` (`maskedwhen=True`) or not `None` (`maskedwhen=False`). Setting `maskedwhen` doesn't change the logical meaning of the array. If `clip=True`, then the inner arrays will have exactly `length` elements (by clipping the ones that are too long). Even though this results in regular sizes, they are still represented by a `JaggedArray`." ] }, { "cell_type": "code", "execution_count": 241, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 241, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5], [6.6, 7.7, 8.8, 9.9]])\n", "a" ] }, { "cell_type": "code", "execution_count": 242, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 242, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.pad(3)" ] }, { "cell_type": "code", "execution_count": 243, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 243, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.pad(3, maskedwhen=False)" ] }, { "cell_type": "code", "execution_count": 244, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 244, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.pad(3, clip=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you want to get rid of the `MaskedArray` layer, replace `None` with some value." ] }, { "cell_type": "code", "execution_count": 245, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 245, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.pad(3).fillna(-999)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you want to make an effectively regular array into a real Numpy array, use `regular`." ] }, { "cell_type": "code", "execution_count": 246, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([[1.1, 2.2, 3.3],\n", " [0. , 0. , 0. ],\n", " [4.4, 5.5, 0. ],\n", " [6.6, 7.7, 8.8]])" ] }, "execution_count": 246, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.pad(3, clip=True).fillna(0).regular()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If a `JaggedArray` is nested within some other type, `pad` will propagate down to it." ] }, { "cell_type": "code", "execution_count": 247, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 247, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([[1.1, 2.2, 3.3], [], None, [4.4, 5.5], None])\n", "a" ] }, { "cell_type": "code", "execution_count": 248, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 248, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.pad(3)" ] }, { "cell_type": "code", "execution_count": 249, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "[{'x': [1, 1], 'y': [1.1, 2.2, 3.3]},\n", " {'x': [2, 2], 'y': []},\n", " {'x': [3, 3], 'y': [4.4, 5.5]},\n", " {'x': [4, 4], 'y': [6.6, 7.7, 8.8, 9.9]}]" ] }, "execution_count": 249, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.Table(x=[[1, 1], [2, 2], [3, 3], [4, 4]],\n", " y=awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5], [6.6, 7.7, 8.8, 9.9]]))\n", "a.tolist()" ] }, { "cell_type": "code", "execution_count": 250, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "[{'x': [1, 1, None], 'y': [1.1, 2.2, 3.3]},\n", " {'x': [2, 2, None], 'y': [None, None, None]},\n", " {'x': [3, 3, None], 'y': [4.4, 5.5, None]},\n", " {'x': [4, 4, None], 'y': [6.6, 7.7, 8.8, 9.9]}]" ] }, "execution_count": 250, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.pad(3).tolist()" ] }, { "cell_type": "code", "execution_count": 251, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "[{'x': [1, 1, None], 'y': [1.1, 2.2, 3.3]},\n", " {'x': [2, 2, None], 'y': [None, None, None]},\n", " {'x': [3, 3, None], 'y': [4.4, 5.5, None]},\n", " {'x': [4, 4, None], 'y': [6.6, 7.7, 8.8]}]" ] }, "execution_count": 251, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.pad(3, clip=True).tolist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you pass a `pad` through a `Table`, be sure that every field in each record is a nested array (and therefore can be padded)." ] }, { "cell_type": "code", "execution_count": 252, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "[{'x': 1, 'y': [1.1, 2.2, 3.3]},\n", " {'x': 2, 'y': []},\n", " {'x': 3, 'y': [4.4, 5.5]},\n", " {'x': 4, 'y': [6.6, 7.7, 8.8, 9.9]}]" ] }, "execution_count": 252, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.Table(x=[1, 2, 3, 4],\n", " y=awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5], [6.6, 7.7, 8.8, 9.9]]))\n", "a.tolist()" ] }, { "cell_type": "code", "execution_count": 253, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " pad cannot be applied to scalars\n" ] } ], "source": [ "try:\n", " a.pad(3)\n", "except Exception as err:\n", " print(type(err), str(err))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The same goes for `UnionArrays`." ] }, { "cell_type": "code", "execution_count": 254, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 254, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([[1.1, 2.2, 3.3, [1, 2, 3]], [], [4.4, 5.5, [4, 5]]])\n", "a" ] }, { "cell_type": "code", "execution_count": 255, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 255, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.pad(5)" ] }, { "cell_type": "code", "execution_count": 256, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 256, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.UnionArray.fromtags([0, 0, 0, 1, 1],\n", " [awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]]),\n", " awkward0.fromiter([[100, 101], [102]])])\n", "a" ] }, { "cell_type": "code", "execution_count": 257, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 257, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.pad(3)" ] }, { "cell_type": "code", "execution_count": 258, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 258, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.UnionArray.fromtags([0, 0, 0, 1, 1],\n", " [awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]]),\n", " awkward0.fromiter([100, 200])])\n", "a" ] }, { "cell_type": "code", "execution_count": 259, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " pad cannot be applied to scalars\n" ] } ], "source": [ "try:\n", " a.pad(3)\n", "except Exception as err:\n", " print(type(err), str(err))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The general behavior of `pad` is to replace the shallowest `JaggedArray` with a `JaggedArray` containing a `MaskedArray`. The one exception to this type signature is that `StringArrays` are padded with characters." ] }, { "cell_type": "code", "execution_count": 260, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 260, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([\"one\", \"two\", \"three\"])\n", "a" ] }, { "cell_type": "code", "execution_count": 261, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 261, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.pad(4, clip=True)" ] }, { "cell_type": "code", "execution_count": 262, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 262, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.pad(4, maskedwhen=b\".\", clip=True)" ] }, { "cell_type": "code", "execution_count": 263, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 263, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.pad(4, maskedwhen=b\"\\x00\", clip=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `argmin()` and `argmax()`: returns the index of the minimum or maximum value in a non-jagged array or the indexes where each inner array is minimized or maximized. The jagged structure of the return value consists of empty arrays for each empty array and singleton arrays for non-empty ones, consisting of a single index in an inner array. This is the form needed to extract one element from each inner array using jagged indexing." ] }, { "cell_type": "code", "execution_count": 264, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [], "source": [ "a = awkward0.fromiter([[-3.3, 5.5, -8.8], [], [-6.6, 0.0, 2.2, 3.3], [], [2.2, -2.2, 4.4]])\n", "absa = abs(a)" ] }, { "cell_type": "code", "execution_count": 265, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 265, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a" ] }, { "cell_type": "code", "execution_count": 266, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 266, "metadata": {}, "output_type": "execute_result" } ], "source": [ "absa" ] }, { "cell_type": "code", "execution_count": 267, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 267, "metadata": {}, "output_type": "execute_result" } ], "source": [ "index = absa.argmax()\n", "index" ] }, { "cell_type": "code", "execution_count": 268, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 268, "metadata": {}, "output_type": "execute_result" } ], "source": [ "absa[index]" ] }, { "cell_type": "code", "execution_count": 269, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 269, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a[index]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `cross(other, nested=False)` and `argcross(other, nested=False)`: returns jagged tuples representing the [cross-join](https://en.wikipedia.org/wiki/Join_(SQL)#Cross_join) of `array[i]` and `other[i]` separately for each `i`. If `nested=True`, the result is doubly jagged so that each element of the output corresponds to exactly one element in the original `array`." ] }, { "cell_type": "code", "execution_count": 270, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 270, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5], [6.6, 7.7, 8.8, 9.9]])\n", "b = awkward0.fromiter([[\"one\", \"two\"], [\"three\"], [\"four\", \"five\", \"six\"], [\"seven\"]])\n", "a.cross(b)" ] }, { "cell_type": "code", "execution_count": 271, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 271, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.cross(b, nested=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The \"arg\" version returns indexes at which the appropriate objects may be found, as usual." ] }, { "cell_type": "code", "execution_count": 272, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 272, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.argcross(b)" ] }, { "cell_type": "code", "execution_count": 273, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 273, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.argcross(b, nested=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This method is good to use with `unzip`, which separates the `Table` of tuples into a left half and a right half." ] }, { "cell_type": "code", "execution_count": 274, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "(,\n", " )" ] }, "execution_count": 274, "metadata": {}, "output_type": "execute_result" } ], "source": [ "left, right = a.cross(b).unzip()\n", "left, right" ] }, { "cell_type": "code", "execution_count": 275, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "(,\n", " )" ] }, "execution_count": 275, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5], [6.6, 7.7, 8.8, 9.9]])\n", "b = awkward0.fromiter([[1, 2], [3], [4, 5, 6], [7]])\n", "left, right = a.cross(b, nested=True).unzip()\n", "left, right" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This can be handy if a subsequent function takes two jagged arrays as arguments." ] }, { "cell_type": "code", "execution_count": 276, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 276, "metadata": {}, "output_type": "execute_result" } ], "source": [ "distance = round(abs(left - right), 1)\n", "distance" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Cross with `nested=True`, followed by some calculation on the pairs and then some reducer, is a common pattern. Because of the `nested=True` and the reducer, the resulting array has the same structure as the original." ] }, { "cell_type": "code", "execution_count": 277, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 277, "metadata": {}, "output_type": "execute_result" } ], "source": [ "distance.min()" ] }, { "cell_type": "code", "execution_count": 278, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 278, "metadata": {}, "output_type": "execute_result" } ], "source": [ "round(a + distance.min(), 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `pairs(nested=False)` and `argpairs(nested=False)`: returns jagged tuples representing the [self-join](https://en.wikipedia.org/wiki/Join_(SQL)#Self-join) removing duplicates but not same-object pairs (i.e. a self-join with `i1 <= i2`) for each inner array separately." ] }, { "cell_type": "code", "execution_count": 279, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 279, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([[\"a\", \"b\", \"c\"], [], [\"d\", \"e\"]])\n", "a.pairs()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The \"arg\" and `nested=True` versions have the same meanings as with `cross` (above)." ] }, { "cell_type": "code", "execution_count": 280, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 280, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.argpairs()" ] }, { "cell_type": "code", "execution_count": 281, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 281, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.pairs(nested=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Just as with `cross` (above), this is good to combine with `unzip` and maybe a reducer." ] }, { "cell_type": "code", "execution_count": 282, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "(,\n", " )" ] }, "execution_count": 282, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.pairs().unzip()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `distincts(nested=False)` and `argdistincts(nested=False)`: returns jagged tuples representing the [self-join](https://en.wikipedia.org/wiki/Join_(SQL)#Self-join) removing duplicates and same-object pairs (i.e. a self-join with `i1 < i2`) for each inner array separately." ] }, { "cell_type": "code", "execution_count": 283, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 283, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([[\"a\", \"b\", \"c\"], [], [\"d\", \"e\"]])\n", "a.distincts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The \"arg\" and `nested=True` versions have the same meanings as with `cross` (above)." ] }, { "cell_type": "code", "execution_count": 284, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 284, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.argdistincts()" ] }, { "cell_type": "code", "execution_count": 285, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 285, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.distincts(nested=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Just as with `cross` (above), this is good to combine with `unzip` and maybe a reducer." ] }, { "cell_type": "code", "execution_count": 286, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "(,\n", " )" ] }, "execution_count": 286, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.distincts().unzip()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `choose(n)` and `argchoose(n)`: returns jagged tuples for distinct combinations of `n` elements from every inner array separately. `array.choose(2)` is the same as `array.distincts()` apart from order." ] }, { "cell_type": "code", "execution_count": 287, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 287, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([[\"a\", \"b\", \"c\"], [], [\"d\", \"e\"], [\"f\", \"g\", \"h\", \"i\", \"j\"]])\n", "a" ] }, { "cell_type": "code", "execution_count": 288, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 288, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.choose(2)" ] }, { "cell_type": "code", "execution_count": 289, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 289, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.choose(3)" ] }, { "cell_type": "code", "execution_count": 290, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 290, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.choose(4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The \"arg\" version has the same meaning as `cross` (above), but there is no `nested=True` because of the order." ] }, { "cell_type": "code", "execution_count": 291, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 291, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.argchoose(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Just as with `cross` (above), this is good to combine with `unzip` and maybe a reducer." ] }, { "cell_type": "code", "execution_count": 292, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "(,\n", " )" ] }, "execution_count": 292, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.choose(2).unzip()" ] }, { "cell_type": "code", "execution_count": 293, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "(,\n", " ,\n", " )" ] }, "execution_count": 293, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.choose(3).unzip()" ] }, { "cell_type": "code", "execution_count": 294, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "(,\n", " ,\n", " ,\n", " )" ] }, "execution_count": 294, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.choose(4).unzip()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `JaggedArray.zip(columns...)`: combines jagged arrays with the same structure into a single jagged array. The columns may be unnamed (resulting in a jagged array of tuples) or named with keyword arguments or dict keys (resulting in a jagged array of a table with named columns)." ] }, { "cell_type": "code", "execution_count": 295, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 295, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])\n", "b = awkward0.fromiter([[100, 200, 300], [], [400, 500]])\n", "awkward0.JaggedArray.zip(a, b)" ] }, { "cell_type": "code", "execution_count": 296, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "[[{'x': 1.1, 'y': 100}, {'x': 2.2, 'y': 200}, {'x': 3.3, 'y': 300}],\n", " [],\n", " [{'x': 4.4, 'y': 400}, {'x': 5.5, 'y': 500}]]" ] }, "execution_count": 296, "metadata": {}, "output_type": "execute_result" } ], "source": [ "awkward0.JaggedArray.zip(x=a, y=b).tolist()" ] }, { "cell_type": "code", "execution_count": 297, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "[[{'x': 1.1, 'y': 100}, {'x': 2.2, 'y': 200}, {'x': 3.3, 'y': 300}],\n", " [],\n", " [{'x': 4.4, 'y': 400}, {'x': 5.5, 'y': 500}]]" ] }, "execution_count": 297, "metadata": {}, "output_type": "execute_result" } ], "source": [ "awkward0.JaggedArray.zip({\"x\": a, \"y\": b}).tolist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Not all of the arguments need to be jagged; those that aren't will be broadcasted to the right shape." ] }, { "cell_type": "code", "execution_count": 298, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 298, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])\n", "b = awkward0.fromiter([100, 200, 300])\n", "awkward0.JaggedArray.zip(a, b)" ] }, { "cell_type": "code", "execution_count": 299, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 299, "metadata": {}, "output_type": "execute_result" } ], "source": [ "awkward0.JaggedArray.zip(a, 1000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Properties and methods for tabular columns\n", "\n", "All Awkward Arrays have these methods, but they provide information about the first nested `Table` within a structure. If, for instance, the `Table` is within some structure that doesn't affect high-level type (e.g. `IndexedArray`, `ChunkedArray`, `VirtualArray`), then the methods are passed through to the `Table`. If it's nested within something that does change type, but can meaningfully pass on the call, such as `MaskedArray`, then that's what they do.\n", "\n", "* `columns`: the names of the columns at the first tabular level of depth." ] }, { "cell_type": "code", "execution_count": 300, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "[{'x': 1, 'y': 1.1, 'z': 'one'},\n", " {'x': 2, 'y': 2.2, 'z': 'two'},\n", " {'x': 3, 'y': 3.3, 'z': 'three'}]" ] }, "execution_count": 300, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([{\"x\": 1, \"y\": 1.1, \"z\": \"one\"}, {\"x\": 2, \"y\": 2.2, \"z\": \"two\"}, {\"x\": 3, \"y\": 3.3, \"z\": \"three\"}])\n", "a.tolist()" ] }, { "cell_type": "code", "execution_count": 301, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "['x', 'y', 'z']" ] }, "execution_count": 301, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.columns" ] }, { "cell_type": "code", "execution_count": 302, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "[{'x': 1, 'y': 1.1, 'z': {'a': 4, 'b': 4.4}},\n", " {'x': 2, 'y': 2.2, 'z': {'a': 5, 'b': 5.5}},\n", " {'x': 3, 'y': 3.3, 'z': {'a': 6, 'b': 6.6}}]" ] }, "execution_count": 302, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.Table(x=[1, 2, 3],\n", " y=[1.1, 2.2, 3.3],\n", " z=awkward0.Table(a=[4, 5, 6], b=[4.4, 5.5, 6.6]))\n", "a.tolist()" ] }, { "cell_type": "code", "execution_count": 303, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "['x', 'y', 'z']" ] }, "execution_count": 303, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.columns" ] }, { "cell_type": "code", "execution_count": 304, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "['a', 'b']" ] }, "execution_count": 304, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a[\"z\"].columns" ] }, { "cell_type": "code", "execution_count": 305, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "['a', 'b']" ] }, "execution_count": 305, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.z.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `unzip()`: returns a tuple of projections through each of the columns (in the same order as the `columns` property)." ] }, { "cell_type": "code", "execution_count": 306, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "(array([1, 2, 3]),\n", " array([1.1, 2.2, 3.3]),\n", " )" ] }, "execution_count": 306, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([{\"x\": 1, \"y\": 1.1, \"z\": \"one\"}, {\"x\": 2, \"y\": 2.2, \"z\": \"two\"}, {\"x\": 3, \"y\": 3.3, \"z\": \"three\"}])\n", "a.unzip()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `unzip` method is the opposite of the `Table` constructor," ] }, { "cell_type": "code", "execution_count": 307, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "[{'x': 1, 'y': 1.1, 'z': 'one'},\n", " {'x': 2, 'y': 2.2, 'z': 'two'},\n", " {'x': 3, 'y': 3.3, 'z': 'three'}]" ] }, "execution_count": 307, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.Table(x=[1, 2, 3],\n", " y=[1.1, 2.2, 3.3],\n", " z=awkward0.fromiter([\"one\", \"two\", \"three\"]))\n", "a.tolist()" ] }, { "cell_type": "code", "execution_count": 308, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "(array([1, 2, 3]),\n", " array([1.1, 2.2, 3.3]),\n", " )" ] }, "execution_count": 308, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.unzip()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "but it is also the opposite of `JaggedArray.zip`." ] }, { "cell_type": "code", "execution_count": 309, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "[[{'x': 1, 'y': 1.1, 'z': 'a'},\n", " {'x': 2, 'y': 2.2, 'z': 'b'},\n", " {'x': 3, 'y': 3.3, 'z': 'c'}],\n", " [],\n", " [{'x': 4, 'y': 4.4, 'z': 'd'}, {'x': 5, 'y': 5.5, 'z': 'e'}]]" ] }, "execution_count": 309, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b = awkward0.JaggedArray.zip(x=awkward0.fromiter([[1, 2, 3], [], [4, 5]]),\n", " y=awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]]),\n", " z=awkward0.fromiter([[\"a\", \"b\", \"c\"], [], [\"d\", \"e\"]]))\n", "b.tolist()" ] }, { "cell_type": "code", "execution_count": 310, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "(,\n", " ,\n", " )" ] }, "execution_count": 310, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b.unzip()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`JaggedArray.zip` produces a jagged array of `Table` whereas the `Table` constructor produces just a `Table`, and these are distinct things, though they can both be inverted by the same function because row indexes and column indexes commute:" ] }, { "cell_type": "code", "execution_count": 311, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([1.1, 2.2, 3.3])" ] }, "execution_count": 311, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b[0][\"y\"]" ] }, { "cell_type": "code", "execution_count": 312, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([1.1, 2.2, 3.3])" ] }, "execution_count": 312, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b[\"y\"][0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So `unzip` turns a flat `Table` into a tuple of flat arrays (opposite of the `Table` constructor) and it turns a jagged `Table` into a tuple of jagged arrays (opposite of `JaggedArray.zip`).\n", "\n", "* `istuple`: an array of tuples is a special kind of `Table`, one whose `rowname` is `\"tuple\"` and columns are `\"0\"`, `\"1\"`, `\"2\"`, etc. If these conditions are met, `istuple` is `True`; otherwise, `False`." ] }, { "cell_type": "code", "execution_count": 313, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "[{'x': 1, 'y': 1.1, 'z': 'one'},\n", " {'x': 2, 'y': 2.2, 'z': 'two'},\n", " {'x': 3, 'y': 3.3, 'z': 'three'}]" ] }, "execution_count": 313, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.Table(x=[1, 2, 3],\n", " y=[1.1, 2.2, 3.3],\n", " z=awkward0.fromiter([\"one\", \"two\", \"three\"]))\n", "a.tolist()" ] }, { "cell_type": "code", "execution_count": 314, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 314, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.istuple" ] }, { "cell_type": "code", "execution_count": 315, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "[(1, 1.1, 'one'), (2, 2.2, 'two'), (3, 3.3, 'three')]" ] }, "execution_count": 315, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.Table([1, 2, 3],\n", " [1.1, 2.2, 3.3],\n", " awkward0.fromiter([\"one\", \"two\", \"three\"]))\n", "a.tolist()" ] }, { "cell_type": "code", "execution_count": 316, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 316, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.istuple" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Even though the following tuples are inside of a jagged array, the first level of `Table` is a tuple, so `istuple` is `True`." ] }, { "cell_type": "code", "execution_count": 317, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 317, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b = awkward0.JaggedArray.zip(awkward0.fromiter([[1, 2, 3], [], [4, 5]]),\n", " awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]]),\n", " awkward0.fromiter([[\"a\", \"b\", \"c\"], [], [\"d\", \"e\"]]))\n", "b" ] }, { "cell_type": "code", "execution_count": 318, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 318, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b.istuple" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `i0` through `i9`: one of the two conditions for a `Table` to be a `tuple` is that columns are named `\"0\"`, `\"1\"`, `\"2\"`, etc. Columns like that could be selected with `[\"0\"]` at the risk of being misread as `[0]`, and they could not be selected with attribute dot-access because pure numbers are not valid Python attributes. However, `i0` through `i9` are provided as shortcuts (overriding any columns with these exact names) for the first 10 tuple slots." ] }, { "cell_type": "code", "execution_count": 319, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "[(1, 1.1, 'one'), (2, 2.2, 'two'), (3, 3.3, 'three')]" ] }, "execution_count": 319, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.Table([1, 2, 3],\n", " [1.1, 2.2, 3.3],\n", " awkward0.fromiter([\"one\", \"two\", \"three\"]))\n", "a.tolist()" ] }, { "cell_type": "code", "execution_count": 320, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([1, 2, 3])" ] }, "execution_count": 320, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.i0" ] }, { "cell_type": "code", "execution_count": 321, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([1.1, 2.2, 3.3])" ] }, "execution_count": 321, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.i1" ] }, { "cell_type": "code", "execution_count": 322, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 322, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.i2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `flattentuple()`: calling `cross` repeatedly can result in tuples nested within tuples; this flattens them at all levels, turning all `(i, (j, k))` into `(i, j, k)`. Whereas `array.flatten()` removes one level of structure from the rows (losing information), `array.flattentuple()` removes all levels of structure from the columns (renaming them, but not losing information)." ] }, { "cell_type": "code", "execution_count": 323, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "[(1, 1, ((1, 1), 1)), (2, 2, ((2, 2), 2)), (3, 3, ((3, 3), 3))]" ] }, "execution_count": 323, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.Table([1, 2, 3], [1, 2, 3], awkward0.Table(awkward0.Table([1, 2, 3], [1, 2, 3]), [1, 2, 3]))\n", "a.tolist()" ] }, { "cell_type": "code", "execution_count": 324, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "[(1, 1, 1, 1, 1), (2, 2, 2, 2, 2), (3, 3, 3, 3, 3)]" ] }, "execution_count": 324, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.flattentuple().tolist()" ] }, { "cell_type": "code", "execution_count": 325, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [], "source": [ "a = awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5], [6.6, 7.7, 8.8, 9.9]])\n", "b = awkward0.fromiter([[100, 200], [300], [400, 500, 600], [700]])\n", "c = awkward0.fromiter([[\"a\"], [\"b\", \"c\"], [\"d\"], [\"e\", \"f\"]])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `cross` method internally calls `flattentuples()` if it detects that one of its arguments is the result of a `cross`." ] }, { "cell_type": "code", "execution_count": 326, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "[[(1.1, 100, 'a'),\n", " (1.1, 200, 'a'),\n", " (2.2, 100, 'a'),\n", " (2.2, 200, 'a'),\n", " (3.3, 100, 'a'),\n", " (3.3, 200, 'a')],\n", " [],\n", " [(4.4, 400, 'd'),\n", " (4.4, 500, 'd'),\n", " (4.4, 600, 'd'),\n", " (5.5, 400, 'd'),\n", " (5.5, 500, 'd'),\n", " (5.5, 600, 'd')],\n", " [(6.6, 700, 'e'),\n", " (6.6, 700, 'f'),\n", " (7.7, 700, 'e'),\n", " (7.7, 700, 'f'),\n", " (8.8, 700, 'e'),\n", " (8.8, 700, 'f'),\n", " (9.9, 700, 'e'),\n", " (9.9, 700, 'f')]]" ] }, "execution_count": 326, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.cross(b).cross(c).tolist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Properties and methods for missing values\n", "\n", "All Awkward Arrays have these methods, but they provide information about the first nested `MaskedArray` within a structure. If, for instance, the `MaskedArray` is within some structure that doesn't affect high-level type (e.g. `IndexedArray`, `ChunkedArray`, `VirtualArray`), then the methods are passed through to the `MaskedArray`. If it's nested within something that does change type, but can meaningfully pass on the call, such as `JaggedArray`, then that's what they do.\n", "\n", "* `boolmask(maskedwhen=None)`: returns a Numpy array of booleans indicating which elements are missing (\"masked\") and which are not. If `maskedwhen=True`, a `True` value in the Numpy array means missing/masked; if `maskedwhen=False`, a `False` value in the Numpy array means missing/masked. If no value is passed (or `None`), the `MaskedArray`'s own `maskedwhen` property is used (which is by default `True`). Non-`MaskedArrays` are assumed to have a `maskedwhen` of `True` (the default)." ] }, { "cell_type": "code", "execution_count": 327, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([False, False, True, False, False, True, True, False])" ] }, "execution_count": 327, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([1, 2, None, 3, 4, None, None, 5])\n", "a.boolmask()" ] }, { "cell_type": "code", "execution_count": 328, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([ True, True, False, True, True, False, False, True])" ] }, "execution_count": 328, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.boolmask(maskedwhen=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`MaskedArrays` inside of `JaggedArrays` or `Tables` are hidden." ] }, { "cell_type": "code", "execution_count": 329, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([False, False, False])" ] }, "execution_count": 329, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([[1.1, None, 2.2], [], [3.3, 4.4, None, 5.5]])\n", "a.boolmask()" ] }, { "cell_type": "code", "execution_count": 330, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([False, True, False, False, False, True, False])" ] }, "execution_count": 330, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.flatten().boolmask()" ] }, { "cell_type": "code", "execution_count": 331, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([False, False, False, False])" ] }, "execution_count": 331, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([{\"x\": 1, \"y\": 1.1}, {\"x\": None, \"y\": 2.2}, {\"x\": None, \"y\": 3.3}, {\"x\": 4, \"y\": None}])\n", "a.boolmask()" ] }, { "cell_type": "code", "execution_count": 332, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([False, True, True, False])" ] }, "execution_count": 332, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.x.boolmask()" ] }, { "cell_type": "code", "execution_count": 333, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([False, False, False, True])" ] }, "execution_count": 333, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.y.boolmask()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `ismasked` and `isunmasked`: shortcut for `boolmask(maskedwhen=True)` and `boolmask(maskedwhen=False)` as a property, which is more appropriate for analysis." ] }, { "cell_type": "code", "execution_count": 334, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([False, False, True, False, False, True, True, False])" ] }, "execution_count": 334, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([1, 2, None, 3, 4, None, None, 5])\n", "a.ismasked" ] }, { "cell_type": "code", "execution_count": 335, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([ True, True, False, True, True, False, False, True])" ] }, "execution_count": 335, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.isunmasked" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `fillna(value)`: turn a `MaskedArray` into a non-`MaskedArray` by replacing `None` with `value`. Applies to the outermost `MaskedArray`, but it passes through `JaggedArrays` and into all `Table` columns." ] }, { "cell_type": "code", "execution_count": 336, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([ 1, 2, 999, 3, 4, 999, 999, 5])" ] }, "execution_count": 336, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([1, 2, None, 3, 4, None, None, 5])\n", "a.fillna(999)" ] }, { "cell_type": "code", "execution_count": 337, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 337, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([[1.1, None, 2.2], [], [3.3, 4.4, None, 5.5]])\n", "a.fillna(999)" ] }, { "cell_type": "code", "execution_count": 338, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "[{'x': 1, 'y': 1.1},\n", " {'x': 999, 'y': 2.2},\n", " {'x': 999, 'y': 3.3},\n", " {'x': 4, 'y': 999.0}]" ] }, "execution_count": 338, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([{\"x\": 1, \"y\": 1.1}, {\"x\": None, \"y\": 2.2}, {\"x\": None, \"y\": 3.3}, {\"x\": 4, \"y\": None}])\n", "a.fillna(999).tolist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Functions for structure manipulation\n", "\n", "Only one structure-manipulation function (for now) is defined at top-level in Awkward Array: `awkward0.concatenate`.\n", "\n", "* `awkward0.concatenate(arrays, axis=0)`: concatenate two or more `arrays`. If `axis=0`, the arrays are concatenated lengthwise (the resulting length is the sum of the lengths of each of the `arrays`). If `axis=1`, each inner array is concatenated: the input `arrays` must all be jagged with the same outer array length. (Values of `axis` greater than `1` are not yet supported.)" ] }, { "cell_type": "code", "execution_count": 339, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 339, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])\n", "b = awkward0.fromiter([[100, 200], [300], [400, 500, 600]])\n", "awkward0.concatenate([a, b])" ] }, { "cell_type": "code", "execution_count": 340, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 340, "metadata": {}, "output_type": "execute_result" } ], "source": [ "awkward0.concatenate([a, b], axis=1)" ] }, { "cell_type": "code", "execution_count": 341, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "[{'x': 1, 'y': 1.1},\n", " {'x': 2, 'y': 2.2},\n", " {'x': 3, 'y': 3.3},\n", " {'x': 4, 'y': 4.4},\n", " {'x': 5, 'y': 5.5}]" ] }, "execution_count": 341, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([{\"x\": 1, \"y\": 1.1}, {\"x\": 2, \"y\": 2.2}, {\"x\": 3, \"y\": 3.3}])\n", "b = awkward0.fromiter([{\"x\": 4, \"y\": 4.4}, {\"x\": 5, \"y\": 5.5}])\n", "awkward0.concatenate([a, b]).tolist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the arrays have different types, their concatenation is a `UnionArray`." ] }, { "cell_type": "code", "execution_count": 342, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "[{'x': 1, 'y': 1.1},\n", " {'x': 2, 'y': 2.2},\n", " {'x': 3, 'y': 3.3},\n", " [1.1, 2.2, 3.3],\n", " [],\n", " [4.4, 5.5]]" ] }, "execution_count": 342, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([{\"x\": 1, \"y\": 1.1}, {\"x\": 2, \"y\": 2.2}, {\"x\": 3, \"y\": 3.3}])\n", "b = awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])\n", "awkward0.concatenate([a, b]).tolist()" ] }, { "cell_type": "code", "execution_count": 343, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 343, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([1, None, 2])\n", "b = awkward0.fromiter([None, 3, None])\n", "awkward0.concatenate([a, b])" ] }, { "cell_type": "code", "execution_count": 344, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 344, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import awkward0, numpy\n", "a = awkward0.fromiter([\"one\", \"two\", \"three\"])\n", "b = awkward0.fromiter([\"four\", \"five\", \"six\"])\n", "awkward0.concatenate([a, b])" ] }, { "cell_type": "code", "execution_count": 345, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 345, "metadata": {}, "output_type": "execute_result" } ], "source": [ "awkward0.concatenate([a, b], axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Functions for input/output and conversion\n", "\n", "Most of the functions defined at the top-level of the library are conversion functions.\n", "\n", "* `awkward0.fromiter(iterable, dictencoding=False, maskedwhen=True)`: convert Python or JSON data into Awkward Arrays. Not a fast function: it necessarily involves a Python for loop. If `dictencoding` is `True`, bytes and strings will be \"dictionary-encoded\" in Arrow/Parquet terms—this is an `IndexedArray` in Awkward. The `maskedwhen` parameter determines whether `MaskedArrays` have a mask that is `True` when data are missing or `False` when data are missing." ] }, { "cell_type": "code", "execution_count": 346, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ " None ]] at 0x7bc6cc155240>" ] }, "execution_count": 346, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# We have been using this function all along, but why not another example?\n", "complicated = awkward0.fromiter([[1.1, 2.2, None, 3.3, None],\n", " [4.4, [5.5]],\n", " [{\"x\": 6, \"y\": {\"z\": 7}}, None, {\"x\": 8, \"y\": {\"z\": 9}}]\n", " ])\n", "complicated" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The fact that this nested, row-wise data have been converted into columnar arrays can be seen by inspecting its `layout`." ] }, { "cell_type": "code", "execution_count": 347, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ " layout \n", "[ ()] JaggedArray(starts=layout[0], stops=layout[1], content=layout[2])\n", "[ 0] ndarray(shape=3, dtype=dtype('int64'))\n", "[ 1] ndarray(shape=3, dtype=dtype('int64'))\n", "[ 2] IndexedMaskedArray(mask=layout[2, 0], content=layout[2, 1], maskedwhen=-1)\n", "[ 2, 0] ndarray(shape=10, dtype=dtype('int64'))\n", "[ 2, 1] UnionArray(tags=layout[2, 1, 0], index=layout[2, 1, 1], contents=[layout[2, 1, 2], layout[2, 1, 3], layout[2, 1, 4]])\n", "[ 2, 1, 0] ndarray(shape=7, dtype=dtype('uint8'))\n", "[ 2, 1, 1] ndarray(shape=7, dtype=dtype('int64'))\n", "[ 2, 1, 2] ndarray(shape=4, dtype=dtype('float64'))\n", "[ 2, 1, 3] JaggedArray(starts=layout[2, 1, 3, 0], stops=layout[2, 1, 3, 1], content=layout[2, 1, 3, 2])\n", "[ 2, 1, 3, 0] ndarray(shape=1, dtype=dtype('int64'))\n", "[ 2, 1, 3, 1] ndarray(shape=1, dtype=dtype('int64'))\n", "[ 2, 1, 3, 2] ndarray(shape=1, dtype=dtype('float64'))\n", "[ 2, 1, 4] Table(x=layout[2, 1, 4, 0], y=layout[2, 1, 4, 1])\n", "[ 2, 1, 4, 0] ndarray(shape=2, dtype=dtype('int64'))\n", "[ 2, 1, 4, 1] Table(z=layout[2, 1, 4, 1, 0])\n", "[2, 1, 4, 1, 0] ndarray(shape=2, dtype=dtype('int64'))" ] }, "execution_count": 347, "metadata": {}, "output_type": "execute_result" } ], "source": [ "complicated.layout" ] }, { "cell_type": "code", "execution_count": 348, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[ 0] array([0, 5, 7])\n", "[ 1] array([ 5, 7, 10])\n", "[ 2, 0] array([ 0, 1, -1, 2, -1, 3, 4, 5, -1, 6])\n", "[ 2, 1, 0] array([0, 0, 0, 0, 1, 2, 2], dtype=uint8)\n", "[ 2, 1, 1] array([0, 1, 2, 3, 0, 0, 1])\n", "[ 2, 1, 2] array([1.1, 2.2, 3.3, 4.4])\n", "[ 2, 1, 3, 0] array([0])\n", "[ 2, 1, 3, 1] array([1])\n", "[ 2, 1, 3, 2] array([5.5])\n", "[ 2, 1, 4, 0] array([6, 8])\n", "[2, 1, 4, 1, 0] array([7, 9])\n" ] } ], "source": [ "for index, node in complicated.layout.items():\n", " if node.cls == numpy.ndarray:\n", " print(\"[{0:>13s}] {1}\".format(\", \".join(repr(i) for i in index), repr(node.array)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The number of arrays in this object scales with the complexity of its data type, but not with the size of the dataset. If it were as complicated as it is now but billions of elements long, it would still contain 11 Numpy arrays, and operations on it would scale as Numpy scales. However, converting a billion Python objects to these 11 arrays would be a large up-front cost.\n", "\n", "More detail on the row-wise to columnar conversion process is given in [docs/fromiter.adoc](https://github.com/scikit-hep/awkward-0.x/blob/master/docs/fromiter.adoc)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `load(file, whitelist=awkward0.persist.whitelist, cache=None, schemasuffix=\".json\")`: loads data from an \"awkd\" (special ZIP) file. This function is like `numpy.load`, but for Awkward Arrays. If the file contains a single object, that object will be read immediately; if it has a collection of named arrays, it will return a loader that loads those arrays on demand. The `whitelist` is where you can provide a list of functions that may be called in this process, `cache` is a global cache object assigned to `VirtualArrays`, and `schemasuffix` determines the file name pattern to look for objects inside the ZIP file.\n", "\n", "* `save(file, array, name=None, mode=\"a\", compression=awkward0.persist.compression, delimiter=\"-\", suffix=\".raw\", schemasuffix=\".json\")`: saves data to an \"awkd\" (special ZIP) file. This function is like `numpy.savez` and is the reverse of `load` (above). The `array` may be a single object or a dict of named arrays, the `name` is a name to use inside the file, `mode=\"a\"` means create or append to an existing file, refusing to overwrite data while `mode=\"w\"` overwrites data, `compression` is a compression policy (set of rules determining which arrays to compress and how), and the rest of the arguments determine file names within the ZIP: `delimiter` between name components, `suffix` for array data, and `schemasuffix` for the schemas that tell `load` how to find all other data." ] }, { "cell_type": "code", "execution_count": 349, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [], "source": [ "a = awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])\n", "b = awkward0.fromiter([[1.1, 2.2, None, 3.3, None],\n", " [4.4, [5.5]],\n", " [{\"x\": 6, \"y\": {\"z\": 7}}, None, {\"x\": 8, \"y\": {\"z\": 9}}]\n", " ])" ] }, { "cell_type": "code", "execution_count": 350, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [], "source": [ "awkward0.save(\"single.awkd\", a, mode=\"w\")" ] }, { "cell_type": "code", "execution_count": 351, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 351, "metadata": {}, "output_type": "execute_result" } ], "source": [ "awkward0.load(\"single.awkd\")" ] }, { "cell_type": "code", "execution_count": 352, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [], "source": [ "awkward0.save(\"multi.awkd\", {\"a\": a, \"b\": b}, mode=\"w\")" ] }, { "cell_type": "code", "execution_count": 353, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [], "source": [ "multi = awkward0.load(\"multi.awkd\")" ] }, { "cell_type": "code", "execution_count": 354, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 354, "metadata": {}, "output_type": "execute_result" } ], "source": [ "multi[\"a\"]" ] }, { "cell_type": "code", "execution_count": 355, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ " None ]] at 0x7bc6ded8cf28>" ] }, "execution_count": 355, "metadata": {}, "output_type": "execute_result" } ], "source": [ "multi[\"b\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Only `save` has a `compression` parameter because only the writing process gets to decide how arrays are compressed. We don't use ZIP's built-in compression, but use Python compression functions and encode the choice in the metadata. If `compression=True`, all arrays will be compressed with zlib; if `compression=False`, `None`, or `[]`, none will. In general, `compression` is a list of rules; the first rule that is satisfied by a given array uses the specified compress/decompress pair of functions. Here's the default policy:" ] }, { "cell_type": "code", "execution_count": 356, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "[{'minsize': 8192,\n", " 'types': [numpy.bool_, bool, numpy.integer],\n", " 'contexts': '*',\n", " 'pair': (,\n", " ('zlib', 'decompress'))}]" ] }, "execution_count": 356, "metadata": {}, "output_type": "execute_result" } ], "source": [ "awkward0.persist.compression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The default policy has only one rule. If any array has a minimum size (`minsize`) of 8 kB (`8192` bytes), a numeric type (`array.dtype.type`) that is a subclass of `numpy.bool_`, `bool`, or `numpy.integer`, and is in any Awkward Array context (`JaggedArray.starts`, `MaskedArray.mask`, etc.), then it will be compressed with `zip.compress` and decompressed with `('zlib', 'decompress')`. The compression function is given as an object—the Python function that will be called to transform byte strings into compressed byte strings—but the decompression function is given as a location in Python's namespace: a tuple of nested objects, the first of which is a fully qualified module name (submodules separated by dots). This is because only the *location* of the decompression function needs to be written to the file.\n", "\n", "The saved Awkward Array consists of a collection of byte strings for Numpy arrays (2 for object `a` and 11 for object `b`, above) and JSON-formatted metadata that reconstructs the nested hierarchy of Awkward classes around those Numpy arrays. This metadata includes information such as which byte strings should be decompressed and how, but also which Awkward constructors to call to fit everything together. As such, the JSON metadata is code, a limited language without looping or function definitions (i.e. not Turing complete) but with the ability to call any Python function.\n", "\n", "Using a mini-language as metadata gives us great capacity for backward and forward compatibility (new or old ways of encoding things are simply calling different functions), but it does raise the danger of malicious array files calling unwanted Python functions. For this reason, `load` refuses to call any functions not specified in a `whitelist`. The default whitelist consists of functions known to be safe:" ] }, { "cell_type": "code", "execution_count": 357, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "[['numpy', 'frombuffer'],\n", " ['zlib', 'decompress'],\n", " ['lzma', 'decompress'],\n", " ['backports.lzma', 'decompress'],\n", " ['lz4.block', 'decompress'],\n", " ['awkward0', '*Array'],\n", " ['awkward0', 'Table'],\n", " ['awkward0', 'numpy', 'frombuffer'],\n", " ['awkward0.util', 'frombuffer'],\n", " ['awkward0.persist'],\n", " ['awkward0.arrow', '_ParquetFile', 'fromjson'],\n", " ['uproot_methods.classes.*'],\n", " ['uproot_methods.profiles.*'],\n", " ['uproot.tree', '_LazyFiles'],\n", " ['uproot.tree', '_LazyTree'],\n", " ['uproot.tree', '_LazyBranch']]" ] }, "execution_count": 357, "metadata": {}, "output_type": "execute_result" } ], "source": [ "awkward0.persist.whitelist" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The format of each item in the whitelist is a list of nested objects, the first of which being a fully qualified module name (submodules separated by dots). For instance, in the `awkward0.arrow` submodule, there is a class named `_ParquetFile` and it has a static method `fromjson` that is deemed to be safe. Patterns of safe names are can be wildcarded, such as `['awkward0', '*Array']` and `['uproot_methods.classes.*']`.\n", "\n", "You can add your own functions, and forward compatibility (using data made by a new version in an old version of Awkward Array) often dictates that you must add a function manually. The error message explains how to do this.\n", "\n", "The same serialization format is used when you pickle an Awkward Array or save it in an HDF5 file. More detail on the metadata mini-language is given in [docs/serialization.adoc](https://github.com/scikit-hep/awkward-0.x/blob/master/docs/serialization.adoc)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `hdf5(group, compression=awkward0.persist.compression, whitelist=awkward0.persist.whitelist, cache=None)`: wrap a `h5py.Group` as an Awkward-aware group, to save Awkward Arrays to HDF5 files and to read them back again. The options have the same meaning as `load` and `save`.\n", "\n", "Unlike \"awkd\" (special ZIP) files, HDF5 files can be written and overwritten like a database, rather than write-once files." ] }, { "cell_type": "code", "execution_count": 358, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [], "source": [ "a = awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])\n", "b = awkward0.fromiter([[1.1, 2.2, None, 3.3, None],\n", " [4.4, [5.5]],\n", " [{\"x\": 6, \"y\": {\"z\": 7}}, None, {\"x\": 8, \"y\": {\"z\": 9}}]\n", " ])" ] }, { "cell_type": "code", "execution_count": 359, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 359, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import h5py\n", "f = h5py.File(\"awkward0.hdf5\", \"w\")\n", "f" ] }, { "cell_type": "code", "execution_count": 360, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 360, "metadata": {}, "output_type": "execute_result" } ], "source": [ "g = awkward0.hdf5(f)\n", "g" ] }, { "cell_type": "code", "execution_count": 361, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [], "source": [ "g[\"array\"] = a" ] }, { "cell_type": "code", "execution_count": 362, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 362, "metadata": {}, "output_type": "execute_result" } ], "source": [ "g[\"array\"]" ] }, { "cell_type": "code", "execution_count": 363, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [], "source": [ "del g[\"array\"]" ] }, { "cell_type": "code", "execution_count": 364, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [], "source": [ "g[\"array\"] = b" ] }, { "cell_type": "code", "execution_count": 365, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ " None ]] at 0x7bc6cc0d8e48>" ] }, "execution_count": 365, "metadata": {}, "output_type": "execute_result" } ], "source": [ "g[\"array\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The HDF5 format does not include columnar representations of arbitrary nested data, as Awkward Array does, so what we're actually storing are plain Numpy arrays and the metadata necessary to reconstruct the Awkward Array." ] }, { "cell_type": "code", "execution_count": 366, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 366, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Reopen file, without wrapping it as awkward0.hdf5 this time.\n", "f = h5py.File(\"awkward0.hdf5\", \"r\")\n", "f" ] }, { "cell_type": "code", "execution_count": 367, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 367, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f[\"array\"]" ] }, { "cell_type": "code", "execution_count": 368, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 368, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f[\"array\"].keys()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The \"schema.json\" array is the JSON metadata, containing directives like `{\"call\": [\"awkward0\", \"JaggedArray\", \"fromcounts\"]}` and `{\"read\": \"1\"}` meaning the array named `\"1\"`, etc." ] }, { "cell_type": "code", "execution_count": 369, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "{'awkward0': '0.12.0rc1',\n", " 'schema': {'call': ['awkward0', 'JaggedArray', 'fromcounts'],\n", " 'args': [{'call': ['awkward0', 'numpy', 'frombuffer'],\n", " 'args': [{'read': '1'}, {'dtype': 'int64'}, {'json': 3, 'id': 2}],\n", " 'id': 1},\n", " {'call': ['awkward0', 'IndexedMaskedArray'],\n", " 'args': [{'call': ['awkward0', 'numpy', 'frombuffer'],\n", " 'args': [{'read': '4'}, {'dtype': 'int64'}, {'json': 10, 'id': 5}],\n", " 'id': 4},\n", " {'call': ['awkward0', 'UnionArray', 'fromtags'],\n", " 'args': [{'call': ['awkward0', 'numpy', 'frombuffer'],\n", " 'args': [{'read': '7'}, {'dtype': 'uint8'}, {'json': 7, 'id': 8}],\n", " 'id': 7},\n", " {'list': [{'call': ['awkward0', 'numpy', 'frombuffer'],\n", " 'args': [{'read': '9'}, {'dtype': 'float64'}, {'json': 4, 'id': 10}],\n", " 'id': 9},\n", " {'call': ['awkward0', 'JaggedArray', 'fromcounts'],\n", " 'args': [{'call': ['awkward0', 'numpy', 'frombuffer'],\n", " 'args': [{'read': '12'},\n", " {'dtype': 'int64'},\n", " {'json': 1, 'id': 13}],\n", " 'id': 12},\n", " {'call': ['awkward0', 'numpy', 'frombuffer'],\n", " 'args': [{'read': '14'}, {'dtype': 'float64'}, {'ref': 13}],\n", " 'id': 14}],\n", " 'id': 11},\n", " {'call': ['awkward0', 'Table', 'frompairs'],\n", " 'args': [{'pairs': [['x',\n", " {'call': ['awkward0', 'numpy', 'frombuffer'],\n", " 'args': [{'read': '16'},\n", " {'dtype': 'int64'},\n", " {'json': 2, 'id': 17}],\n", " 'id': 16}],\n", " ['y',\n", " {'call': ['awkward0', 'Table', 'frompairs'],\n", " 'args': [{'pairs': [['z',\n", " {'call': ['awkward0', 'numpy', 'frombuffer'],\n", " 'args': [{'read': '19'}, {'dtype': 'int64'}, {'ref': 17}],\n", " 'id': 19}]]},\n", " {'json': 0}],\n", " 'id': 18}]]},\n", " {'json': 0}],\n", " 'id': 15}]}],\n", " 'id': 6},\n", " {'json': -1}],\n", " 'id': 3}],\n", " 'id': 0},\n", " 'prefix': 'array/'}" ] }, "execution_count": 369, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import json\n", "json.loads(f[\"array\"][\"schema.json\"][:].tostring())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Without Awkward Array, these objects can't be meaningfully read back from the HDF5 file." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `awkward0.fromarrow(arrow)`: convert an [Apache Arrow](https://arrow.apache.org) formatted buffer to an Awkward Array (zero-copy).\n", "\n", "* `awkward0.toarrow(array)`: convert an Awkward Array to an Apache Arrow buffer, if possible (involving a data copy, but no Python loops)." ] }, { "cell_type": "code", "execution_count": 370, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [], "source": [ "a = awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])\n", "b = awkward0.fromiter([[1.1, 2.2, None, 3.3, None],\n", " [4.4, [5.5]],\n", " [{\"x\": 6, \"y\": {\"z\": 7}}, None, {\"x\": 8, \"y\": {\"z\": 9}}]\n", " ])" ] }, { "cell_type": "code", "execution_count": 371, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "\n", "[\n", " [\n", " 1.1,\n", " 2.2,\n", " 3.3\n", " ],\n", " [],\n", " [\n", " 4.4,\n", " 5.5\n", " ]\n", "]" ] }, "execution_count": 371, "metadata": {}, "output_type": "execute_result" } ], "source": [ "awkward0.toarrow(a)" ] }, { "cell_type": "code", "execution_count": 372, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 372, "metadata": {}, "output_type": "execute_result" } ], "source": [ "awkward0.fromarrow(awkward0.toarrow(a))" ] }, { "cell_type": "code", "execution_count": 373, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "\n", "[\n", " -- is_valid: all not null\n", " -- type_ids: [\n", " 0,\n", " 0,\n", " 2,\n", " 0,\n", " 2\n", " ]\n", " -- value_offsets: [\n", " 0,\n", " 1,\n", " 1,\n", " 2,\n", " 1\n", " ]\n", " -- child 0 type: double\n", " [\n", " 1.1,\n", " 2.2,\n", " 3.3,\n", " 4.4\n", " ]\n", " -- child 1 type: list\n", " [\n", " [\n", " 5.5\n", " ]\n", " ]\n", " -- child 2 type: struct>\n", " -- is_valid: all not null\n", " -- child 0 type: int64\n", " [\n", " 6,\n", " 8\n", " ]\n", " -- child 1 type: struct\n", " -- is_valid: all not null\n", " -- child 0 type: int64\n", " [\n", " 7,\n", " 9\n", " ],\n", " -- is_valid: all not null\n", " -- type_ids: [\n", " 0,\n", " 1\n", " ]\n", " -- value_offsets: [\n", " 3,\n", " 0\n", " ]\n", " -- child 0 type: double\n", " [\n", " 1.1,\n", " 2.2,\n", " 3.3,\n", " 4.4\n", " ]\n", " -- child 1 type: list\n", " [\n", " [\n", " 5.5\n", " ]\n", " ]\n", " -- child 2 type: struct>\n", " -- is_valid: all not null\n", " -- child 0 type: int64\n", " [\n", " 6,\n", " 8\n", " ]\n", " -- child 1 type: struct\n", " -- is_valid: all not null\n", " -- child 0 type: int64\n", " [\n", " 7,\n", " 9\n", " ],\n", " -- is_valid: all not null\n", " -- type_ids: [\n", " 2,\n", " 2,\n", " 2\n", " ]\n", " -- value_offsets: [\n", " 0,\n", " 1,\n", " 1\n", " ]\n", " -- child 0 type: double\n", " [\n", " 1.1,\n", " 2.2,\n", " 3.3,\n", " 4.4\n", " ]\n", " -- child 1 type: list\n", " [\n", " [\n", " 5.5\n", " ]\n", " ]\n", " -- child 2 type: struct>\n", " -- is_valid: all not null\n", " -- child 0 type: int64\n", " [\n", " 6,\n", " 8\n", " ]\n", " -- child 1 type: struct\n", " -- is_valid: all not null\n", " -- child 0 type: int64\n", " [\n", " 7,\n", " 9\n", " ]\n", "]" ] }, "execution_count": 373, "metadata": {}, "output_type": "execute_result" } ], "source": [ "awkward0.toarrow(b)" ] }, { "cell_type": "code", "execution_count": 374, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ " 3.3 ] [4.4 [5.5]] [ ]] at 0x7bc6634bccc0>" ] }, "execution_count": 374, "metadata": {}, "output_type": "execute_result" } ], "source": [ "awkward0.fromarrow(awkward0.toarrow(b))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Unlike HDF5, Arrow is capable of columnar jagged arrays, nullable values, nested structures, etc. If you save an Awkward Array in Arrow format, someone else can read it without the Awkward Array library. There are a few Awkward Array classes that don't have an Arrow equivalent, though. Below is a list of all translations.\n", "\n", "* Numpy array → Arrow [BooleanArray](https://arrow.apache.org/docs/python/generated/pyarrow.BooleanArray.html), [IntegerArray](https://arrow.apache.org/docs/python/generated/pyarrow.IntegerArray.html), or [FloatingPointArray](https://arrow.apache.org/docs/python/generated/pyarrow.FloatingPointArray.html).\n", "* `JaggedArray` → Arrow [ListArray](https://arrow.apache.org/docs/python/generated/pyarrow.ListArray.html).\n", "* `StringArray` → Arrow [StringArray](https://arrow.apache.org/docs/python/generated/pyarrow.StringArray.html).\n", "* `Table` → Arrow [Table](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html) at top-level, but an Arrow [StructArray](https://arrow.apache.org/docs/python/generated/pyarrow.StructArray.html) if nested.\n", "* `MaskedArray` → missing data mask (nullability in Arrow is an array attribute, rather than an array wrapper).\n", "* `IndexedMaskedArray` → unfolded into a simple mask before the Arrow translation.\n", "* `IndexedArray` → Arrow [DictionaryArray](https://arrow.apache.org/docs/python/generated/pyarrow.DictionaryArray.html).\n", "* `SparseArray` → converted to a dense array before the Arrow translation.\n", "* `ObjectArray` → Pythonic interpretation is discarded before the Arrow translation.\n", "* `UnionArray` → Arrow dense [UnionArray](https://arrow.apache.org/docs/python/generated/pyarrow.UnionArray.html) if possible, sparse UnionArray if necessary.\n", "* `ChunkedArray` (including `AppendableArray`) → Arrow [RecordBatches](https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html), but only at top-level: nested `ChunkedArrays` cannot be converted.\n", "* `VirtualArray` → array gets materialized before the Arrow translation (i.e. the lazy-loading is not preserved)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since Arrow is an in-memory format, both `toarrow` and `fromarrow` are side-effect-free functions with a return value. Functions that write to files have a side-effect (the state of your disk changing) and no return value. Once you've made your Arrow buffer, you have to figure out what to do with it. (You may want to [write it to a stream](https://arrow.apache.org/docs/python/ipc.html) for interprocess communication.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `awkward0.fromparquet(where)`: reads from a Parquet file (at filename/URI `where`) into an Awkward Array, through pyarrow.\n", "\n", "* `awkward0.toparquet(where, array, schema=None)`: writes an Awkward Array to a Parquet file (at filename/URI `where`), through pyarrow. The Parquet `schema` may be inferred from the Awkward Array or explicitly specified.\n", "\n", "Like Arrow and unlike HDF5, Parquet natively stores complex data structures in a columnar format and doesn't need to be wrapped by an interpretation layer like `awkward0.hdf5`. Like HDF5 and unlike Arrow, Parquet is a file format, intended for storage." ] }, { "cell_type": "code", "execution_count": 375, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [], "source": [ "a = awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])\n", "b = awkward0.fromiter([[1.1, 2.2, None, 3.3, None],\n", " [4.4, [5.5]],\n", " [{\"x\": 6, \"y\": {\"z\": 7}}, None, {\"x\": 8, \"y\": {\"z\": 9}}]\n", " ])" ] }, { "cell_type": "code", "execution_count": 376, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [], "source": [ "awkward0.toparquet(\"dataset.parquet\", a)" ] }, { "cell_type": "code", "execution_count": 377, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 377, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a2 = awkward0.fromparquet(\"dataset.parquet\")\n", "a2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that we get a `ChunkedArray` back. This is because `awkward0.fromparquet` is lazy-loading the Parquet file, which might be very large (not in this case, obviously). It's actually a `ChunkedArray` (one [row group](https://parquet.apache.org/documentation/latest/#unit-of-parallelization) per chunk) of `VirtualArrays`, and each `VirtualArray` is read when it is accessed (for instance, to print it above)." ] }, { "cell_type": "code", "execution_count": 378, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 378, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a2.chunks" ] }, { "cell_type": "code", "execution_count": 379, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 379, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a2.chunks[0].array" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next layer of new structure is that the jagged array is bit-masked. Even though none of the values are nullable, this is an artifact of the way Parquet formats columnar data." ] }, { "cell_type": "code", "execution_count": 380, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 380, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a2.chunks[0].array.content" ] }, { "cell_type": "code", "execution_count": 381, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ " layout \n", "[ ()] ChunkedArray(chunks=[layout[0]], chunksizes=[3])\n", "[ 0] VirtualArray(generator=, args=(0, ''), kwargs={}, array=layout[0, 0])\n", "[ 0, 0] BitMaskedArray(mask=layout[0, 0, 0], content=layout[0, 0, 1], maskedwhen=False, lsborder=True)\n", "[ 0, 0, 0] ndarray(shape=1, dtype=dtype('uint8'))\n", "[ 0, 0, 1] JaggedArray(starts=layout[0, 0, 1, 0], stops=layout[0, 0, 1, 1], content=layout[0, 0, 1, 2])\n", "[ 0, 0, 1, 0] ndarray(shape=3, dtype=dtype('int32'))\n", "[ 0, 0, 1, 1] ndarray(shape=3, dtype=dtype('int32'))\n", "[ 0, 0, 1, 2] BitMaskedArray(mask=layout[0, 0, 1, 2, 0], content=layout[0, 0, 1, 2, 1], maskedwhen=False, lsborder=True)\n", "[0, 0, 1, 2, 0] ndarray(shape=1, dtype=dtype('uint8'))\n", "[0, 0, 1, 2, 1] ndarray(shape=5, dtype=dtype('float64'))" ] }, "execution_count": 381, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a2.layout" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Fewer types can be written to Parquet files than Arrow buffers, since pyarrow does not yet have a complete Arrow → Parquet transformation." ] }, { "cell_type": "code", "execution_count": 382, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Unhandled type for Arrow to Parquet schema conversion: union[dense]<0: double=0, 1: list=1, 2: struct>=2>\n" ] } ], "source": [ "try:\n", " awkward0.toparquet(\"dataset2.parquet\", b)\n", "except Exception as err:\n", " print(type(err), str(err))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `awkward0.topandas(array, flatten=False)`: convert the array into a Pandas DataFrame (if tabular) or a Pandas Series (otherwise). If `flatten=False`, wrap the Awkward Arrays as a new Pandas extension type (not fully implemented). If `flatten=True`, convert the jaggedness and nested tables into row and column `pandas.MultiIndex` without introducing any new types (not always possible)." ] }, { "cell_type": "code", "execution_count": 383, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
xy
0[1.1 2.2 3.3]100
1[]200
2[4.4 5.5]300
3[6.6 7.7 8.8 9.9]400
\n", "" ], "text/plain": [ " x y\n", "0 [1.1 2.2 3.3] 100\n", "1 [] 200\n", "2 [4.4 5.5] 300\n", "3 [6.6 7.7 8.8 9.9] 400" ] }, "execution_count": 383, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.Table(x=awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5], [6.6, 7.7, 8.8, 9.9]]),\n", " y=awkward0.fromiter([100, 200, 300, 400]))\n", "df = awkward0.topandas(a)\n", "df" ] }, { "cell_type": "code", "execution_count": 384, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "0 [1.1 2.2 3.3]\n", "1 []\n", "2 [4.4 5.5]\n", "3 [6.6 7.7 8.8 9.9]\n", "Name: x, dtype: awkward0" ] }, "execution_count": 384, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.x" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the `dtype` is `awkward0`. The array has not been converted into Numpy `dtype=object` (which would imply a performance loss); it has been wrapped as a container that Pandas recognizes. You can get the Awkward Array back the same way you would a Numpy array:" ] }, { "cell_type": "code", "execution_count": 385, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 385, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.x.values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(`JaggedSeries` is a thin wrapper on `JaggedArray`; they behave the same way.)\n", "\n", "The value of this is that Awkward slice semantics can be applied to data in Pandas." ] }, { "cell_type": "code", "execution_count": 386, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
xy
1[]200
2[4.4 5.5]300
3[6.6 7.7 8.8 9.9]400
\n", "
" ], "text/plain": [ " x y\n", "1 [] 200\n", "2 [4.4 5.5] 300\n", "3 [6.6 7.7 8.8 9.9] 400" ] }, "execution_count": 386, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[1:]" ] }, { "cell_type": "code", "execution_count": 387, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "0 [1.1 2.2 3.3]\n", "2 [4.4 5.5]\n", "3 [6.6 7.7 8.8 9.9]\n", "Name: x, dtype: awkward0" ] }, "execution_count": 387, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.x[df.x.values.counts > 0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, Pandas has a (limited) way of handling jaggedness and nested tables, with `pandas.MultiIndex` rows and columns, respectively." ] }, { "cell_type": "code", "execution_count": 388, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ae
bc
d
0012.03
1045.06
145.16
2078.09
178.19
278.29
\n", "
" ], "text/plain": [ " a e\n", " b c \n", " d \n", "0 0 1 2.0 3\n", "1 0 4 5.0 6\n", " 1 4 5.1 6\n", "2 0 7 8.0 9\n", " 1 7 8.1 9\n", " 2 7 8.2 9" ] }, "execution_count": 388, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Nested tables become MultiIndex-valued column names.\n", "array = awkward0.fromiter([{\"a\": {\"b\": 1, \"c\": {\"d\": [2]}}, \"e\": 3},\n", " {\"a\": {\"b\": 4, \"c\": {\"d\": [5, 5.1]}}, \"e\": 6},\n", " {\"a\": {\"b\": 7, \"c\": {\"d\": [8, 8.1, 8.2]}}, \"e\": 9}])\n", "df = awkward0.topandas(array, flatten=True)\n", "df" ] }, { "cell_type": "code", "execution_count": 389, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
00012.2
113.3
214.4
2015.5
116.6
100101.1
10102.2
1103.3
30104.4
2101009.9
\n", "
" ], "text/plain": [ " a b\n", "0 0 0 1 2.2\n", " 1 1 3.3\n", " 2 1 4.4\n", " 2 0 1 5.5\n", " 1 1 6.6\n", "1 0 0 10 1.1\n", " 1 0 10 2.2\n", " 1 10 3.3\n", " 3 0 10 4.4\n", "2 1 0 100 9.9" ] }, "execution_count": 389, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Jagged arrays become MultiIndex-valued rows (index).\n", "array = awkward0.fromiter([{\"a\": 1, \"b\": [[2.2, 3.3, 4.4], [], [5.5, 6.6]]},\n", " {\"a\": 10, \"b\": [[1.1], [2.2, 3.3], [], [4.4]]},\n", " {\"a\": 100, \"b\": [[], [9.9]]}])\n", "df = awkward0.topandas(array, flatten=True)\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The advantage of this is that no new column types are introduced, and Pandas already has functions for managing structure in its `MultiIndex`. For instance, this structure can be unstacked into Pandas's columns." ] }, { "cell_type": "code", "execution_count": 390, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
012012
001.01.01.02.23.34.4
21.01.0NaN5.56.6NaN
1010.0NaNNaN1.1NaNNaN
110.010.0NaN2.23.3NaN
310.0NaNNaN4.4NaNNaN
21100.0NaNNaN9.9NaNNaN
\n", "
" ], "text/plain": [ " a b \n", " 0 1 2 0 1 2\n", "0 0 1.0 1.0 1.0 2.2 3.3 4.4\n", " 2 1.0 1.0 NaN 5.5 6.6 NaN\n", "1 0 10.0 NaN NaN 1.1 NaN NaN\n", " 1 10.0 10.0 NaN 2.2 3.3 NaN\n", " 3 10.0 NaN NaN 4.4 NaN NaN\n", "2 1 100.0 NaN NaN 9.9 NaN NaN" ] }, "execution_count": 390, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.unstack()" ] }, { "cell_type": "code", "execution_count": 391, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
a...b
012...012
0123012301...2301230123
01.0NaN1.0NaN1.0NaN1.0NaN1.0NaN...5.5NaN3.3NaN6.6NaN4.4NaNNaNNaN
110.010.0NaN10.0NaN10.0NaNNaNNaNNaN...NaN4.4NaN3.3NaNNaNNaNNaNNaNNaN
2NaN100.0NaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
\n", "

3 rows × 24 columns

\n", "
" ], "text/plain": [ " a ... b \\\n", " 0 1 2 ... 0 1 \n", " 0 1 2 3 0 1 2 3 0 1 ... 2 3 0 \n", "0 1.0 NaN 1.0 NaN 1.0 NaN 1.0 NaN 1.0 NaN ... 5.5 NaN 3.3 \n", "1 10.0 10.0 NaN 10.0 NaN 10.0 NaN NaN NaN NaN ... NaN 4.4 NaN \n", "2 NaN 100.0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN \n", "\n", " \n", " 2 \n", " 1 2 3 0 1 2 3 \n", "0 NaN 6.6 NaN 4.4 NaN NaN NaN \n", "1 3.3 NaN NaN NaN NaN NaN NaN \n", "2 NaN NaN NaN NaN NaN NaN NaN \n", "\n", "[3 rows x 24 columns]" ] }, "execution_count": 391, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.unstack().unstack()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is also possible to get [Pandas Series and DataFrames through Arrow](https://arrow.apache.org/docs/python/pandas.html), though this doesn't handle jagged arrays well: they get converted into Numpy `dtype=object` arrays." ] }, { "cell_type": "code", "execution_count": 392, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
01[[2.2, 3.3, 4.4], [], [5.5, 6.6]]
110[[1.1], [2.2, 3.3], [], [4.4]]
2100[[], [9.9]]
\n", "
" ], "text/plain": [ " a b\n", "0 1 [[2.2, 3.3, 4.4], [], [5.5, 6.6]]\n", "1 10 [[1.1], [2.2, 3.3], [], [4.4]]\n", "2 100 [[], [9.9]]" ] }, "execution_count": 392, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = awkward0.toarrow(array).to_pandas()\n", "df" ] }, { "cell_type": "code", "execution_count": 393, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "0 [[2.2, 3.3, 4.4], [], [5.5, 6.6]]\n", "1 [[1.1], [2.2, 3.3], [], [4.4]]\n", "2 [[], [9.9]]\n", "Name: b, dtype: object" ] }, "execution_count": 393, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.b" ] }, { "cell_type": "code", "execution_count": 394, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([array([2.2, 3.3, 4.4]), array([], dtype=float64),\n", " array([5.5, 6.6])], dtype=object)" ] }, "execution_count": 394, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.b[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# High-level types\n", "\n", "The high-level type of an array describes its characteristics in terms of what it *represents*, a *logical* view of the data. By contrast, the layouts (below) describe the nested arrays themselves, a *physical* view of the data.\n", "\n", "The logical view of Numpy arrays is described in terms of `shape` and `dtype`. The Awkward type of a Numpy array is presented a little differently." ] }, { "cell_type": "code", "execution_count": 395, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "ArrayType(3, 2, dtype('float64'))" ] }, "execution_count": 395, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = numpy.array([[1.1, 2.2], [3.3, 4.4], [5.5, 6.6]])\n", "t = awkward0.type.fromarray(a)\n", "t" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Above is the object-form of the high-level type and object that `takes` arguments `to` return values." ] }, { "cell_type": "code", "execution_count": 396, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "3" ] }, "execution_count": 396, "metadata": {}, "output_type": "execute_result" } ], "source": [ "t.takes" ] }, { "cell_type": "code", "execution_count": 397, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "ArrayType(2, dtype('float64'))" ] }, "execution_count": 397, "metadata": {}, "output_type": "execute_result" } ], "source": [ "t.to" ] }, { "cell_type": "code", "execution_count": 398, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "dtype('float64')" ] }, "execution_count": 398, "metadata": {}, "output_type": "execute_result" } ], "source": [ "t.to.to" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "High-level type objects also have a printable form for human readability." ] }, { "cell_type": "code", "execution_count": 399, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0, 3) -> [0, 2) -> float64\n" ] } ], "source": [ "print(t)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above should be read like a function's data type: `argument type -> return type` for the function that takes an index in square brackets and returns something else. For example, the first `[0, 3)` means that you could put any non-negative integer less than `3` in square brackets after the array, like this:" ] }, { "cell_type": "code", "execution_count": 400, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([5.5, 6.6])" ] }, "execution_count": 400, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a[2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The second `[0, 2)` means that the next argument can be any non-negative integer less than `2`." ] }, { "cell_type": "code", "execution_count": 401, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "6.6" ] }, "execution_count": 401, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a[2][1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And then you have a Numpy `dtype`.\n", "\n", "The reason high-level types are expressed like this, instead of Numpy `shape` and `dtype` is to generalize to arbitrary objects." ] }, { "cell_type": "code", "execution_count": 402, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0, 3) -> 'x' -> int64\n", " 'y' -> [0, inf) -> float64\n" ] } ], "source": [ "a = awkward0.fromiter([{\"x\": 1, \"y\": []}, {\"x\": 2, \"y\": [1.1, 2.2]}, {\"x\": 3, \"y\": [1.1, 2.2, 3.3]}])\n", "print(a.type)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the above, you could call `a[2][\"x\"]` to get `3` or `a[2][\"y\"][1]` to get `2.2`, but the types and even number of allowed arguments depend on which path you take. Numpy's `shape` and `dtype` have no equivalent.\n", "\n", "Also in the above, the allowed argument for the jagged array is specified as `[0, inf)`, which doesn't literally mean any value up to infinity is allowed—the constraint simply isn't specific because it depends on the details of the jagged array. Even specifying the maximum length of any sublist (`a[\"y\"].counts.max()`) would require a calculation that scales with the size of the dataset, which can be infeasible in some cases. Instead, `[0, inf)` simply means \"jagged.\"\n", "\n", "Fixed-length arrays inside of `JaggedArrays` or `Tables` are presented with known upper limits:" ] }, { "cell_type": "code", "execution_count": 403, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0, 3) -> 'x' -> [0, 2) -> float64\n", " 'y' -> [0, inf) -> int64\n" ] } ], "source": [ "a = awkward0.Table(x=[[1.1, 2.2], [3.3, 4.4], [5.5, 6.6]],\n", " y=awkward0.fromiter([[1, 2, 3], [], [4, 5]]))\n", "print(a.type)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Whereas each value of a `Table` row ([product type](https://en.wikipedia.org/wiki/Product_type)) contains a member of every one of its fields, each value of a `UnionArray` item ([sum type](https://en.wikipedia.org/wiki/Tagged_union)) contains a member of exactly one of its possibilities. The distinction is drawn as the lack or presence of a vertical bar (meaning \"or\": `|`)." ] }, { "cell_type": "code", "execution_count": 404, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0, 3) -> 'x' -> int64\n", " 'y' -> \n" ] } ], "source": [ "a = awkward0.fromiter([{\"x\": 1, \"y\": \"one\"}, {\"x\": 2, \"y\": \"two\"}, {\"x\": 3, \"y\": \"three\"}])\n", "print(a.type)" ] }, { "cell_type": "code", "execution_count": 405, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0, 6) -> (int64 |\n", " )\n" ] } ], "source": [ "a = awkward0.fromiter([1, 2, 3, \"four\", \"five\", \"six\"])\n", "print(a.type)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The parenthesis is to keep `Table` fields from being mixed up with `UnionArray` possibilities." ] }, { "cell_type": "code", "execution_count": 406, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0, 4) -> 'x' -> int64\n", " 'y' -> (float64 |\n", " )\n" ] } ], "source": [ "a = awkward0.fromiter([{\"x\": 1, \"y\": 1.1}, {\"x\": 2, \"y\": 2.2}, {\"x\": 3, \"y\": \"three\"}, {\"x\": 4, \"y\": \"four\"}])\n", "print(a.type)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As in mathematics, products and the adjacency operator take precedence over sums." ] }, { "cell_type": "code", "execution_count": 407, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0, 6) -> (int64 |\n", " 'x' -> float64\n", " 'y' -> )\n" ] } ], "source": [ "a = awkward0.fromiter([1, 2, 3, {\"x\": 4.4, \"y\": \"four\"}, {\"x\": 5.5, \"y\": \"five\"}, {\"x\": 6.6, \"y\": \"six\"}])\n", "print(a.type)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Missing data, represented by `MaskedArrays`, `BitMaskedArrays`, or `IndexedMaskedArrays`, are called \"option types\" in the high-level type language." ] }, { "cell_type": "code", "execution_count": 408, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0, 7) -> ?(int64)\n" ] } ], "source": [ "a = awkward0.fromiter([1, 2, 3, None, None, 4, 5])\n", "print(a.type)" ] }, { "cell_type": "code", "execution_count": 409, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0, 3) -> ?([0, inf) -> float64)\n" ] } ], "source": [ "# Inner arrays could be missing values.\n", "a = awkward0.fromiter([[1.1, 2.2, 3.3], None, [4.4, 5.5]])\n", "print(a.type)" ] }, { "cell_type": "code", "execution_count": 410, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0, 3) -> [0, inf) -> ?(float64)\n" ] } ], "source": [ "# Numbers in those arrays could be missing values.\n", "a = awkward0.fromiter([[1.1, 2.2, None], [], [4.4, 5.5]])\n", "print(a.type)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Cross-references and cyclic references are expressed in Awkward type objects by creating the same graph structure among the type objects as the arrays. Thus," ] }, { "cell_type": "code", "execution_count": 411, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "{'left': {'left': {'left': {'left': None, 'right': None, 'value': 9.0},\n", " 'right': None,\n", " 'value': 3.14},\n", " 'right': {'left': None,\n", " 'right': {'left': None, 'right': None, 'value': 0.0},\n", " 'value': 2.71},\n", " 'value': 3.21},\n", " 'right': {'left': {'left': None, 'right': None, 'value': 5.55},\n", " 'right': {'left': None, 'right': None, 'value': 8.0},\n", " 'value': 9.99},\n", " 'value': 1.23}" ] }, "execution_count": 411, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tree = awkward0.fromiter([\n", " {\"value\": 1.23, \"left\": 1, \"right\": 2}, # node 0\n", " {\"value\": 3.21, \"left\": 3, \"right\": 4}, # node 1\n", " {\"value\": 9.99, \"left\": 5, \"right\": 6}, # node 2\n", " {\"value\": 3.14, \"left\": 7, \"right\": None}, # node 3\n", " {\"value\": 2.71, \"left\": None, \"right\": 8}, # node 4\n", " {\"value\": 5.55, \"left\": None, \"right\": None}, # node 5\n", " {\"value\": 8.00, \"left\": None, \"right\": None}, # node 6\n", " {\"value\": 9.00, \"left\": None, \"right\": None}, # node 7\n", " {\"value\": 0.00, \"left\": None, \"right\": None}, # node 8\n", "])\n", "left = tree.contents[\"left\"].content\n", "right = tree.contents[\"right\"].content\n", "left[(left < 0) | (left > 8)] = 0 # satisfy overzealous validity checks\n", "right[(right < 0) | (right > 8)] = 0\n", "tree.contents[\"left\"].content = awkward0.IndexedArray(left, tree)\n", "tree.contents[\"right\"].content = awkward0.IndexedArray(right, tree)\n", "\n", "tree[0].tolist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the print-out, labels (`T0 :=`, `T1 :=`, `T2 :=`) are inserted to indicate where cross-references begin and end." ] }, { "cell_type": "code", "execution_count": 412, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0, 9) -> 'left' -> T0 := ?(T1 := 'left' -> T0\n", " 'right' -> T2 := ?(T1)\n", " 'value' -> float64)\n", " 'right' -> T2\n", " 'value' -> float64\n" ] } ], "source": [ "print(tree.type)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `ObjectArray` class turns Awkward Array structures into Python objects on demand. From an analysis point of view, the elements of the array *are* Python objects, and that is reflected in the type." ] }, { "cell_type": "code", "execution_count": 413, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 413, "metadata": {}, "output_type": "execute_result" } ], "source": [ "class Point:\n", " def __init__(self, x, y):\n", " self.x, self.y = x, y\n", " def __repr__(self):\n", " return \"Point({0}, {1})\".format(self.x, self.y)\n", "\n", "a = awkward0.fromiter([Point(0, 0), Point(3, 2), Point(1, 1), Point(2, 4), Point(0, 0)])\n", "a" ] }, { "cell_type": "code", "execution_count": 414, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0, 5) -> .make at 0x7bc66343ee18>\n" ] } ], "source": [ "print(a.type)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In summary,\n", "\n", "* each element of a Numpy `shape` like `(i, j, k)` becomes a functional argument: `[0, i) -> [0, j) -> [0, k)`;\n", "* high-level types terminate on Numpy `dtypes` or `ObjectArray` functions;\n", "* columns of a `Table` are presented adjacent to one another: the type is field 1 *and* field 2 *and* field 3, etc.;\n", "* possibilities of a `UnionArray` are separated by vertical bars `|`: the type is possibility 1 *or* possibility 2 *or* possibility 3, etc.;\n", "* nullable types are indicated by a question mark;\n", "* cross-references and cyclic references are maintained in the type objects, printed with labels." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Low-level layouts\n", "\n", "The layout of an array describes how it is constructed in terms of Numpy arrays and other parameters. It has more information than a high-level type (above), more that would typically be needed for data analysis, but very necessary for data engineering.\n", "\n", "A `Layout` object is a mapping from position tuples to `LayoutNodes`. The screen representation is sufficient for reading." ] }, { "cell_type": "code", "execution_count": 415, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ " layout \n", "[ ()] JaggedArray(starts=layout[0], stops=layout[1], content=layout[2])\n", "[ 0] ndarray(shape=3, dtype=dtype('int64'))\n", "[ 1] ndarray(shape=3, dtype=dtype('int64'))\n", "[ 2] ndarray(shape=5, dtype=dtype('float64'))" ] }, "execution_count": 415, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])\n", "t = a.layout\n", "t" ] }, { "cell_type": "code", "execution_count": 416, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 416, "metadata": {}, "output_type": "execute_result" } ], "source": [ "t[2]" ] }, { "cell_type": "code", "execution_count": 417, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([1.1, 2.2, 3.3, 4.4, 5.5])" ] }, "execution_count": 417, "metadata": {}, "output_type": "execute_result" } ], "source": [ "t[2].array" ] }, { "cell_type": "code", "execution_count": 418, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ " layout \n", "[ ()] JaggedArray(starts=layout[0], stops=layout[1], content=layout[2])\n", "[ 0] ndarray(shape=3, dtype=dtype('int64'))\n", "[ 1] ndarray(shape=3, dtype=dtype('int64'))\n", "[ 2] JaggedArray(starts=layout[2, 0], stops=layout[2, 1], content=layout[2, 2])\n", "[ 2, 0] ndarray(shape=3, dtype=dtype('int64'))\n", "[ 2, 1] ndarray(shape=3, dtype=dtype('int64'))\n", "[ 2, 2] ndarray(shape=5, dtype=dtype('float64'))" ] }, "execution_count": 418, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = awkward0.fromiter([[[1.1, 2.2], [3.3]], [], [[4.4, 5.5]]])\n", "t = a.layout\n", "t" ] }, { "cell_type": "code", "execution_count": 419, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 419, "metadata": {}, "output_type": "execute_result" } ], "source": [ "t[2]" ] }, { "cell_type": "code", "execution_count": 420, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 420, "metadata": {}, "output_type": "execute_result" } ], "source": [ "t[2].array" ] }, { "cell_type": "code", "execution_count": 421, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "array([1.1, 2.2, 3.3, 4.4, 5.5])" ] }, "execution_count": 421, "metadata": {}, "output_type": "execute_result" } ], "source": [ "t[2, 2].array" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Classes like `IndexedArray`, `SparseArray`, `ChunkedArray`, `AppendableArray`, and `VirtualArray` don't change the high-level type of an array, but they do change the layout. Consider, for instance, an array made with `awkward0.fromiter` and an array read by `awkward0.fromparquet`." ] }, { "cell_type": "code", "execution_count": 422, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [], "source": [ "a = awkward0.fromiter([[1.1, 2.2, None, 3.3], [], None, [4.4, 5.5]])" ] }, { "cell_type": "code", "execution_count": 423, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [], "source": [ "awkward0.toparquet(\"tmp.parquet\", a)" ] }, { "cell_type": "code", "execution_count": 424, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [], "source": [ "b = awkward0.fromparquet(\"tmp.parquet\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At first, it terminates at `VirtualArray` because the data haven't been read—we don't know what arrays are associated with it." ] }, { "cell_type": "code", "execution_count": 425, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ " layout \n", "[ ()] ChunkedArray(chunks=[layout[0]], chunksizes=[4])\n", "[ 0] VirtualArray(generator=, args=(0, ''), kwargs={})" ] }, "execution_count": 425, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b.layout" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But after reading," ] }, { "cell_type": "code", "execution_count": 426, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 426, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The layout shows that it has more structure than `a`." ] }, { "cell_type": "code", "execution_count": 427, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ " layout \n", "[ ()] ChunkedArray(chunks=[layout[0]], chunksizes=[4])\n", "[ 0] VirtualArray(generator=, args=(0, ''), kwargs={}, array=layout[0, 0])\n", "[ 0, 0] BitMaskedArray(mask=layout[0, 0, 0], content=layout[0, 0, 1], maskedwhen=False, lsborder=True)\n", "[ 0, 0, 0] ndarray(shape=1, dtype=dtype('uint8'))\n", "[ 0, 0, 1] JaggedArray(starts=layout[0, 0, 1, 0], stops=layout[0, 0, 1, 1], content=layout[0, 0, 1, 2])\n", "[ 0, 0, 1, 0] ndarray(shape=4, dtype=dtype('int32'))\n", "[ 0, 0, 1, 1] ndarray(shape=4, dtype=dtype('int32'))\n", "[ 0, 0, 1, 2] BitMaskedArray(mask=layout[0, 0, 1, 2, 0], content=layout[0, 0, 1, 2, 1], maskedwhen=False, lsborder=True)\n", "[0, 0, 1, 2, 0] ndarray(shape=1, dtype=dtype('uint8'))\n", "[0, 0, 1, 2, 1] ndarray(shape=6, dtype=dtype('float64'))" ] }, "execution_count": 427, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b.layout" ] }, { "cell_type": "code", "execution_count": 428, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ " layout \n", "[ ()] MaskedArray(mask=layout[0], content=layout[1], maskedwhen=True)\n", "[ 0] ndarray(shape=4, dtype=dtype('bool'))\n", "[ 1] JaggedArray(starts=layout[1, 0], stops=layout[1, 1], content=layout[1, 2])\n", "[ 1, 0] ndarray(shape=4, dtype=dtype('int64'))\n", "[ 1, 1] ndarray(shape=4, dtype=dtype('int64'))\n", "[ 1, 2] MaskedArray(mask=layout[1, 2, 0], content=layout[1, 2, 1], maskedwhen=True)\n", "[1, 2, 0] ndarray(shape=6, dtype=dtype('bool'))\n", "[1, 2, 1] ndarray(shape=6, dtype=dtype('float64'))" ] }, "execution_count": 428, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.layout" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, they have the same high-level type." ] }, { "cell_type": "code", "execution_count": 429, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0, 4) -> ?([0, inf) -> ?(float64))\n" ] } ], "source": [ "print(b.type)" ] }, { "cell_type": "code", "execution_count": 430, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0, 4) -> ?([0, inf) -> ?(float64))\n" ] } ], "source": [ "print(a.type)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Cross-references and cyclic references are also encoded in the `layout`, as references to previously seen indexes." ] }, { "cell_type": "code", "execution_count": 431, "metadata": { "collapsed": false, "inputHidden": false, "jupyter": { "outputs_hidden": false }, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ " layout \n", "[ ()] Table(left=layout[0], right=layout[1], value=layout[2])\n", "[ 0] MaskedArray(mask=layout[0, 0], content=layout[0, 1], maskedwhen=True)\n", "[ 0, 0] ndarray(shape=9, dtype=dtype('bool'))\n", "[ 0, 1] IndexedArray(index=layout[0, 1, 0], content=layout[0, 1, 1])\n", "[0, 1, 0] ndarray(shape=9, dtype=dtype('int64'))\n", "[0, 1, 1] -> layout[()]\n", "[ 1] MaskedArray(mask=layout[1, 0], content=layout[1, 1], maskedwhen=True)\n", "[ 1, 0] ndarray(shape=9, dtype=dtype('bool'))\n", "[ 1, 1] IndexedArray(index=layout[1, 1, 0], content=layout[1, 1, 1])\n", "[1, 1, 0] ndarray(shape=9, dtype=dtype('int64'))\n", "[1, 1, 1] -> layout[()]\n", "[ 2] ndarray(shape=9, dtype=dtype('float64'))" ] }, "execution_count": 431, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tree.layout" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 4 }