{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Uproot and Awkward Arrays\n", "\n", "## Tutorial for Electron Ion Collider users\n", "\n", "Jim Pivarski (Princeton University)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preliminaries\n", "\n", "This demo presented was to the Electron Ion Collider group on April 8, 2020, before the final 1.0 version was released. Some interfaces may have changed. If you are running this notebook on Binder (i.e. you clicked on the \"Launch Binder\" button), then you have all the dependencies and the data file it needs to run.\n", "\n", "Otherwise, make sure you have version 0.2.10 ([GitHub](https://github.com/scikit-hep/awkward-1.0/releases/tag/0.2.10), [pip](https://pypi.org/project/awkward1/0.2.10/)) by installing\n", "\n", "```bash\n", "pip install 'awkward1==0.2.10'\n", "```\n", "\n", "as well as the other libraries this notebook uses,\n", "\n", "```bash\n", "pip install 'uproot<4.0' 'particle' 'boost-histogram' 'matplotlib' 'mplhep' 'numba<0.49' 'pandas' 'numexpr' 'autograd'\n", "```\n", "\n", "and download the data file:\n", "\n", "```bash\n", "wget https://github.com/jpivarski/2020-04-08-eic-jlab/raw/master/open_charm_18x275_10k.root\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Table of contents\n", "\n", "* [Uproot: getting data](#uproot)\n", " - [Exploring a TFile](#Exploring-a-TFile)\n", " - [Exploring a TTree](#Exploring-a-TTree)\n", " - [Turning ROOT branches into NumPy arrays](#Turning-ROOT-branches-into-NumPy-arrays)\n", " - [Memory management; caching and iteration](#Memory-management;-caching-and-iteration)\n", " - [Jagged arrays (segue)](#Jagged-arrays-(segue))\n", "* [Awkward Array: manipulating data](#awkward)\n", " - [Using Uproot data in Awkward 1.0](#Using-Uproot-data-in-Awkward-1.0)\n", " - [Iteration in Python vs array-at-a-time operations](#Iteration-in-Python-vs-array-at-a-time-operations)\n", " - [Zipping arrays into records and projecting them back out](#Zipping-arrays-into-records-and-projecting-them-back-out)\n", " - [Filtering (cutting) events and particles with advanced selections](#Filtering-(cutting)-events-and-particles-with-advanced-selections)\n", " - [Flattening for plots and regularizing to NumPy for machine learning](#Flattening-for-plots-and-regularizing-to-NumPy-for-machine-learning)\n", " - [Broadcasting flat arrays and jagged arrays](#Broadcasting-flat-arrays-and-jagged-arrays)\n", " - [Combinatorics: cartesian and combinations](#Combinatorics:-cartesian-and-combinations)\n", " - [Reducing from combinations](#Reducing-from-combinations)\n", " - [Imperative, but still fast, programming in Numba](#Imperative,-but-still-fast,-programming-in-Numba)\n", " - [Grafting jagged data onto Pandas](#Grafting-jagged-data-onto-Pandas)\n", " - [NumExpr, Autograd, and other third-party libraries](#NumExpr,-Autograd,-and-other-third-party-libraries)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Uproot is a pure Python reimplementation of a significant part of ROOT I/O.\n", "\n", "
\n", "\n", "
\n", "\n", "You can read TTrees containing basic data types, STL vectors, strings, and some more complex data, especially if it was written with a high \"splitLevel\".\n", "\n", "You can also read histograms and other objects into generic containers, but the C++ methods that give those objects functionality are not available." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exploring a TFile" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Uproot was designed to be Pythonic, so the way we interact with ROOT files is different than it is in ROOT." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import uproot\n", "file = uproot.open(\"open_charm_18x275_10k.root\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A ROOT file may be thought of as a dict of key-value pairs, like a Python dict." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "file.keys()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "file.values()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**What's the `b` before the name?** All strings retrieved from ROOT are unencoded, which Python 3 treats differently from Python 2. In the near future, Uproot will automatically interpret all strings from ROOT as UTF-8 and this cosmetic issue will be gone.\n", "\n", "**What's the `;1` at the end of the name?** It's the cycle number, something ROOT uses to track multiple versions of an object. You can usually ignore it." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Nested directories are a dict of dicts." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "file[\"events\"].keys()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "file[\"events\"][\"tree\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But there are shortcuts:\n", "\n", " * use a `/` to navigate through the levels in a single string;\n", " * use `allkeys` to recursively show all keys in all directories." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "file.allkeys()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "file[\"events/tree\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exploring a TTree" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A TTree can also be thought of as a dict of dicts, this time to navigate through TBranches." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tree = file[\"events/tree\"]\n", "tree.keys()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Often, the first thing I do when I look at a TTree is `show` to see how each branch is going to be interpreted." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"branch name streamer (for complex data) interpretation in Python\")\n", "print(\"==============================================================================\")\n", "\n", "tree.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Most of the information you'd expect to find in a TTree is there. See [uproot.readthedocs.io](https://uproot.readthedocs.io/en/latest/ttree-handling.html) for a complete list." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tree.numentries" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tree[\"evt_id\"].compressedbytes(), tree[\"evt_id\"].uncompressedbytes(), tree[\"evt_id\"].compressionratio()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tree[\"evt_id\"].numbaskets" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "[tree[\"evt_id\"].basket_entrystart(i) for i in range(3)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Turning ROOT branches into NumPy arrays" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are several methods for this; they differ only in convenience." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# TBranch → array\n", "tree[\"evt_id\"].array()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# TTree + branch name → array\n", "tree.array(\"evt_id\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# TTree + branch names → arrays\n", "tree.arrays([\"evt_id\", \"evt_prt_count\"])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# TTree + branch name pattern(s) → arrays\n", "tree.arrays(\"evt_*\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# TTree + branch name regex(s) → arrays\n", "tree.arrays(\"/evt_[A-Z_0-9]*/i\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Convenience #1:** turn the bytestrings into real strings (will soon be unnecessary)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tree.arrays(\"evt_*\", namedecode=\"utf-8\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Convenience #2:** output a tuple instead of a dict." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tree.arrays([\"evt_id\", \"evt_prt_count\"], outputtype=tuple)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "... to use it in assignment:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "evt_id, evt_prt_count = tree.arrays([\"evt_id\", \"evt_prt_count\"], outputtype=tuple)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "evt_id" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Memory management; caching and iteration" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `array` methods read an entire branch into memory. Sometimes, you might not have enough memory to do that.\n", "\n", "The simplest solution is to set `entrystart` (inclusive) and `entrystop` (exclusive) to read a small batch at a time." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tree.array(\"evt_id\", entrystart=500, entrystop=600)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Uproot is _not_ agressive about caching: if you call `arrays` many times (for many small batches), it will read from the file every time.\n", "\n", "You can avoid frequent re-reading by assigning arrays to variables, but then you'd have to manage all those variables.\n", "\n", "**Instead, use explicit caching:**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Make a cache with an acceptable limit.\n", "gigabyte_cache = uproot.ArrayCache(\"1 GB\")\n", "\n", "# Read the array from disk:\n", "tree.array(\"evt_id\", cache=gigabyte_cache)\n", "\n", "# Get the array from the cache:\n", "tree.array(\"evt_id\", cache=gigabyte_cache)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The advantage is that the same code can be used for first time and subsequent times. You can put this in a loop.\n", "\n", "Naturally, fetching from the cache is much faster than reading from disk (though our file isn't very big!)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%timeit\n", "\n", "tree.arrays(\"*\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%timeit\n", "\n", "tree.arrays(\"*\", cache=gigabyte_cache)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The value of an explicit cache is that you get to control it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "len(gigabyte_cache)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gigabyte_cache.clear()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "len(gigabyte_cache)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Setting `entrystart` and `entrystop` can get annoying; we probably want to do it in a loop.\n", "\n", "There's a method, `iterate`, for that." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for arrays in tree.iterate(\"evt_*\", entrysteps=1000):\n", " print({name: len(array) for name, array in arrays.items()})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Keep in mind that this is a loop over _batches_, not _events_.\n", "\n", "What goes in the loop is code that applies to _arrays_.\n", "\n", "You can also set the `entrysteps` to be a size in memory." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for arrays in tree.iterate(\"evt_*\", entrysteps=\"100 kB\"):\n", " print({name: len(array) for name, array in arrays.items()})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The same size in memory covers more events if you read fewer branches." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for arrays in tree.iterate(\"evt_id\", entrysteps=\"100 kB\"):\n", " print({name: len(array) for name, array in arrays.items()})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This `TTree.iterate` method is similar in form to the `uproot.iterate` function, which iterates in batches over a collection of files." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for arrays in uproot.iterate([\"open_charm_18x275_10k.root\",\n", " \"open_charm_18x275_10k.root\",\n", " \"open_charm_18x275_10k.root\"], \"events/tree\", \"evt_*\", entrysteps=\"100 kB\"):\n", " print({name: len(array) for name, array in arrays.items()})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Jagged arrays (segue)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Most of the branches in this file have an \"asjagged\" interpretation, instead of \"asdtype\" (NumPy)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tree[\"evt_id\"].interpretation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tree[\"pdg\"].interpretation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This means that they have multiple values per entry." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tree[\"pdg\"].array()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Jagged arrays (lists of variable-length sublists) are very common in particle physics, and surprisingly uncommon in other fields.\n", "\n", "We need them because we almost always have a variable number of particles per event." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from particle import Particle # https://github.com/scikit-hep/particle\n", "\n", "counter = 0\n", "for event in tree[\"pdg\"].array():\n", " print(len(event), \"particles:\", \" \".join(Particle.from_pdgid(x).name for x in event))\n", " counter += 1\n", " if counter == 30:\n", " break" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Although you can iterate over jagged arrays with for loops, the idiomatic and faster way to do it is with array-at-a-time functions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "vtx_x, vtx_y, vtx_z = tree.arrays([\"vtx_[xyz]\"], outputtype=tuple)\n", "\n", "vtx_dist = np.sqrt(vtx_x**2 + vtx_y**2 + vtx_z**2)\n", "\n", "vtx_dist" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import mplhep as hep # https://github.com/scikit-hep/mplhep\n", "import boost_histogram as bh # https://github.com/scikit-hep/boost-histogram\n", "\n", "vtx_hist = bh.Histogram(bh.axis.Regular(100, 0, 0.1))\n", "\n", "vtx_hist.fill(vtx_dist.flatten())\n", "\n", "hep.histplot(vtx_hist)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "vtx_dist > 0.01" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pdg = tree[\"pdg\"].array()\n", "pdg[vtx_dist > 0.01]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "counter = 0\n", "for event in pdg[vtx_dist > 0.10]:\n", " print(len(event), \"particles:\", \" \".join(Particle.from_pdgid(x).name for x in event))\n", " counter += 1\n", " if counter == 30:\n", " break" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Particle.from_string(\"p~\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Particle.from_string(\"p~\").pdgid" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "is_antiproton = (pdg == Particle.from_string(\"p~\").pdgid)\n", "is_antiproton" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hep.histplot(bh.Histogram(bh.axis.Regular(100, 0, 0.1)).fill(\n", " vtx_dist[is_antiproton].flatten()\n", "))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But that's a topic for the next section." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Awkward Array is a library for manipulating arbitrary data structures in a NumPy-like way.\n", "\n", "The idea is that you have a large number of identically typed, nested objects in variable-length lists.\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using Uproot data in Awkward 1.0\n", "\n", "Awkward Array is in transition from\n", "\n", " * version 0.x, which is in use at the LHC but has revealed some design flaws, to\n", " * version 1.x, which is well-architected and has completed development, but is not in widespread use yet.\n", "\n", "Awkward 1.0 hasn't been incorporated into Uproot yet, which is how it will get in front of most users.\n", "\n", "Since development is complete and the interface is (intentionally) different, I thought it better to show you the new version." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import awkward1 as ak" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Old-style arrays can be converted into the new framework with [ak.from_awkward0](https://awkward-array.readthedocs.io/en/latest/_auto/ak.from_awkward0.html). This won't be a necessary step for long." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "?ak.from_awkward0" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "?ak.to_awkward0" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ak.from_awkward0(tree.array(\"pdg\"))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "arrays = {name: ak.from_awkward0(array) for name, array in tree.arrays(namedecode=\"utf-8\").items()}\n", "arrays" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "?ak.Array" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Iteration in Python vs array-at-a-time operations\n", "\n", "As before, you _can_ iterate over them in Python, but only do that for small-scale exploration." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%timeit -n1 -r1\n", "\n", "vtx_dist = []\n", "for xs, xy, xz in zip(arrays[\"vtx_x\"], arrays[\"vtx_y\"], arrays[\"vtx_z\"]):\n", " out = []\n", " for x, y, z in zip(xs, xy, xz):\n", " out.append(np.sqrt(x**2 + y**2 + z**2))\n", " vtx_dist.append(out)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%timeit -n100 -r1\n", "\n", "vtx_dist = np.sqrt(arrays[\"vtx_x\"]**2 + arrays[\"vtx_y\"]**2 + arrays[\"vtx_z\"]**2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Zipping arrays into records and projecting them back out\n", "\n", "Instead of having all these arrays floating around, let's [ak.zip](https://awkward-array.readthedocs.io/en/latest/_auto/ak.zip.html) them into a structure.\n", "\n", "(This is the sort of thing that a framework developer might do for the data analysts.)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "?ak.zip" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "events = ak.zip({\"id\": arrays[\"evt_id\"],\n", " \"true\": ak.zip({\"q2\": arrays[\"evt_true_q2\"],\n", " \"x\": arrays[\"evt_true_x\"],\n", " \"y\": arrays[\"evt_true_y\"],\n", " \"w2\": arrays[\"evt_true_w2\"],\n", " \"nu\": arrays[\"evt_true_nu\"]}),\n", " \"has_dis_info\": arrays[\"evt_has_dis_info\"],\n", " \"prt_count\": arrays[\"evt_prt_count\"],\n", " \"prt\": ak.zip({\"id\": arrays[\"id\"],\n", " \"pdg\": arrays[\"pdg\"],\n", " \"trk_id\": arrays[\"trk_id\"],\n", " \"charge\": arrays[\"charge\"],\n", " \"dir\": ak.zip({\"x\": arrays[\"dir_x\"],\n", " \"y\": arrays[\"dir_y\"],\n", " \"z\": arrays[\"dir_z\"]}, with_name=\"point3\"),\n", " \"p\": arrays[\"p\"],\n", " \"px\": arrays[\"px\"],\n", " \"py\": arrays[\"py\"],\n", " \"pz\": arrays[\"pz\"],\n", " \"m\": arrays[\"m\"],\n", " \"time\": arrays[\"time\"],\n", " \"is_beam\": arrays[\"is_beam\"],\n", " \"is_stable\": arrays[\"is_stable\"],\n", " \"gen_code\": arrays[\"gen_code\"],\n", " \"mother\": ak.zip({\"id\": arrays[\"mother_id\"],\n", " \"second_id\": arrays[\"mother_second_id\"]}),\n", " \"pol\": ak.zip({\"has_info\": arrays[\"has_pol_info\"],\n", " \"x\": arrays[\"pol_x\"],\n", " \"y\": arrays[\"pol_y\"],\n", " \"z\": arrays[\"pol_z\"]}, with_name=\"point3\"),\n", " \"vtx\": ak.zip({\"has_info\": arrays[\"has_vtx_info\"],\n", " \"id\": arrays[\"vtx_id\"],\n", " \"x\": arrays[\"vtx_x\"],\n", " \"y\": arrays[\"vtx_y\"],\n", " \"z\": arrays[\"vtx_z\"],\n", " \"t\": arrays[\"vtx_t\"]}, with_name=\"point3\"),\n", " \"smear\": ak.zip({\"has_info\": arrays[\"has_smear_info\"],\n", " \"has_e\": arrays[\"smear_has_e\"],\n", " \"has_p\": arrays[\"smear_has_p\"],\n", " \"has_pid\": arrays[\"smear_has_pid\"],\n", " \"has_vtx\": arrays[\"smear_has_vtx\"],\n", " \"has_any_eppid\": arrays[\"smear_has_any_eppid\"],\n", " \"orig\": ak.zip({\"tot_e\": arrays[\"smear_orig_tot_e\"],\n", " \"p\": arrays[\"smear_orig_p\"],\n", " \"px\": arrays[\"smear_orig_px\"],\n", " \"py\": arrays[\"smear_orig_py\"],\n", " \"pz\": arrays[\"smear_orig_pz\"],\n", " \"vtx\": ak.zip({\"x\": arrays[\"smear_orig_vtx_x\"],\n", " \"y\": arrays[\"smear_orig_vtx_y\"],\n", " \"z\": arrays[\"smear_orig_vtx_z\"]},\n", " with_name=\"point3\")})})}, with_name=\"particle\")},\n", " depthlimit=1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "?ak.type" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ak.type(events)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The type written with better formatting:\n", "\n", "```javascript\n", "10000 * {\"id\": uint64,\n", " \"true\": {\"q2\": float64,\n", " \"x\": float64,\n", " \"y\": float64,\n", " \"w2\": float64,\n", " \"nu\": float64},\n", " \"has_dis_info\": int8,\n", " \"prt_count\": uint64,\n", "\n", " \"prt\": var * particle[\"id\": uint64,\n", " \"pdg\": int64,\n", " \"trk_id\": float64,\n", " \"charge\": float64,\n", " \"dir\": point3[\"x\": float64, \"y\": float64, \"z\": float64],\n", " \"p\": float64,\n", " \"px\": float64,\n", " \"py\": float64,\n", " \"pz\": float64,\n", " \"m\": float64,\n", " \"time\": float64,\n", " \"is_beam\": bool,\n", " \"is_stable\": bool,\n", " \"gen_code\": bool,\n", " \"mother\": {\"id\": uint64, \"second_id\": uint64},\n", " \"pol\": point3[\"has_info\": float64,\n", " \"x\": float64,\n", " \"y\": float64,\n", " \"z\": float64],\n", " \"vtx\": point3[\"has_info\": bool,\n", " \"id\": uint64,\n", " \"x\": float64,\n", " \"y\": float64,\n", " \"z\": float64,\n", " \"t\": float64],\n", " \"smear\": {\"has_info\": bool,\n", " \"has_e\": bool,\n", " \"has_p\": bool,\n", " \"has_pid\": bool,\n", " \"has_vtx\": bool,\n", " \"has_any_eppid\": bool,\n", " \"orig\": {\"tot_e\":\n", " float64,\n", " \"p\": float64,\n", " \"px\": float64,\n", " \"py\": float64,\n", " \"pz\": float64,\n", " \"vtx\": point3[\"x\": float64,\n", " \"y\": float64,\n", " \"z\": float64]}}]}\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It means that these are now nested objects." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "?ak.to_list" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ak.to_list(events[0].prt[0])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ak.to_list(events[-1].prt[:10].smear.orig.vtx)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatively," ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ak.to_list(events[-1, \"prt\", :10, \"smear\", \"orig\", \"vtx\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\"Zipping\" arrays together into structures costs nothing (time does not scale with size of data), nor does \"projecting\" arrays out of structures." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "events.prt.px" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "events.prt.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "events.prt.pz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is called \"projection\" because the request for `\"pz\"` is slicing through arrays and jagged arrays.\n", "\n", "The following are equivalent:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "events[999, \"prt\", 12, \"pz\"]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "events[\"prt\", 999, 12, \"pz\"]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "events[999, \"prt\", \"pz\", 12]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "events[\"prt\", 999, \"pz\", 12]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This \"object oriented view\" is a conceptual aid, not a constraint on computation. It's very fluid.\n", "\n", "Moreover, we can add behaviors to named records, like methods in object oriented programming. (See [ak.behavior](https://awkward-array.readthedocs.io/en/latest/ak.behavior.html) in the documentation.)\n", "\n", "(This is the sort of thing that a framework developer might do for the data analysts.)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def point3_absolute(data):\n", " return np.sqrt(data.x**2 + data.y**2 + data.z**2)\n", "\n", "def point3_distance(left, right):\n", " return np.sqrt((left.x - right.x)**2 + (left.y - right.y)**2 + (left.z - right.z)**2)\n", "\n", "ak.behavior[np.absolute, \"point3\"] = point3_absolute\n", "ak.behavior[np.subtract, \"point3\", \"point3\"] = point3_distance" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Absolute value of all smear origin vertexes\n", "abs(events.prt.smear.orig.vtx)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Absolute value of the last smear origin vertex\n", "abs(events[-1].prt[-1].smear.orig.vtx)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Distance between each particle vertex and itself\n", "events.prt.vtx - events.prt.vtx" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Distances between the first and last particle vertexes in the first 100 events\n", "events.prt.vtx[:100, 0] - events.prt.vtx[:100, -1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "More methods can be added by declaring subclasses of arrays and records." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class ParticleRecord(ak.Record): \n", " @property\n", " def pt(self):\n", " return np.sqrt(self.px**2 + self.py**2)\n", "\n", "class ParticleArray(ak.Array):\n", " __name__ = \"Array\" # prevent it from writing \n", " # instead of \n", " @property\n", " def pt(self):\n", " return np.sqrt(self.px**2 + self.py**2)\n", "\n", "ak.behavior[\"particle\"] = ParticleRecord\n", "ak.behavior[\"*\", \"particle\"] = ParticleArray" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "type(events[0].prt[0])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "events[0].prt[0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "events[0].prt[0].pt" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "type(events.prt)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "events.prt" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "events.prt.pt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Filtering (cutting) events and particles with advanced selections\n", "\n", "NumPy has a versatile selection mechanism:\n", "\n", "\n", "\n", "The same expressions apply to Awkward Arrays, and more." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# First particle momentum in the first 5 events\n", "events.prt.p[:5, 0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# First two particles in all events\n", "events.prt.pdg[:, :2]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# First direction of the last event\n", "events.prt.dir[-1, 0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "NumPy also lets you filter (cut) using an array of booleans." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "events.prt_count > 100" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.count_nonzero(events.prt_count > 100)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "events[events.prt_count > 100]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One dimension can be selected with an array while another is selected with a slice." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Select events with at least two particles, then select the first two particles\n", "events.prt[events.prt_count >= 2, :2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This can be a good way to avoid errors from trying to select what isn't there." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "try:\n", " events.prt[:, 0]\n", "except Exception as err:\n", " print(type(err).__name__, str(err))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "events.prt[events.prt_count > 0, 0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "See also [awkward-array.readthedocs.io](https://awkward-array.readthedocs.io/) for a list of operations like [ak.num](https://awkward-array.readthedocs.io/en/latest/_auto/ak.num.html):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "?ak.num" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ak.num(events.prt), events.prt_count" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can even use an array of integers to select a set of indexes at once." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# First and last particle in each event that has at least two\n", "events.prt.pdg[ak.num(events.prt) >= 2][:, [0, -1]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But beyond NumPy, we can also use arrays of nested lists as boolean or integer selectors." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Array of lists of True and False\n", "abs(events.prt.vtx) > 0.10" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Particles that have vtx > 0.10 for all events (notice that there's still 10000)\n", "events.prt[abs(events.prt.vtx) > 0.10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "See [awkward-array.readthedocs.io](https://awkward-array.readthedocs.io/) for more, but there are functions like [ak.max](https://awkward-array.readthedocs.io/en/latest/_auto/ak.max.html), which picks the maximum in a groups.\n", "\n", " * With `axis=0`, the group is the set of all events.\n", " * With `axis=1`, the groups are particles in each event." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "?ak.max" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ak.max(abs(events.prt.vtx), axis=1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Selects *events* that have a maximum *particle vertex* greater than 100\n", "events[ak.max(abs(events.prt.vtx), axis=1) > 100]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The difference between \"select particles\" and \"select events\" is the number of jagged dimensions in the array; \"reducers\" like ak.max reduce the dimensionality by one.\n", "\n", "There are other reducers like ak.any, ak.all, ak.sum..." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "?ak.sum" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Is this particle an antineutron?\n", "events.prt.pdg == Particle.from_string(\"n~\").pdgid" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Are any particles in the event antineutrons?\n", "ak.any(events.prt.pdg == Particle.from_string(\"n~\").pdgid, axis=1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Select events that contain an antineutron\n", "events[ak.any(events.prt.pdg == Particle.from_string(\"n~\").pdgid, axis=1)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use these techniques to make subcollections for specific particle types and attach them to the same `events` array for easy access." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "events.prt[abs(events.prt.pdg) == abs(Particle.from_string(\"p\").pdgid)]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Assignments have to be through __setitem__ (brackets), not __setattr__ (as an attribute).\n", "# Is that a problem? (Assigning as an attribute would have to be implemented with care, if at all.)\n", "\n", "events[\"pions\"] = events.prt[abs(events.prt.pdg) == abs(Particle.from_string(\"pi+\").pdgid)]\n", "events[\"kaons\"] = events.prt[abs(events.prt.pdg) == abs(Particle.from_string(\"K+\").pdgid)]\n", "events[\"protons\"] = events.prt[abs(events.prt.pdg) == abs(Particle.from_string(\"p\").pdgid)]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "events.pions" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "events.kaons" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "events.protons" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ak.num(events.prt), ak.num(events.pions), ak.num(events.kaons), ak.num(events.protons)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Flattening for plots and regularizing to NumPy for machine learning\n", "\n", "All of this structure is great, but eventually, we need to plot the data or ship it to some statistical process, such as machine learning.\n", "\n", "Most of these tools know about NumPy arrays and rectilinear data, but not Awkward Arrays." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As a design choice, Awkward Array **does not implicitly flatten**; you need to do this yourself, and you might make different choices of how to apply this lossy transformation in different circumstances.\n", "\n", "The basic tool for removing structure is [ak.flatten](https://awkward-array.readthedocs.io/en/latest/_auto/ak.flatten.html)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "?ak.flatten" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Turn particles-grouped-by-event into one big array of particles\n", "ak.flatten(events.prt, axis=1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Eliminate structure at all levels; produce one numerical array\n", "ak.flatten(events.prt, axis=None)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For plotting, you probably want to pick one field and flatten it. Flattening with `axis=1` (the default) works for one level of structure and is safer than `axis=None`.\n", "\n", "The flattening is explicit as a reminder that a histogram whose entries are particles is different from a histogram whose entries are events." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Directly through Matplotlib\n", "plt.hist(ak.flatten(events.kaons.p), bins=100, range=(0, 10))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Through mplhep and boost-histgram, which are more HEP-friendly\n", "\n", "hep.histplot(bh.Histogram(bh.axis.Regular(100, 0, 10)).fill(\n", " \n", " ak.flatten(events.kaons.p)\n", " \n", "))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the particles are sorted (`ak.sort`/`ak.argsort` is [in development](https://github.com/scikit-hep/awkward-1.0/pull/168)), you might want to pick the first kaon from every event that has them (i.e. *use* the event structure).\n", "\n", "This is an analysis choice: *you* have to decide you want this.\n", "\n", "The `ak.num(events.kaons) > 0` selection is explicit as a reminder that empty events are not counted in the histogram." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hep.histplot(bh.Histogram(bh.axis.Regular(100, 0, 10)).fill(\n", " \n", " events.kaons.p[ak.num(events.kaons) > 0, 0]\n", " \n", "))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or perhaps the maximum pion momentum in each event. This one must be flattened (with `axis=0`) to remove `None` values.\n", "\n", "This flattening is explicit as a reminder that empty events are not counted in the histogram." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ak.max(events.kaons.p, axis=1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ak.flatten(ak.max(events.kaons.p, axis=1), axis=0)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hep.histplot(bh.Histogram(bh.axis.Regular(100, 0, 10)).fill(\n", " \n", " ak.flatten(ak.max(events.kaons.p, axis=1), axis=0)\n", " \n", "))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or perhaps the momentum of the kaon with the farthest vertex. [ak.argmax](https://awkward-array.readthedocs.io/en/latest/_auto/ak.argmax.html) creates an array of integers selecting from each event." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "?ak.argmax" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ak.argmax(abs(events.kaons.vtx), axis=1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "?ak.singletons" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Get a length-1 list containing the index of the biggest vertex when there is one\n", "# And a length-0 list when there isn't one\n", "ak.singletons(ak.argmax(abs(events.kaons.vtx), axis=1))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# A nested integer array like this is what we need to select kaons with the biggest vertex\n", "events.kaons[ak.singletons(ak.argmax(abs(events.kaons.vtx), axis=1))]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "events.kaons[ak.singletons(ak.argmax(abs(events.kaons.vtx), axis=1))].p" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Flatten the distinction between length-1 lists and length-0 lists\n", "ak.flatten(events.kaons[ak.singletons(ak.argmax(abs(events.kaons.vtx), axis=1))].p)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Putting it all together...\n", "hep.histplot(bh.Histogram(bh.axis.Regular(100, 0, 10)).fill(\n", " \n", " ak.flatten(events.kaons[ak.singletons(ak.argmax(abs(events.kaons.vtx), axis=1))].p)\n", " \n", "))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you're sending the data to a library that expects rectilinear structure, you might need to pad and clip the variable length lists.\n", "\n", "[ak.pad_none](https://awkward-array.readthedocs.io/en/latest/_auto/ak.pad_none.html) puts `None` values at the end of each list to reach a minimum length." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "?ak.pad_none" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# pad them look at the first 30\n", "ak.pad_none(events.kaons.id, 3)[:30].tolist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The lengths are still irregular, so you can also `clip=True` them." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# pad them look at the first 30\n", "ak.pad_none(events.kaons.id, 3, clip=True)[:30].tolist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The library we're sending this to might not be able to deal with missing values, so choose a replacement to fill them with." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "?ak.fill_none" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# fill with -1 <- pad them look at the first 30\n", "ak.fill_none(ak.pad_none(events.kaons.id, 3, clip=True), -1)[:30].tolist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These are still Awkward-brand arrays; the downstream library might complain if they're not NumPy-brand, so use [ak.to_numpy](https://awkward-array.readthedocs.io/en/latest/_auto/ak.to_numpy.html) or simply cast it with NumPy's `np.asarray`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "?ak.to_numpy" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.asarray(ak.fill_none(ak.pad_none(events.kaons.id, 3, clip=True), -1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you try to convert an Awkward Array as NumPy and structure would be lost, you get an error. (You won't accidentally eliminate structure.)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "try:\n", " np.asarray(events.kaons.id)\n", "except Exception as err:\n", " print(type(err), str(err))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Broadcasting flat arrays and jagged arrays\n", "\n", "NumPy lets you combine arrays and scalars in a mathematical expression by first \"broadcasting\" the scalar to an array of the same length." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(np.array([1, 2, 3, 4, 5]) + 100)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Awkward Array does the same thing, except that each element of a flat array can be broadcasted to each nested list of a jagged array.\n", "\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(ak.Array([[1, 2, 3], [], [4, 5], [6]]) + np.array([100, 200, 300, 400]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is useful for emulating\n", "\n", "```python\n", "all_vertices = []\n", "for event in events:\n", " vertices = []\n", " for kaon in events.kaons:\n", " all_vertices.append((kaon.vtx.x - event.true.x,\n", " kaon.vtx.y - event.true.y))\n", " all_vertices.append(vertices)\n", "```\n", "\n", "where `event.true.x` and `y` have only one value per event but `kaon.vtx.x` and `y` have one per kaon." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# one value per kaon one per event\n", "ak.zip([events.kaons.vtx.x - events.true.x,\n", " events.kaons.vtx.y - events.true.y])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You don't have to do anything special for this: broadcasting is a common feature of all functions that apply to more than one array.\n", "\n", "You can get it explicitly with [ak.broadcast_arrays](https://awkward-array.readthedocs.io/en/latest/_auto/ak.broadcast_arrays.html)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "?ak.broadcast_arrays" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ak.broadcast_arrays(events.true.x, events.kaons.vtx.x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Combinatorics: cartesian and combinations\n", "\n", "At all levels of a physics analysis, we need to compare objects drawn from different collections.\n", "\n", " * **Gen-reco matching:** to associate a reconstructed particle with its generator-level parameters.\n", " * **Cleaning:** assocating soft photons with a reconstructed electron or leptons to a jet.\n", " * **Bump-hunting:** looking for mass peaks in pairs of particles.\n", " * **Dalitz analysis:** looking for resonance structure in triples of particles.\n", "\n", "To do this with array-at-a-time operations, use one function to generate all the combinations, \n", "\n", "\n", "\n", "apply \"flat\" operations,\n", "\n", "\n", "\n", "then use \"reducers\" to get one value per particle or per event again.\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cartesian and combinations\n", "\n", "The two main \"explode\" operations are [ak.cartesian](https://awkward-array.readthedocs.io/en/latest/_auto/ak.cartesian.html) and [ak.combinations](https://awkward-array.readthedocs.io/en/latest/_auto/ak.combinations.html).\n", "\n", "The first generates the **Cartesian product** (a.k.a. cross product) of two collections **per nested list**.\n", "\n", "\n", "\n", "The second generates **distinct combinations** (i.e. \"n choose k\") of a collection with itself **per nested list**.\n", "\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "?ak.cartesian" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "?ak.combinations" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ak.to_list(ak.cartesian(([[1, 2, 3], [], [4]],\n", " [[\"a\", \"b\"], [\"c\"], [\"d\", \"e\"]])))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ak.to_list(ak.combinations([[\"a\", \"b\", \"c\", \"d\"], [], [1, 2]], 2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To search for $\\Lambda^0 \\to \\pi p$, we need to compute the mass of pairs drawn from these two collections." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pairs = ak.cartesian([events.pions, events.protons])\n", "pairs" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "?ak.unzip" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def mass(pairs, left_mass, right_mass):\n", " left, right = ak.unzip(pairs)\n", " left_energy = np.sqrt(left.p**2 + left_mass**2)\n", " right_energy = np.sqrt(right.p**2 + right_mass**2)\n", " return np.sqrt((left_energy + right_energy)**2 -\n", " (left.px + right.px)**2 -\n", " (left.py + right.py)**2 -\n", " (left.pz + right.pz)**2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mass(pairs, 0.139570, 0.938272)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hep.histplot(bh.Histogram(bh.axis.Regular(100, 1.115683 - 0.01, 1.115683 + 0.01)).fill(\n", " ak.flatten(mass(pairs, 0.139570, 0.938272))\n", "))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can improve the peak by selecting for opposite charges and large vertexes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def opposite(pairs):\n", " left, right = ak.unzip(pairs)\n", " return pairs[left.charge != right.charge]\n", "\n", "def distant(pairs):\n", " left, right = ak.unzip(pairs)\n", " return pairs[np.logical_and(abs(left.vtx) > 0.10, abs(right.vtx) > 0.10)]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hep.histplot(bh.Histogram(bh.axis.Regular(100, 1.115683 - 0.01, 1.115683 + 0.01)).fill(\n", " ak.flatten(mass(distant(opposite(pairs)), 0.139570, 0.938272))\n", "))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatively, all of these functions could have been methods on the pair objects for reuse.\n", "\n", "(This is to make the point that any kind of object can have methods, not just particles.)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class ParticlePairArray(ak.Array):\n", " __name__ = \"Pairs\"\n", " \n", " def mass(self, left_mass, right_mass):\n", " left, right = self.slot0, self.slot1\n", " left_energy = np.sqrt(left.p**2 + left_mass**2)\n", " right_energy = np.sqrt(right.p**2 + right_mass**2)\n", " return np.sqrt((left_energy + right_energy)**2 -\n", " (left.px + right.px)**2 -\n", " (left.py + right.py)**2 -\n", " (left.pz + right.pz)**2)\n", " \n", " def opposite(self):\n", " return self[self.slot0.charge != self.slot1.charge]\n", " \n", " def distant(self, cut):\n", " return self[np.logical_and(abs(self.slot0.vtx) > cut, abs(self.slot1.vtx) > cut)]\n", "\n", "ak.behavior[\"*\", \"pair\"] = ParticlePairArray" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pairs = ak.cartesian([events.pions, events.protons], with_name=\"pair\")\n", "pairs" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hep.histplot(bh.Histogram(bh.axis.Regular(100, 1.115683 - 0.01, 1.115683 + 0.01)).fill(\n", " ak.flatten(pairs.opposite().distant(0.10).mass(0.139570, 0.938272))\n", "))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Self-study question:** why does the call to `mass` have to be last?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An example for `ak.combinations`: $K_S \\to \\pi\\pi$." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pairs = ak.combinations(events.pions, 2, with_name=\"pair\")\n", "pairs" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hep.histplot(bh.Histogram(bh.axis.Regular(100, 0.497611 - 0.015, 0.497611 + 0.015)).fill(\n", " ak.flatten(pairs.opposite().distant(0.10).mass(0.139570, 0.139570))\n", "))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Bonus problem:** $D^0 \\to K^- \\pi^+ \\pi^0$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pizero_candidates = ak.combinations(events.prt[events.prt.pdg == Particle.from_string(\"gamma\").pdgid], 2, with_name=\"pair\")\n", "pizero = pizero_candidates[pizero_candidates.mass(0, 0) - 0.13498 < 0.000001]\n", "pizero[\"px\"] = pizero.slot0.px + pizero.slot1.px\n", "pizero[\"py\"] = pizero.slot0.py + pizero.slot1.py\n", "pizero[\"pz\"] = pizero.slot0.pz + pizero.slot1.pz\n", "pizero[\"p\"] = np.sqrt(pizero.px**2 + pizero.py**2 + pizero.pz**2)\n", "pizero" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "kminus = events.prt[events.prt.pdg == Particle.from_string(\"K-\").pdgid]\n", "piplus = events.prt[events.prt.pdg == Particle.from_string(\"pi+\").pdgid]\n", "\n", "triples = ak.cartesian({\"kminus\": kminus[abs(kminus.vtx) > 0.03],\n", " \"piplus\": piplus[abs(piplus.vtx) > 0.03],\n", " \"pizero\": pizero[np.logical_and(abs(pizero.slot0.vtx) > 0.03, abs(pizero.slot1.vtx) > 0.03)]})\n", "triples" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ak.num(triples)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def mass2(left, left_mass, right, right_mass):\n", " left_energy = np.sqrt(left.p**2 + left_mass**2)\n", " right_energy = np.sqrt(right.p**2 + right_mass**2)\n", " return ((left_energy + right_energy)**2 -\n", " (left.px + right.px)**2 -\n", " (left.py + right.py)**2 -\n", " (left.pz + right.pz)**2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mKpi = mass2(triples.kminus, 0.493677, triples.piplus, 0.139570)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mpipi = mass2(triples.piplus, 0.139570, triples.pizero, 0.1349766)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This Dalitz plot doesn't look right (doesn't cut off at kinematic limits), but I'm going to leave it as an exercise for the reader." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dalitz = bh.Histogram(bh.axis.Regular(50, 0, 3), bh.axis.Regular(50, 0, 2))\n", "dalitz.fill(ak.flatten(mKpi), ak.flatten(mpipi))\n", "\n", "X, Y = dalitz.axes.edges\n", "\n", "fig, ax = plt.subplots()\n", "mesh = ax.pcolormesh(X.T, Y.T, dalitz.view().T)\n", "fig.colorbar(mesh)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reducing from combinations\n", "\n", "The mass-peak examples above don't need to \"reduce\" combinations, but many applications do.\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Suppose that we want to find the \"nearest photon to each electron\" (e.g. bremsstrahlung)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "electrons = events.prt[abs(events.prt.pdg) == abs(Particle.from_string(\"e-\").pdgid)]\n", "photons = events.prt[events.prt.pdg == Particle.from_string(\"gamma\").pdgid]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The problem with the raw output of `ak.cartesian` is that all the combinations are mixed together in the same lists." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ak.to_list(ak.cartesian([electrons[[\"pdg\", \"id\"]], photons[[\"pdg\", \"id\"]]])[8])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can fix this by asking for `nested=True`, which adds another level of nesting to the output." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ak.to_list(ak.cartesian([electrons[[\"pdg\", \"id\"]], photons[[\"pdg\", \"id\"]]], nested=True)[8])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All electron-photon pairs associated with a given electron are grouped in a list-within-each-list.\n", "\n", "Now we can apply reducers to this inner dimension to sum over some quantity, pick the best one, etc." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def cos_angle(pairs):\n", " left, right = ak.unzip(pairs)\n", " return left.dir.x*right.dir.x + left.dir.y*right.dir.y + left.dir.z*right.dir.z" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "electron_photons = ak.cartesian([electrons, photons], nested=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cos_angle(electron_photons)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hep.histplot(bh.Histogram(bh.axis.Regular(100, -1, 1)).fill(\n", " ak.flatten(cos_angle(electron_photons), axis=None) # a good reason to use flatten axis=None\n", "))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We pick the \"maximum according to a function\" using the same `ak.singletons(ak.argmax(f(x))` trick as above." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "best_electron_photons = electron_photons[ak.singletons(ak.argmax(cos_angle(electron_photons), axis=2))]\n", "\n", "hep.histplot(bh.Histogram(bh.axis.Regular(100, -1, 1)).fill(\n", " ak.flatten(cos_angle(best_electron_photons), axis=None)\n", "))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By construction, `best_electron_photons` has zero or one elements in each *inner* nested list." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ak.num(electron_photons, axis=2), ak.num(best_electron_photons, axis=2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since we no longer care about that *inner* structure, we could flatten it at `axis=2` (leaving `axis=1` untouched)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "best_electron_photons" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ak.flatten(best_electron_photons, axis=2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But it would be better to invert the `ak.singletons` by calling `ak.firsts`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "?ak.singletons" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "?ak.firsts" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ak.firsts(best_electron_photons, axis=2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because then we can get back one value for each electron (with `None` if `ak.argmax` resulted in `None` because there were no pairs)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ak.num(electrons), ak.num(ak.firsts(best_electron_photons, axis=2))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ak.all(ak.num(electrons) == ak.num(ak.firsts(best_electron_photons, axis=2)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And that means that we can make this \"closest photon\" an attribute of the electrons. We have now performed electron-photon matching." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "electrons[\"photon\"] = ak.firsts(best_electron_photons, axis=2)\n", "ak.to_list(electrons[8])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Current set of reducers:\n", "\n", " * [ak.count](https://awkward-array.readthedocs.io/en/latest/_auto/ak.count.html): counts the number in each group (subtly different from [ak.num](https://awkward-array.readthedocs.io/en/latest/_auto/ak.num.html) because `ak.count` is a reducer)\n", " * [ak.count_nonzero](https://awkward-array.readthedocs.io/en/latest/_auto/ak.count_nonzero.html): counts the number of non-zero elements in each group\n", " * [ak.sum](https://awkward-array.readthedocs.io/en/latest/_auto/ak.sum.html): adds up values in the group, the quintessential reducer\n", " * [ak.prod](https://awkward-array.readthedocs.io/en/latest/_auto/ak.prod.html): multiplies values in the group (e.g. for charges or probabilities)\n", " * [ak.any](https://awkward-array.readthedocs.io/en/latest/_auto/ak.any.html): boolean reducer for logical `or` (\"do *any* in this group satisfy a constraint?\")\n", " * [ak.all](https://awkward-array.readthedocs.io/en/latest/_auto/ak.all.html): boolean reducer for logical `and` (\"do *all* in this group satisfy a constraint?\")\n", " * [ak.min](https://awkward-array.readthedocs.io/en/latest/_auto/ak.min.html): minimum value in each group (`None` for empty groups)\n", " * [ak.max](https://awkward-array.readthedocs.io/en/latest/_auto/ak.max.html): maximum value in each group (`None` for empty groups)\n", " * [ak.argmin](https://awkward-array.readthedocs.io/en/latest/_auto/ak.argmin.html): index of minimum value in each group (`None` for empty groups)\n", " * [ak.argmax](https://awkward-array.readthedocs.io/en/latest/_auto/ak.argmax.html): index of maximum value in each group (`None` for empty groups)\n", "\n", "And other functions that have the same interface as a reducer (reduces a dimension):\n", "\n", " * [ak.moment](https://awkward-array.readthedocs.io/en/latest/_auto/ak.moment.html): computes the $n^{th}$ moment in each group\n", " * [ak.mean](https://awkward-array.readthedocs.io/en/latest/_auto/ak.mean.html): computes the mean in each group\n", " * [ak.var](https://awkward-array.readthedocs.io/en/latest/_auto/ak.var.html): computes the variance in each group\n", " * [ak.std](https://awkward-array.readthedocs.io/en/latest/_auto/ak.std.html): computes the standard deviation in each group\n", " * [ak.covar](https://awkward-array.readthedocs.io/en/latest/_auto/ak.covar.html): computes the covariance in each group\n", " * [ak.corr](https://awkward-array.readthedocs.io/en/latest/_auto/ak.corr.html): computes the correlation in each group\n", " * [ak.linear_fit](https://awkward-array.readthedocs.io/en/latest/_auto/ak.linear_fit.html): computes the linear fit in each group\n", " * [ak.softmax](https://awkward-array.readthedocs.io/en/latest/_auto/ak.softmax.html): computes the softmax function in each group" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Imperative, but still fast, programming in Numba\n", "\n", "Array-at-a-time operations let us manipulate dynamically typed data with compiled code (and in some cases, benefit from hardware vectorization). However, they're complicated. Finding the closest photon to each electron is more complicated than it seems it ought to be.\n", "\n", "Some of these things are simpler in imperative (step-by-step scalar-at-a-time) code. Imperative Python code is slow because it has to check the data type of every object it enounters (among other things); compiled code is faster because these checks are performed once during a compilation step for any number of identically typed values.\n", "\n", "We can get the best of both worlds by Just-In-Time (JIT) compiling the code. [Numba](http://numba.pydata.org/) is a NumPy-centric JIT compiler for Python." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numba as nb\n", "\n", "@nb.jit\n", "def monte_carlo_pi(nsamples):\n", " acc = 0\n", " for i in range(nsamples):\n", " x = np.random.random()\n", " y = np.random.random()\n", " if (x**2 + y**2) < 1.0:\n", " acc += 1\n", " return 4.0 * acc / nsamples" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%timeit\n", "\n", "# Run the pure Python function (without nb.jit)\n", "monte_carlo_pi.py_func(1000000)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%timeit\n", "\n", "# Run the compiled function\n", "monte_carlo_pi(1000000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The price for this magical speedup is that not all Python code can be accelerated; you have to be conservative with the functions and language features you use, and Numba has to recognize the data types.\n", "\n", "Numba recognizes Awkward Arrays." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "@nb.jit\n", "def lambda_mass(events):\n", " num_lambdas = 0\n", " for event in events:\n", " num_lambdas += len(event.pions) * len(event.protons)\n", "\n", " lambda_masses = np.empty(num_lambdas, np.float64)\n", " i = 0\n", " for event in events:\n", " for pion in event.pions:\n", " for proton in event.protons:\n", " pion_energy = np.sqrt(pion.p**2 + 0.139570**2)\n", " proton_energy = np.sqrt(proton.p**2 + 0.938272**2)\n", " mass = np.sqrt((pion_energy + proton_energy)**2 -\n", " (pion.px + proton.px)**2 -\n", " (pion.py + proton.py)**2 -\n", " (pion.pz + proton.pz)**2)\n", " lambda_masses[i] = mass\n", " i += 1\n", " \n", " return lambda_masses\n", "\n", "lambda_mass(events)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hep.histplot(bh.Histogram(bh.axis.Regular(100, 1.115683 - 0.01, 1.115683 + 0.01)).fill(\n", " lambda_mass(events)\n", "))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some constraints:\n", "\n", " * Awkward arrays are read-only structures (always true, even outside of Numba)\n", " * Awkward arrays can't be created inside a Numba-compiled function\n", "\n", "That was fine for a function that creates and returns a NumPy array, but what if we want to create something with structure?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The [ak.ArrayBuilder](https://awkward-array.readthedocs.io/en/latest/_auto/ak.ArrayBuilder.html) is a general way to make data structures." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "?ak.ArrayBuilder" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "builder = ak.ArrayBuilder()\n", "\n", "builder.begin_list()\n", "\n", "builder.begin_record()\n", "builder.field(\"x\").integer(1)\n", "builder.field(\"y\").real(1.1)\n", "builder.field(\"z\").string(\"one\")\n", "builder.end_record()\n", "\n", "builder.begin_record()\n", "builder.field(\"x\").integer(2)\n", "builder.field(\"y\").real(2.2)\n", "builder.field(\"z\").string(\"two\")\n", "builder.end_record()\n", "\n", "builder.end_list()\n", "\n", "builder.begin_list()\n", "builder.end_list()\n", "\n", "builder.begin_list()\n", "\n", "builder.begin_record()\n", "builder.field(\"x\").integer(3)\n", "builder.field(\"y\").real(3.3)\n", "builder.field(\"z\").string(\"three\")\n", "builder.end_record()\n", "\n", "builder.end_list()\n", "\n", "ak.to_list(builder.snapshot())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "ArrayBuilders can be used in Numba, albeit with some constraints:\n", "\n", " * ArrayBuilders can't be created inside a Numba-compiled function (pass them in)\n", " * The `snapshot` method (to turn it into an array) can't be used in a Numba-compiled function (use it outside)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "@nb.jit(nopython=True)\n", "def make_electron_photons(events, builder):\n", " for event in events:\n", " builder.begin_list()\n", " for electron in event.electrons:\n", " best_i = -1\n", " best_angle = -1.0\n", " for i in range(len(event.photons)):\n", " photon = event.photons[i]\n", " angle = photon.dir.x*electron.dir.x + photon.dir.y*electron.dir.y + photon.dir.z*electron.dir.z\n", " if angle > best_angle:\n", " best_i = i\n", " best_angle = angle\n", " if best_i == -1:\n", " builder.null()\n", " else:\n", " builder.append(photon)\n", " builder.end_list()\n", "\n", "events[\"electrons\"] = events.prt[abs(events.prt.pdg) == abs(Particle.from_string(\"e-\").pdgid)]\n", "events[\"photons\"] = events.prt[events.prt.pdg == Particle.from_string(\"gamma\").pdgid]\n", "\n", "builder = ak.ArrayBuilder()\n", "make_electron_photons(events, builder)\n", "builder.snapshot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A few of them are `None` (called `builder.null()` because there were no photons to attach to the electron)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ak.count_nonzero(ak.is_none(ak.flatten(builder.snapshot())))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But the `builder.snapshot()` otherwise matches up with the `events.electrons`, so it's something we could attach to it, as before." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "?ak.with_field" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "events[\"electrons\"] = ak.with_field(events.electrons, builder.snapshot(), \"photon\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ak.to_list(events[8].electrons)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Grafting jagged data onto Pandas\n", "\n", "Awkward Arrays can be Pandas columns." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "df = pd.DataFrame({\"pions\": events.pions,\n", " \"kaons\": events.kaons,\n", " \"protons\": events.protons})\n", "df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df[\"pions\"].dtype" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But that's unlikely to be useful for very complex data structures because there aren't any Pandas functions for deeply nested structure.\n", "\n", "Instead, you'll probably want to *convert* the nested structures into the corresponding Pandas [MultiIndex](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "?ak.pandas.df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ak.pandas.df(events.pions)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now the nested lists are represented as MultiIndex rows and the nested records are represented as MultiIndex columns, which are structures that Pandas knows how to deal with." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But what about two types of particles, pions and kaons? (And let's simplify to just `\"px\", \"py\", \"pz\", \"vtx\"`.)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "simpler = ak.zip({\"pions\": events.pions[[\"px\", \"py\", \"pz\", \"vtx\"]],\n", " \"kaons\": events.kaons[[\"px\", \"py\", \"pz\", \"vtx\"]]}, depthlimit=1)\n", "ak.type(simpler)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ak.pandas.df(simpler)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There's only one row MultiIndex, so pion #1 in each event is the same row as kaon #1. That assocation is probably meaningless.\n", "\n", "The issue is that a single Pandas DataFrame represents *less* information than an Awkward Array. In general, we would need a collection of DataFrames to losslessly encode an Awkward Array. (Pandas represents the data in [database normal form](https://en.wikipedia.org/wiki/Database_normalization); Awkward represents it in objects.)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "?ak.pandas.dfs" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# This array corresponds to *two* Pandas DataFrames.\n", "pions_df, kaons_df = ak.pandas.dfs(simpler)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pions_df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "kaons_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## NumExpr, Autograd, and other third-party libraries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[NumExpr](https://numexpr.readthedocs.io/en/latest/user_guide.html) can calcuate pure numerical expressions faster than NumPy because it does so in one pass. (It has a low-overhead virtual machine.)\n", "\n", "NumExpr doesn't recognize Awkward Arrays, but we have a wrapper for it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numexpr\n", "\n", "# This works because px, py, pz are flat, like NumPy\n", "px = ak.flatten(events.pions.px)\n", "py = ak.flatten(events.pions.py)\n", "pz = ak.flatten(events.pions.pz)\n", "numexpr.evaluate(\"px**2 + py**2 + pz**2\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# This doesn't work because px, py, pz have structure\n", "px = events.pions.px\n", "py = events.pions.py\n", "pz = events.pions.pz\n", "try:\n", " numexpr.evaluate(\"px**2 + py**2 + pz**2\")\n", "except Exception as err:\n", " print(type(err), str(err))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# But in this wrapped version, we broadcast and maintain structure\n", "ak.numexpr.evaluate(\"px**2 + py**2 + pz**2\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similarly for [Autograd](https://github.com/HIPS/autograd#readme), which has an `elementwise_grad` for differentiating expressions with respect to NumPy [universal functions](https://docs.scipy.org/doc/numpy/reference/ufuncs.html), but not Awkward Arrays." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "@ak.autograd.elementwise_grad\n", "def tanh(x):\n", " y = np.exp(-2.0 * x)\n", " return (1.0 - y) / (1.0 + y)\n", "\n", "ak.to_list(tanh([{\"x\": 0.0, \"y\": []}, {\"x\": 0.1, \"y\": [1]}, {\"x\": 0.2, \"y\": [2, 2]}, {\"x\": 0.3, \"y\": [3, 3, 3]}]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The set of third-party libraries supported this way will continue to grow. Some major plans on the horizon:\n", "\n", " * [Apache Arrow](https://arrow.apache.org/), and through it the [Parquet](https://parquet.apache.org/) file format.\n", " * The [Zarr](https://zarr.readthedocs.io/en/stable/) array delivery system.\n", " * [CuPy](https://cupy.chainer.org/) and Awkward Arrays on the GPU." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }