{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "

Files, data, event processing

\n", "\n", "#### **Quick intro to the following packages**\n", "- `uproot` - ROOT I/O library in pure Python and NumPy.\n", "- `awkward-array` - manipulate arrays of complex data structures as easily as NumPy." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " \n", "
\n", " \n", "

ROOT I/O in pure Python and NumPy

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **What is ``uproot``?**\n", "\n", "**Effectively what connects HEP data (ROOT format) with the Python scientific ecosystem around NumPy!**\n", "\n", "`uproot` provides very fast, efficient, and convenient access to ROOT trees.\n", "\n", "- Pure Python + NumPy implementation of ROOT I/O.\n", "- An array-centric view of ROOT TTree data:\n", " - Branches of simple types are simple arrays.\n", " - Branches of complex types are “jagged arrays”.\n", "- High performance for large baskets, despite Python’s slowness (because all per-entry operations are performed in NumPy).\n", "- Greatest benefits: simplicity, minimal installation, set-up, and affinity with machine learning interfaces." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **Why does it exist?**\n", "1. To extract columnar data (branches) from a ROOT file without invoking the\n", "event-handling infrastructure of the ROOT framework.\n", "3. To express the semantics and conventions of the ROOT file format independently\n", "of ROOT, in lieu of a formal specification." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **Why Python + NumPy?**\n", "- As stressed several times, the scientific Python ecosystem, including much of ML, is designed around a fundamental abstraction called the NumPy array.\n", "- Working with computer scientists is easier when you can say, \"*pip install uproot*\".\n", "- Implemented correctly, Python + NumPy doesn't have to be slow.\n", " - Finding the columnar data in a ROOT file may be done in slow Python, as long as\n", "decompression and array manipulations are done by compiled code, see the now-old-ish performance study below." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **1. Getting data from a simple ROOT TTree**\n", "\n", "By \"simple\" we mean a file without *jagged structures*, or nested structures with branch sizes depending on an event-by-event basis." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import uproot" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "f = uproot.open('data/sample_simple-example.root')\n", "\n", "f" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "ROOT files, directories, and trees are like Python dicts with keys() and values()." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "f.keys()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "t = f[\"events\"]\n", "t.keys()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "t['M']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Uproot's main purpose is to read branches from ROOT files as NumPy arrays:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "t['M'].array()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All branches can be looked at with `t.arrays()`. A subset is specified e.g. as `t.arrays(['Run', 'Event'])`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "t.arrays()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One can now start performing calculations. But let's avoid explicit loops and rather exploit the ecosystem ..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Create a Pandas DataFrame**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas\n", "\n", "df = t.arrays(library=\"pd\")\n", "df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "t.arrays(['Run', 'Event', 'pt1', 'pt2'], library=\"pd\").head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **2. Data with jagged structure**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "f = uproot.open('data/sample_muons.root')\n", "\n", "f.keys()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "branches = f['Events'].arrays()\n", " \n", "branches" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The jagged structure here comes from the number of muons per event, which is variable:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "branches['nMuon']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This becomes evident when checking for example the $p_T$ of all muons:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "branches['Muon_pt']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Print the $p_T$ for the muons in the first 10 events to trivially see the jagged structure:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(' \\n'.join([str(elm) for elm in branches['Muon_pt'][:10]]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will get back to jagged arrays in a sec. Let's first show that `uproot` also has (limited) writing functionality." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " 
\n", "\n", "

Manipulate arrays of complex data structures as easily as NumPy

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **What is ``awkward-array``?**\n", "\n", "Awkward Array is a library for nested, variable-sized data, including arbitrary-length lists, records, mixed types, and missing data, using NumPy-like idioms." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **Example - di-muons in a sample of selected muons**\n", "\n", "As seen above, the number of muons varies on an event-by-event basis:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "branches['nMuon']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Can we see any structure in those events containing a $\\mu^+ \\mu^-$ pair? Interesting since many particles are known to decay as such, e.g. $J/\\psi \\to \\mu^+ \\mu^-$.\n", "\n", "Let's first investigate the sample a bit further." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import awkward as ak" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "branches" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ak.type(branches)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`1000000 *` means that there are a 100k events, `\"Muon_pt\": var *` means that the contents of the `\"Muon_pt\"` field are jagged: there's a variable number of them per event.\n", "\n", "We could look at a few of these as Python lists and dicts." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ak.to_list(branches[:3])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "branches['nMuon'][0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Nothing like a visual inspection, though:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "\n", "plt.hist(branches['nMuon'], bins=10, range=(0, 10))\n", "plt.xlabel('Number of muons in event')\n", "plt.ylabel('Number of events');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Quickly do the same with Scikit-HEP histogramming tools (raison d'être of the `Hist` package)?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from hist import Hist\n", "Hist.new.Reg(10, 0 ,10, name=\"nMuon\", label=\"Number of muons in event\").Double().fill(branches['nMuon'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above even tells us something we did not realise from the other plot - there are 110 overflow entries. Indeed" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ak.count(branches['nMuon'][branches['nMuon'] >=10])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How many muon entries are there in total?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# ak.num gives the number of elements in each nested list\n", "len(branches['Muon_pt']), ak.sum(branches['nMuon']), sum(ak.num(branches['Muon_pt'])) # 235286 muons in 1e5 events" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sum(branches.nMuon)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ak.type(branches.Muon_pt)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plot the $p_T$ and $\\eta$ of all muons:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.hist(ak.flatten(branches.Muon_pt), bins=100, range=(0, 100))\n", "plt.xlabel('Muon pT')\n", "plt.ylabel('Number of candidates')\n", "plt.yscale('log');" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.hist(ak.flatten(branches.Muon_eta), bins=100, range=(-2.5, 2.5))\n", "plt.xlabel('Muon $\\eta$')\n", "plt.ylabel('Number of candidates');" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%timeit\n", "len(ak.flatten(branches.Muon_pt))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%timeit\n", "sum(branches.nMuon)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Selections are done via masks. Let's create one that singles out events with a single muon:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "branches['nMuon'] == 1" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "single_muon_mask = branches['nMuon'] == 1\n", "\n", "print(\"There are {} single-muon events.\".format(sum(single_muon_mask)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Just checking:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "len(branches['Muon_pt'][single_muon_mask]), branches['Muon_pt'][single_muon_mask]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.hist(ak.flatten(branches['Muon_pt'][single_muon_mask]), bins=100, range=(0, 100))\n", "plt.xlabel('Muon $p_{\\mathrm{T}}$ [MeV]')\n", "plt.ylabel('Number of single muons / 1 MeV')\n", "plt.yscale('log')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Mask to select muons within $|\\eta| <2$:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "eta_mask = abs(branches['Muon_eta']) < 2\n", "eta_mask" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ak.sum(ak.flatten(eta_mask))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ak.num(eta_mask)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ak.sum(eta_mask)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Again, a visual inspection never harms." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.hist(ak.flatten(branches['Muon_eta']), bins=50, range=(-2.5, 2.5))\n", "plt.title('No selection')\n", "plt.xlabel('Muon $\\eta$')\n", "plt.ylabel('Number of muons')\n", "plt.show()\n", "\n", "plt.hist(ak.flatten(branches['Muon_eta'][eta_mask]), bins=50, range=(-2.5, 2.5))\n", "plt.title('With $|\\eta| < 2$ selection')\n", "plt.xlabel('Muon $\\eta$')\n", "plt.ylabel('Number of muons')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "len(single_muon_mask & eta_mask)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.hist([ak.flatten(branches['Muon_pt'][single_muon_mask & eta_mask]),\n", " ak.flatten(branches['Muon_pt'][single_muon_mask & ~eta_mask])],\n", " label=['$|\\eta| < 2$', '$|\\eta| \\geq 2$'],\n", " density=True,\n", " bins=25, range=(0, 50))\n", "plt.xlabel('Muon $p_{\\mathrm{T}}$ [GeV]')\n", "plt.ylabel('Number of single muons / 2 GeV')\n", "plt.legend()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At last, concentrate on 2-muon events:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "two_muons_mask = branches['nMuon'] == 2\n", "two_muons_sample = branches[two_muons_mask]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sum(two_muons_mask)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "two_muons_sample['Muon_pt']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import uproot3_methods\n", "\n", "first_muon_p4 = uproot3_methods.TLorentzVectorArray.from_ptetaphim(two_muons_sample['Muon_pt'][:,0],\n", " two_muons_sample['Muon_eta'][:,0],\n", " two_muons_sample['Muon_phi'][:,0],\n", " two_muons_sample['Muon_mass'][:,0]\n", " )\n", "first_muon_p4" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "second_muon_p4 = uproot3_methods.TLorentzVectorArray.from_ptetaphim(two_muons_sample['Muon_pt'][:,1],\n", " two_muons_sample['Muon_eta'][:,1],\n", " two_muons_sample['Muon_phi'][:,1],\n", " two_muons_sample['Muon_mass'][:,1]\n", " )\n", "second_muon_p4" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "len(first_muon_p4)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "first_muon_p4.delta_r(second_muon_p4)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.hist(first_muon_p4.delta_r(second_muon_p4), bins=100)\n", "plt.xlabel('$\\Delta R$ between muons')\n", "plt.ylabel('Number of two-muon events')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Are we done? No! We have not checked that the 2 muons have opposite charge ...\n", "Further refine to $\\mu^+\\mu^-$ pairs - you see where're getting ;-):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sum_p4 = first_muon_p4 + second_muon_p4\n", "opposite_sign_muons_mask = two_muons_sample['Muon_charge'][:, 0] != two_muons_sample['Muon_charge'][:, 1]\n", "dimuon_p4 = sum_p4[opposite_sign_muons_mask]\n", "dimuon_p4" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "figsize_l, figsize_h = plt.rcParams[\"figure.figsize\"]\n", "plt.figure(figsize=(figsize_l*2.5, figsize_h*3.))\n", "\n", "(yvals, binedges, patches) = plt.hist(dimuon_p4.mass, bins=np.logspace(np.log10(0.1), np.log10(1000), 200))\n", "\n", "plt.xlabel('Dimuon invariant mass [GeV]')\n", "plt.ylabel('Number of dimuon candidates')\n", "plt.xscale('log')\n", "plt.yscale('log')\n", "\n", "import particle.literals as lpart\n", "from hepunits import GeV\n", " \n", "list_particles = [getattr(lpart,name) for name in ('eta', 'rho_770_0', 'omega_782','phi_1020','Jpsi_1S', 'Z_0')]\n", "\n", "for p in list_particles:\n", " # Not a very clever way to \"distribute\" the particle labels but it will do\n", " binnumber = np.searchsorted(binedges, p.mass/GeV)\n", " scaling = 1.02\n", " if p.name == 'rho(770)0':\n", " x = p.mass/GeV - 0.2\n", " scaling = 0.9\n", " elif p.name == 'omega(782)':\n", " x = p.mass/GeV + 0.3\n", " else:\n", " x = p.mass/GeV\n", " plt.text(x, yvals[binnumber-1]*scaling, '${}$'.format(p.latex_name), horizontalalignment='center', fontsize=15)\n", " \n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " \n", "
\n", "\n", "

Important note

\n", "\n", "These 2 packages provide a very extensive functionality set! Be sure to check their repositories and documentation.\n", "
\n", "\n", "- `awkward-array`: https://github.com/scikit-hep/awkward-1.0.\n", "- `uproot`: https://github.com/scikit-hep/uproot4." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 4 }