{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center><h1><b><span style=\"color:blue\">Files, data, event processing</span></b></h1></center>\n",
    "\n",
    "#### **Quick intro to the following packages**\n",
    "- `uproot` - ROOT I/O library in pure Python and NumPy.\n",
    "- `awkward-array` - manipulate arrays of complex data structures as easily as NumPy."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&nbsp;\n",
    "<center>\n",
    "    <img src=\"images/logo_uproot.png\" style=\"width:150px;\"/>\n",
    "    <h2><b><span style=\"color:green\">ROOT I/O in pure Python and NumPy</span></b></h2>\n",
    "</center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### **What is ``uproot``?**\n",
    "\n",
    "<span style=\"color:green\">**Effectively what connects HEP data (ROOT format) with the Python scientific ecosystem around NumPy!**</span>\n",
    "\n",
    "`uproot` provides very fast, efficient, and convenient access to ROOT trees.\n",
    "\n",
    "- Pure Python + NumPy implementation of ROOT I/O.\n",
    "- An array-centric view of ROOT TTree data:\n",
    "  - Branches of simple types are simple arrays.\n",
    "  - Branches of complex types are “jagged arrays”.\n",
    "-  High performance for large baskets, despite Python’s slowness (because all per-entry operations are performed in NumPy).\n",
    "- Greatest benefits: simplicity, minimal installation, set-up, and affinity with machine learning interfaces."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### **Why does it exist?**\n",
    "1. To extract columnar data (branches) from a ROOT file without invoking the\n",
    "event-handling infrastructure of the ROOT framework.\n",
    "3. To express the semantics and conventions of the ROOT file format independently\n",
    "of ROOT, in lieu of a formal specification."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### **Why Python + NumPy?**\n",
    "- As stressed several times, the scientific Python ecosystem, including much of ML, is designed around a fundamental abstraction called the NumPy array.\n",
    "- Working with computer scientists is easier when you can say, \"*pip install uproot*\".\n",
    "- Implemented correctly, Python + NumPy doesn't have to be slow.\n",
    "  - Finding the columnar data in a ROOT file may be done in slow Python, as long as\n",
    "decompression and array manipulations are done by compiled code, see the now-old-ish performance study below."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### **1. Getting data from a simple ROOT TTree**\n",
    "\n",
    "By \"simple\" we mean a file without *jagged structures*, or nested structures with branch sizes depending on an event-by-event basis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import uproot"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "f = uproot.open('data/sample_simple-example.root')\n",
    "\n",
    "f"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "ROOT files, directories, and trees are like Python dicts with keys() and values()."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "f.keys()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "t = f[\"events\"]\n",
    "t.keys()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "t['M']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Uproot's main purpose is to read branches from ROOT files as NumPy arrays:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "t['M'].array()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All branches can be looked at with `t.arrays()`. A subset is specified e.g. as `t.arrays(['Run', 'Event'])`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "t.arrays()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One can now start performing calculations. But let's avoid explicit loops and rather exploit the ecosystem ..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Create a Pandas DataFrame**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas\n",
    "\n",
    "df = t.arrays(library=\"pd\")\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "t.arrays(['Run', 'Event', 'pt1', 'pt2'], library=\"pd\").head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### **2. Data with jagged structure**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "f = uproot.open('data/sample_muons.root')\n",
    "\n",
    "f.keys()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "branches = f['Events'].arrays()\n",
    "                               \n",
    "branches"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The jagged structure here comes from the number of muons per event, which is variable:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "branches['nMuon']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This becomes evident when checking for example the $p_T$ of all muons:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "branches['Muon_pt']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Print the $p_T$ for the muons in the first 10 events to trivially see the jagged structure:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(' \\n'.join([str(elm) for elm in branches['Muon_pt'][:10]]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will get back to jagged arrays in a sec. Let's first show that `uproot` also has (limited) writing functionality."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&nbsp;<br><center><img src=\"images/logo_awkward-array.png\" style=\"width: 150px;\"/></center>\n",
    "\n",
    "<center><h2><b><span style=\"color:green\">Manipulate arrays of complex data structures as easily as NumPy</span></b></h2></center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### **What is ``awkward-array``?**\n",
    "\n",
    "Awkward Array is a library for nested, variable-sized data, including arbitrary-length lists, records, mixed types, and missing data, using NumPy-like idioms."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### **Example - di-muons in a sample of selected muons**\n",
    "\n",
    "As seen above, the number of muons varies on an event-by-event basis:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "branches['nMuon']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Can we see any structure in those events containing a $\\mu^+ \\mu^-$ pair? Interesting since many particles are known to decay as such, e.g. $J/\\psi \\to \\mu^+ \\mu^-$.\n",
    "\n",
    "Let's first investigate the sample a bit further."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import awkward as ak"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "branches"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ak.type(branches)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`1000000 *` means that there are a 100k events, `\"Muon_pt\": var *` means that the contents of the `\"Muon_pt\"` field are jagged: there's a variable number of them per event.\n",
    "\n",
    "We could look at a few of these as Python lists and dicts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ak.to_list(branches[:3])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "branches['nMuon'][0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Nothing like a visual inspection, though:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "\n",
    "plt.hist(branches['nMuon'], bins=10, range=(0, 10))\n",
    "plt.xlabel('Number of muons in event')\n",
    "plt.ylabel('Number of events');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Quickly do the same with Scikit-HEP histogramming tools (raison d'être of the `Hist` package)?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from hist import Hist\n",
    "Hist.new.Reg(10, 0 ,10, name=\"nMuon\", label=\"Number of muons in event\").Double().fill(branches['nMuon'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The above even tells us something we did not realise from the other plot - there are 110 overflow entries. Indeed"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ak.count(branches['nMuon'][branches['nMuon'] >=10])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How many muon entries are there in total?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ak.num gives the number of elements in each nested list\n",
    "len(branches['Muon_pt']), ak.sum(branches['nMuon']), sum(ak.num(branches['Muon_pt']))  # 235286 muons in 1e5 events"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sum(branches.nMuon)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ak.type(branches.Muon_pt)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Plot the $p_T$ and $\\eta$ of all muons:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.hist(ak.flatten(branches.Muon_pt), bins=100, range=(0, 100))\n",
    "plt.xlabel('Muon pT')\n",
    "plt.ylabel('Number of candidates')\n",
    "plt.yscale('log');"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.hist(ak.flatten(branches.Muon_eta), bins=100, range=(-2.5, 2.5))\n",
    "plt.xlabel('Muon $\\eta$')\n",
    "plt.ylabel('Number of candidates');"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%timeit\n",
    "len(ak.flatten(branches.Muon_pt))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%timeit\n",
    "sum(branches.nMuon)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Selections are done via masks. Let's create one that singles out events with a single muon:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "branches['nMuon'] == 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "single_muon_mask = branches['nMuon'] == 1\n",
    "\n",
    "print(\"There are {} single-muon events.\".format(sum(single_muon_mask)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Just checking:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "len(branches['Muon_pt'][single_muon_mask]), branches['Muon_pt'][single_muon_mask]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.hist(ak.flatten(branches['Muon_pt'][single_muon_mask]), bins=100, range=(0, 100))\n",
    "plt.xlabel('Muon $p_{\\mathrm{T}}$ [MeV]')\n",
    "plt.ylabel('Number of single muons / 1 MeV')\n",
    "plt.yscale('log')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Mask to select muons within $|\\eta| <2$:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "eta_mask = abs(branches['Muon_eta']) < 2\n",
    "eta_mask"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ak.sum(ak.flatten(eta_mask))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ak.num(eta_mask)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ak.sum(eta_mask)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Again, a visual inspection never harms."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.hist(ak.flatten(branches['Muon_eta']), bins=50, range=(-2.5, 2.5))\n",
    "plt.title('No selection')\n",
    "plt.xlabel('Muon $\\eta$')\n",
    "plt.ylabel('Number of muons')\n",
    "plt.show()\n",
    "\n",
    "plt.hist(ak.flatten(branches['Muon_eta'][eta_mask]), bins=50, range=(-2.5, 2.5))\n",
    "plt.title('With $|\\eta| < 2$ selection')\n",
    "plt.xlabel('Muon $\\eta$')\n",
    "plt.ylabel('Number of muons')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "len(single_muon_mask & eta_mask)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.hist([ak.flatten(branches['Muon_pt'][single_muon_mask & eta_mask]),\n",
    "          ak.flatten(branches['Muon_pt'][single_muon_mask & ~eta_mask])],\n",
    "         label=['$|\\eta| < 2$', '$|\\eta| \\geq 2$'],\n",
    "         density=True,\n",
    "         bins=25, range=(0, 50))\n",
    "plt.xlabel('Muon $p_{\\mathrm{T}}$ [GeV]')\n",
    "plt.ylabel('Number of single muons / 2 GeV')\n",
    "plt.legend()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "At last, concentrate on 2-muon events:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "two_muons_mask = branches['nMuon'] == 2\n",
    "two_muons_sample = branches[two_muons_mask]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sum(two_muons_mask)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "two_muons_sample['Muon_pt']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import uproot3_methods\n",
    "\n",
    "first_muon_p4 = uproot3_methods.TLorentzVectorArray.from_ptetaphim(two_muons_sample['Muon_pt'][:,0],\n",
    "                                                                   two_muons_sample['Muon_eta'][:,0],\n",
    "                                                                   two_muons_sample['Muon_phi'][:,0],\n",
    "                                                                   two_muons_sample['Muon_mass'][:,0]\n",
    "                                                                 )\n",
    "first_muon_p4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "second_muon_p4 = uproot3_methods.TLorentzVectorArray.from_ptetaphim(two_muons_sample['Muon_pt'][:,1],\n",
    "                                                                    two_muons_sample['Muon_eta'][:,1],\n",
    "                                                                    two_muons_sample['Muon_phi'][:,1],\n",
    "                                                                    two_muons_sample['Muon_mass'][:,1]\n",
    "                                                                 )\n",
    "second_muon_p4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "len(first_muon_p4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "first_muon_p4.delta_r(second_muon_p4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.hist(first_muon_p4.delta_r(second_muon_p4), bins=100)\n",
    "plt.xlabel('$\\Delta R$ between muons')\n",
    "plt.ylabel('Number of two-muon events')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Are we done? No! We have not checked that the 2 muons have opposite charge ...\n",
    "Further refine to $\\mu^+\\mu^-$ pairs - you see where're getting ;-):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sum_p4 = first_muon_p4 + second_muon_p4\n",
    "opposite_sign_muons_mask = two_muons_sample['Muon_charge'][:, 0] != two_muons_sample['Muon_charge'][:, 1]\n",
    "dimuon_p4 = sum_p4[opposite_sign_muons_mask]\n",
    "dimuon_p4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "figsize_l, figsize_h = plt.rcParams[\"figure.figsize\"]\n",
    "plt.figure(figsize=(figsize_l*2.5, figsize_h*3.))\n",
    "\n",
    "(yvals, binedges, patches) = plt.hist(dimuon_p4.mass, bins=np.logspace(np.log10(0.1), np.log10(1000), 200))\n",
    "\n",
    "plt.xlabel('Dimuon invariant mass [GeV]')\n",
    "plt.ylabel('Number of dimuon candidates')\n",
    "plt.xscale('log')\n",
    "plt.yscale('log')\n",
    "\n",
    "import particle.literals as lpart\n",
    "from hepunits import GeV\n",
    "    \n",
    "list_particles = [getattr(lpart,name) for name in ('eta', 'rho_770_0', 'omega_782','phi_1020','Jpsi_1S', 'Z_0')]\n",
    "\n",
    "for p in list_particles:\n",
    "    # Not a very clever way to \"distribute\" the particle labels but it will do\n",
    "    binnumber = np.searchsorted(binedges, p.mass/GeV)\n",
    "    scaling = 1.02\n",
    "    if p.name == 'rho(770)0':\n",
    "        x = p.mass/GeV - 0.2\n",
    "        scaling = 0.9\n",
    "    elif p.name == 'omega(782)':\n",
    "        x = p.mass/GeV + 0.3\n",
    "    else:\n",
    "        x = p.mass/GeV\n",
    "    plt.text(x, yvals[binnumber-1]*scaling, '${}$'.format(p.latex_name), horizontalalignment='center', fontsize=15)\n",
    "        \n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&nbsp;\n",
    "<div class=\"alert alert-warning\">\n",
    "\n",
    "<h4><b>Important note</b></h4>\n",
    "\n",
    "These 2 packages provide a very extensive functionality set! Be sure to check their repositories and documentation.\n",
    "</div>\n",
    "\n",
    "- `awkward-array`: https://github.com/scikit-hep/awkward-1.0.\n",
    "- `uproot`: https://github.com/scikit-hep/uproot4."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}