{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Analysis tools and accumulators\n", "\n", "This is a rendered copy of [accumulators.ipynb](https://github.com/CoffeaTeam/coffea/blob/master/binder/accumulators.ipynb). You can optionally run it interactively on [binder at this link](https://mybinder.org/v2/gh/coffeateam/coffea/master?filepath=binder%2Faccumulators.ipynb)\n", "\n", "Now that we know how to access data with NanoEvents and apply corrections, let's go through some useful columnar analysis tools and idioms for building _accumulators_, namely, the eventual output of a coffea processor that has natural \"merge\" semantics. The most familiar type of accumulator is the histogram, but other types are useful in certain contexts.\n", "\n", "We'll use our small sample file to demonstrate the utilities, although it won't be very interesting to analyze" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import awkward as ak\n", "from coffea.nanoevents import NanoEventsFactory, NanoAODSchema\n", "\n", "fname = \"https://raw.githubusercontent.com/CoffeaTeam/coffea/master/tests/samples/nano_dy.root\"\n", "events = NanoEventsFactory.from_root(\n", " fname,\n", " schemaclass=NanoAODSchema.v6,\n", " metadata={\"dataset\": \"DYJets\"},\n", ").events()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To generate some mock systematics, we'll use one of the scale factors from the applying_corrections notebook (note you will have to at least execute the cell that downloads test data in that notebook for this to work)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_keys(['scalefactors_Tight_Electron', 'scalefactors_Tight_Electron_error'])" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from coffea.lookup_tools import extractor\n", "\n", "ext = extractor()\n", "ext.add_weight_sets([\"* * data/testSF2d.histo.root\"])\n", "ext.finalize()\n", "evaluator = ext.make_evaluator()\n", "evaluator.keys()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Weights\n", "\n", "This is a container for event weights and associated systematic shifts, which helps track the product of the weights (i.e. the total event weight to be used for filling histograms) as well as systematic variations to that product. Here we demo its use by constructing an event weight consisting of the generator weight, the $\\alpha_s$ uncertainty variation, and the electron ID scale factor with its associated systematic." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from coffea.analysis_tools import Weights\n", "\n", "weights = Weights(len(events))\n", "\n", "weights.add(\"genWeight\", events.genWeight)\n", "\n", "weights.add(\n", " \"alphaS\",\n", " # in NanoAOD, the generator weights are already stored with respect to nominal\n", " weight=np.ones(len(events)),\n", " # 31 => alphas(MZ)=0.1165 central value; 32 => alphas(MZ)=0.1195\n", " # per https://lhapdfsets.web.cern.ch/current/PDF4LHC15_nnlo_30_pdfas/PDF4LHC15_nnlo_30_pdfas.info\n", " # which was found by looking up the LHA ID in events.LHEPdfWeight.__doc__\n", " weightUp=events.LHEPdfWeight[:, 32],\n", " weightDown=events.LHEPdfWeight[:, 31],\n", ")\n", "\n", "eleSF = evaluator[\"scalefactors_Tight_Electron\"](events.Electron.eta, events.Electron.pt)\n", "eleSFerror = evaluator[\"scalefactors_Tight_Electron_error\"](events.Electron.eta, events.Electron.pt)\n", "weights.add(\n", " \"eleSF\",\n", " # the event weight is the product of the per-electron weights\n", " # note, in a real analysis we would first have to select electrons of interest\n", " weight=ak.prod(eleSF, axis=1),\n", " weightUp=ak.prod(eleSF + eleSFerror, axis=1),\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A [WeightStatistics](https://coffeateam.github.io/coffea/api/coffea.analysis_tools.WeightStatistics.html) object tracks the smallest and largest weights seen per type, as well as some other summary statistics. It is kept internally and can be accessed via `weights.weightStatistics`. This object is addable, so it can be used in an accumulator." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'genWeight': WeightStatistics(sumw=578762.4375, sumw2=27678371840.0, minw=-26331.201171875, maxw=26331.201171875, n=40),\n", " 'alphaS': WeightStatistics(sumw=40.0, sumw2=40.0, minw=1.0, maxw=1.0, n=40),\n", " 'eleSF': WeightStatistics(sumw=38.26972579956055, sumw2=36.81547546386719, minw=0.6674144268035889, maxw=1.0077519416809082, n=40)}" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "weights.weightStatistics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then the total event weight is available via" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 26331.20117188, 23933.4196682 , 24385.06528643, 17224.10489525,\n", " 26331.20117188, 26535.31910775, -23236.21447283, 26067.890625 ,\n", " 26331.20117188, 25105.68057538, 26331.20117188, 26061.82814944,\n", " 26061.82814944, -26331.20117188, 26331.20117188, 24421.85661211,\n", " -24854.62516629, 26331.20117188, -19978.69273076, 26331.20117188,\n", " 22168.68846612, -25134.17258657, 26331.20117188, 26331.20117188,\n", " 26331.20117188, 24770.33051391, 26331.20117188, 22732.29949574,\n", " 26331.20117188, 24421.85661211, 26331.20117188, 25857.1530546 ,\n", " 26331.20117188, -23903.50101615, 26331.20117188, -24770.33051391,\n", " 26331.20117188, -24906.05914783, 26331.20117188, -26331.20117188])" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "weights.weight()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And the total event weight with a given variation is available via" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 26331.20117188, 24397.53583916, 25294.0252539 , 17997.89462374,\n", " 26331.20117188, 27253.27428405, -23502.68164558, 26067.890625 ,\n", " 26331.20117188, 25230.22218679, 26331.20117188, 26779.78332574,\n", " 26779.78332574, -26331.20117188, 26331.20117188, 24885.97278307,\n", " -24977.92136851, 26331.20117188, -20983.93483174, 26331.20117188,\n", " 22734.27398292, -25869.18763094, 26331.20117188, 26331.20117188,\n", " 26331.20117188, 25448.12088952, 26331.20117188, 22998.76666849,\n", " 26331.20117188, 24885.97278307, 26331.20117188, 26254.27086417,\n", " 26331.20117188, -24141.51091823, 26331.20117188, -25448.12088952,\n", " 26331.20117188, -25583.84952343, 26331.20117188, -26331.20117188])" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "weights.weight(\"eleSFUp\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "all variations tracked by the `weights` object are available via" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'alphaSDown', 'alphaSUp', 'eleSFDown', 'eleSFUp'}" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "weights.variations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## PackedSelection\n", "\n", "This class can store several boolean arrays in a memory-efficient mannner and evaluate arbitrary combinations of boolean requirements in an CPU-efficient way. Supported inputs include 1D numpy or awkward arrays. This makes it a good tool to form analysis signal and control regions, and to implement cutflow or \"N-1\" plots.\n", "\n", "Below we create a packed selection with some typical selections for a Z+jets study, to be used later to form same-sign and opposite-sign $ee$ and $\\mu\\mu$ event categories/regions." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['twoElectron', 'eleOppSign', 'noElectron', 'twoMuon', 'muOppSign', 'noMuon', 'leadPt20']\n" ] } ], "source": [ "from coffea.analysis_tools import PackedSelection\n", "\n", "selection = PackedSelection()\n", "\n", "selection.add(\"twoElectron\", ak.num(events.Electron) == 2)\n", "selection.add(\"eleOppSign\", ak.sum(events.Electron.charge, axis=1) == 0)\n", "selection.add(\"noElectron\", ak.num(events.Electron) == 0)\n", "\n", "selection.add(\"twoMuon\", ak.num(events.Muon) == 2)\n", "selection.add(\"muOppSign\", ak.sum(events.Muon.charge, axis=1) == 0)\n", "selection.add(\"noMuon\", ak.num(events.Muon) == 0)\n", "\n", "\n", "selection.add(\n", " \"leadPt20\",\n", " # assuming one of `twoElectron` or `twoMuon` is imposed, this implies at least one is above threshold\n", " ak.any(events.Electron.pt >= 20.0, axis=1) | ak.any(events.Muon.pt >= 20.0, axis=1)\n", ")\n", "\n", "print(selection.names)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To evaluate a boolean mask (e.g. to filter events) we can use the `selection.all(*names)` function, which will compute the AND of all listed boolean selections" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([False, False, True, False, False, False, False, False, False,\n", " False, False, False, False, False, False, False, False, False,\n", " False, False, True, True, False, False, False, False, False,\n", " False, False, False, False, False, False, False, False, False,\n", " False, False, False, False])" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "selection.all(\"twoElectron\", \"noMuon\", \"leadPt20\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also be more specific and require that a specific set of selections have a given value (with the unspecified ones allowed to be either true or false) using `selection.require`" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([False, False, False, True, False, False, False, False, False,\n", " False, False, False, False, False, False, False, False, False,\n", " False, False, False, False, False, False, False, False, False,\n", " False, False, False, False, False, False, False, False, False,\n", " False, False, False, False])" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "selection.require(twoElectron=True, noMuon=True, eleOppSign=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using the python syntax for passing an arguments variable, we can easily implement a \"N-1\" style selection" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Events passing all cuts, ignoring noMuon: 3\n", "Events passing all cuts, ignoring twoElectron: 10\n", "Events passing all cuts, ignoring leadPt20: 5\n", "Events passing all cuts: 3\n" ] } ], "source": [ "allCuts = {\"twoElectron\", \"noMuon\", \"leadPt20\"}\n", "for cut in allCuts:\n", " nev = selection.all(*(allCuts - {cut})).sum()\n", " print(f\"Events passing all cuts, ignoring {cut}: {nev}\")\n", "\n", "nev = selection.all(*allCuts).sum()\n", "print(f\"Events passing all cuts: {nev}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Accumulator semantics\n", "\n", "Prior to coffea 0.7.2, accumulators were more explicit in that they inherited from [AccumulatorABC](https://coffeateam.github.io/coffea/api/coffea.processor.AccumulatorABC.html). Such accumulators are still supported, but now the requirements for an acuumulator output of a processor have been relaxed to the protocol:\n", "```python\n", "class Addable(Protocol):\n", " def __add__(self: T, other: T) -> T:\n", " ...\n", "\n", "\n", "Accumulatable = Union[Addable, MutableSet, MutableMapping]\n", "```\n", "\n", "Essentially, any item that can be added, any set, or any dictionary of the above, can be used as an accumulator:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'num': 4, 'val': 11.1, 'thing': {'a', 'b', 'c'}}" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from coffea.processor import accumulate\n", "\n", "accumulate([\n", " {\"num\": 0, \"val\": 3.1},\n", " {\"num\": 2, \"val\": 4.0, \"thing\": {\"a\", \"b\"}},\n", " {\"num\": 2, \"val\": 4.0, \"thing\": {\"a\", \"c\"}},\n", "])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since `hist` histograms are addable, they can be used as well:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0. 3. 0. 2. 0. 1. 0. 0. 0. 0.]\n" ] }, { "data": { "text/html": [ "\n", "
\n", "
\n", "\n", "\n", "\n", "0\n", "\n", "\n", "1\n", "\n", "\n", "Axis 0\n", "\n", "\n", "\n", "
\n", "
\n", "Regular(10, 0, 1, label='Axis 0')
\n", "
\n", "Double() Σ=6.0\n", "\n", "
\n", "
\n", "" ], "text/plain": [ "Hist(Regular(10, 0, 1, label='Axis 0'), storage=Double()) # Sum: 6.0" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import hist\n", "\n", "def makehist():\n", " return hist.Hist.new.Reg(10, 0, 1).Double()\n", "\n", "h = accumulate([\n", " makehist().fill([0.1, 0.1, 0.3]),\n", " makehist().fill([0.1, 0.3, 0.5]),\n", "])\n", "print(h.values())\n", "h" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Bringing it together\n", "\n", "Let's build an output accumulator that stores, per dataset:\n", " - the sum of weights for the events processed, to use for later luminosity-normalizing the yields;\n", " - a histogram of the dilepton invariant mass, with category axes for various selection regions of interest and systematics; and\n", " - the weight statistics, for debugging purposes\n", " " ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'DYJets': {'sumw': 578762.44,\n", " 'mass': Hist(\n", " StrCategory(['ee', 'eeSS', 'mm', 'mmSS'], name='region', label='region'),\n", " StrCategory(['nominal', 'alphaSUp', 'eleSFUp', 'alphaSDown', 'eleSFDown'], name='systematic', label='systematic'),\n", " Regular(60, 60, 120, name='mass', label='$m_{ll}$ [GeV]'),\n", " storage=Weight()) # Sum: WeightedSum(value=621215, variance=2.20829e+10),\n", " 'weightStats': {'genWeight': WeightStatistics(sumw=578762.4375, sumw2=27678371840.0, minw=-26331.201171875, maxw=26331.201171875, n=40),\n", " 'alphaS': WeightStatistics(sumw=40.0, sumw2=40.0, minw=1.0, maxw=1.0, n=40),\n", " 'eleSF': WeightStatistics(sumw=38.26972579956055, sumw2=36.81547546386719, minw=0.6674144268035889, maxw=1.0077519416809082, n=40)}}}" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "regions = {\n", " \"ee\": {\"twoElectron\": True, \"noMuon\": True, \"leadPt20\": True, \"eleOppSign\": True},\n", " \"eeSS\": {\"twoElectron\": True, \"noMuon\": True, \"leadPt20\": True, \"eleOppSign\": False},\n", " \"mm\": {\"twoMuon\": True, \"noElectron\": True, \"leadPt20\": True, \"muOppSign\": True},\n", " \"mmSS\": {\"twoMuon\": True, \"noElectron\": True, \"leadPt20\": True, \"muOppSign\": False},\n", "}\n", "\n", "masshist = (\n", " hist.Hist.new\n", " .StrCat(regions.keys(), name=\"region\")\n", " .StrCat([\"nominal\"] + list(weights.variations), name=\"systematic\")\n", " .Reg(60, 60, 120, name=\"mass\", label=\"$m_{ll}$ [GeV]\")\n", " .Weight()\n", ")\n", "\n", "for region, cuts in regions.items():\n", " goodevent = selection.require(**cuts)\n", "\n", " if region.startswith(\"ee\"):\n", " mass = events.Electron[goodevent].sum().mass\n", " elif region.startswith(\"mm\"):\n", " mass = events.Muon[goodevent].sum().mass\n", "\n", " masshist.fill(\n", " region=region,\n", " systematic=\"nominal\",\n", " mass=mass,\n", " weight=weights.weight()[goodevent],\n", " )\n", " for syst in weights.variations:\n", " masshist.fill(\n", " region=region,\n", " systematic=syst,\n", " mass=mass,\n", " weight=weights.weight(syst)[goodevent],\n", " )\n", "\n", "out = {\n", " events.metadata[\"dataset\"]: {\n", " \"sumw\": ak.sum(events.genWeight),\n", " \"mass\": masshist,\n", " \"weightStats\": weights.weightStatistics,\n", " }\n", "}\n", "\n", "out" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The cell below demonstrates that indeed this output is accumulatable. So if we were to return such a result from a coffea processor, we would be all ready to process thousands of files!" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'DYJets': {'mass': Hist(\n", " StrCategory(['ee', 'eeSS', 'mm', 'mmSS'], name='region', label='region'),\n", " StrCategory(['nominal', 'alphaSUp', 'eleSFUp', 'alphaSDown', 'eleSFDown'], name='systematic', label='systematic'),\n", " Regular(60, 60, 120, name='mass', label='$m_{ll}$ [GeV]'),\n", " storage=Weight()) # Sum: WeightedSum(value=1.24243e+06, variance=4.41659e+10),\n", " 'weightStats': {'genWeight': WeightStatistics(sumw=1157524.875, sumw2=55356743680.0, minw=-26331.201171875, maxw=26331.201171875, n=80),\n", " 'alphaS': WeightStatistics(sumw=80.0, sumw2=80.0, minw=1.0, maxw=1.0, n=80),\n", " 'eleSF': WeightStatistics(sumw=76.5394515991211, sumw2=73.63095092773438, minw=0.6674144268035889, maxw=1.0077519416809082, n=80)},\n", " 'sumw': 1157524.9}}" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "accumulate([out, out])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The mass histogram itself is not very interesting with only 40 input events, however" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
\n", "\n", "\n", "\n", "60\n", "\n", "\n", "120\n", "\n", "\n", "$m_{ll}$ [GeV]\n", "\n", "\n", "\n", "
\n", "
\n", "Regular(60, 60, 120, name='mass', label='$m_{ll}$ [GeV]')
\n", "
\n", "Weight() Σ=WeightedSum(value=126744, variance=4.49114e+09)\n", "\n", "
\n", "
\n", "" ], "text/plain": [ "Hist(Regular(60, 60, 120, name='mass', label='$m_{ll}$ [GeV]'), storage=Weight()) # Sum: WeightedSum(value=126744, variance=4.49114e+09)" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "out[\"DYJets\"][\"mass\"][sum, \"nominal\", :]" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" } }, "nbformat": 4, "nbformat_minor": 2 }