{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Coffea Casa Analysis Template" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Table of Contents\n", "- [Introduction](#Introduction)\n", "- [Loading Files](#Loading-Files)\n", "- [The Processor Class](#The-Processor-Class)\n", " - [Skeleton](#Skeleton)\n", " - [Specifications](#Specifications)\n", " - [Minimal Processor](#Minimal-Processor)\n", "- [The Dask Executor](#The-Dask-Executor)\n", " - [Scheduler Setup](#Scheduler-Setup)\n", " - [Running the Analysis](#Running-the-Analysis)\n", "- [Miscellaneous](#Miscellaneous)\n", " - [ServiceX](#ServiceX)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction\n", "This is a template file to elucidate the structure of a typical analysis notebook on coffe-casa. We will load in sample data, create a minimal processor class, and run the Dask executor." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import coffea\n", "import coffea.processor as processor\n", "from coffea import hist" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading Files\n", "\n", "A dataset is parsed as a dictionary where each key is a dataset name, and each value is a list of files in that dataset. You can have multiple datasets (multiple keys), and you can have multiple files in a dataset (multiple pointers in the list). Typically, CMS files will require authentication, but coffea-casa does away with this by implementation of tokens. In order to bypass authentication, replace the redirector portion of your file with xcache; i.e., the file:\n", "\n", "`root://`**xrootd.unl.edu**`//eos/cms/store/mc/RunIIAutumn18NanoAODv7/DYJetsToLL_M-50_TuneCP5_13TeV-amcatnloFXFX-pythia8/NANOAODSIM/Nano02Apr2020_102X_upgrade2018_realistic_v21_ext2-v1/260000/47DA174D-9F5A-F745-B2AA-B9F66CDADB1A.root`\n", "\n", "becomes\n", "\n", "`root://`**xcache**`//eos/cms/store/mc/RunIIAutumn18NanoAODv7/DYJetsToLL_M-50_TuneCP5_13TeV-amcatnloFXFX-pythia8/NANOAODSIM/Nano02Apr2020_102X_upgrade2018_realistic_v21_ext2-v1/260000/47DA174D-9F5A-F745-B2AA-B9F66CDADB1A.root`\n", "\n", "Below, we load in two datasets. The first has six files, and the second has four." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "fileset = {'tHq': ['root://xcache//store/mc/RunIISummer16NanoAODv5/THQ_Hincl_13TeV-madgraph-pythia8_TuneCUETP8M1/NANOAODSIM/PUMoriond17_Nano1June2019_102X_mcRun2_asymptotic_v7-v1/100000/38E83594-51BD-7D46-B96D-620DD60078A7.root',\n", " 'root://xcache//store/mc/RunIISummer16NanoAODv5/THQ_Hincl_13TeV-madgraph-pythia8_TuneCUETP8M1/NANOAODSIM/PUMoriond17_Nano1June2019_102X_mcRun2_asymptotic_v7-v1/100000/3A3BA22C-AA71-2544-810A-6DF4C6BA96FC.root',\n", " 'root://xcache//store/mc/RunIISummer16NanoAODv5/THQ_Hincl_13TeV-madgraph-pythia8_TuneCUETP8M1/NANOAODSIM/PUMoriond17_Nano1June2019_102X_mcRun2_asymptotic_v7-v1/100000/3AFB1F42-BC6D-D44E-86FD-DB93C83F88FF.root',\n", " 'root://xcache//store/mc/RunIISummer16NanoAODv5/THQ_Hincl_13TeV-madgraph-pythia8_TuneCUETP8M1/NANOAODSIM/PUMoriond17_Nano1June2019_102X_mcRun2_asymptotic_v7-v1/100000/A37B4B7A-FB5B-484D-8577-40B860D77D23.root',\n", " 'root://xcache//store/mc/RunIISummer16NanoAODv5/THQ_Hincl_13TeV-madgraph-pythia8_TuneCUETP8M1/NANOAODSIM/PUMoriond17_Nano1June2019_102X_mcRun2_asymptotic_v7-v1/100000/E3C7548E-EE40-BA45-9130-17DF56FBE537.root',\n", " 'root://xcache//store/mc/RunIISummer16NanoAODv5/THQ_Hincl_13TeV-madgraph-pythia8_TuneCUETP8M1/NANOAODSIM/PUMoriond17_Nano1June2019_102X_mcRun2_asymptotic_v7-v1/100000/F9EFC559-09E9-BB48-8150-9AA8B7F02C1C.root'],\n", " 'tHW': ['root://xcache//store/mc/RunIISummer16NanoAODv5/THW_Hincl_13TeV-madgraph-pythia8_TuneCUETP8M1/NANOAODSIM/PUMoriond17_Nano1June2019_102X_mcRun2_asymptotic_v7-v1/70000/2806293E-D1DD-4A49-A274-0CC3BA57BBDF.root',\n", " 'root://xcache//store/mc/RunIISummer16NanoAODv5/THW_Hincl_13TeV-madgraph-pythia8_TuneCUETP8M1/NANOAODSIM/PUMoriond17_Nano1June2019_102X_mcRun2_asymptotic_v7-v1/70000/2F19962E-1DFB-A14A-91C2-30B69D5651D3.root',\n", " 'root://xcache//store/mc/RunIISummer16NanoAODv5/THW_Hincl_13TeV-madgraph-pythia8_TuneCUETP8M1/NANOAODSIM/PUMoriond17_Nano1June2019_102X_mcRun2_asymptotic_v7-v1/70000/D9744111-ED04-3F47-A52A-C18424F01609.root',\n", " 'root://xcache//store/mc/RunIISummer16NanoAODv5/THW_Hincl_13TeV-madgraph-pythia8_TuneCUETP8M1/NANOAODSIM/PUMoriond17_Nano1June2019_102X_mcRun2_asymptotic_v7-v1/70000/E4CFA095-E7DB-B449-986D-1A5D21FD1D50.root']}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The Processor Class" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Skeleton" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "class Processor(processor.ProcessorABC):\n", " def __init__(self):\n", " ''' Define histogram properties here. '''\n", " \n", " @property\n", " def accumulator(self):\n", " return self._accumulator\n", " \n", " def process(self, events):\n", " ''' Define analysis details here. '''\n", " return output\n", "\n", " def postprocess(self, accumulator):\n", " ''' Define weights, scaling, rebinning here.'''\n", " return accumulator" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Specifications\n", "This part is pure [Coffea](https://coffeateam.github.io/coffea/reference.html). The processor class encapsulates all of our analysis. It is what we send to our executor, which forwards it to our workers. For detailed instructions on how to create the processor class, see the Coffea examples and documentation, or refer to the benchmarks and analysis in this repository. In short:\n", "\n", "`__init__`: This is where we define our histograms. Categorical or sparse axes (*Cats*) split data vertically, into different categories. Bin or dense axes (*Bins*) split data horizontally, into the 'bars' of the histogram. We also define an accumulator here. Data that is fed to the Processor is split into chunks, and we need to add all of these chunks together to get a histogram of **all** chunks. The accumulator is a tool that allows us to do this, by enabling easy object addition; i.e., \\[AwkwardArray1\\] + \\[AwkwardArray2\\] = \\[AwkwardArray1 + AwkwardArray2\\].\n", "\n", "`accumulator`: This is a helper method for our accumulator. Just return the accumulator in it.\n", "\n", "`process`: This is where all of the magic actually happens. All of your analysis code should go here. The current Coffea standard is to use NanoEvents for reading data. For a primer on columnar analysis, see the benchmarks and analysis in this repository, or the Coffea documentation's examples.\n", "\n", "`postprocess`: This is where we can make post-analysis adjustments, such as rebinning or scaling our histograms.\n", "\n", "### Minimal Processor\n", "We'll just plot the MET of our sample datasets. MET is an event-level property, so our arrays are flat and not jagged, which makes things a little more simple." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "class Processor(processor.ProcessorABC):\n", " def __init__(self):\n", " dataset_axis = hist.Cat(\"dataset\", \"\")\n", " # Split data into 50 bins, ranging from 0 to 100.\n", " MET_axis = hist.Bin(\"MET\", \"MET [GeV]\", 50, 0, 100)\n", " \n", " self._accumulator = processor.dict_accumulator({\n", " 'MET': hist.Hist(\"Counts\", dataset_axis, MET_axis),\n", " })\n", " \n", " @property\n", " def accumulator(self):\n", " return self._accumulator\n", " \n", " def process(self, events):\n", " output = self.accumulator.identity()\n", " \n", " dataset = events.metadata[\"dataset\"]\n", " MET = events.MET.pt\n", "\n", " output['MET'].fill(dataset=dataset, MET=MET)\n", " return output\n", "\n", " def postprocess(self, accumulator):\n", " return accumulator" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The Dask Executor\n", "\n", "### Scheduler Setup\n", "This is where [Dask](https://dask.org/) comes in. Now that we have a minimal processor put together, we can execute it on our sample data. This requires an executor. Coffea comes with basic executors such as `futures_executor` and `iterative_executor` which use strictly Pythonic tools. The Dask executor (`dask_executor`), however, is more sophisticated for cluster computing, and coffea-casa enables its usage.\n", "\n", "In the JupyterLab sidebar, you should see a sidecar dedicated to Dask:\n", "\n", "\n", "\"Drawing\"\n", "\n", "\n", "You can click on the UNL HTCondor Cluster button and drag it out into a block of the Jupyter Notebook, and it will paste everything necessary to connect to the Dask scheduler. It should look something like this (of course, the IP will be different):\n", "\n", "\n", "\"Drawing\"\n", "\n", "\n", "The Dask workers will then connect to this scheduler when the executor is run. **Do this in the code block below!**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Running the Analysis\n", "\n", "Then, all we have to do is run the executor. This is done through the `processor.run_uproot_job` method. It requires the following to be provided as arguments:\n", "\n", "`fileset`: The files we want to run our analysis on. In our case, the sample file defined earlier.\n", "\n", "`treename`: This is the name of the tree inside of the root file. For NanoAODs, I believe this should always just be 'Events.'\n", "\n", "`executor`: The executor that we wish to use; coffea-casa is intended to be used with the Dask executor. You can also try `futures_executor` and `iterative_executor`, and both can be useful for debugging or troubleshooting when workers are acting up with errors.\n", "\n", "`executor_args`: There's a lot of optional arguments you can put in the dictionary here. See the run_uproot_job [documentation](https://coffeateam.github.io/coffea/api/coffea.processor.run_uproot_job.html). At minimum, we need to point to a Dask scheduler (`'client': client`) if we're using the Dask executor; we do not need to do this for the futures or iterative executor. If you're using NanoEvents, then you need to say so (`'schema': processor.NanoAODSchema`).\n", "\n", "`chunksize`: Coffea will split your data into chunks with this many events. If your data has a million events and your chunksize is 250000, you'll have four chunks. There is also a `maxchunks` argument you can put in, which will stop the analysis after a certain number of chunks are reached. In other words, `maxchunks=2` will only process 500000 events of your million. This can be useful for debugging." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[########################################] | 100% Completed | 30.9s\r" ] } ], "source": [ "output = processor.run_uproot_job(fileset=fileset, \n", " treename=\"Events\", \n", " processor_instance=Processor(),\n", " executor=processor.dask_executor,\n", " executor_args={'client': client, 'schema': processor.NanoAODSchema},\n", " chunksize=250000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Miscellaneous\n", "\n", "### ServiceX\n", "[ServiceX](https://servicex.readthedocs.io/en/latest/introduction/) is a data delivery package which uses [func_adl](https://pypi.org/project/func-adl/) to fetch data. \n", "\n", "**The coffea-casa facility is built to support ServiceX, though it is currently in experimental stages. This section will be updated as ServiceX implementation becomes more stable.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.2" } }, "nbformat": 4, "nbformat_minor": 4 }