{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# HistFactory" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`pyhf` stands for **py**thon-based **H**ist**F**actory.\n", "\n", "It's a tool for statistical analysis of data in High Energy Physics.\n", "\n", "In this chapter, we will cover\n", "* What HistFactory is in general\n", "* What pyhf is specifically (and what it is not)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Statistical Analysis\n", "\n", "We divide analyses into the type of fit being performed:\n", "* unbinned analysis (based on individual observed events)\n", "* binned analyses (based on aggregation of events)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<center>\n", " <img alt=\"WHgamgam unbinned distribution\" src=\"https://atlas.web.cern.ch/Atlas/GROUPS/PHYSICS/PAPERS/SUSY-2018-23/fig_04d.png\" width=300 style=\"display: inline\" />\n", " <img alt=\"SUSY MBJ binned distribution\" src=\"https://atlas.web.cern.ch/Atlas/GROUPS/PHYSICS/CONFNOTES/ATLAS-CONF-2018-041/fig_08a.png\" width=400 style=\"display: inline\" />\n", "</center>" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Like HistFactory, `pyhf` does not work with unbinned analyses. These will not be covered in the tutorial.\n", "\n", "So what uses HistFactory?\n", "* TRexFitter\n", "* WSMaker\n", "* HistFitter\n", "\n", "Most everyone in SUSY and Exotics who performs an asymptotic fit as part of their analysis is likely using HistFactory!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Why Binned?\n", "\n", "Most likely, one performs a binned analysis if no functional form of the p.d.f. is known. Instead, you make approximations (re: educated guesses) as to this functional form through histograms.\n", "\n", "What is a histogram? Fundamentally, a histogram is a tool to bookkeep arrays of numbers:\n", "* binning\n", "* counts\n", "* errors\n", "\n", "Beyond that, it contains lots of other magic ingredients to make them more user-friendly for common operations (addition, division, etc...)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What are the ingredients?\n", "\n", "Once you have a model, you can perform inference such as\n", "* exclusion fit (upper limits)\n", "* discovery fit (lower limits)\n", "* measurement (two-sided intervals)\n", "* parameter scans\n", "* impact plots\n", "* pull plots\n", "* ...\n", "\n", "<img src=\"https://raw.githubusercontent.com/scikit-hep/pyhf/master/docs/_static/img/README_1bin_example.png\" alt=\"common operation - parameter scan\" width=400 />\n", "\n", "Let's make up some samples and histograms to go along with it to understand what's going on. Suppose we have an analysis with expected event rate $\\lambda$ and measurements $n$. For this incredibly simple case, the overall probability of the full experiment is the **joint probability** of each bin:\n", "\n", "$$\n", "p(n|\\lambda) = \\prod_{\\mathrm{bin}\\ b} \\mathrm{Pois}(n_b | \\lambda_b)\n", "$$\n", "\n", "Why Poisson? This is a counting experiment after all. A region we want to model will then just be a series of Poissons." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import ipywidgets as widgets\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "from scipy.stats import norm" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bins = [1, 2, 3]\n", "observed = [3, 4, 4]\n", "expected_yields = [3.7, 3.2, 2.5]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots()\n", "ax.bar(bins, expected_yields, 1.0, label=r\"expected\", edgecolor=\"blue\", alpha=0.5)\n", "ax.scatter(bins, [3, 4, 4], color=\"black\", label=\"observed\")\n", "ax.set_ylim(0, 6)\n", "ax.legend();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, we don't always often have just a single (MC) sample, and $\\lambda$ is often the sum of multiple sample yields\n", "\n", "$$\n", "\\lambda = \\sum_{\\mathrm{sample}\\ s} \\lambda_s\n", "$$\n", "\n", "A typical case might be multiple (sub)dominant backgrounds or having a model where the observed events are described by a signal + background p.d.f. Then the p.d.f. might look like\n", "\n", "$$\n", "p(n|\\lambda) = \\prod_{\\mathrm{bin}\\ b} \\mathrm{Pois}(n_b | \\lambda_b) \\qquad \\lambda_b = \\sum_{\\mathrm{sample}\\ s} \\lambda_{bs}\n", "$$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bins = [1, 2, 3]\n", "observed = [3, 4, 4]\n", "background = [3.0, 1.5, 1.0]\n", "signal = [0.7, 1.7, 1.5]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots()\n", "ax.bar(bins, background, 1.0, label=r\"$t\\bar{t}$\", edgecolor=\"red\", alpha=0.5)\n", "ax.bar(\n", " bins, signal, 1.0, label=r\"signal\", edgecolor=\"blue\", bottom=background, alpha=0.5\n", ")\n", "ax.scatter(bins, [3, 4, 4], color=\"black\", label=\"observed\")\n", "ax.set_ylim(0, 6)\n", "ax.legend();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Already, you can see the p.d.f. for this simple case starts expanding to be a little bit more generic, and a little bit more flexible. Now we want to incorporate when the expected yields for signal and backgrounds depend on some **parameters**, perhaps how we applied calibrations to some objects, or how we configured our Monte-Carlo generators, etc...\n", "\n", "Suppose we wanted a $\\mu_s$ that is a normalization factor scaling up (or down!) the sample. For example, if we want to parametrize the signal strength (without changing background). So $\\lambda$ becomes a function of $\\theta = \\{\\mu\\}$ (a set of the parameters that determine the expected event rate), then our p.d.f. expands to be\n", "\n", "$$\n", "p(n|\\lambda(\\mu)) = \\prod_{\\mathrm{bin}\\ b} \\mathrm{Pois}(n_b | \\lambda_b(\\theta)) \\qquad \\lambda_b(\\theta) = \\sum_{\\mathrm{sample}\\ s} \\lambda_{bs}(\\theta)\n", "$$\n", "\n", "where $\\mu_{\\mathrm{background}} = 1$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "@widgets.interact(mu=(0, 5, 0.1))\n", "def draw_plot(mu=1):\n", " bins = [1, 2, 3]\n", " observed = [3, 4, 4]\n", " background = [3.0, 1.5, 1.0]\n", " signal = [i * mu for i in [0.7, 1.7, 1.5]]\n", "\n", " print(f\"signal: {signal}\")\n", " print(f\"background: {background}\")\n", " print(f\"observed: {observed}\\n\")\n", "\n", " fig, ax = plt.subplots()\n", " ax.bar(bins, background, 1.0, label=r\"$t\\bar{t}$\", edgecolor=\"red\", alpha=0.5)\n", " ax.bar(\n", " bins,\n", " signal,\n", " 1.0,\n", " label=r\"signal\",\n", " edgecolor=\"blue\",\n", " bottom=background,\n", " alpha=0.5,\n", " )\n", " ax.scatter(bins, [3, 4, 4], color=\"black\", label=\"observed\")\n", " ax.set_ylim(0, 6)\n", " ax.legend();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One final thing to finish our build up of a simplified p.d.f. is about **auxiliary measurements**. What we mean is that perhaps the background sample is modeled by some normalization parameter, but we've also performed additional measurements in a separate analysis that constraints the parametrization (e.g. Jet Energy Scale) so we have stronger confidence that the true parameter is within a certain range.\n", "\n", "For some parameters in a statistical model, all we have to infer its values is the given analysis. These parameters are **unconstrained** ($\\eta$):\n", "\n", "$$\n", "p(n | \\lambda(\\theta))\n", "$$\n", "\n", "For many parameters, we have the **auxiliary data** ($a$) given as an *auxiliary measurement* which is added to the main p.d.f.. These parameters are **constrained** ($\\chi$).\n", "\n", "$$\n", "p_\\chi(a | \\chi)\n", "$$\n", "\n", "where $\\theta = \\{\\eta, \\chi\\}$. This constraining function can generally be anything, but most of the time in HistFactory - it's a Gaussian or a Poisson. The p.d.f. expands to be\n", "\n", "$$\n", "p(n,a|\\lambda(\\theta)) = \\prod_{\\mathrm{bin}\\ b} \\mathrm{Pois}(n_b | \\lambda_b(\\theta)) \\prod_{\\mathrm{constraint}\\ \\chi} p_\\chi(a_\\chi | \\chi) \\qquad \\lambda_b(\\theta) = \\sum_{\\mathrm{sample}\\ s} \\lambda_{bs}(\\theta)\n", "$$\n", "\n", "For this simple example, let's consider a Gaussian centered at $\\mu=0$ with $\\sigma=1$ for constraining the normalization on the background where an up-variation ($\\mu_b = +1$) scales by 1.3, and a down-variation ($\\mu_b = -1$) scales by 0.8." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def gaussian_constraint(mu_b=0.0):\n", " return norm.pdf(mu_b, loc=0.0, scale=1.0)\n", "\n", "\n", "# interpolating\n", "def interpolate(down, nom, up, alpha):\n", " if alpha >= 0:\n", " return (up - nom) * alpha + 1\n", " else:\n", " return 1 - (down - nom) * alpha\n", "\n", "\n", "@widgets.interact(mu=(0, 5, 0.1), mu_b=(-1, 1, 0.1))\n", "def draw_plot(mu=1, mu_b=0):\n", " bins = [1, 2, 3]\n", " observed = [3, 4, 4]\n", " background = [i * interpolate(0.8, 1.0, 1.3, mu_b) for i in [3.0, 1.5, 1.0]]\n", " signal = [i * mu for i in [0.7, 1.7, 1.5]]\n", "\n", " print(f\"signal: {signal}\")\n", " print(f\"background: {background}\")\n", " print(f\"observed: {observed}\")\n", " print(\n", " f\"likelihood scaled by: {gaussian_constraint(mu_b)/gaussian_constraint(0.0)}\\n\"\n", " )\n", "\n", " fig, ax = plt.subplots()\n", " ax.bar(bins, background, 1.0, label=r\"$t\\bar{t}$\", edgecolor=\"red\", alpha=0.5)\n", " ax.bar(\n", " bins,\n", " signal,\n", " 1.0,\n", " label=r\"signal\",\n", " edgecolor=\"blue\",\n", " bottom=background,\n", " alpha=0.5,\n", " )\n", " ax.scatter(bins, [3, 4, 4], color=\"black\", label=\"observed\")\n", " ax.set_ylim(0, 6)\n", " ax.legend();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But that's not all! Notice that all along, we've been only discussing a single \"channel\" with 3 bins. The statistical analysis being studied might involve **multiple channels** corresponding to different signal regions and control regions. Therefore, we compute the likelihood as\n", "\n", "$$\n", "p_\\text{main} = p_\\text{channel1} * p_\\text{channel2} * p_\\text{channel3} \\cdots\n", "$$\n", "\n", "So in fact, we can then expand out the likelihood definition further\n", "\n", "$$\n", "p(n,a|\\theta) = \\underbrace{\\prod_{\\mathrm{channel}\\ c}\\prod_{\\mathrm{bin}\\ b} \\mathrm{Pois}(n_{cb} | \\lambda_{cb}(\\theta))}_{\\text{main}} \\underbrace{\\prod_{\\mathrm{constraint}\\ \\chi} p_\\chi(a_\\chi | \\chi)}_{\\text{auxiliary}} \\qquad \\lambda_{cb}(\\theta) = \\sum_{\\mathrm{sample}\\ s} \\lambda_{cbs}(\\theta)\n", "$$\n", "\n", "As you can see, this is sort of a bookkeeping problem. We have two pieces of this likelihood:\n", "* our main model, which consists of\n", " * several channels (regions, histograms, etc), where\n", " * each channel is a set of simultaneous Poissons measuring the bin count against an expected value, where\n", " * the expected value is the sum of various samples, where\n", " * each samples expected value can be a function of parameters (or modifiers)\n", "* the constraint model, which consists of\n", " * several auxiliary measurements, where\n", " * each measurement comes with auxiliary data\n", " \n", "It should be clear by now that this is quite a lot of pieces to keep track of. This is where HistFactory comes in to play. Using HistFactory, we can\n", "* describe observed event rates and expected event rates\n", "* use well-defined **modifiers** to express parameterizations of the expected event rates\n", "* use well-defined **interpolation** mechanisms to derive expected event rates (if needed)\n", "* automatically handle auxiliary measurements / additional constraint terms\n", "\n", "*Note: if you're curious about interpolation and interesting challenges, see the next chapter.*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## pyhf\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Up till 2018, HistFactory was only implemented using ROOT, RooStats, RooFit (+ minuit). pyhf provides two separate pieces:\n", "* a schema for serializing the HistFactory workspace in plain-text formats, such as JSON\n", "* a toolkit that interacts and manipulates the HistFactory workspaces\n", "\n", "Why is this crucial? HistFactory in ROOT is a combination of loosely-linked XML+ROOT files\n", "* XML for structure\n", "* ROOT for storing data\n", "\n", "These would then be processed through a `hist2workspace` command to get the ROOT Workspace that RooStats/RooFit use. As an example, let's look at the provided multichannel HistFactory XML+ROOT as part of this tutorial:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!ls -lhR data/multichannel_histfactory" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, we have two folders:\n", "\n", "* `config` which provides\n", " * the XML HistFactory schema [`HistFactorySchema.dtd`](./data/multichannel_histfactory/config/HistFactorySchema.dtd)\n", " * a top-level [`example.xml`](./data/multichannel_histfactory/config/example.xml)\n", " * signal region and control region structures\n", "* `data` which provides the stored histograms in [`data.root`](./data/multichannel_histfactory/data/data.root)\n", "\n", "Let's just look at the XML structure for now. What does the top-level look like?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!cat -n data/multichannel_histfactory/config/example.xml" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This top-level specifies:\n", "* 15: the HistFactory XML schema\n", "* 17: the workspace definition\n", "* 18,19: channel definitions (links to other files)\n", "* 20: a measurement `GaussExample` with specifications for luminosity, the parameter of interest, and setting `lumi` constant\n", "\n", "What does the signal region look like?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!cat -n data/multichannel_histfactory/config/example_signal.xml" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This signal region specifies:\n", "* 16: the HistFactory XML schema\n", "* 18: the channel definition along with the path to the file for where the data for this channel is stored\n", "* 19: where the expected event rate (data) for this channel is located\n", "* 20, 23: sample definitions for `signal` and `bkg` with each sample expected event rate stored under `HistoName` in the corresponding ROOT file\n", "* 21: a parameter `SigXsecOverSM` which is an unconstrained normalization factor\n", "* 24: a parameter `uncorrshape_signal` which is a Poisson-constrained shape systematic, with the corresponding auxiliary data stored under `HistoName` in the corresponding ROOT file\n", "\n", "As you can see, this works fine. It's a little bulky, and a lot of loosely-tied information, but this fulls specifies the HistFactory model we've discussed so far.\n", "\n", "In the next chapter, we'll learn how to use `pyhf` to convert to the HistFactory JSON representation." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.7" } }, "nbformat": 4, "nbformat_minor": 4 }