{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# HistFactory"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`pyhf` stands for **py**thon-based **H**ist**F**actory.\n",
    "\n",
    "It's a tool for statistical analysis of data in High Energy Physics.\n",
    "\n",
    "In this chapter, we will cover\n",
    "* What HistFactory is in general\n",
    "* What pyhf is specifically (and what it is not)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Statistical Analysis\n",
    "\n",
    "We divide analyses into the type of fit being performed:\n",
    "* unbinned analysis (based on individual observed events)\n",
    "* binned analyses (based on aggregation of events)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center>\n",
    "    <img alt=\"WHgamgam unbinned distribution\" src=\"https://atlas.web.cern.ch/Atlas/GROUPS/PHYSICS/PAPERS/SUSY-2018-23/fig_04d.png\" width=300 style=\"display: inline\" />\n",
    "    <img alt=\"SUSY MBJ binned distribution\" src=\"https://atlas.web.cern.ch/Atlas/GROUPS/PHYSICS/CONFNOTES/ATLAS-CONF-2018-041/fig_08a.png\" width=400 style=\"display: inline\" />\n",
    "</center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Like HistFactory, `pyhf` does not work with unbinned analyses. These will not be covered in the tutorial.\n",
    "\n",
    "So what uses HistFactory?\n",
    "* TRexFitter\n",
    "* WSMaker\n",
    "* HistFitter\n",
    "\n",
    "Most everyone in SUSY and Exotics who performs an asymptotic fit as part of their analysis is likely using HistFactory!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Why Binned?\n",
    "\n",
    "Most likely, one performs a binned analysis if no functional form of the p.d.f. is known. Instead, you make approximations (re: educated guesses) as to this functional form through histograms.\n",
    "\n",
    "What is a histogram? Fundamentally, a histogram is a tool to bookkeep arrays of numbers:\n",
    "* binning\n",
    "* counts\n",
    "* errors\n",
    "\n",
    "Beyond that, it contains lots of other magic ingredients to make them more user-friendly for common operations (addition, division, etc...)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What are the ingredients?\n",
    "\n",
    "Once you have a model, you can perform inference such as\n",
    "* exclusion fit (upper limits)\n",
    "* discovery fit (lower limits)\n",
    "* measurement (two-sided intervals)\n",
    "* parameter scans\n",
    "* impact plots\n",
    "* pull plots\n",
    "* ...\n",
    "\n",
    "<img src=\"https://raw.githubusercontent.com/scikit-hep/pyhf/master/docs/_static/img/README_1bin_example.png\" alt=\"common operation - parameter scan\" width=400 />\n",
    "\n",
    "Let's make up some samples and histograms to go along with it to understand what's going on. Suppose we have an analysis with expected event rate $\\lambda$ and measurements $n$. For this incredibly simple case, the overall probability of the full experiment is the **joint probability** of each bin:\n",
    "\n",
    "$$\n",
    "p(n|\\lambda) = \\prod_{\\mathrm{bin}\\ b} \\mathrm{Pois}(n_b | \\lambda_b)\n",
    "$$\n",
    "\n",
    "Why Poisson? This is a counting experiment after all. A region we want to model will then just be a series of Poissons."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import ipywidgets as widgets\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "from scipy.stats import norm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bins = [1, 2, 3]\n",
    "observed = [3, 4, 4]\n",
    "expected_yields = [3.7, 3.2, 2.5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots()\n",
    "ax.bar(bins, expected_yields, 1.0, label=r\"expected\", edgecolor=\"blue\", alpha=0.5)\n",
    "ax.scatter(bins, [3, 4, 4], color=\"black\", label=\"observed\")\n",
    "ax.set_ylim(0, 6)\n",
    "ax.legend();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "However, we don't always often have just a single (MC) sample, and $\\lambda$ is often the sum of multiple sample yields\n",
    "\n",
    "$$\n",
    "\\lambda = \\sum_{\\mathrm{sample}\\ s} \\lambda_s\n",
    "$$\n",
    "\n",
    "A typical case might be multiple (sub)dominant backgrounds or having a model where the observed events are described by a signal + background p.d.f. Then the p.d.f. might look like\n",
    "\n",
    "$$\n",
    "p(n|\\lambda) = \\prod_{\\mathrm{bin}\\ b} \\mathrm{Pois}(n_b | \\lambda_b) \\qquad \\lambda_b = \\sum_{\\mathrm{sample}\\ s} \\lambda_{bs}\n",
    "$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bins = [1, 2, 3]\n",
    "observed = [3, 4, 4]\n",
    "background = [3.0, 1.5, 1.0]\n",
    "signal = [0.7, 1.7, 1.5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots()\n",
    "ax.bar(bins, background, 1.0, label=r\"$t\\bar{t}$\", edgecolor=\"red\", alpha=0.5)\n",
    "ax.bar(\n",
    "    bins, signal, 1.0, label=r\"signal\", edgecolor=\"blue\", bottom=background, alpha=0.5\n",
    ")\n",
    "ax.scatter(bins, [3, 4, 4], color=\"black\", label=\"observed\")\n",
    "ax.set_ylim(0, 6)\n",
    "ax.legend();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Already, you can see the p.d.f. for this simple case starts expanding to be a little bit more generic, and a little bit more flexible. Now we want to incorporate when the expected yields for signal and backgrounds depend on some **parameters**, perhaps how we applied calibrations to some objects, or how we configured our Monte-Carlo generators, etc...\n",
    "\n",
    "Suppose we wanted a $\\mu_s$ that is a normalization factor scaling up (or down!) the sample. For example, if we want to parametrize the signal strength (without changing background). So $\\lambda$ becomes a function of $\\theta = \\{\\mu\\}$ (a set of the parameters that determine the expected event rate), then our p.d.f. expands to be\n",
    "\n",
    "$$\n",
    "p(n|\\lambda(\\mu)) = \\prod_{\\mathrm{bin}\\ b} \\mathrm{Pois}(n_b | \\lambda_b(\\theta)) \\qquad \\lambda_b(\\theta) = \\sum_{\\mathrm{sample}\\ s} \\lambda_{bs}(\\theta)\n",
    "$$\n",
    "\n",
    "where $\\mu_{\\mathrm{background}} = 1$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "@widgets.interact(mu=(0, 5, 0.1))\n",
    "def draw_plot(mu=1):\n",
    "    bins = [1, 2, 3]\n",
    "    observed = [3, 4, 4]\n",
    "    background = [3.0, 1.5, 1.0]\n",
    "    signal = [i * mu for i in [0.7, 1.7, 1.5]]\n",
    "\n",
    "    print(f\"signal: {signal}\")\n",
    "    print(f\"background: {background}\")\n",
    "    print(f\"observed: {observed}\\n\")\n",
    "\n",
    "    fig, ax = plt.subplots()\n",
    "    ax.bar(bins, background, 1.0, label=r\"$t\\bar{t}$\", edgecolor=\"red\", alpha=0.5)\n",
    "    ax.bar(\n",
    "        bins,\n",
    "        signal,\n",
    "        1.0,\n",
    "        label=r\"signal\",\n",
    "        edgecolor=\"blue\",\n",
    "        bottom=background,\n",
    "        alpha=0.5,\n",
    "    )\n",
    "    ax.scatter(bins, [3, 4, 4], color=\"black\", label=\"observed\")\n",
    "    ax.set_ylim(0, 6)\n",
    "    ax.legend();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One final thing to finish our build up of a simplified p.d.f. is about **auxiliary measurements**. What we mean is that perhaps the background sample is modeled by some normalization parameter, but we've also performed additional measurements in a separate analysis that constraints the parametrization (e.g. Jet Energy Scale) so we have stronger confidence that the true parameter is within a certain range.\n",
    "\n",
    "For some parameters in a statistical model, all we have to infer its values is the given analysis. These parameters are **unconstrained** ($\\eta$):\n",
    "\n",
    "$$\n",
    "p(n | \\lambda(\\theta))\n",
    "$$\n",
    "\n",
    "For many parameters, we have the **auxiliary data** ($a$) given as an *auxiliary measurement* which is added to the main p.d.f.. These parameters are **constrained** ($\\chi$).\n",
    "\n",
    "$$\n",
    "p_\\chi(a | \\chi)\n",
    "$$\n",
    "\n",
    "where $\\theta = \\{\\eta, \\chi\\}$. This constraining function can generally be anything, but most of the time in HistFactory - it's a Gaussian or a Poisson. The p.d.f. expands to be\n",
    "\n",
    "$$\n",
    "p(n,a|\\lambda(\\theta)) = \\prod_{\\mathrm{bin}\\ b} \\mathrm{Pois}(n_b | \\lambda_b(\\theta)) \\prod_{\\mathrm{constraint}\\ \\chi} p_\\chi(a_\\chi | \\chi) \\qquad \\lambda_b(\\theta) = \\sum_{\\mathrm{sample}\\ s} \\lambda_{bs}(\\theta)\n",
    "$$\n",
    "\n",
    "For this simple example, let's consider a Gaussian centered at $\\mu=0$ with $\\sigma=1$ for constraining the normalization on the background where an up-variation ($\\mu_b = +1$) scales by 1.3, and a down-variation ($\\mu_b = -1$) scales by 0.8."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def gaussian_constraint(mu_b=0.0):\n",
    "    return norm.pdf(mu_b, loc=0.0, scale=1.0)\n",
    "\n",
    "\n",
    "# interpolating\n",
    "def interpolate(down, nom, up, alpha):\n",
    "    if alpha >= 0:\n",
    "        return (up - nom) * alpha + 1\n",
    "    else:\n",
    "        return 1 - (down - nom) * alpha\n",
    "\n",
    "\n",
    "@widgets.interact(mu=(0, 5, 0.1), mu_b=(-1, 1, 0.1))\n",
    "def draw_plot(mu=1, mu_b=0):\n",
    "    bins = [1, 2, 3]\n",
    "    observed = [3, 4, 4]\n",
    "    background = [i * interpolate(0.8, 1.0, 1.3, mu_b) for i in [3.0, 1.5, 1.0]]\n",
    "    signal = [i * mu for i in [0.7, 1.7, 1.5]]\n",
    "\n",
    "    print(f\"signal: {signal}\")\n",
    "    print(f\"background: {background}\")\n",
    "    print(f\"observed: {observed}\")\n",
    "    print(\n",
    "        f\"likelihood scaled by: {gaussian_constraint(mu_b)/gaussian_constraint(0.0)}\\n\"\n",
    "    )\n",
    "\n",
    "    fig, ax = plt.subplots()\n",
    "    ax.bar(bins, background, 1.0, label=r\"$t\\bar{t}$\", edgecolor=\"red\", alpha=0.5)\n",
    "    ax.bar(\n",
    "        bins,\n",
    "        signal,\n",
    "        1.0,\n",
    "        label=r\"signal\",\n",
    "        edgecolor=\"blue\",\n",
    "        bottom=background,\n",
    "        alpha=0.5,\n",
    "    )\n",
    "    ax.scatter(bins, [3, 4, 4], color=\"black\", label=\"observed\")\n",
    "    ax.set_ylim(0, 6)\n",
    "    ax.legend();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "But that's not all! Notice that all along, we've been only discussing a single \"channel\" with 3 bins. The statistical analysis being studied might involve **multiple channels** corresponding to different signal regions and control regions. Therefore, we compute the likelihood as\n",
    "\n",
    "$$\n",
    "p_\\text{main} = p_\\text{channel1} * p_\\text{channel2} * p_\\text{channel3} \\cdots\n",
    "$$\n",
    "\n",
    "So in fact, we can then expand out the likelihood definition further\n",
    "\n",
    "$$\n",
    "p(n,a|\\theta) = \\underbrace{\\prod_{\\mathrm{channel}\\ c}\\prod_{\\mathrm{bin}\\ b} \\mathrm{Pois}(n_{cb} | \\lambda_{cb}(\\theta))}_{\\text{main}} \\underbrace{\\prod_{\\mathrm{constraint}\\ \\chi} p_\\chi(a_\\chi | \\chi)}_{\\text{auxiliary}} \\qquad \\lambda_{cb}(\\theta) = \\sum_{\\mathrm{sample}\\ s} \\lambda_{cbs}(\\theta)\n",
    "$$\n",
    "\n",
    "As you can see, this is sort of a bookkeeping problem. We have two pieces of this likelihood:\n",
    "* our main model, which consists of\n",
    "  * several channels (regions, histograms, etc), where\n",
    "  * each channel is a set of simultaneous Poissons measuring the bin count against an expected value, where\n",
    "  * the expected value is the sum of various samples, where\n",
    "  * each samples expected value can be a function of parameters (or modifiers)\n",
    "* the constraint model, which consists of\n",
    "  * several auxiliary measurements, where\n",
    "  * each measurement comes with auxiliary data\n",
    "  \n",
    "It should be clear by now that this is quite a lot of pieces to keep track of. This is where HistFactory comes in to play. Using HistFactory, we can\n",
    "* describe observed event rates and expected event rates\n",
    "* use well-defined **modifiers** to express parameterizations of the expected event rates\n",
    "* use well-defined **interpolation** mechanisms to derive expected event rates (if needed)\n",
    "* automatically handle auxiliary measurements / additional constraint terms\n",
    "\n",
    "*Note: if you're curious about interpolation and interesting challenges, see the next chapter.*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## pyhf\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Up till 2018, HistFactory was only implemented using ROOT, RooStats, RooFit (+ minuit). pyhf provides two separate pieces:\n",
    "* a schema for serializing the HistFactory workspace in plain-text formats, such as JSON\n",
    "* a toolkit that interacts and manipulates the HistFactory workspaces\n",
    "\n",
    "Why is this crucial? HistFactory in ROOT is a combination of loosely-linked XML+ROOT files\n",
    "* XML for structure\n",
    "* ROOT for storing data\n",
    "\n",
    "These would then be processed through a `hist2workspace` command to get the ROOT Workspace that RooStats/RooFit use. As an example, let's look at the provided multichannel HistFactory XML+ROOT as part of this tutorial:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!ls -lhR data/multichannel_histfactory"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, we have two folders:\n",
    "\n",
    "* `config` which provides\n",
    "  * the XML HistFactory schema [`HistFactorySchema.dtd`](./data/multichannel_histfactory/config/HistFactorySchema.dtd)\n",
    "  * a top-level [`example.xml`](./data/multichannel_histfactory/config/example.xml)\n",
    "  * signal region and control region structures\n",
    "* `data` which provides the stored histograms in [`data.root`](./data/multichannel_histfactory/data/data.root)\n",
    "\n",
    "Let's just look at the XML structure for now. What does the top-level look like?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!cat -n data/multichannel_histfactory/config/example.xml"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This top-level specifies:\n",
    "* 15: the HistFactory XML schema\n",
    "* 17: the workspace definition\n",
    "* 18,19: channel definitions (links to other files)\n",
    "* 20: a measurement `GaussExample` with specifications for luminosity, the parameter of interest, and setting `lumi` constant\n",
    "\n",
    "What does the signal region look like?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!cat -n data/multichannel_histfactory/config/example_signal.xml"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This signal region specifies:\n",
    "* 16: the HistFactory XML schema\n",
    "* 18: the channel definition along with the path to the file for where the data for this channel is stored\n",
    "* 19: where the expected event rate (data) for this channel is located\n",
    "* 20, 23: sample definitions for `signal` and `bkg` with each sample expected event rate stored under `HistoName` in the corresponding ROOT file\n",
    "* 21: a parameter `SigXsecOverSM` which is an unconstrained normalization factor\n",
    "* 24: a parameter `uncorrshape_signal` which is a Poisson-constrained shape systematic, with the corresponding auxiliary data stored under `HistoName` in the corresponding ROOT file\n",
    "\n",
    "As you can see, this works fine. It's a little bulky, and a lot of loosely-tied information, but this fulls specifies the HistFactory model we've discussed so far.\n",
    "\n",
    "In the next chapter, we'll learn how to use `pyhf` to convert to the HistFactory JSON representation."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}