{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# zfit binned\n", "\n", "There are two main ways of looking at \"binned fits\"\n", "- Either an analytic shape that could be fit unbinned but is fit to binned data *because of the datasize* (typical LHCb, Belle II,...)\n", "- stacking template histograms from simulation to provide the shape and fit to binned data (typical done in CMS, ATLAS, some LHCb,...)\n", "\n", "Some templated fits with uniform binning, no analytic components and specific morphing and constraints fit into the HistFactory model, implemented in [pyhf](https://github.com/scikit-hep/pyhf).\n", "These fits make a large portion of CMS and ATLAS analyses.\n", "\n", "zfit can, in principle, reproduce them too. However, it's comparably inefficient, a lot of code and slow. The main purpose is to support *anything that is beyond HistFactory*." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import mplhep\n", "import numpy as np\n", "import zfit\n", "import zfit.z.numpy as znp # numpy-like backend interface\n", "from zfit import z # backend, special functions\n", "\n", "zfit.settings.set_seed(43)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Binned parts\n", "\n", "zfit introduces binned equivalents to unbinned components and transformations from one to the other.\n", "For example:\n", "- `SumPDF` -> `BinnedSumPDF`\n", "- `Data` -> `BinnedData`\n", "- `UnbinnedNLL` -> `BinnedNLL`\n", "\n", "There are converters and new, histogram specific PDFs and methods." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## From unbinned to binned\n", "\n", "Let's start with an example, namely a simple, unbinned fit that we want to perform binned." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "normal_np = np.random.normal(loc=2., scale=1.3, size=10000)\n", "\n", "obs = zfit.Space(\"x\", -10, 10)\n", "\n", "mu = zfit.Parameter(\"mu\", 1., -4, 6)\n", "sigma = zfit.Parameter(\"sigma\", 1., 0.1, 10)\n", "model_nobin = zfit.pdf.Gauss(mu, sigma, obs)\n", "\n", "data_nobin = zfit.Data.from_numpy(obs, normal_np)\n", "\n", "loss_nobin = zfit.loss.UnbinnedNLL(model_nobin, data_nobin)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "minimizer = zfit.minimize.Minuit()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# make binned\n", "nbins = 50\n", "data = data_nobin.to_binned(nbins)\n", "model = model_nobin.to_binned(data.space)\n", "loss = zfit.loss.BinnedNLL(model, data)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "result = minimizer.minimize(loss)\n", "print(result)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "result.hesse(name=\"hesse\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "result.errors(name=\"errors\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(result)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Binned parts in detail\n", "\n", "`to_binned` creates a binned (and `to_unbinned` an unbinned) version of objects. It takes a binned Space, a binning or (as above), an integer (in which case a uniform binning is created).\n", "\n", "This creates implicitly a new, binned space." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "obs_binned_auto = data.space\n", "print(obs_binned_auto)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(f\"is_binned: {obs_binned_auto.is_binned}, binned obs binning: {obs_binned_auto.binning}\")\n", "print(f\"is_binned: {obs.is_binned}, unbinned obs binning:{obs.binning}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Explicit conversion\n", "\n", "We can explicitly convert spaces, data and models to binned parts.\n", "\n", "Either number of bins for uniform binning or explicit binning." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "obs_binned = obs.with_binning(nbins)\n", "print(obs_binned)\n", "\n", "# or we can create binnings\n", "binning_regular = zfit.binned.RegularBinning(nbins, -10, 10, name='x')\n", "binning_variable = zfit.binned.VariableBinning([-10, -6, -1, -0.1, 0.4, 3, 10], name='x')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since a binning contains all the information needed to create a Space, a binning can be used to define a space directly." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "obs_binned_variable = zfit.Space(binning=binning_variable)\n", "print(obs_binned_variable, obs_binned_variable.binning)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Converting data, models" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_nobin.to_binned(obs_binned_variable)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model_nobin.to_binned(obs_binned_variable)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Compatibility with UHI\n", "\n", "zfit keeps compatibility with Universal Histogram Interface (UHI) and libraries that implement it (boost-histogram, hist).\n", "- `BinnedData` directly adheres to UHI (and has a `to_hist` attribute)\n", "- `BinnedPDF` has a `to_binneddata` and `to_hist` attribute\n", "\n", "Where a `BinnedData ` object is expected, a (named) UHI object is also possible. Same goes for the binning axes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "h = model.to_hist()\n", "h_scaled = h * [10_000]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Binneddata\n", "\n", "Binned data has `counts`, `values` and `variances` attributes, it has a `binning` (aliased with axes)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.values()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## BinnedPDF\n", "\n", "A binned PDF has the same methods as the unbinned counterparts, namely `pdf`, `integrate` (and their `ext_*` parts) and `sample` that can respond to binned as well as unbinned data.\n", "\n", "Additionally, there are two more methods, namely\n", "- `counts` returns the absolute counts as for a histogram. Equivalent to `ext_pdf`, `ext_integrate`, this only works if the PDF is extended.\n", "- `rel_counts` relative counts, like a histogram, but the sum is normalized to 1\n", "\n", "\n", "### Note on Counts vs Density\n", "\n", "Counts are the *integrated* density, i.e. they differ by a factor `bin_width`. For regular binning, this is \"just\" a constant factor, as it's the same for all bins,\n", "but for Variable binning, this changes \"the shape\" of the histogram." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "binned_sample = model.sample(n=1_000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Plotting made easy\n", "\n", "This allows plotting to become a lot easier using `mplhep`, also for unbinned models." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.title(\"Counts plot\")\n", "mplhep.histplot(data, label=\"data\")\n", "mplhep.histplot(model.to_hist() * [data.nevents],\n", " label=\"model\") # scaling up since model is not extended, i.e. has no yield\n", "plt.legend()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.title(\"Counts plot\")\n", "mplhep.histplot(binned_sample, label=\"sampled data\")\n", "mplhep.histplot(model.to_hist() * [binned_sample.nevents],\n", " label=\"model\") # scaling up since model is not extended, i.e. has no yield\n", "plt.legend()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# or using unbinned data points, we can do a density plot\n", "plt.title(\"Density plot\")\n", "mplhep.histplot(data.to_hist(), density=True, label=\"data\")\n", "x = znp.linspace(-10, 10, 200)\n", "plt.plot(x, model.pdf(x), label=\"model\")\n", "plt.legend()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Binned loss functions\n", "\n", "We used above the `BinnedNLL`, but zfit offers more, namely an extended version and a BinnedChi2 (or least-square)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(zfit.loss.__all__)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Fitting using histograms\n", "\n", "There are a few new PDFs that are specific to histogram-like shapes, such as morphing interpolation and shape variations.\n", "\n", "Most simple a HistogramPDF wraps a histogram and acts as a PDF.\n", "\n", "By default, these histograms are extended automatically (which can be overruled using the `extended` argument)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "histpdf = zfit.pdf.HistogramPDF(h_scaled) # fixed yield\n", "print(np.sum(histpdf.counts()))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sig_yield = zfit.Parameter('sig_yield', 4_000, 0, 100_000)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "histpdf = zfit.pdf.HistogramPDF(h, extended=sig_yield)\n", "print(np.sum(histpdf.counts()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Modifiers\n", "\n", "We may want to add modifiers, i.e. scale each bin by a value. `BinwiseScaleModifier` offers this functionality.\n", "\n", "Note however that these are *just free parameters* and not in any way constraint. This needs to be done manually." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "histpdf.space.binning.size" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sig_pdf = zfit.pdf.BinwiseScaleModifier(histpdf,\n", " modifiers=True) # or we could give a list of parameters matching each bin\n", "modifiers = sig_pdf.params" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# modifiers = {f'modifier_{i}': zfit.Parameter(f'modifier_{i}', 1, 0, 10) for i in range(histpdf.space.binning.size[0])}\n", "# histpdf_scaled = zfit.pdf.BinwiseScaleModifier(histpdf, modifiers=modifiers)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "modifiers" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sig_pdf.get_yield()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Morphing\n", "\n", "Let's create a background from simulation. Let's assume, we have a parameter in the simulation that we're unsure about.\n", "\n", "A common used technique is to use \"morphing\": creating multiple templates and interpolating between them. Typically, they are created at +1 and -1 sigma of the\n", "nuisance parameter (however, zfit allows arbitrary values and as many as wanted)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bkg_hist = zfit.Data(np.random.exponential(scale=20, size=100_000) - 10, obs=obs_binned)\n", "bkg_hist_m1 = zfit.Data.from_numpy(obs=obs,\n", " array=np.random.exponential(scale=35, size=100_000) - 10).to_binned(\n", " obs_binned)\n", "bkg_hist_m05 = zfit.Data.from_numpy(obs=obs,\n", " array=np.random.exponential(scale=26, size=100_000) - 10).to_binned(\n", " obs_binned)\n", "bkg_hist_p1 = zfit.Data.from_numpy(obs=obs,\n", " array=np.random.exponential(scale=17, size=100_000) - 10).to_binned(\n", " obs_binned)\n", "bkg_hists = {-1: bkg_hist_m1, -0.5: bkg_hist_m05, 0: bkg_hist, 1: bkg_hist_p1}\n", "bkg_histpdfs = {k: zfit.pdf.HistogramPDF(v) for k, v in bkg_hists.items()}\n", "mplhep.histplot(list(bkg_hists.values()), label=list(bkg_hists.keys()))\n", "plt.legend();" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "alpha = zfit.Parameter(\"alpha\", 0, -3, 3)\n", "bkg_yield = zfit.Parameter(\"bkg_yield\", 15_000)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bkg_pdf = zfit.pdf.SplineMorphingPDF(alpha, bkg_histpdfs, extended=bkg_yield)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with alpha.set_value(-0.6): # we can change this value to play around\n", " mplhep.histplot(bkg_pdf.to_hist())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bkg_pdf = zfit.pdf.HistogramPDF(bkg_hist, extended=bkg_yield) # we don't use the spline for simplicity" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model = zfit.pdf.BinnedSumPDF([sig_pdf, bkg_pdf])\n", "model.to_hist()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with zfit.param.set_values([alpha] + list(modifiers.values()),\n", " [0.] + list(np.random.normal(1.0, scale=0.14, size=len(modifiers)))):\n", " data = bkg_pdf.sample(n=10_000).to_hist() + sig_pdf.sample(1000).to_hist()\n", "data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "modifier_constraints = zfit.constraint.GaussianConstraint(params=list(modifiers.values()), observation=np.ones(len(modifiers)),\n", " uncertainty=0.1 * np.ones(len(modifiers)))\n", "alpha_constraint = zfit.constraint.GaussianConstraint(alpha, 0, sigma=1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "loss_binned = zfit.loss.ExtendedBinnedNLL(model, data, constraints=[modifier_constraints, alpha_constraint])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "result = minimizer.minimize(loss_binned)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(result)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mplhep.histplot(model.to_hist(), label='model')\n", "mplhep.histplot(data, label='data')\n", "plt.legend()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(sig_pdf.get_yield())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Binned to unbinned\n", "\n", "We can convert a histogram directly to an unbinned PDF with `to_unbinned` or smooth it out by interpolating with splines.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "unbinned_spline = zfit.pdf.SplinePDF(sig_pdf)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.plot(x, unbinned_spline.pdf(x))\n", "mplhep.histplot(sig_pdf.to_hist(), density=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# hepstats\n", "\n", "As before, we can now use hepstats to do further statistical treatment (supports binned PDFs).\n", "\n", "More tutorials on hepstats can be found [in the zfit guides](https://zfit-tutorials.readthedocs.io/en/latest/tutorials/guides/README.html) or in the [hepstats tutorials](https://mybinder.org/v2/gh/scikit-hep/hepstats/master)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.4" }, "vscode": { "interpreter": { "hash": "35582a6a3ca7193893daa07e79c86f9b031e623ca33bcb273b52ae17295e8545" } } }, "nbformat": 4, "nbformat_minor": 4 }