{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"# Using boost-histogram\n",
"\n",
"### Henry Schreiner, Hans Dembenski\n",
"\n",
"Setup (done for you on binder):\n",
"\n",
"```bash\n",
"conda env create\n",
"conda activate bh-talk\n",
"jupyter lab\n",
"```\n",
"\n",
"The interesting parts of the environment:\n",
"\n",
"* `boost-histogram 0.10.0`: This is the current boost-histogram.\n",
"* `hist 2.0.0a5`: You have to have a recent alpha release of Hist.\n",
"* `uproot4`: The new version that will become default by the end of the year.\n",
"* `mplhep`: The matplotlib HEP helper library"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> #### Note:\n",
">\n",
"> boost-histogram has minimal, lightweight requirements and works almost anywhere. Anything beyond NumPy and boost-histogram is just for our demos later."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import boost_histogram as bh\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import mplhep"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> #### IPython setup\n",
"> \n",
"> For the demos, let's see the last expression so it's easier to follow what's happening, even if it is an assignment. The default here is 'last_expr', but 'last_expr_or_assign' is better for demos. Also see 'last', 'all', and 'none'."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%config InteractiveShell.ast_node_interactivity=\"last_expr_or_assign\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"### See the SciPy 2020 talk on YouTube for general overview!\n",
"\n",
"\n",
"\n",
"\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"# What is boost-histogram?\n",
"\n",
"The Python ecosystem is missing a good Histogram object. NumPy can perform a histogram operation, but it does not produce an object, and there are limitations and performance issues. The closest thing we have to a histogram is in ROOT."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1: Basic histograms\n",
"\n",
"Let's start with the basics. We will create a histogram using boost-histogram and fill it."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.1: Data\n",
"\n",
"Let's make a 1D dataset to run on."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data1 = np.random.normal(3.5, 2.5, size=1_000_000)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.2: Histogram definition\n",
"\n",
"Now, let's make a 1D histogram"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"hist1 = bh.Histogram(bh.axis.Regular(40, -2, 10))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.3 Histogram fill\n",
"\n",
"Like ROOT, we can fill _after_ we make a histogram, as many times as we want. You can fill single values, but to take advantage of the performance, you should fill with arrays."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"hist1.fill(data1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can see that the histogram has been filled. Let's explicitly check to see how many entries are in the histogram:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"hist1.sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What happened to the missing items? They are in the underflow and overflow bins:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"hist1.sum(flow=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Like ROOT, we have overflow bins by default. We can turn them off, but they enable some powerful things like projections.\n",
"\n",
"Let's plot this:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.bar(hist1.axes[0].centers, hist1.view(), width=hist1.axes[0].widths);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note: You can leave off the `.view()` if you want to - histograms conform to the buffer protocol. Also, you can select the axes before or after calling `.centers`; this is very useful for ND histograms."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"From now on, let's be lazy, and use mplhep, which natively supports boost-histogram now."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"mplhep.histplot(hist1);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You should think of boost-histogram like NumPy: No plotting is built in, but the data is easy to access."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Unlike ROOT, Histograms are built of of basic building blocks: 1 or more **axes**, and a **storage**. The storage holds **accumulators**, which can be simple doubles or ints, or more complex things that hold extra information about the operation (which might not even be a sum! (generalized histograms). Here's what it looks like for 1D:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we go to 2D, this generalizes - we have the _same_ API for ND!\n",
"\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2: Drop-in replacement for NumPy\n",
"\n",
"To start using this yourself, you don't even need to change your code. Let's try the NumPy adapters."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"bins2, edges2 = bh.numpy.histogram(data1, bins=10)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"b2, e2 = np.histogram(data1, bins=10)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"bins2 - b2"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"e2 - edges2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Not bad! Let's start moving to the boost-histogram API, so we can use the plotting function we just learned about:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"hist2 = bh.numpy.histogram(data1, bins=\"auto\", histogram=bh.Histogram)\n",
"mplhep.histplot(hist2);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can move over to boost-histogram one step at a time! Just to be complete, we can also go back to a Numpy tuple from a Histogram object:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"b2p, e2p = bh.numpy.histogram(data1, bins=10, histogram=bh.Histogram).to_numpy()\n",
"b2p == b2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And, while \"recently\" NumPy optimized 1D regular binning, we still beat optimized NumPy:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%timeit\n",
"bh.numpy.histogram(data1, bins=100)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%timeit\n",
"np.histogram(data1, bins=100)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3: More dimensions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The same API works for multiple dimensions."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"hist3 = bh.Histogram(bh.axis.Regular(150, -1.5, 1.5), bh.axis.Regular(100, -1, 1))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data2d = [np.random.normal(size=1_000_000) for _ in range(2)]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"hist3.fill(*data2d)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.pcolormesh(*hist3.axes.edges.T, hist3.view().T);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is transposed because of differing indexing conventions."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's try a 3D histogram"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data3d = [np.random.normal(size=1_000_000) for _ in range(3)]\n",
"\n",
"hist3d = bh.Histogram(\n",
" bh.axis.Regular(150, -5, 5),\n",
" bh.axis.Regular(100, -5, 5),\n",
" bh.axis.Regular(100, -5, 5),\n",
")\n",
"\n",
"hist3d.fill(*data3d)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's project to the first two axes:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"mplhep.hist2dplot(hist3d[:, :, sum]);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4: UHI\n",
"\n",
" \n",
"\n",
"Let's explore the boost-histogram UHI syntax. We will start with a 1D histogram:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"h = bh.Histogram(bh.axis.Regular(100, -3.5, 3.5))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Fill it with some mildly interesting data:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data = np.concatenate([\n",
" np.random.normal(-.75,.3, 1_000_000),\n",
" np.random.normal(.75,.3, 750_000),\n",
" np.random.normal(-1.5,.2, 200_000),\n",
"])\n",
"\n",
"h.fill(data)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"mplhep.histplot(h);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I can see that I want x from -2 to 0, in *data coordinates*:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"mplhep.histplot(h[bh.loc(-2):bh.loc(0)]);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What's the contents of a bin?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"h[bh.loc(-.75)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To get the coordinates manually, you could do (not UHI):\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"h.axes[0].index(.75)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How about reducing a histogram? Let's try the previous 2D Histogram"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"mplhep.histplot(hist3[:, sum])\n",
"mplhep.histplot(hist3[sum, :]);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's look at one part and rebin:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"mplhep.hist2dplot(hist3[: 50 : bh.rebin(2), 50 :: bh.rebin(2)]);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What is the value at `(-.75, .5)`?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"hist3[bh.loc(-0.75), bh.loc(0.5)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"### Distribution is state-of-the-art Scikit-HEP style\n",
"\n",
"The [Scikit-HEP developer pages](https://scikit-hep.org/developer) were heavily influenced by boost-histogram.\n",
"\n",
"\n",
" \n",
"\n",
"We now support ARM and PowerPC wheels!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [conda env:bh-talk]",
"language": "python",
"name": "conda-env-bh-talk-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.0"
}
},
"nbformat": 4,
"nbformat_minor": 4
}