{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Binned Data\n",
    "\n",
    "## Introduction\n",
    "\n",
    "Scipp supports features for *binning* scattered data.\n",
    "\n",
    "<div class=\"alert alert-info\">\n",
    "\n",
    "**Terminology**\n",
    "    \n",
    "Scipp distinguishes **histogrammed** data from **binned** data:\n",
    "\n",
    "- Histogrammed data refers to regular dense arrays of, e.g., floating-point values with an associated bin-edge coordinate.\n",
    "- Binned data refers to the precursor of histogrammed data, i.e., each bin contains a \"list\" of contributing events or values.\n",
    "  Binned data can be converted into a histogram by computing the sum or mean over all events or values in a bin.\n",
    "    \n",
    "</div>\n",
    "\n",
    "Scattered data in the context of binning refers to data values irregularly placed in, e.g., space or time.\n",
    "Binning lets us:\n",
    "\n",
    "- Map a table of position-based data to an X-Y-Z grid.\n",
    "- Map a table of position-based data to an angle such as $\\theta$.\n",
    "- Map event time stamps to time bins.\n",
    "\n",
    "The key feature here is that *binning does not actually histogram or resample data*.\n",
    "Data is kept in its original form.\n",
    "Binning provides a wrapper with a coordinate system more adequate for working with the scientific data.\n",
    "\n",
    "## From scattered to binned data\n",
    "\n",
    "We outline the underlying concepts based on a simple example.\n",
    "The scattered raw data is represented as a table with meta data for every data point (event):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import scipp as sc\n",
    "\n",
    "np.random.seed(1)  # Fixed for reproducibility"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Consider a list of measurements at various \"points\" in space.\n",
    "Here we restrict ourselves to the X-Y plane for visualization purposes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "N = 50\n",
    "values = 10 * np.random.rand(N)\n",
    "table = sc.DataArray(\n",
    "    data=sc.array(dims=['row'], unit=sc.units.counts, values=values, variances=values),\n",
    "    coords={\n",
    "        'position': sc.array(\n",
    "            dims=['row'], values=[f'site-{i}' for i in range(N)]\n",
    "        ),\n",
    "        'x': sc.array(dims=['row'], unit='m', values=np.random.rand(N)),\n",
    "        'y': sc.array(dims=['row'], unit='m', values=np.random.rand(N)),\n",
    "    },\n",
    ")\n",
    "table"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For every point we measured at the auxiliary coordinates `'x'` and `'y'` give the position in the X-Y plane.\n",
    "These are *not* dimension-coordinates, since our measurements are *not* on a 2-D grid, but rather points with an irregular distribution.\n",
    "`data` is literally a 1-D table of measurements:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.table(table)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can plot this data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "table.plot()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `'row'` dimension is not a continuous dimension but essentially just a row in our table.\n",
    "In practice, such a figure and this representation of data in general may therefore not be very useful.\n",
    "\n",
    "As an alternative view of our data we can create a scatter plot.\n",
    "We do this explicitly here to demonstrate how the content of `data` is connected to elements of the figure:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(dpi=96)\n",
    "scatter = ax.scatter(\n",
    "    x=table.coords['x'].values, y=table.coords['y'].values, c=table.values\n",
    ")\n",
    "ax.set_xlabel('x [{}]'.format(table.coords['x'].unit))\n",
    "ax.set_ylabel('y [{}]'.format(table.coords['y'].unit))\n",
    "cbar = plt.colorbar(scatter)\n",
    "cbar.set_label(f\"[{table.unit}]\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This shows the distribution in space, but for real datasets with millions of points this is not practical.\n",
    "Furthermore, operating with scattered data is often inconvenient and may require knowledge of the underlying representation.\n",
    "\n",
    "We can now use [scipp.bin](../../generated/functions/scipp.bin.rst) to provide a more accessible wrapper for our data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "binned = table.bin(y=4, x=2)\n",
    "binned"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`binned` is a 2-D data array, but it contains (a reordered copy) the original table of \"unaligned\" data.\n",
    "Each element of `binned` is a view of a section of that table:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "binned.values[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The binning procedure based on bin edges for `'x'` and `'y'` is *not* performing the actual histogramming step.\n",
    "However, since its dimensions are defined by the bin-edge coordinates for `'x'` and `'y'`, we will see below that it behaves much like normal dense data for operations such as slicing.\n",
    "\n",
    "We create another figure to better illustrate the structure of `binned`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(dpi=96)\n",
    "buffer = binned.bins.constituents['data']\n",
    "scatter = ax.scatter(\n",
    "    x=buffer.coords['x'].values, y=buffer.coords['y'].values, c=buffer.values\n",
    ")\n",
    "ax.set_xlabel('x [{}]'.format(binned.coords['x'].unit))\n",
    "ax.set_ylabel('y [{}]'.format(binned.coords['y'].unit))\n",
    "ax.set_xticks(binned.coords['x'].values)\n",
    "ax.set_yticks(binned.coords['y'].values)\n",
    "ax.grid()\n",
    "cbar = fig.colorbar(scatter)\n",
    "cbar.set_label(f\"[{table.unit}]\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is essentially the same figure as the scatter plot for the original `data`.\n",
    "The differences are:\n",
    "\n",
    "- A \"grid\" (the bin edges) that is stored alongside the data.\n",
    "- All points outside the limits of the specified bin edges have been dropped\n",
    "\n",
    "`binned` can now directly be histogrammed, without the need for specifying bin boundaries:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "binned.hist().plot()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "Note that calling `binned.hist()` is equivalent to `binned.bins.sum()`, summing the data array within each bin.\n",
    "Here `sum` performs histogramming for all \"binned\" dimensions, in this case `x` and `y`.\n",
    "The resulting values in the X-Y bins are the counts accumulated from measurements at all points falling in a given bin.\n",
    "\n",
    "We can explicitly histogram with a different bin count the binned data for a higher-resolution plot:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "binned.hist(y=10, x=10).plot()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "\n",
    "**Note**\n",
    "\n",
    "Unless [Plopp](https://scipp.github.io/plopp/) is used for plotting (as in the Scipp documentation, but not enabled by default yet), `plot()` will automatically histogram and resample data.\n",
    "Automatic resampling by `plot()` will be removed in the near future, due to the inconsistent results.\n",
    "In the future explicit binning will be required, and we recommend explicit histogramming already now, to easy the transition in the future.\n",
    "    \n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Working with binned data\n",
    "\n",
    "### Slicing\n",
    "\n",
    "Binned data can be sliced as usual, e.g., to create plots of subregions:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "binned['x', 0].hist(y=10).plot()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Just like slicing dense variables, a slice of binned data \"drops\" all scattered data falling into areas outside the slice:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "s0 = binned['x', 0]\n",
    "s1 = binned['x', 1]\n",
    "print(f'total events: {binned.sum().value}')\n",
    "print(f'events x=0:   {s0.sum().value}')\n",
    "print(f'events x=1:   {s1.sum().value}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This can provide an intuitive way of \"filtering\" lists of data based on some property of the list items."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Masking\n",
    "\n",
    "Masks can be defined for the scattered data array, as well as the bin wrapper.\n",
    "This gives fine-grained and intuitive control, for e.g., masking invalid list entries on the one hand, and excluding regions in space on the other hand, without the need of manually determining which list entries fall into the exclusion zone.\n",
    "\n",
    "We define two masks one in the X-Y plane and one for positions.\n",
    "The position mask is added to the `masks` dict of the `bins` property:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "binned = binned.bin(y=10, x=10)\n",
    "x_y_mask = (binned.coords['x'][1:] < 0.5 * sc.Unit('m')) & (\n",
    "    binned.coords['y'][1:] < 0.5 * sc.Unit('m')\n",
    ")\n",
    "binned.masks['exclude'] = x_y_mask\n",
    "binned.bins.masks['broken_sensor'] = binned.bins.coords['y'] > 0.6 * sc.Unit('m')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As usual, more masks can be added if required, and masks can be removed as long as no reduction operation such as summing or histogramming took place.\n",
    "\n",
    "We can then plot the result.\n",
    "The mask of the underlying scattered data is applied during the histogram step, i.e., masked positions are excluded.\n",
    "The mask of the binned wrapper is indicated in the plot and carried through the histogram step, provided that the binning is not changed.\n",
    "Make sure to compare this figure with the one we obtained earlier, before masking, and note how the values of the un-masked X-Y bins have changed due to masked positions of the underlying unaligned data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "binned.hist().plot()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Arithmetic operations\n",
    "\n",
    "A number of [arithmetic operations and other operations](computation.ipynb) for binned data arrays are supported."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Manipulating bin-based and event-based metadata\n",
    "\n",
    "#### Convert bin-based coordinate into event-based coordinate\n",
    "\n",
    "Consider binned data as above, but with a coordinate that has no corresponding event-coord.\n",
    "This could be the case, e.g., with a `time` dimension that corresponds to groups rather than bins:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "da = table.bin(y=4, x=2)\n",
    "del da.coords['y']\n",
    "del da.bins.coords['y']\n",
    "da = da.rename_dims({'y': 'time'})\n",
    "da.coords['time'] = sc.arange('time', 4, unit='s')\n",
    "da"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we, e.g., intend to erase the grouping dimension, we may nevertheless want to preserve the `time` information.\n",
    "This can be achieved using [bins_like](../../generated/functions/scipp.bins_like.rst#scipp.bins_like) to broadcast, for each bin, a single value to a \"list\" of the required size:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "da.bins.coords['time'] = sc.bins_like(da, da.coords['time'])\n",
    "sc.table(da.values[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note the different values for the `time` coordinate above and below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.table(da.values[2])"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}