{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# GroupBy\n",
    "\n",
    "\"Group by\" refers to an implementation of the \"split-apply-combine\" approach known from [pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html) and [xarray](https://xarray.pydata.org/en/stable/groupby.html).\n",
    "Scipp currently supports only a limited number of operations that can be applied.\n",
    "\n",
    "## Grouping based on label values\n",
    "\n",
    "Suppose we have measured data for a number of parameter values, potentially repeating measurements with the same parameter multiple times:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import scipp as sc\n",
    "\n",
    "np.random.seed(0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "param = sc.Variable(dims=['x'], values=[1, 3, 1, 1, 5, 3])\n",
    "values = sc.Variable(dims=['x', 'y'], values=np.random.rand(6, 16))\n",
    "values += 1.0 + param"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we store this data as a data array we obtain the following plot:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = sc.DataArray(\n",
    "    values,\n",
    "    coords={\n",
    "        'x': sc.Variable(dims=['x'], values=np.arange(6)),\n",
    "        'y': sc.Variable(dims=['y'], values=np.arange(16)),\n",
    "    },\n",
    ")\n",
    "sc.plot(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that we chose the \"measured\" values such that the three distinct values of the underlying parameter are visible.\n",
    "We can now use the split-apply-combine mechanism to transform our data into a more useful representation.\n",
    "We start by storing the parameter values (or any value to be used for grouping) as a non-dimension coordinate:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.coords['param'] = param"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we call `scipp.groupby` to split the data and call `mean` on each of the groups:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "grouped = sc.groupby(data, group='param').mean('x')\n",
    "sc.plot(grouped)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Apart from `mean`, `groupby` also supports `sum`, `concat`, and more. See [GroupByDataArray](../generated/classes/scipp.GroupByDataArray.rst) and [GroupByDataset](../generated/classes/scipp.GroupByDataset.rst) for a full list.\n",
    "\n",
    "## Grouping based on binned label values\n",
    "\n",
    "Grouping based on non-dimension coordinate values (also known as labels) is most useful when labels are strings or integers.\n",
    "If labels are floating-point values or cover a wide range, it is more convenient to group values into bins, i.e., all values within certain bounds are mapped into the same group.\n",
    "We modify the above example to use a contiuously-valued parameter:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "param = sc.Variable(dims=['x'], values=np.random.rand(16))\n",
    "values = sc.Variable(dims=['x', 'y'], values=np.random.rand(16, 16))\n",
    "values += 1.0 + 5.0 * param"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = sc.DataArray(\n",
    "    values,\n",
    "    coords={\n",
    "        'x': sc.Variable(dims=['x'], values=np.arange(16)),\n",
    "        'y': sc.Variable(dims=['y'], values=np.arange(16)),\n",
    "    },\n",
    ")\n",
    "sc.plot(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We create a variable defining the desired binning:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bins = sc.Variable(dims=[\"z\"], values=np.linspace(0.0, 1.0, 10))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As before, we can now use `groupby` and `mean` to transform the data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.coords['param'] = param\n",
    "grouped = sc.groupby(data, group='param', bins=bins).mean('x')\n",
    "sc.plot(grouped)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The values in the white rows are `NaN`.\n",
    "This is the result of empty bins, which do not have a meaningful mean value."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Alternatively, grouping can be done based on groups defined as Variables rather than strings. This, however, requires bins to be specified, since bins define the new dimension label."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "grouped = sc.groupby(data, group=param, bins=bins).mean(\n",
    "    'x'\n",
    ")  # note the lack of quotes around param!\n",
    "sc.plot(grouped)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}