{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# GroupBy\n", "\n", "\"Group by\" refers to an implementation of the \"split-apply-combine\" approach known from [pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html) and [xarray](http://xarray.pydata.org/en/stable/groupby.html).\n", "Scipp currently supports only a limited number of operations that can be applied.\n", "\n", "## Grouping based on label values\n", "\n", "Suppose we have measured data for a number of parameter values, potentially repeating measurements with the same parameter multiple times:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "import numpy as np\n", "import scipp as sc\n", "\n", "np.random.seed(0)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "param = sc.Variable(dims=['x'], values=[1, 3, 1, 1, 5, 3])\n", "values = sc.Variable(dims=['x', 'y'], values=np.random.rand(6, 16))\n", "values += 1.0 + param" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we store this data as a data array we obtain the following plot:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = sc.DataArray(\n", " values,\n", " coords={\n", " 'x': sc.Variable(dims=['x'], values=np.arange(6)),\n", " 'y': sc.Variable(dims=['y'], values=np.arange(16)),\n", " },\n", ")\n", "sc.plot(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that we chose the \"measured\" values such that the three distinct values of the underlying parameter are visible.\n", "We can now use the split-apply-combine mechanism to transform our data into a more useful representation.\n", "We start by storing the parameter values (or any value to be used for grouping) as a non-dimension coordinate:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.coords['param'] = param" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we call `scipp.groupby` to split the data and call `mean` on each of the groups:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "grouped = sc.groupby(data, group='param').mean('x')\n", "sc.plot(grouped)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Apart from `mean`, `groupby` also supports `sum`, `concat`, and more. See [GroupByDataArray](../generated/classes/scipp.GroupByDataArray.rst) and [GroupByDataset](../generated/classes/scipp.GroupByDataset.rst) for a full list.\n", "\n", "## Grouping based on binned label values\n", "\n", "Grouping based on non-dimension coordinate values (also known as labels) is most useful when labels are strings or integers.\n", "If labels are floating-point values or cover a wide range, it is more convenient to group values into bins, i.e., all values within certain bounds are mapped into the same group.\n", "We modify the above example to use a contiuously-valued parameter:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "param = sc.Variable(dims=['x'], values=np.random.rand(16))\n", "values = sc.Variable(dims=['x', 'y'], values=np.random.rand(16, 16))\n", "values += 1.0 + 5.0 * param" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = sc.DataArray(\n", " values,\n", " coords={\n", " 'x': sc.Variable(dims=['x'], values=np.arange(16)),\n", " 'y': sc.Variable(dims=['y'], values=np.arange(16)),\n", " },\n", ")\n", "sc.plot(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We create a variable defining the desired binning:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bins = sc.Variable(dims=[\"z\"], values=np.linspace(0.0, 1.0, 10))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As before, we can now use `groupby` and `mean` to transform the data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.coords['param'] = param\n", "grouped = sc.groupby(data, group='param', bins=bins).mean('x')\n", "sc.plot(grouped)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The values in the white rows are `NaN`.\n", "This is the result of empty bins, which do not have a meaningful mean value." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatively, grouping can be done based on groups defined as Variables rather than strings. This, however, requires bins to be specified, since bins define the new dimension label." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "grouped = sc.groupby(data, group=param, bins=bins).mean(\n", " 'x'\n", ") # note the lack of quotes around param!\n", "sc.plot(grouped)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3" } }, "nbformat": 4, "nbformat_minor": 4 }