{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Indexing and Selecting\n", "\n", "Variable, DataArrays, and Datasets in Scipp can be \"sliced\" in several ways.\n", "The general way is [positional indexing](#Positional-indexing) using indices as in NumPy. \n", "A second approach is to use [label-based indexing](#Label-based-indexing) which uses actual coordinate values for selection.\n", "Positional and label-based indexing returns *view* into the indexed object and can be used to modify an object in-place.\n", "\n", "In addition, [advanced indexing](#Advanced-indexing), which comprises [integer array indexing](#Integer-array-indexing) and [boolean variable indexing](#Boolean-variable-indexing), can be used for more complex selections.\n", "Unlike the aforementioned basic positional and label-based indexing, indexing with integer arrays or boolean variables returns a *copy* of the indexed object." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Positional indexing\n", "\n", "### Overview\n", "\n", "Data in a [variable](../generated/classes/scipp.Variable.rst#scipp.Variable), [data array](../generated/classes/scipp.DataArray.rst#scipp.DataArray), or [dataset](../generated/classes/scipp.Dataset.rst#scipp.Dataset) can be indexed in a similar manner to NumPy and xarray.\n", "The dimension to be sliced is specified using a dimension label.\n", "In contrast to NumPy, positional dimension lookup is not available, unless the object being sliced is one-dimensional.\n", "Positional indexing with an integer or an integer range is made via `__getitem__` and `__setitem__` with a dimension label as first argument.\n", "This is available for variables, data arrays, and datasets.\n", "In all cases a *view* is returned, i.e., just like when slicing a [numpy.ndarray](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.html#numpy.ndarray) no copy is performed.\n", "\n", "### Variables\n", "\n", "Consider the following variable:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "import numpy as np\n", "import scipp as sc\n", "\n", "var = sc.array(\n", " dims=['z', 'y', 'x'],\n", " values=np.random.rand(2, 3, 4),\n", " variances=np.random.rand(2, 3, 4),\n", ")\n", "sc.show(var)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As when slicing a `numpy.ndarray`, the dimension `'x'` is removed since no range is specified:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s = var['x', 1]\n", "sc.show(s)\n", "print(s.dims, s.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When a range is specified, the dimension is kept, even if it has extent 1:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s = var['x', 1:3]\n", "sc.show(s)\n", "print(s.dims, s.shape)\n", "\n", "s = var['x', 1:2]\n", "sc.show(s)\n", "print(s.dims, s.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Slicing can be chained arbitrarily:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "s = var['x', 1:4]['y', 2]['x', 1]\n", "sc.show(s)\n", "print(s.dims, s.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `copy()` method turns a view obtained from a slice into an independent object:`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s = var['x', 1:2].copy()\n", "s += 1000\n", "var" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To avoid subtle and hard-to-spot bugs, positional indexing without dimension label is in general not supported:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "try:\n", " var[1]\n", "except sc.DimensionError as e:\n", " print(e)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Scipp makes an exception from this rule in the unambiguous case of 1-D objects:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "var1d = sc.linspace(dim='x', start=0.1, stop=0.2, num=5)\n", "var1d[1]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "var1d[2:4]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Positional index also supports an optional stride (step):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "var['x', 1:4:2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Negative step sizes are current not supported." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data arrays\n", "\n", "Slicing for data arrays works in the same way, but some additional rules apply.\n", "Consider:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "a = sc.DataArray(\n", " data=sc.array(dims=['y', 'x'], values=np.random.rand(2, 3)),\n", " coords={\n", " 'x': sc.array(dims=['x'], values=np.arange(3.0), unit='m'),\n", " 'y': sc.array(dims=['y'], values=np.arange(2.0), unit='m'),\n", " },\n", " masks={'mask': sc.array(dims=['x'], values=[True, False, False])},\n", ")\n", "sc.show(a)\n", "a" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As when slicing a variable, the sliced dimension is removed when slicing without range, and kept when slicing with range.\n", "\n", "When slicing a data array the following additional rule applies:\n", "\n", "- Meta data (coords, masks) that *do not depend on the slice dimension* are marked as *readonly*\n", "- Slicing **without range**:\n", " - The *coordinates* for the sliced dimension become unaligned.\n", "- Slicing **with a range**:\n", " - The *coordinates* for the sliced dimension keep their alignment.\n", "\n", "The rationale behind this mechanism is as follows.\n", "Meta data is often of a lower dimensionality than data, such as in this example where coords and masks are 1-D whereas data is 2-D.\n", "Elements of meta data entries are thus shared by many data elements, and we must be careful to not apply operations to subsets of data while unintentionally modifying meta data for other unrelated data elements:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a['x', 0:1].coords['x'] *= 2 # ok, modifies only coord value \"private\" to this x-slice\n", "try:\n", " # not ok, would modify coord value \"shared\" by all x-slices\n", " a['x', 0:1].coords['y'] *= 2\n", "except sc.VariableError as e:\n", " print(\n", " f'\\'y\\' is shared with other \\'x\\'-slices and should not be modified by the slice, so we get an error:\\n{e}'\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In practice, a much more dangerous issue this mechanism protects from is unintentional changes to masks.\n", "Consider" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "val = a['x', 1]['y', 1].copy()\n", "val" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we now assign this scalar `val` to a slice at `y=0`, using `=` we need to update the mask.\n", "However, the mask in this example depends only on `x` so it also applies to the slices `y=1`.\n", "If we would allow updating the mask, the following would *unmask data for all* `y`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "try:\n", " a['y', 0] = val\n", "except sc.DimensionError as e:\n", " print(e)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since we cannot update the mask in a consistent manner the entire operation fails.\n", "Data is not modified.\n", "The same mechanism is applied for binary arithmetic operations such as `+=` where the masks would be updated using a logical OR operation.\n", "\n", "The purpose for making coords unaligned when slicing *without* a range is to support useful operations such as:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "a - a['x', 1] # compute difference compared to data at x=1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the `x` coord of `a['x', 0]` were aligned, this would fail due to a coord mismatch.\n", "If coord checking is required, use a range-slice such as `a['x', 1:2]`. Compare the two cases shown in the following and make sure to inspect the `dims` and `shape` of all variables (data and coordinates) of the resulting slices (note the tooltip shown when moving the mouse over the name also contains this information):" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "sc.show(a['y', 1:2]) # Range of length 1\n", "a['y', 1:2]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sc.show(a['y', 1]) # No range\n", "a['y', 1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Datasets\n", "\n", "Slicing for datasets works just like for data arrays.\n", "In addition to making certain coords unaligned and marking certain meta data entries as read-only, slicing a dataset also marks lower-dimensional *data entries* readonly.\n", "Consider a dataset `d`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "d = sc.Dataset(\n", " data={\n", " 'a': sc.array(dims=['y', 'x'], values=np.random.rand(2, 3)),\n", " 'b': sc.array(dims=['x', 'y'], values=np.random.rand(3, 2)),\n", " },\n", " coords={\n", " 'x': sc.array(dims=['x'], values=np.arange(3.0), unit='m'),\n", " 'y': sc.array(dims=['y'], values=np.arange(2.0), unit='m'),\n", " },\n", ")\n", "sc.show(d)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and a slice of `d`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "sc.show(d['y', 0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Slicing a data item of a dataset should not bring any surprises.\n", "Essentially this behaves like slicing a data array:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "sc.show(d['a']['x', 1:2])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Slicing and item access can be done in arbitrary order with identical results:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "assert sc.identical(d['x', 1:2]['a'], d['a']['x', 1:2])\n", "assert sc.identical(d['x', 1:2]['a'].coords['x'], d.coords['x']['x', 1:2])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Label-based indexing\n", "\n", "### Overview\n", "\n", "Data in a [dataset](../generated/classes/scipp.Dataset.rst#scipp.Dataset) or [data array](../generated/classes/scipp.DataArray.rst#scipp.DataArray) can be selected by the coordinate value.\n", "This is similar to pandas [pandas.DataFrame.loc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html).\n", "Scipp leverages its ubiquitous support for physical units to provide label-based indexing in an intuitive manner, using the same syntax as [positional indexing](#Positional-indexing).\n", "For example:\n", "\n", "- `array['x', 0:3]` selects positionally, i.e., returns the first three element along `'x'`.\n", "- `array['x', 1.2*sc.Unit('m'):1.3*sc.Unit('m')]` selects by label, i.e., returns the elements along `'x'` falling between `1.2 m` and `1.3 m`.\n", "\n", "That is, label-based indexing is made via `__getitem__` and `__setitem__` with a dimension label as first argument and a scalar [variable](../generated/classes/scipp.Variable.rst#scipp.Variable) or a Python `slice()` as created by the colon operator `:` from two scalar variables.\n", "In all cases a *view* is returned, i.e., just like when slicing a [numpy.ndarray](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.html#numpy.ndarray) no copy is performed.\n", "\n", "Consider:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "da = sc.DataArray(\n", " data=sc.array(dims=['year', 'x'], values=np.random.random((3, 7))),\n", " coords={\n", " 'x': sc.array(dims=['x'], values=np.linspace(0.1, 0.9, num=7), unit='m'),\n", " 'year': sc.array(dims=['year'], values=[2020, 2023, 2027]),\n", " },\n", ")\n", "sc.show(da)\n", "da" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can select a slice of `da` based on the `'year'` labels:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "year = sc.scalar(2023)\n", "da['year', year]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this case `2023` is the second element of the coordinate so this is equivalent to positionally slicing `data['year', 1]` and [the usual rules](#Positional-indexing) regarding dropping dimensions and making dimension coordinates unaligned apply:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "assert sc.identical(da['year', year], da['year', 1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "