{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Intro to Pandas and Plotting\n", "\n", "> Abstract: This is a short tutorial to motivate usage of Pandas and demonstrates some basic functionality for working with time series. Previous knowledge of [Numpy](https://docs.scipy.org/doc/numpy/user/quickstart.html) and [Matplotlib](https://matplotlib.org/tutorials/index.html) is assumed. See also: https://pandas.pydata.org/docs/getting_started/10min.html We show various plots, including basic time series statistics and error intervals." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%load_ext watermark\n", "%watermark -i -v -p pandas,matplotlib" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Organising data\n", "\n", "Suppose we run an experiment with two variables, x and y. We could store these data as two numpy arrays assigned to variables `x` and `y`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Generate a sinusoidal signal with random noise\n", "x = np.linspace(-2*np.pi, 2*np.pi, 744)\n", "y = np.sin(x) + 0.5*np.random.rand(len(x))\n", "print(x[:3], \"...\")\n", "print(y[:3], \"...\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To manipulate our data (e.g plotting), we must pass both of these variables around" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def plot_data_1(x, y):\n", " fig, ax = plt.subplots()\n", " ax.plot(x, y)\n", " ax.set_xlabel(\"x\")\n", " ax.set_ylabel(\"y\")\n", " return fig, ax\n", "\n", "plot_data_1(x,y);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is manageable as there are only two variables, but if we were to add more variables to our collection (e.g. other concurrent measurements), or if we wanted to store extra information about each variable, this would rapidly become annoying to handle.\n", "\n", "To organise the variables better, we could assign them to a dictionary, `d`. Now we only have to pass one object to our plotting function. We also use the key names in the dictionary to describe our data (in this case, just x and y)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "d = {\"x\": x, \"y\": y}\n", "print(\"x:\", d[\"x\"][:3], \"...\")\n", "print(\"y:\", d[\"y\"][:3], \"...\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def plot_data_2(d, xvar=\"x\", yvar=\"y\"):\n", " fig, ax = plt.subplots()\n", " ax.plot(d[xvar], d[yvar])\n", " ax.set_xlabel(xvar)\n", " ax.set_ylabel(yvar)\n", " return fig, ax\n", "\n", "plot_data_2(d);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We could go on like this, adding new measurements to this dictionary, and defining different datasets (including their various measurements) as different dictionaries. But there is a better way.\n", "\n", "## Working with dataframes\n", "\n", "The `pandas.DataFrame` organises tabular data and provides convenient tools for computation and visualisation. Dataframes act much like a spreadsheet (or a SQL database) and are inspired partly by the R programming language. They consist of *columns* (here, we named them `x` and `y`), and *rows*. Each column should contain the same number of elements, and each row refers to some related measurements. Dataframes also have an `index` which identifies the rows - in our case they have just been labelled as integers that match the indexes in the original input arrays.\n", "\n", "Pandas comes with [many I/O tools](https://pandas.pydata.org/docs/reference/io.html) to load dataframes. We can also create one from a dictionary:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.DataFrame.from_dict(d)\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is an underlying Numpy array that can be accessed through the `.values` property:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.values[:5, :]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can still extract the separate arrays (x and y) similar to interacting with a dictionary:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "type(df[\"x\"].values)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can append new data to the dataframe just like appending a variable to a dictionary. The new data should be the same length as the dataframe, but if a constant is supplied then that is used for every row:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df[\"y_error\"] = 0.3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some of the real advantages of using Pandas come when we employ a more useful index. If our measurements are taken at different times, we can set the index as a time-aware object. This uses the [Pandas DatetimeIndex](https://pandas.pydata.org/docs/reference/indexing.html#datetimeindex) which is related to the [datetime standard library](https://docs.python.org/3/library/datetime.html)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Generate some sample times at hourly intervals over a month\n", "df[\"time\"] = pd.date_range(\"2020-01-01\", \"2020-02-01\", periods=745, inclusive=\"left\")\n", "df = df.set_index(\"time\")\n", "df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "type(df.index)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is reasonable to use dataframes to contain your data, use it for easily reading and writing to files, and to apply Numpy-based transformations and other computation, together with plotting routines using Matplotlib. However there are also [many ways to use Pandas for manipulating data](https://pandas.pydata.org/docs/user_guide/cookbook.html) which are not covered here.\n", "\n", "Below we show a few basic ways to plot time series." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Plotting with dataframes\n", "\n", "We can use the `.plot()` method to access the Pandas plotting API which itself creates Matplotlib objects. This mechanism is rather complex but enables [many convenient shortcuts](https://pandas.pydata.org/docs/reference/frame.html#plotting) to creating complex figures." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It takes some time to get familiar with this API but after that it becomes very useful for rapid feedback and iteration while playing with data, particularly in combination with Jupyter notebooks." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.plot(y=\"y\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try some things to better visualise this time series.\n", "\n", "We can use the [resampling system](https://pandas.pydata.org/docs/user_guide/timeseries.html#timeseries-resampling) to change the data from hourly samples to daily samples based on the mean of the measurements taken each day. We can directly feed the derived dataframe into a plotting command." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.resample(\"1d\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.resample(\"1d\").mean().plot(y=\"y\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A related method is [rolling calculations](https://pandas.pydata.org/docs/user_guide/computation.html#stats-moments-ts-versus-resampling):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.rolling(24).mean().plot(y=\"y\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For more, see e.g. https://ourcodingclub.github.io/2019/01/07/pandas-time-series.html\n", "\n", "Rather than just creating the plot straight away as above, we can instead [instantiate a Matplotlib axes](https://matplotlib.org/tutorials/introductory/usage.html#coding-styles), and then direct Pandas to plot onto it. This enables some more flexible configuration, like plotting aspects from two dataframes onto one axes:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots()\n", "df.plot(y=\"y\", ax=ax)\n", "df.rolling(24, center=True).mean().plot(y=\"y\", ax=ax, label=\"y-smoothed\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As an example of the more configurable plotting options through `.plot()`, let's show the error bars on measurements. To make them visible, we first subselect down to only every 24th measurement [using .iloc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.iloc[::24].plot(x=\"x\", y=\"y\", kind=\"scatter\", yerr=\"y_error\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Beware! If we had resampled the dataframe like in the previous steps, the resampling logic would also have applied to the `y_error` column. You would instead need to supply the correct error propagation mechanism yourself.\n", "\n", "Note that we had to resort to plotting against the \"x\" data instead of the index (time) because of a limitation in what the \"scatter\" option can do. Plotting against the index is not supported - this could be worked around by creating an extra column to use: `df[\"time\"] = df.index`. Another option is to create the plot ourselves [using the Matplotlib API](https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.errorbar.html):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "_df = df.iloc[::24]\n", "fig, ax = plt.subplots(figsize=(12, 3))\n", "ax.errorbar(_df.index, _df[\"y\"], yerr=_df[\"y_error\"], marker=\"o\")\n", "ax.set_ylabel(\"y\")\n", "ax.set_xlabel(\"time\");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Above, we subsampled the data in order to create a clear visualisation with vertical error bars. To show the error intervals on the original data it is better to shade the area with [fill_between](https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.fill_between.html):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=(12, 3))\n", "x = df.index\n", "y = df[\"y\"]\n", "y1 = y - df[\"y_error\"]\n", "y2 = y + df[\"y_error\"]\n", "ax.plot(x, y)\n", "ax.fill_between(x, y1, y2, color=\"grey\")\n", "ax.set_ylabel(\"y\")\n", "ax.set_xlabel(\"time\");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Rather than using the provided error values, we might choose to plot the spread in the data itself. In the example below we create a new `df_daily` dataframe that contains the daily means and standard deviations of measurements. We use these to plot the mean and spread in the measurements." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def plot_daily_means(df, ax):\n", " df_daily = (\n", " df.resample(\"1d\").mean()\n", " .drop(columns=[\"x\", \"y_error\"])\n", " .rename(columns={\"y\": \"y_mean\"}))\n", " df_daily[\"y_std\"] = df[\"y\"].resample(\"1d\").std()\n", "\n", " ax.plot(df_daily.index, df_daily[\"y_mean\"])\n", " ax.fill_between(\n", " df_daily.index,\n", " df_daily[\"y_mean\"] - df_daily[\"y_std\"],\n", " df_daily[\"y_mean\"] + df_daily[\"y_std\"],\n", " color=\"lightgrey\")\n", " ax.set_ylabel(\"y\");\n", "\n", "fig, ax = plt.subplots(figsize=(12, 3))\n", "plot_daily_means(df, ax)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When we create a figure like this, it is useful to define the plotting routine as a function that applies to a [Matplotlib Axes object](https://matplotlib.org/tutorials/introductory/usage.html#axes). This way we can control the figure setup (geometry, other plots etc.) separately from the detailed plotting commands. The figure can be manipulated more cleanly in this way. Other subplots can be easily combined into one figure, or other configurations applied: for example, we might later add on grid lines without modifying the original plotting code:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ax.grid(True)\n", "fig" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Customising figures\n", "\n", "We can continue modifying the figure above since we still have access to the `fig` and `ax` objects which contain the figure. Roughly speaking, the `fig` object (figure) can be considered as the page on which we are drawing, and the `ax` object (axes) refers to the particular subplot (though in this case, there is only one plot).\n", "\n", "Let's update the display of the plot and redraw it:\n", "- We can get matplotlib to directly render $\\LaTeX$ elements for us by using the `$...$` pattern\n", "- `set_xlabel`, `set_ylabel`, `set_title` let us update the labels in place\n", "- `set_ylim` controls the lower and upper limits on the y axis" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ax.set_xlabel(\"Time\")\n", "ax.set_ylabel(\"$Mystery^2 + \\sqrt{Intrigue}$\")\n", "ax.set_title(\"Analysis #42\")\n", "ax.set_ylim((-1.5, 1.5))\n", "fig" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## To xarray\n", "\n", "There are some limitations with `pandas.DataFrame` that make it not so suitable for the physical sciences. [**Xarray**](http://xarray.pydata.org/en/stable/why-xarray.html) fills some of these gaps and is mostly compatible with Pandas, providing a similar API. To learn more, please refer to [a full tutorial](https://xarray-contrib.github.io/xarray-tutorial/). Below is a little motivation as to why you might want to invest the time.\n", "\n", "We can transform a `pandas.DataFrame` into a `xarray.Dataset` with `.to_xarray()`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ds = df.to_xarray()\n", "ds" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similar quick plotting can be peformed, but the mechanism is different due to the greater complexity of the data structure." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ds[\"y\"].plot.line()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The primary advantage of xarray is that it extends Pandas-like functionality to n-dimensional data. In Pandas, each column is limited to contain a 1-dimensional array (though this can be worked around by using a MultiIndex). In xarray, each \"data variable\" (itself an `xarray.DataArray`) can hold an n-dimensional array, with each dimension carrying a dimension name. To provide label-based access, dimensions can have associated coordinates. In our example, the data variables (`x`, `y`, `y_error`) have the `time` dimension which has datetime-based coordinates.\n", "\n", "We might add more complex data, `v`, which has a spatial component as well. We need to provide dimension names in order to do this (see also: [xarray.Dataset.assign](http://xarray.pydata.org/en/stable/generated/xarray.Dataset.assign.html))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "v = np.random.rand(len(ds[\"time\"]), 3)\n", "ds[\"v\"] = ((\"time\", \"space\"), v)\n", "ds" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another advantage of xarray is support for metadata. For example, we can add units and a description by changing the `.attrs` (attributes) property of the `DataArray`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ds[\"v\"].attrs = {\"units\": \"m/s\", \"description\": \"A velocity vector\"}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plotting commands can automatically handle the multi-dimensional aspect, as well as adding the provided units to the axis labels." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ds[\"v\"].plot.line(x=\"time\");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To-do: tutorial on indexing and other aspects - see http://xarray.pydata.org/en/stable/indexing.html" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3" } }, "nbformat": 4, "nbformat_minor": 4 }