{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Pandas" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cities = {'name': [\"Copenhagen\", \"London\"],\n", " 'population': [1.5, 11.2],\n", " 'dist_to_coast': [0.0, 2.3]}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.DataFrame(cities)\n", "df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df[df.name=='London']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.population.mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get row by number" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.iloc[0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.iloc[1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get row by name (named index)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = df.set_index('name')\n", "df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.loc[\"London\"]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.index" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can transpose the dataframe, (rows -> columns)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.T" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.loc[\"London\"].population" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Delimited files\n", "\n", "Delimited files, separated by comma, semi-colon, tabs, spaces or any other special character, is a very common data format for tabular data. Comma separated value (csv) files can be read by the [pandas `read_csv` function](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html). It is a very powerful function, with a lot of options. It is very rare, that you have to write your own python function to parse csv files." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below is an example of csv file:\n", "\n", "![](images/csv_example.png)\n", "\n", "\n", "* Header with comments\n", "* Columns are separated with semi-colon (;)\n", "* Decimal separator is (,)\n", "* Date and time are in separate columns\n", "* There is a redundant station column\n", "* The column names are not in english " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv(\"data/boogus.csv\", \n", " comment=\"%\",\n", " sep=\";\",\n", " decimal=\",\",\n", " parse_dates=[[\"dato\",\"tid\"]],\n", " usecols=[1,2,3])\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Most functions in Pandas returns a copy, so even though the below line, looks like it changes the name, since it is printed to the screen, the `df` variable is not changed." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.rename(columns={\"temperatur\": \"air_temperature\"})" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df # not modfied" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_new_name = df.rename(columns={\"temperatur\": \"air_temperature\"})\n", "df_new_name" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Inline exercise**\n", "\n", "1. Calculate the min and max tempterature by selecting the `air_temperature` column from the `df_new_name` dataframe like this: `df_new_name.air_temperature` and call the .`min()` and `max()` methods.\n", "2. Read the csv file again, but this time include all columns.\n", "3. Read the csv file once more, but don't use the comment argument, use another argument to skip the first four rows. Help: `pd.read_csv?`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Timeseries" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv(\"data/oceandata.csv\", parse_dates=True, index_col='datetime')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "type(df)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "type(df.waterlevel)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.plot()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.waterlevel.plot()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.index" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.loc['2015-02-18'].plot()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.loc['2015-02-18'].interpolate().plot()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.loc['2015-02-18 14:00':'2015-02-18 15:20']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_interp = df.interpolate()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_interp.loc['2015-02-18 14:00':'2015-02-18 15:20']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Resampling\n", "\n", "Aggregate temporal data " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.resample('H')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Resampling requires an aggregation function, e.g., sum, mean, median,..." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.resample('D').sum().head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `sum` function doesn't make sense in this example. Better to use `mean`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.resample('H').mean().head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.resample('H').first().head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.resample('H').median().head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_h = df.resample('H').interpolate().dropna()\n", "df_h.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note: resample will use either the left or the right end-point depending on the resampling frequency (e.g. for hours the beginning of the hour but for months the end of the month). If you want to make sure you are resampling right - specify the closed argument." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Inline exercise\n", "\n", "Please find the maximum value for every 6 hour period." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# insert your code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Extrapolation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rng = pd.date_range(\"2015-02-17\",\"2015-02-20\",freq='H')\n", "ix = pd.DatetimeIndex(rng)\n", "dfr = df_interp.reindex(ix)\n", "dfr.plot()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dfr.ffill().plot()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_extra = dfr.bfill().ffill()\n", "df_extra.plot()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_extra" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from IPython.display import YouTubeVideo\n", "\n", "YouTubeVideo(\"8upGdZMlkYM\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For more tips and tricks on how to use Pandas for timeseries data see this talk:\n", "[Ian Ozsvald: A gentle introduction to Pandas timeseries and Seaborn | PyData London 2019](https://www.youtube.com/watch?v=8upGdZMlkYM)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3.10.8 ('base')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.8" }, "vscode": { "interpreter": { "hash": "7aec4f91c09090e98e6ae8203c38529831bb4a3ce54cd1b69639b53cb01a6aa9" } } }, "nbformat": 4, "nbformat_minor": 4 }