{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Mean Absolute Difference in Pandas Tutorial" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook explains how to calculate the mean absolute difference of time series data in `pandas`.\n", "\n", "This notebook will use gold and silver price data from `rdatasets` for this tutorial" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Packages" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The documentation for each package used in this tutorial is linked below:\n", "* [pandas](https://pandas.pydata.org/docs/)\n", "* [statsmodels](https://www.statsmodels.org/stable/index.html)\n", " * [statsmodels.api](https://www.statsmodels.org/stable/api.html#statsmodels-api)\n", "* [numpy](https://numpy.org/doc/stable/)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import statsmodels.api as sm\n", "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create initial dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data is from `rdatasets` imported using the Python package `statsmodels`." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 9132 entries, 0 to 9131\n", "Data columns (total 3 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 date 9132 non-null object \n", " 1 gold 9132 non-null float64\n", " 2 silver 9132 non-null float64\n", "dtypes: float64(2), object(1)\n", "memory usage: 214.2+ KB\n" ] } ], "source": [ "df = sm.datasets.get_rdataset('GoldSilver', 'AER').data.reset_index().rename(columns={'index': 'date'})\n", "df.info()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "df['date'] = pd.to_datetime(df.date)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Time series aggregation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `pandas` function `rolling` can be used to create aggregations on windows of specific lengths. Here, an aggregate of the daily gold and silver price data will be created covering the primary week. \n", "\n", "First, a datetime index needs to be created from the **date** column." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "df.set_index('date', inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pandas aggregation will require a function that can be called. The mean difference will be called using three functions from `numpy`: `mean`, `abs` and `diff`. `diff` calculates the difference between the current value and a prior value (by default the immediate prior value)." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "def mean_absolute_difference(series):\n", " return np.mean(np.abs(np.diff(series)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If, instead of an offset (**'7D'** representing 7 days), a number is used, it will just use the prior number of observations." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "weekly_resample = df.rolling('7D')\n", "aggregated_df = weekly_resample.agg(['mean', mean_absolute_difference])\n", "aggregated_df.columns = ['_'.join(col).strip() + '_week' for col in aggregated_df.columns.values]" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
gold_mean_weekgold_mean_absolute_difference_weeksilver_mean_weeksilver_mean_absolute_difference_week
date
1977-12-30100.000NaN223.420NaN
1978-01-02100.0000.0000223.4200.000000
1978-01-03100.0000.0000225.5603.210000
1978-01-04100.0000.0000225.3153.893333
1978-01-05100.0000.0000225.8503.772500
1978-01-06100.0000.0000226.6043.972500
1978-01-09100.2460.3075227.8442.975000
1978-01-10100.4360.3775227.6701.822500
1978-01-11100.8860.7025228.9981.532500
1978-01-12101.0621.0450228.9782.165000
1978-01-13101.5861.1725230.0502.722500
1978-01-16101.9581.2200230.6242.575000
1978-01-17102.1241.2225230.8662.590000
1978-01-18101.8481.1075230.2142.312500
1978-01-19102.0120.8800230.4121.377500
1978-01-20101.8320.7675229.6861.372500
1978-01-23101.9380.9150229.9882.065000
1978-01-24102.3000.6950230.6781.602500
1978-01-25102.8360.4975231.9861.590000
1978-01-26103.1520.5600232.5082.332500
\n", "
" ], "text/plain": [ " gold_mean_week gold_mean_absolute_difference_week \\\n", "date \n", "1977-12-30 100.000 NaN \n", "1978-01-02 100.000 0.0000 \n", "1978-01-03 100.000 0.0000 \n", "1978-01-04 100.000 0.0000 \n", "1978-01-05 100.000 0.0000 \n", "1978-01-06 100.000 0.0000 \n", "1978-01-09 100.246 0.3075 \n", "1978-01-10 100.436 0.3775 \n", "1978-01-11 100.886 0.7025 \n", "1978-01-12 101.062 1.0450 \n", "1978-01-13 101.586 1.1725 \n", "1978-01-16 101.958 1.2200 \n", "1978-01-17 102.124 1.2225 \n", "1978-01-18 101.848 1.1075 \n", "1978-01-19 102.012 0.8800 \n", "1978-01-20 101.832 0.7675 \n", "1978-01-23 101.938 0.9150 \n", "1978-01-24 102.300 0.6950 \n", "1978-01-25 102.836 0.4975 \n", "1978-01-26 103.152 0.5600 \n", "\n", " silver_mean_week silver_mean_absolute_difference_week \n", "date \n", "1977-12-30 223.420 NaN \n", "1978-01-02 223.420 0.000000 \n", "1978-01-03 225.560 3.210000 \n", "1978-01-04 225.315 3.893333 \n", "1978-01-05 225.850 3.772500 \n", "1978-01-06 226.604 3.972500 \n", "1978-01-09 227.844 2.975000 \n", "1978-01-10 227.670 1.822500 \n", "1978-01-11 228.998 1.532500 \n", "1978-01-12 228.978 2.165000 \n", "1978-01-13 230.050 2.722500 \n", "1978-01-16 230.624 2.575000 \n", "1978-01-17 230.866 2.590000 \n", "1978-01-18 230.214 2.312500 \n", "1978-01-19 230.412 1.377500 \n", "1978-01-20 229.686 1.372500 \n", "1978-01-23 229.988 2.065000 \n", "1978-01-24 230.678 1.602500 \n", "1978-01-25 231.986 1.590000 \n", "1978-01-26 232.508 2.332500 " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "aggregated_df.head(20)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" } }, "nbformat": 4, "nbformat_minor": 4 }