{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Datetime Aggregates in Pandas Tutorial" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook explains how to aggregate time series data in `pandas`.\n", "\n", "This notebook will use gold and silver price data from `rdatasets` for this tutorial" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Packages" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The documentation for each package used in this tutorial is linked below:\n", "* [pandas](https://pandas.pydata.org/docs/)\n", "* [statsmodels](https://www.statsmodels.org/stable/index.html)\n", " * [statsmodels.api](https://www.statsmodels.org/stable/api.html#statsmodels-api)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import statsmodels.api as sm\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create initial dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data is from `rdatasets` imported using the Python package `statsmodels`." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 9132 entries, 0 to 9131\n", "Data columns (total 3 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 date 9132 non-null object \n", " 1 gold 9132 non-null float64\n", " 2 silver 9132 non-null float64\n", "dtypes: float64(2), object(1)\n", "memory usage: 214.2+ KB\n" ] } ], "source": [ "df = sm.datasets.get_rdataset('GoldSilver', 'AER').data.reset_index().rename(columns={'index': 'date'})\n", "df.info()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "df['date'] = pd.to_datetime(df.date)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Time series aggregation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `pandas` function `resample` can be used to create aggregations on specified windows. Here, a weekly aggregate of the daily gold and silver price data will be created. \n", "\n", "First, a datetime index needs to be created from the **date** column." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "df.set_index('date', inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The parameter **origin** is used to specify the day of week to start the aggregation on. January 1, 1978 is a Sunday, so each of these aggreagtions will start on Sunday and aggregate over the subsequent week." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "weekly_resample = df.resample('7D', origin=pd.to_datetime('1978-01-01'))\n", "aggregated_df = weekly_resample.agg(['min', 'mean', 'max', 'std'])\n", "aggregated_df.columns = ['_'.join(col).strip() + '_week' for col in aggregated_df.columns.values]" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
gold_min_weekgold_mean_weekgold_max_weekgold_std_weeksilver_min_weeksilver_mean_weeksilver_max_weeksilver_std_week
date
1977-12-25100.00100.000100.00NaN223.42223.420223.42NaN
1978-01-01100.00100.000100.000.000000223.42226.604229.842.596657
1978-01-08100.88101.586102.620.796825227.89230.050232.551.846334
1978-01-15100.87101.832103.090.796976227.96229.686232.491.755130
1978-01-22102.49103.306103.620.475742230.37232.798234.501.777630
1978-01-29101.96102.578103.490.636530226.91229.116231.611.936835
1978-02-05101.90102.572103.150.505193224.98227.506229.901.875361
1978-02-12103.92104.936105.870.850312228.16231.506234.012.412370
1978-02-19105.14105.956107.230.822028229.81232.392234.551.784550
1978-02-26106.20107.076107.600.538451229.83231.092232.221.043681
1978-03-05108.89109.916111.030.791789235.33244.862248.985.619695
1978-03-12106.05108.042110.391.638740248.09251.940254.952.875239
1978-03-19102.65104.458105.531.355865240.71243.672246.052.578317
1978-03-26104.66106.180107.291.119710246.81249.628253.552.764710
1978-04-02103.33103.968105.270.771894241.89243.780246.381.809627
1978-04-09102.99104.120105.070.806908241.23243.816247.272.410940
1978-04-1699.1199.996100.930.852397230.53232.784235.152.045527
1978-04-2397.4598.23499.410.856756225.99228.588230.491.848194
1978-04-3098.3899.778100.921.143119228.87232.358234.712.439912
1978-05-07100.11100.874101.510.644306232.08233.124234.550.905003
\n", "
" ], "text/plain": [ " gold_min_week gold_mean_week gold_max_week gold_std_week \\\n", "date \n", "1977-12-25 100.00 100.000 100.00 NaN \n", "1978-01-01 100.00 100.000 100.00 0.000000 \n", "1978-01-08 100.88 101.586 102.62 0.796825 \n", "1978-01-15 100.87 101.832 103.09 0.796976 \n", "1978-01-22 102.49 103.306 103.62 0.475742 \n", "1978-01-29 101.96 102.578 103.49 0.636530 \n", "1978-02-05 101.90 102.572 103.15 0.505193 \n", "1978-02-12 103.92 104.936 105.87 0.850312 \n", "1978-02-19 105.14 105.956 107.23 0.822028 \n", "1978-02-26 106.20 107.076 107.60 0.538451 \n", "1978-03-05 108.89 109.916 111.03 0.791789 \n", "1978-03-12 106.05 108.042 110.39 1.638740 \n", "1978-03-19 102.65 104.458 105.53 1.355865 \n", "1978-03-26 104.66 106.180 107.29 1.119710 \n", "1978-04-02 103.33 103.968 105.27 0.771894 \n", "1978-04-09 102.99 104.120 105.07 0.806908 \n", "1978-04-16 99.11 99.996 100.93 0.852397 \n", "1978-04-23 97.45 98.234 99.41 0.856756 \n", "1978-04-30 98.38 99.778 100.92 1.143119 \n", "1978-05-07 100.11 100.874 101.51 0.644306 \n", "\n", " silver_min_week silver_mean_week silver_max_week \\\n", "date \n", "1977-12-25 223.42 223.420 223.42 \n", "1978-01-01 223.42 226.604 229.84 \n", "1978-01-08 227.89 230.050 232.55 \n", "1978-01-15 227.96 229.686 232.49 \n", "1978-01-22 230.37 232.798 234.50 \n", "1978-01-29 226.91 229.116 231.61 \n", "1978-02-05 224.98 227.506 229.90 \n", "1978-02-12 228.16 231.506 234.01 \n", "1978-02-19 229.81 232.392 234.55 \n", "1978-02-26 229.83 231.092 232.22 \n", "1978-03-05 235.33 244.862 248.98 \n", "1978-03-12 248.09 251.940 254.95 \n", "1978-03-19 240.71 243.672 246.05 \n", "1978-03-26 246.81 249.628 253.55 \n", "1978-04-02 241.89 243.780 246.38 \n", "1978-04-09 241.23 243.816 247.27 \n", "1978-04-16 230.53 232.784 235.15 \n", "1978-04-23 225.99 228.588 230.49 \n", "1978-04-30 228.87 232.358 234.71 \n", "1978-05-07 232.08 233.124 234.55 \n", "\n", " silver_std_week \n", "date \n", "1977-12-25 NaN \n", "1978-01-01 2.596657 \n", "1978-01-08 1.846334 \n", "1978-01-15 1.755130 \n", "1978-01-22 1.777630 \n", "1978-01-29 1.936835 \n", "1978-02-05 1.875361 \n", "1978-02-12 2.412370 \n", "1978-02-19 1.784550 \n", "1978-02-26 1.043681 \n", "1978-03-05 5.619695 \n", "1978-03-12 2.875239 \n", "1978-03-19 2.578317 \n", "1978-03-26 2.764710 \n", "1978-04-02 1.809627 \n", "1978-04-09 2.410940 \n", "1978-04-16 2.045527 \n", "1978-04-23 1.848194 \n", "1978-04-30 2.439912 \n", "1978-05-07 0.905003 " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "aggregated_df.head(20)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" } }, "nbformat": 4, "nbformat_minor": 4 }