{ "cells": [ { "cell_type": "markdown", "id": "0", "metadata": {}, "source": [ "---\n", "title: \"Part 11: Time Series\"\n", "---" ] }, { "cell_type": "markdown", "id": "1", "metadata": {}, "source": [ "[](https://colab.research.google.com/github/sambaiga/ds-mlops-path/blob/main/tutorials/01-python-basics/11-time-series.ipynb) [](https://raw.githubusercontent.com/sambaiga/ds-mlops-path/main/tutorials/01-python-basics/11-time-series.ipynb)" ] }, { "cell_type": "markdown", "id": "2", "metadata": {}, "source": [ "**DS-MLOps Data Analysis**\n", "\n", "**Python 3.12+ | Author: Anthony Faustine**\n", "\n", "## Before you begin\n", "\n", "This notebook continues from Part 10 (`10-combining-reshaping.ipynb`), which covered concatenation, merging, groupby, and pivoting. The student exam results dataset has no real dates in it, only an `establishment_year`, so this part introduces a second dataset built for the job: daily attendance records for 5 schools over a school term: the shape most time-indexed data takes in practice, one row per day per entity.\n", "\n", "Part 12 (`11-polars.ipynb`) continues with Polars, including its own (faster) take on time-indexed data.\n", "\n", "::: {.callout-note collapse=\"true\" icon=false}\n", "## Topics covered\n", "\n", "| Topic | Why it matters |\n", "|---|---|\n", "| **`Timestamp` and `to_datetime`** | pandas' building block for a single point in time, and how to parse one out of text |\n", "| **`DatetimeIndex`** | An index made of dates unlocks date-based slicing, not just position-based |\n", "| **Selecting by date** | Partial strings and date ranges, instead of exact label or integer position |\n", "| **`resample`** | Change the time granularity of a series: daily to weekly, weekly to monthly |\n", "| **Timezones** | Localize naive timestamps, convert between zones, and store everything in UTC |\n", "| **Lag, lead, and autocorrelation** | Create feature-engineered columns from past values; measure how much a series predicts itself |\n", ":::\n", "\n", "> Callout markers used throughout this notebook are explained on the [book cover page](../../index.qmd#callout-guide)." ] }, { "cell_type": "markdown", "id": "3", "metadata": {}, "source": [ "::: {.callout-note collapse=\"true\" icon=false}\n", "## Learning Objectives\n", "\n", "By the end of Part 11 you will be able to:\n", "\n", "| # | Skill | Covered in |\n", "|---|---|---|\n", "| 1 | Create and inspect `Timestamp` values, and parse dates out of text with `to_datetime` | Sec. 1 |\n", "| 2 | Build a `DatetimeIndex` and use it to slice a time series by date | Sec. 2 |\n", "| 3 | Select rows with a partial date string or a date range | Sec. 3 |\n", "| 4 | Change the time granularity of a series with `resample` | Sec. 4 |\n", "| 5 | Localize naive timestamps, convert between timezones, and store in UTC | Sec. 5 |\n", "| 6 | Create lag and lead features with `shift`, measure autocorrelation | Sec. 6 |\n", ":::\n" ] }, { "cell_type": "code", "execution_count": null, "id": "4", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "attendance = pd.read_csv(\"data/daily_attendance.csv\")\n", "attendance.dtypes" ] }, { "cell_type": "markdown", "id": "5", "metadata": {}, "source": [ "## 1. Date and Time Data Types\n", "\n", "The `date` column above read in as plain text, `str` dtype, not a date pandas can do arithmetic on. `pd.to_datetime` converts it to pandas' dedicated datetime dtype, `datetime64`:" ] }, { "cell_type": "code", "execution_count": null, "id": "6", "metadata": {}, "outputs": [], "source": [ "attendance[\"date\"] = pd.to_datetime(attendance[\"date\"])\n", "attendance.dtypes" ] }, { "cell_type": "markdown", "id": "7", "metadata": {}, "source": [ "
datetime64[ns], nanosecond precision, whether the data needed it or not. Pandas 3's to_datetime infers a resolution from what is actually in the data: day-level strings like the ones here become datetime64[s] or coarser, not nanoseconds. attendance[\"date\"].dtype shows whichever resolution was inferred for this column.\n",
"[\"2025-01-06\", \"2025-02-14\", \"2025-03-28\"], to Timestamp values with pd.to_datetime, then print the day name of each one.\n",
"dates = pd.to_datetime([\"2025-01-06\", \"2025-02-14\", \"2025-03-28\"])\n",
"for d in dates:\n",
" print(d.day_name())\n",
"attendance.set_index(\"date\") on the full table produces a DatetimeIndex with the same date repeated once per school, since every school has a row for every day. Slicing that index by date then returns rows from every school mixed together for that date, not a clean single time series. Filter to one entity first, exactly as school_300 does above, then set the index.\n",
"attendance to school_id == 302, set date as the index, and confirm the result's index is a DatetimeIndex with isinstance(result.index, pd.DatetimeIndex).\n",
"school_302 = attendance[attendance[\"school_id\"] == 302].set_index(\"date\")\n",
"isinstance(school_302.index, pd.DatetimeIndex)\n",
"len(school_300.loc[\"2025-01\"]) against len(school_300.loc[\"2025-02\"]) confirms the row count for each month matches its number of business days, the same `bdate_range` weekday-only pattern used to build this dataset in the first place.\n",
"school_300 from \"2025-03-15\" to \"2025-03-28\" inclusive, and print the mean attendance_rate over that range.\n",
"last_two_weeks = school_300.loc[\"2025-03-15\":\"2025-03-28\"]\n",
"last_two_weeks[\"attendance_rate\"].mean()\n",
"resample groups by time interval, groupby groups by valuedf.groupby(\"caste\") (Part 3) splits rows by whatever value is already in the caste column. series.resample(\"W\") splits rows by which week their DatetimeIndex label falls into, intervals that did not exist as a column at all until resample created them. Both still end with an aggregation like .mean() to combine each group into one number.\n",
"attendance.set_index(\"date\").groupby(\"school_id\")[\"attendance_rate\"].resample(\"W\").mean() resamples each school's series independently in one call, instead of looping over schools and resampling each one by hand. groupby before resample is what keeps the schools from being averaged together.\n",
"date as the index on the full attendance table, group by school_id, and resample to monthly means in one chained call. Use \"ME\" (month-end): the pandas 3 replacement for the removed \"M\" alias.\n",
"attendance.set_index(\"date\").groupby(\"school_id\")[\"attendance_rate\"].resample(\"ME\").mean()\n", "
tz_localize stamps the existing values with a timezone label: 2025-01-06 00:00 becomes 2025-01-06 00:00+03:00. The number did not change. tz_convert shifts the values to represent the same instant elsewhere: 2025-01-06 00:00+03:00 becomes 2025-01-05 21:00+00:00. If you call tz_localize(\"UTC\") on data that is actually in EAT, you have mislabelled it. The timestamps will compare as if they are 3 hours earlier than they really are.\n",
"school_302 (built in Activity 2), localize its naive index to \"Africa/Nairobi\", then convert to \"Europe/London\". Print the first timestamp in both representations and confirm they are the same instant.\n",
"school_302 = attendance[attendance[\"school_id\"] == 302].set_index(\"date\")\n",
"s302_eat = school_302.copy()\n",
"s302_eat.index = s302_eat.index.tz_localize(\"Africa/Nairobi\")\n",
"s302_london = s302_eat.copy()\n",
"s302_london.index = s302_london.index.tz_convert(\"Europe/London\")\n",
"print(s302_eat.index[0])\n",
"print(s302_london.index[0])\n",
"rate_lag1 column is how you communicate \"yesterday's value\" in a language the model understands. A model trained on [rate_lag1, rate_lag5, day_of_week] predicting attendance_rate is a supervised regression problem built from a single time series. The NaN rows produced by shift must be dropped or filled before training.\n",
"school_302, create a feature matrix with columns: the original attendance_rate, a lag-1 column, a lag-5 column, and a 3-day rolling mean. Drop NaN rows, then print the autocorrelation at lag 1 and lag 5. What do the values tell you about how predictable attendance is at school 302?\n",
"rate_302 = attendance[attendance[\"school_id\"]==302].set_index(\"date\")[\"attendance_rate\"]\n",
"fm_302 = pd.concat([\n",
" rate_302,\n",
" rate_302.shift(1).rename(\"lag1\"),\n",
" rate_302.shift(5).rename(\"lag5\"),\n",
" rate_302.rolling(3).mean().rename(\"rolling3\"),\n",
"], axis=1).dropna()\n",
"print(fm_302.head())\n",
"print(\"autocorr lag1:\", rate_302.autocorr(lag=1).round(3))\n",
"print(\"autocorr lag5:\", rate_302.autocorr(lag=5).round(3))\n",
"date as the index on the full attendance table (Sec. 2)school_id and resample to weekly means (Sec. 4)weekly = attendance.set_index(\"date\").groupby(\"school_id\")[\"attendance_rate\"].resample(\"W\").mean()\n",
"\n",
"first_week = weekly.loc[:, \"2025-01-06\":\"2025-01-12\"]\n",
"last_week = weekly.loc[:, \"2025-03-24\":\"2025-03-28\"]\n",
"drop_per_school = first_week.groupby(\"school_id\").mean() - last_week.groupby(\"school_id\").mean()\n",
"drop_per_school.sort_values(ascending=False)\n",
"