{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0",
   "metadata": {},
   "source": [
    "---\n",
    "title: \"Part 11: Time Series\"\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1",
   "metadata": {},
   "source": [
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sambaiga/ds-mlops-path/blob/main/tutorials/01-python-basics/11-time-series.ipynb) [![Download Notebook](https://img.shields.io/badge/Download-Notebook-blue.svg?logo=jupyter&logoColor=white)](https://raw.githubusercontent.com/sambaiga/ds-mlops-path/main/tutorials/01-python-basics/11-time-series.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2",
   "metadata": {},
   "source": [
    "**DS-MLOps Data Analysis**\n",
    "\n",
    "**Python 3.12+ | Author: Anthony Faustine**\n",
    "\n",
    "## Before you begin\n",
    "\n",
    "This notebook continues from Part 10 (`10-combining-reshaping.ipynb`), which covered concatenation, merging, groupby, and pivoting. The student exam results dataset has no real dates in it, only an `establishment_year`, so this part introduces a second dataset built for the job: daily attendance records for 5 schools over a school term: the shape most time-indexed data takes in practice, one row per day per entity.\n",
    "\n",
    "Part 12 (`11-polars.ipynb`) continues with Polars, including its own (faster) take on time-indexed data.\n",
    "\n",
    "::: {.callout-note collapse=\"true\" icon=false}\n",
    "## Topics covered\n",
    "\n",
    "| Topic | Why it matters |\n",
    "|---|---|\n",
    "| **`Timestamp` and `to_datetime`** | pandas' building block for a single point in time, and how to parse one out of text |\n",
    "| **`DatetimeIndex`** | An index made of dates unlocks date-based slicing, not just position-based |\n",
    "| **Selecting by date** | Partial strings and date ranges, instead of exact label or integer position |\n",
    "| **`resample`** | Change the time granularity of a series: daily to weekly, weekly to monthly |\n",
    "| **Timezones** | Localize naive timestamps, convert between zones, and store everything in UTC |\n",
    "| **Lag, lead, and autocorrelation** | Create feature-engineered columns from past values; measure how much a series predicts itself |\n",
    ":::\n",
    "\n",
    "> Callout markers used throughout this notebook are explained on the [book cover page](../../index.qmd#callout-guide)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3",
   "metadata": {},
   "source": [
    "::: {.callout-note collapse=\"true\" icon=false}\n",
    "## Learning Objectives\n",
    "\n",
    "By the end of Part 11 you will be able to:\n",
    "\n",
    "| # | Skill | Covered in |\n",
    "|---|---|---|\n",
    "| 1 | Create and inspect `Timestamp` values, and parse dates out of text with `to_datetime` | Sec. 1 |\n",
    "| 2 | Build a `DatetimeIndex` and use it to slice a time series by date | Sec. 2 |\n",
    "| 3 | Select rows with a partial date string or a date range | Sec. 3 |\n",
    "| 4 | Change the time granularity of a series with `resample` | Sec. 4 |\n",
    "| 5 | Localize naive timestamps, convert between timezones, and store in UTC | Sec. 5 |\n",
    "| 6 | Create lag and lead features with `shift`, measure autocorrelation | Sec. 6 |\n",
    ":::\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "attendance = pd.read_csv(\"data/daily_attendance.csv\")\n",
    "attendance.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5",
   "metadata": {},
   "source": [
    "## 1. Date and Time Data Types\n",
    "\n",
    "The `date` column above read in as plain text, `str` dtype, not a date pandas can do arithmetic on. `pd.to_datetime` converts it to pandas' dedicated datetime dtype, `datetime64`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6",
   "metadata": {},
   "outputs": [],
   "source": [
    "attendance[\"date\"] = pd.to_datetime(attendance[\"date\"])\n",
    "attendance.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7",
   "metadata": {},
   "source": [
    "<div style='background:#EAF3FA;border-left:5px solid #0369A1;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#0369A1;font-weight:bold'><i class=\"bi bi-info-circle-fill\"></i> Key Concept: Pandas 3 infers the resolution it needs, not always nanoseconds</span><br><br>\n",
    "Earlier pandas versions always stored datetimes as <code>datetime64[ns]</code>, nanosecond precision, whether the data needed it or not. Pandas 3's <code>to_datetime</code> infers a resolution from what is actually in the data: day-level strings like the ones here become <code>datetime64[s]</code> or coarser, not nanoseconds. <code>attendance[\"date\"].dtype</code> shows whichever resolution was inferred for this column.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8",
   "metadata": {},
   "source": [
    "A single value out of a datetime column is a `Timestamp`, pandas' equivalent of Python's `datetime.datetime`, with the same year, month, day, and weekday attributes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9",
   "metadata": {},
   "outputs": [],
   "source": [
    "first_day = attendance[\"date\"].iloc[0]\n",
    "print(type(first_day))\n",
    "print(f\"year={first_day.year}, month={first_day.month}, day_name={first_day.day_name()}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "10",
   "metadata": {},
   "source": [
    "<div style='background:#EBF5F0;border-left:5px solid #009E73;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#065F46;font-weight:bold'><i class=\"bi bi-puzzle-fill\"></i> Activity 1 - Parse and Inspect</span><br><br>\n",
    "\n",
    "<b>Goal:</b> Convert a list of three date strings, <code>[\"2025-01-06\", \"2025-02-14\", \"2025-03-28\"]</code>, to <code>Timestamp</code> values with <code>pd.to_datetime</code>, then print the day name of each one.\n",
    "<pre style='background:#FFF8E1;padding:10px;border-radius:4px;font-size:0.9em'>dates = pd.to_datetime([\"2025-01-06\", \"2025-02-14\", \"2025-03-28\"])\n",
    "for d in dates:\n",
    "    print(d.day_name())</pre>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "11",
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: convert the three date strings and print each day name\n",
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "12",
   "metadata": {},
   "source": [
    "## 2. The `DatetimeIndex`\n",
    "\n",
    "Setting a datetime column as the index turns it into a `DatetimeIndex`, which unlocks date-based slicing instead of only position-based or exact-label lookups. Each school's rows are pulled out first, since a `DatetimeIndex` only makes sense for one time series at a time:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "13",
   "metadata": {},
   "outputs": [],
   "source": [
    "school_300 = attendance[attendance[\"school_id\"] == 300].set_index(\"date\")\n",
    "school_300.index"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "14",
   "metadata": {},
   "source": [
    "<div style='background:#FEF2F2;border-left:5px solid #DC2626;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#991B1B;font-weight:bold'><i class=\"bi bi-bug-fill\"></i> Common Mistake: Setting the index before filtering to one entity</span><br><br>\n",
    "<code>attendance.set_index(\"date\")</code> on the full table produces a <code>DatetimeIndex</code> with the same date repeated once per school, since every school has a row for every day. Slicing that index by date then returns rows from every school mixed together for that date, not a clean single time series. Filter to one entity first, exactly as <code>school_300</code> does above, then set the index.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "15",
   "metadata": {},
   "source": [
    "<div style='background:#EBF5F0;border-left:5px solid #009E73;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#065F46;font-weight:bold'><i class=\"bi bi-puzzle-fill\"></i> Activity 2 - Build Another School's Series</span><br><br>\n",
    "\n",
    "<b>Goal:</b> Filter <code>attendance</code> to <code>school_id == 302</code>, set <code>date</code> as the index, and confirm the result's index is a <code>DatetimeIndex</code> with <code>isinstance(result.index, pd.DatetimeIndex)</code>.\n",
    "<pre style='background:#FFF8E1;padding:10px;border-radius:4px;font-size:0.9em'>school_302 = attendance[attendance[\"school_id\"] == 302].set_index(\"date\")\n",
    "isinstance(school_302.index, pd.DatetimeIndex)</pre>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "16",
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: filter to school_id 302, set date as index, confirm DatetimeIndex\n",
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "17",
   "metadata": {},
   "source": [
    "## 3. Selecting Data from a Time Series\n",
    "\n",
    "A `DatetimeIndex` accepts a partial date string in `.loc`, matching every row that falls inside it. `\"2025-02\"` selects the whole month without spelling out the first and last day:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "18",
   "metadata": {},
   "outputs": [],
   "source": [
    "school_300.loc[\"2025-02\"].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "19",
   "metadata": {},
   "source": [
    "A slice with two partial dates selects everything between them, inclusive of both ends:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "20",
   "metadata": {},
   "outputs": [],
   "source": [
    "school_300.loc[\"2025-02-01\":\"2025-02-07\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "21",
   "metadata": {},
   "source": [
    "<div style='background:#EAF7F0;border-left:5px solid #059669;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#059669;font-weight:bold'><i class=\"bi bi-journal-code\"></i> Example: Comparing the size of two date ranges</span><br><br>\n",
    "<code>len(school_300.loc[\"2025-01\"])</code> against <code>len(school_300.loc[\"2025-02\"])</code> confirms the row count for each month matches its number of business days, the same `bdate_range` weekday-only pattern used to build this dataset in the first place.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "22",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f\"January business days  : {len(school_300.loc['2025-01'])}\")\n",
    "print(f\"February business days : {len(school_300.loc['2025-02'])}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "23",
   "metadata": {},
   "source": [
    "<div style='background:#EBF5F0;border-left:5px solid #009E73;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#065F46;font-weight:bold'><i class=\"bi bi-puzzle-fill\"></i> Activity 3 - Filter the Last Two Weeks of Term</span><br><br>\n",
    "\n",
    "<b>Goal:</b> Select every row in <code>school_300</code> from <code>\"2025-03-15\"</code> to <code>\"2025-03-28\"</code> inclusive, and print the mean <code>attendance_rate</code> over that range.\n",
    "<pre style='background:#FFF8E1;padding:10px;border-radius:4px;font-size:0.9em'>last_two_weeks = school_300.loc[\"2025-03-15\":\"2025-03-28\"]\n",
    "last_two_weeks[\"attendance_rate\"].mean()</pre>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "24",
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: select 2025-03-15 through 2025-03-28 and print the mean attendance_rate\n",
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "25",
   "metadata": {},
   "source": [
    "## 4. The Power of Pandas: `resample`\n",
    "\n",
    "`resample` changes the time granularity of a series: daily data summarized into weekly or monthly figures, the same split-apply-combine idea from Part 3's `groupby`, except the groups are time intervals instead of category values:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "26",
   "metadata": {},
   "outputs": [],
   "source": [
    "weekly_attendance = school_300[\"attendance_rate\"].resample(\"W\").mean()\n",
    "weekly_attendance.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "27",
   "metadata": {},
   "source": [
    "<div style='background:#EAF3FA;border-left:5px solid #0369A1;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#0369A1;font-weight:bold'><i class=\"bi bi-info-circle-fill\"></i> Key Concept: <code>resample</code> groups by time interval, <code>groupby</code> groups by value</span><br><br>\n",
    "<code>df.groupby(\"caste\")</code> (Part 3) splits rows by whatever value is already in the <code>caste</code> column. <code>series.resample(\"W\")</code> splits rows by which week their <code>DatetimeIndex</code> label falls into, intervals that did not exist as a column at all until <code>resample</code> created them. Both still end with an aggregation like <code>.mean()</code> to combine each group into one number.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "28",
   "metadata": {},
   "source": [
    "Monthly resampling on the same series needs only a different frequency string. In pandas 3 the old `\"M\"` alias was removed; the replacement is `\"ME\"` (month-end), which anchors each bucket to the last calendar day of the month:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "29",
   "metadata": {},
   "outputs": [],
   "source": [
    "monthly_attendance = school_300[\"attendance_rate\"].resample(\"ME\").mean()\n",
    "monthly_attendance"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "30",
   "metadata": {},
   "source": [
    "<div style='background:#F5F3FF;border-left:5px solid #7C3AED;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#5B21B6;font-weight:bold'><i class=\"bi bi-lightbulb-fill\"></i> Pro Tip: Resampling the whole DataFrame keeps every entity separate, with care</span><br><br>\n",
    "<code>attendance.set_index(\"date\").groupby(\"school_id\")[\"attendance_rate\"].resample(\"W\").mean()</code> resamples each school's series independently in one call, instead of looping over schools and resampling each one by hand. <code>groupby</code> before <code>resample</code> is what keeps the schools from being averaged together.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "31",
   "metadata": {},
   "source": [
    "<div style='background:#EBF5F0;border-left:5px solid #009E73;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#065F46;font-weight:bold'><i class=\"bi bi-puzzle-fill\"></i> Activity 4 - Monthly Comparison Across Schools</span><br><br>\n",
    "\n",
    "<b>Goal:</b> Set <code>date</code> as the index on the full <code>attendance</code> table, group by <code>school_id</code>, and resample to monthly means in one chained call. Use <code>\"ME\"</code> (month-end): the pandas 3 replacement for the removed <code>\"M\"</code> alias.\n",
    "<pre style='background:#FFF8E1;padding:10px;border-radius:4px;font-size:0.9em'>attendance.set_index(\"date\").groupby(\"school_id\")[\"attendance_rate\"].resample(\"ME\").mean()</pre>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "32",
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: set date as index, group by school_id, resample monthly, take the mean\n",
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "33",
   "metadata": {},
   "source": [
    "## 5. Timezones\n",
    "\n",
    "The timestamps in this dataset are naive: they carry no timezone information. That is fine for a single-country attendance dataset. It is not fine for anything that crosses system boundaries: API responses, cloud server logs, or sensor readings from devices in different cities. Naive timestamps from different sources silently offset against each other when you join them, with no error to warn you.\n",
    "\n",
    "Two operations handle timezones in pandas:\n",
    "\n",
    "- `tz_localize(tz)`: stamps a naive series with a timezone. The values do not change, only their meaning.\n",
    "- `tz_convert(tz)`: shifts a timezone-aware series to a different timezone. The values change to represent the same instant in the target zone.\n",
    "\n",
    "The standard practice for storage: localize to UTC, convert to local time only for display."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "34",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Localize the naive DatetimeIndex to East Africa Time (UTC+3, Tanzania/Kenya)\n",
    "school_300_eat = school_300.copy()\n",
    "school_300_eat.index = school_300_eat.index.tz_localize(\"Africa/Nairobi\")\n",
    "print(\"Localized (EAT):\", school_300_eat.index[:2])\n",
    "\n",
    "# Convert to UTC for storage\n",
    "school_300_utc = school_300_eat.copy()\n",
    "school_300_utc.index = school_300_utc.index.tz_convert(\"UTC\")\n",
    "print(\"UTC:            \", school_300_utc.index[:2])"
   ]
  },
  {
   "cell_type": "raw",
   "id": "35",
   "metadata": {
    "raw_mimetype": "text/markdown"
   },
   "source": [
    "> **Timezone flow: localize once, store as UTC, convert for display**\n",
    "\n",
    "```{mermaid}\n",
    "flowchart LR\n",
    "    A[\"tz-naive Timestamp\n",
    "2024-01-15 09:00\n",
    "(no timezone info)\"] -->|\"tz_localize('US/Eastern')\"| B[\"tz-aware\n",
    "2024-01-15 09:00-05:00\"]\n",
    "    B -->|\"tz_convert('UTC')\"| C[\"UTC\n",
    "2024-01-15 14:00+00:00\"]\n",
    "    C -->|\"tz_convert('Africa/Dar_es_Salaam')\"| D[\"Local display\n",
    "2024-01-15 17:00+03:00\"]\n",
    "\n",
    "    A2[\"mixing naive + aware\n",
    "in the same column\"] -->|\"raises\"| E[\"TypeError\"]\n",
    "\n",
    "    style C fill:#EBF5F0,stroke:#059669,color:#065F46\n",
    "    style E fill:#FEF2F2,stroke:#DC2626,color:#991B1B\n",
    "    style A style=dashed\n",
    "```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "36",
   "metadata": {},
   "source": [
    "<div style='background:#EAF3FA;border-left:5px solid #0369A1;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#0369A1;font-weight:bold'><i class=\"bi bi-info-circle-fill\"></i> Key Concept: Store in UTC, display in local time</span><br><br>\n",
    "UTC has no daylight saving transitions and no ambiguous hours, so it is the only safe format for storing timestamps that will be compared, sorted, or joined across sources. Localize to UTC at ingestion; convert to the user's timezone only when formatting for display. Timestamps stored as naive local time break silently when DST shifts: two records that look 60 minutes apart may actually be 120 minutes apart, or the same instant twice.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "37",
   "metadata": {},
   "source": [
    "Mixing naive and aware timestamps raises a `TypeError`, which is actually helpful: it prevents silent wrong answers.\n",
    "\n",
    "```python\n",
    "school_300_utc.index[0] > school_300.index[0]\n",
    "# TypeError: Cannot compare tz-naive and tz-aware datetime-like objects\n",
    "```\n",
    "\n",
    "The fix is always to localize the naive series before comparing:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "38",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Compare two timezone-aware timestamps safely\n",
    "school_300_eat_end = school_300_eat.index[-1]\n",
    "school_300_utc_end = school_300_utc.index[-1]\n",
    "\n",
    "# Same instant, different representation\n",
    "print(\"EAT end:\", school_300_eat_end)\n",
    "print(\"UTC end:\", school_300_utc_end)\n",
    "print(\"Same instant:\", school_300_eat_end == school_300_utc_end)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "39",
   "metadata": {},
   "source": [
    "<div style='background:#FEF2F2;border-left:5px solid #DC2626;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#991B1B;font-weight:bold'><i class=\"bi bi-bug-fill\"></i> Common Mistake: Localizing when you should be converting</span><br><br>\n",
    "<code>tz_localize</code> stamps the existing values with a timezone label: <code>2025-01-06 00:00</code> becomes <code>2025-01-06 00:00+03:00</code>. The number did not change. <code>tz_convert</code> shifts the values to represent the same instant elsewhere: <code>2025-01-06 00:00+03:00</code> becomes <code>2025-01-05 21:00+00:00</code>. If you call <code>tz_localize(\"UTC\")</code> on data that is actually in EAT, you have mislabelled it. The timestamps will compare as if they are 3 hours earlier than they really are.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "40",
   "metadata": {},
   "source": [
    "<div style='background:#EBF5F0;border-left:5px solid #009E73;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#065F46;font-weight:bold'><i class=\"bi bi-puzzle-fill\"></i> Activity 5 - Localize and Convert</span><br><br>\n",
    "\n",
    "<b>Goal:</b> Starting from <code>school_302</code> (built in Activity 2), localize its naive index to <code>\"Africa/Nairobi\"</code>, then convert to <code>\"Europe/London\"</code>. Print the first timestamp in both representations and confirm they are the same instant.\n",
    "<pre style='background:#FFF8E1;padding:10px;border-radius:4px;font-size:0.9em'>school_302 = attendance[attendance[\"school_id\"] == 302].set_index(\"date\")\n",
    "s302_eat = school_302.copy()\n",
    "s302_eat.index = s302_eat.index.tz_localize(\"Africa/Nairobi\")\n",
    "s302_london = s302_eat.copy()\n",
    "s302_london.index = s302_london.index.tz_convert(\"Europe/London\")\n",
    "print(s302_eat.index[0])\n",
    "print(s302_london.index[0])</pre>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "41",
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: localize school_302 to Africa/Nairobi then convert to Europe/London\n",
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "42",
   "metadata": {},
   "source": [
    "## 6. Lag, Lead, and Autocorrelation\n",
    "\n",
    "A lag feature answers the question: \"what was the value yesterday?\" It is the most common feature engineering step for time series in ML. If Monday's attendance rate predicts Tuesday's, a lag-1 feature captures that relationship in a column a model can consume.\n",
    "\n",
    "`shift(n)` moves values forward by `n` periods (positive = lag, negative = lead). The first `n` rows get `NaN` because there is no previous value to fill them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "43",
   "metadata": {},
   "outputs": [],
   "source": [
    "rate = school_300[\"attendance_rate\"].copy()\n",
    "\n",
    "# lag-1: yesterday's attendance rate\n",
    "lag1 = rate.shift(1)\n",
    "lag1.name = \"rate_lag1\"\n",
    "\n",
    "# lag-5: one school week ago\n",
    "lag5 = rate.shift(5)\n",
    "lag5.name = \"rate_lag5\"\n",
    "\n",
    "features = pd.concat([rate, lag1, lag5], axis=1)\n",
    "features.head(8)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "44",
   "metadata": {},
   "source": [
    "<div style='background:#EAF3FA;border-left:5px solid #0369A1;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#0369A1;font-weight:bold'><i class=\"bi bi-info-circle-fill\"></i> Key Concept: Lag features turn time series forecasting into supervised learning</span><br><br>\n",
    "A gradient boosting model does not know what \"time\" means. It knows column values. Giving it a <code>rate_lag1</code> column is how you communicate \"yesterday's value\" in a language the model understands. A model trained on <code>[rate_lag1, rate_lag5, day_of_week]</code> predicting <code>attendance_rate</code> is a supervised regression problem built from a single time series. The NaN rows produced by <code>shift</code> must be dropped or filled before training.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "45",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Rolling 5-day average: smooth out day-of-week noise\n",
    "rolling_mean = rate.rolling(window=5).mean()\n",
    "rolling_mean.name = \"rate_rolling5\"\n",
    "\n",
    "# Rolling 5-day std: flags volatile weeks\n",
    "rolling_std = rate.rolling(window=5).std()\n",
    "rolling_std.name = \"rate_rolling5_std\"\n",
    "\n",
    "feature_matrix = pd.concat([rate, lag1, lag5, rolling_mean, rolling_std], axis=1)\n",
    "feature_matrix.dropna().head()  # drop NaN rows before modelling"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "46",
   "metadata": {},
   "source": [
    "`autocorr(lag)` measures how strongly a series correlates with a delayed copy of itself. An autocorrelation close to 1 means yesterday's value is a strong predictor of today's: a useful sanity check before building a lag-based model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "47",
   "metadata": {},
   "outputs": [],
   "source": [
    "for lag in [1, 5, 10]:\n",
    "    ac = rate.autocorr(lag=lag)\n",
    "    print(f\"autocorr(lag={lag:>2}): {ac:.3f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "48",
   "metadata": {},
   "source": [
    "<div style='background:#EAF7F0;border-left:5px solid #059669;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#059669;font-weight:bold'><i class=\"bi bi-journal-code\"></i> Example: Reading autocorrelation results</span><br><br>\n",
    "A lag-1 autocorrelation of ~0.4 to 0.7 is typical for daily attendance data: yesterday's rate is a moderate predictor of today's but not a perfect one. A value near 0 means the series is essentially random from day to day. A value near 1 means it barely changes at all. The lag-5 value reflects the weekly pattern: a Monday is most similar to the previous Monday, not to last Friday.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "49",
   "metadata": {},
   "source": [
    "<div style='background:#EBF5F0;border-left:5px solid #009E73;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#065F46;font-weight:bold'><i class=\"bi bi-puzzle-fill\"></i> Activity 6 - Build a Lag Feature Matrix</span><br><br>\n",
    "\n",
    "<b>Goal:</b> For <code>school_302</code>, create a feature matrix with columns: the original <code>attendance_rate</code>, a lag-1 column, a lag-5 column, and a 3-day rolling mean. Drop NaN rows, then print the autocorrelation at lag 1 and lag 5. What do the values tell you about how predictable attendance is at school 302?\n",
    "<pre style='background:#FFF8E1;padding:10px;border-radius:4px;font-size:0.9em'>rate_302 = attendance[attendance[\"school_id\"]==302].set_index(\"date\")[\"attendance_rate\"]\n",
    "fm_302 = pd.concat([\n",
    "    rate_302,\n",
    "    rate_302.shift(1).rename(\"lag1\"),\n",
    "    rate_302.shift(5).rename(\"lag5\"),\n",
    "    rate_302.rolling(3).mean().rename(\"rolling3\"),\n",
    "], axis=1).dropna()\n",
    "print(fm_302.head())\n",
    "print(\"autocorr lag1:\", rate_302.autocorr(lag=1).round(3))\n",
    "print(\"autocorr lag5:\", rate_302.autocorr(lag=5).round(3))</pre>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "50",
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: build lag feature matrix for school_302 and check autocorrelations\n",
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "51",
   "metadata": {},
   "source": [
    "## Capstone: Term-End Attendance Report\n",
    "\n",
    "Combine every operation from this notebook: parsing dates, building a per-school time series, slicing by date, and resampling, into one short report comparing the start and end of term."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "52",
   "metadata": {},
   "source": [
    "<div style='background:#EBF5F0;border-left:5px solid #009E73;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#065F46;font-weight:bold'><i class=\"bi bi-puzzle-fill\"></i> Capstone Exercise - Term-End Attendance Report</span><br><br>\n",
    "\n",
    "<b>Goal:</b>\n",
    "<ol>\n",
    "<li>Set <code>date</code> as the index on the full <code>attendance</code> table (Sec. 2)</li>\n",
    "<li>Group by <code>school_id</code> and resample to weekly means (Sec. 4)</li>\n",
    "<li>From the result, select the first week of January and the last week of March for every school, using a partial date string (Sec. 3)</li>\n",
    "<li>Report which school had the largest drop in attendance between those two weeks</li>\n",
    "</ol>\n",
    "<pre style='background:#FFF8E1;padding:10px;border-radius:4px;font-size:0.9em'>weekly = attendance.set_index(\"date\").groupby(\"school_id\")[\"attendance_rate\"].resample(\"W\").mean()\n",
    "\n",
    "first_week = weekly.loc[:, \"2025-01-06\":\"2025-01-12\"]\n",
    "last_week = weekly.loc[:, \"2025-03-24\":\"2025-03-28\"]\n",
    "drop_per_school = first_week.groupby(\"school_id\").mean() - last_week.groupby(\"school_id\").mean()\n",
    "drop_per_school.sort_values(ascending=False)</pre>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "53",
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: build the term-end attendance report described above\n",
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "54",
   "metadata": {},
   "source": [
    "## Further Reading\n",
    "\n",
    "| Resource | Why it matters |\n",
    "|---|---|\n",
    "| McKinney, W. (2022). *Python for Data Analysis*, 3rd ed. O'Reilly. | Chapter 11 (time series) is the most complete treatment of `DatetimeIndex`, `resample`, and `rolling` with pandas |\n",
    "| [pandas documentation: Time series / date functionality](https://pandas.pydata.org/docs/user_guide/timeseries.html) | The authoritative reference for `pd.date_range`, offset aliases, time zones, and `Period` |\n",
    "| [pandas documentation: Time zone handling](https://pandas.pydata.org/docs/user_guide/timeseries.html#time-zone-handling) | `tz_localize`, `tz_convert`, and the ambiguous-times edge cases |\n",
    "| McKinney, W. (2011). [Time series analysis in Python with pandas](https://conference.scipy.org/proceedings/scipy2011/pdfs/wes_mckinney.pdf). *SciPy 2011*. | The original paper describing pandas' time series design; short and still accurate |\n",
    "| Hyndman, R.J. & Athanasopoulos, G. (2021). *Forecasting: Principles and Practice*, 3rd ed. | Free at [otexts.com/fpp3](https://otexts.com/fpp3): the next step after understanding how to *store* time series data is learning how to *model* it |"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "55",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "| Concept | Key rule |\n",
    "|---|---|\n",
    "| `pd.to_datetime` | Parses text into pandas' `datetime64` dtype; pandas 3 infers the resolution instead of always using nanoseconds |\n",
    "| `Timestamp` | A single datetime value, with `.year`, `.month`, `.day_name()`, and similar attributes |\n",
    "| `DatetimeIndex` | Set a datetime column as the index to unlock date-based slicing |\n",
    "| Filter before indexing | Set a `DatetimeIndex` on one entity's rows, not a table mixing several entities at the same dates |\n",
    "| `.loc[\"2025-02\"]` | A partial date string selects every row inside that period |\n",
    "| `.loc[start:end]` | A date range slice is inclusive of both ends |\n",
    "| `.resample(\"ME\")` | Groups rows by month-end interval; `\"ME\"` is the pandas 3 replacement for the removed `\"M\"` alias |\n",
    "| `groupby(...).resample(...)` | Resample each entity's series independently in one chained call |\n",
    "| `tz_localize(tz)` | Stamps a naive timestamp with a timezone; values do not shift |\n",
    "| `tz_convert(tz)` | Shifts a timezone-aware timestamp to a different zone; values change |\n",
    "| Store UTC | Localize to UTC at ingestion, convert to local time only for display |\n",
    "| `shift(n)` | Creates lag (positive n) or lead (negative n) features; first n rows become NaN |\n",
    "| `rolling(n).mean()` | Rolling window average; smooths noise and is a useful feature in its own right |\n",
    "| `autocorr(lag=n)` | Measures how strongly a series predicts itself n steps ahead; guides lag feature selection |\n",
    "\n",
    "**Next:** `11-polars.ipynb`, covering Polars' DataFrame, expression API, lazy evaluation, and streaming."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}