{ "cells": [ { "cell_type": "markdown", "id": "0", "metadata": {}, "source": [ "---\n", "title: \"Part 10: Combining, Reshaping & Time Series\"\n", "---" ] }, { "cell_type": "markdown", "id": "1", "metadata": {}, "source": [ "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sambaiga/ds-mlops-path/blob/main/tutorials/01-python-basics/10-combining-reshaping.ipynb) [![Download Notebook](https://img.shields.io/badge/Download-Notebook-blue.svg?logo=jupyter&logoColor=white)](https://raw.githubusercontent.com/sambaiga/ds-mlops-path/main/tutorials/01-python-basics/10-combining-reshaping.ipynb)" ] }, { "cell_type": "markdown", "id": "2", "metadata": {}, "source": [ "**DS-MLOps Data Analysis**\n", "\n", "**Python 3.12+ | Author: Anthony Faustine**\n", "\n", "## Before you begin\n", "\n", "This notebook continues from Part 9 (`09-pandas-operations.ipynb`). Real analyses rarely live in one table: a second file with reference data shows up, a result needs splitting by category, and a wide summary table is easier to read than a long one. This part covers four multi-table operations and adds time-indexed data as a fifth theme: concatenation, merging, `groupby`, pivoting, and working with dates and `resample`.\n", "\n", "Alongside the student exam results, we use a small synthetic table mapping each `school_id` to a region, and a second dataset of daily attendance records covering five schools over a full school term.\n", "\n", "Part 11 (`11-polars.ipynb`) continues with Polars and the expression API.\n", "\n", "::: {.callout-note collapse=\"true\" icon=false}\n", "## Topics covered\n", "\n", "| Topic | Why it matters |\n", "|---|---|\n", "| **`pd.concat`** | Stack DataFrames on top of each other, or side by side |\n", "| **`pd.merge`** | Combine two tables on a shared key, the same idea as a SQL join |\n", "| **`groupby`** | Split a dataset into groups, apply a function to each, combine the results |\n", "| **`pivot_table`** | Reshape a long result into a wide, readable summary table |\n", "| **`Timestamp` and `to_datetime`** | Parse text into dates; the building block for all time-indexed work |\n", "| **`DatetimeIndex`** | An index made of dates unlocks date-based slicing instead of position-only |\n", "| **`resample`** | Change the time granularity of a series: daily to weekly, weekly to monthly |\n", ":::\n", "\n", "> Callout markers used throughout this notebook are explained on the [book cover page](../../index.qmd#callout-guide)." ] }, { "cell_type": "markdown", "id": "3", "metadata": {}, "source": [ "::: {.callout-note collapse=\"true\" icon=false}\n", "## Learning Objectives\n", "\n", "By the end of Part 10 you will be able to:\n", "\n", "| # | Skill | Covered in |\n", "|---|---|---|\n", "| 1 | Stack DataFrames with `pd.concat`, by row and by column | Sec. 1 |\n", "| 2 | Combine two tables on a shared key with `pd.merge`, and choose the right `how` | Sec. 2 |\n", "| 2b | Predict row counts after a merge and diagnose unexpected fan-out | Sec. 2 |\n", "| 3 | Split a dataset into groups and aggregate each one with `groupby` | Sec. 3 |\n", "| 4 | Reshape a grouped result into a wide summary with `pivot_table` | Sec. 4 |\n", "| 5 | Parse text into datetime values with `to_datetime` and extract date components | Sec. 5 |\n", "| 6 | Build a `DatetimeIndex` and use it to slice a time series by date | Sec. 6 |\n", "| 7 | Select rows with a partial date string or a date range | Sec. 7 |\n", "| 8 | Change the time granularity of a series with `resample` | Sec. 8 |\n", ":::\n" ] }, { "cell_type": "code", "execution_count": null, "id": "4", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "df = pd.read_csv(\"data/university_analytics.csv\")\n", "df[\"average_marks\"] = (df[\"midterm_score\"] + df[\"final_score\"] + df[\"project_score\"]) / 3\n", "\n", "# courses.csv is a thin reference table — one row per course — useful for joins\n", "courses = pd.read_csv(\"data/courses.csv\")\n", "df.head()" ] }, { "cell_type": "markdown", "id": "5", "metadata": {}, "source": [ "## 1. Concatenating DataFrames with `pd.concat`\n", "\n", "`pd.concat` stacks DataFrames together. By default it stacks rows on top of each other (`axis=0`), the operation you need when the same columns show up in two separate files or two separate batches:" ] }, { "cell_type": "code", "execution_count": null, "id": "6", "metadata": {}, "outputs": [], "source": [ "male_students = df[df[\"gender\"] == \"M\"]\n", "female_students = df[df[\"gender\"] == \"F\"]\n", "\n", "recombined = pd.concat([male_students, female_students], axis=0)\n", "print(f\"male: {len(male_students)}, female: {len(female_students)}, combined: {len(recombined)}\")" ] }, { "cell_type": "markdown", "id": "7", "metadata": {}, "source": [ "Passing `axis=1` stacks columns side by side instead, matching rows up by index. This is the shape you get when a calculation is done separately and needs joining back onto the original table:" ] }, { "cell_type": "code", "execution_count": null, "id": "8", "metadata": {}, "outputs": [], "source": [ "pass_fail = (df[\"average_marks\"] >= 0.6).rename(\"passed\")\n", "with_pass_fail = pd.concat([df[[\"student_id\", \"average_marks\"]], pass_fail], axis=1)\n", "with_pass_fail.head()" ] }, { "cell_type": "markdown", "id": "9", "metadata": {}, "source": [ "
\n", " Common Mistake: Concatenating by column without matching indexes first

\n", "pd.concat([df_a, df_b], axis=1) lines rows up by index position, not by any shared key. If df_a and df_b were filtered, sorted, or reset differently beforehand, row 0 in one might not be row 0 in the other, and the result silently combines the wrong rows. pd.merge (Sec. 2) is the safer choice whenever there is an actual key column to join on.\n", "
" ] }, { "cell_type": "markdown", "id": "10", "metadata": {}, "source": [ "
\n", " Activity 1 - Split and Recombine

\n", "\n", "Goal: Split df into two DataFrames by caste: one for \"BC\" and one for everything else, using .isin() or a boolean mask. Concatenate them back with pd.concat and confirm the row count matches the original.\n", "
bc_only = df[df[\"program\"] == \"BC\"]\n",
    "not_bc = df[df[\"program\"] != \"BC\"]\n",
    "recombined = pd.concat([bc_only, not_bc], axis=0)\n",
    "len(recombined) == len(df)
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "11", "metadata": {}, "outputs": [], "source": [ "# TODO: split df by program == \"Data Science\", concat back, and confirm the row count matches\n", "..." ] }, { "cell_type": "markdown", "id": "12", "metadata": {}, "source": [ "## 2. Joining Data with `pd.merge`\n", "\n", "`pd.merge` combines two tables on a shared key, the same idea as a SQL join. The `courses` table loaded above has one row per `course_id`; merging it onto `df` attaches metadata (credits, department, instructor) to every enrollment row:" ] }, { "cell_type": "code", "execution_count": null, "id": "13", "metadata": {}, "outputs": [], "source": [ "df_with_meta = pd.merge(df, courses, on=\"course_id\", how=\"left\")\n", "df_with_meta[[\"student_id\", \"course_id\", \"course_x\", \"department\", \"credits\"]].head()" ] }, { "cell_type": "raw", "id": "14", "metadata": { "raw_mimetype": "text/markdown" }, "source": [ "> **Join types: which rows survive each merge strategy**\n", "\n", "```{mermaid}\n", "flowchart LR\n", " subgraph LEFT [\"Left: students\"]\n", " LR1[\"S001 | Alice\"]\n", " LR2[\"S002 | Bob\"]\n", " LR3[\"S003 | Carol\"]\n", " end\n", " subgraph RIGHT [\"Right: grades\"]\n", " RR1[\"S001 | 85\"]\n", " RR2[\"S002 | 72\"]\n", " RR4[\"S004 | 90\"]\n", " end\n", " LR1 & RR1 --> INN[\"inner: S001 (match only)\"]\n", " LR2 & RR2 --> INN2[\"inner: S002 (match only)\"]\n", " LR1 & LR2 & LR3 --> LEFT2[\"left: S001, S002, S003\n", "(Carol grade = NaN)\"]\n", " RR1 & RR2 & RR4 --> RIGHT2[\"right: S001, S002, S004\n", "(Alice/Bob keep grade, S004 name = NaN)\"]\n", " LR1 & LR2 & LR3 & RR4 --> OUTER[\"outer: all 4 students, NaN where no match\"]\n", "\n", " style INN fill:#EBF5F0,stroke:#059669,color:#065F46\n", " style INN2 fill:#EBF5F0,stroke:#059669,color:#065F46\n", " style LEFT2 fill:#EAF3FA,stroke:#0369A1,color:#0C4A6E\n", " style RIGHT2 fill:#F5F3FF,stroke:#7C3AED,color:#3B0764\n", " style OUTER fill:#FEF2F2,stroke:#DC2626,color:#991B1B\n", "```\n" ] }, { "cell_type": "markdown", "id": "15", "metadata": {}, "source": [ "The `how` argument decides which rows survive when a key on one side has no match on the other. `courses` here has a row for every `course_id` in `df`, so every `how` gives the same result; the difference only shows up once the two tables disagree:" ] }, { "cell_type": "markdown", "id": "16", "metadata": {}, "source": [ "
\n", " Example: When how actually changes the result

\n", "Merging df against only half of courses with how=\"inner\" keeps only enrollments whose course is in that half. The same merge with how=\"left\" keeps every student, with region set to NaN wherever the school was missing from the smaller table.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "17", "metadata": {}, "outputs": [], "source": [ "half_courses = courses[courses[\"department\"] == \"Computing\"]\n", "\n", "inner_result = pd.merge(df, half_courses, on=\"course_id\", how=\"inner\")\n", "left_result = pd.merge(df, half_courses, on=\"course_id\", how=\"left\")\n", "missing = left_result[\"department\"].isna().sum()\n", "print(f\"inner: {len(inner_result)} rows, left: {len(left_result)} rows, unmatched dept: {missing}\")" ] }, { "cell_type": "markdown", "id": "18", "metadata": {}, "source": [ "The four join types differ only in which rows they keep, never in how rows are matched:" ] }, { "cell_type": "code", "execution_count": null, "id": "19", "metadata": {}, "outputs": [], "source": [ "from ark.plot.diagrams import merge_join_types_diagram\n", "\n", "merge_join_types_diagram();" ] }, { "cell_type": "markdown", "id": "20", "metadata": {}, "source": [ "
\n", " Common Mistake: Forgetting that an unmatched key produces NaN, not a dropped row

\n", "With how=\"left\", an enrollment whose course_id has no match in courses still gets a row, just with NaN in every column that came from courses. Code that assumes every row has a real region after a left merge will silently miscount or mis-group those rows later, usually in a groupby a few cells down.\n", "
" ] }, { "cell_type": "markdown", "id": "21", "metadata": {}, "source": [ "### Duplicate keys: when merge produces more rows than you expect\n", "\n", "Every example so far used a key where the right table had at most one match per left row (many-to-one). When the right table has more than one match, the result has more rows than either input, and this is correct but often surprising.\n", "\n", "Concrete example: if a student is enrolled in three courses, a left merge of the enrollment table against a course details table produces three rows for that student, one per course. That is the right answer, but code that assumed row counts stay the same will silently operate on a bigger table." ] }, { "cell_type": "code", "execution_count": null, "id": "22", "metadata": {}, "outputs": [], "source": [ "# Create a small example: two students, one appearing in two courses\n", "students_small = pd.DataFrame(\n", " {\n", " \"student_id\": [\"S0001\", \"S0001\", \"S0002\"],\n", " \"course_id\": [\"C01\", \"C02\", \"C01\"],\n", " \"score\": [85, 72, 91],\n", " }\n", ")\n", "\n", "course_info = pd.DataFrame(\n", " {\n", " \"course_id\": [\"C01\", \"C02\"],\n", " \"course_name\": [\"Statistics\", \"Algorithms\"],\n", " \"credits\": [3, 4],\n", " }\n", ")\n", "\n", "merged_dup = pd.merge(students_small, course_info, on=\"course_id\", how=\"left\")\n", "print(f\"students_small rows: {len(students_small)}\")\n", "print(f\"after merge rows : {len(merged_dup)}\")\n", "merged_dup" ] }, { "cell_type": "markdown", "id": "23", "metadata": {}, "source": [ "
\n", " Key Concept: A many-to-one merge preserves row count; a one-to-many merge multiplies it

\n", "Many-to-one (the common case): each row on the left matches at most one row on the right. Row count stays the same.

\n", "One-to-many: one row on the left matches several rows on the right. Each match becomes its own output row. A student enrolled in 3 courses becomes 3 rows after merging on a course-detail table.

\n", "Many-to-many: both sides have duplicates on the key. Every left row pairs with every matching right row: 3 left rows matching 4 right rows produces 12 output rows. This is almost never what you want and is usually a sign of a design error in the data or the query.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "24", "metadata": {}, "outputs": [], "source": [ "# Verify: check how many rows the real df gains after merging on course_id\n", "before = len(df)\n", "after = len(pd.merge(df, courses, on=\"course_id\", how=\"left\"))\n", "print(f\"Before merge: {before} rows\")\n", "print(f\"After merge: {after} rows\")\n", "print(f\"Difference : {after - before} extra rows (students with multiple course matches)\")\n", "\n", "# Diagnose: how many course_ids appear more than once in courses?\n", "dupe_courses = courses[courses.duplicated(\"course_id\", keep=False)]\n", "print(f\"\\nDuplicate course_ids in courses.csv: {dupe_courses['course_id'].nunique()}\")" ] }, { "cell_type": "markdown", "id": "25", "metadata": {}, "source": [ "
\n", " Common Mistake: Summing a column after an unexpected fan-out

\n", "If df has one row per student and groupby(\"student_id\")[\"score\"].sum() gives the right answer before a merge, but the merge doubles some rows, the same sum() after the merge silently counts those scores twice. The most reliable diagnostic is to compare len(df) before and after any merge. A row count increase without a planned one-to-many join is a bug, not a feature.\n", "
# Always verify after merging\n",
    "before = len(df)\n",
    "merged = pd.merge(df, other_table, on=\"key\", how=\"left\")\n",
    "assert len(merged) == before, f\"Unexpected fan-out: {before} -> {len(merged)} rows\"
\n", "
" ] }, { "cell_type": "markdown", "id": "26", "metadata": {}, "source": [ "
\n", " Activity 2 - Count Enrollments Per Department

\n", "\n", "Goal: Merge df with courses using how=\"left\", then use .value_counts() on the resulting department column.egion column to see how many student rows fall in each region.\n", "
merged = pd.merge(df, courses, on=\"course_id\", how=\"left\")\n",
    "merged[\"department\"].value_counts()
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "27", "metadata": {}, "outputs": [], "source": [ "# TODO: merge df with courses (how=\"left\"), then value_counts on department\n", "..." ] }, { "cell_type": "markdown", "id": "28", "metadata": {}, "source": [ "## 3. Group By Operations\n", "\n", "`groupby` splits a dataset into groups by a key, applies a function to each group independently, and combines the results back into one table, one row per group:" ] }, { "cell_type": "code", "execution_count": null, "id": "29", "metadata": {}, "outputs": [], "source": [ "from ark.plot.diagrams import groupby_split_apply_combine_diagram\n", "\n", "groupby_split_apply_combine_diagram();" ] }, { "cell_type": "code", "execution_count": null, "id": "30", "metadata": {}, "outputs": [], "source": [ "df.groupby(\"program\")[\"average_marks\"].mean()" ] }, { "cell_type": "markdown", "id": "31", "metadata": {}, "source": [ "
\n", " Key Concept: groupby computes nothing until you aggregate

\n", "df.groupby(\"program\") on its own returns a DataFrameGroupBy object, a plan for splitting the data, not a result. Nothing is actually computed until a method like .mean(), .sum(), or .agg() is called on it, the same lazy-then-compute pattern you will see again in Part 11's Polars notebook.\n", "
" ] }, { "cell_type": "markdown", "id": "32", "metadata": {}, "source": [ "`.agg()` works after `groupby` exactly as it did in Part 2, computing several statistics per group in one call:" ] }, { "cell_type": "code", "execution_count": null, "id": "33", "metadata": {}, "outputs": [], "source": [ "df.groupby(\"program\")[\"average_marks\"].agg([\"mean\", \"std\", \"count\"])" ] }, { "cell_type": "markdown", "id": "34", "metadata": {}, "source": [ "Grouping by more than one column splits into one group per combination, here gender within caste:" ] }, { "cell_type": "code", "execution_count": null, "id": "35", "metadata": {}, "outputs": [], "source": [ "df.groupby([\"program\", \"gender\"])[\"average_marks\"].mean()" ] }, { "cell_type": "markdown", "id": "36", "metadata": {}, "source": [ "
\n", " Pro Tip: observed=True when grouping a category column

\n", "Grouping a column with category dtype (Part 2, Sec. 2) by default includes every category the dtype knows about, even ones with zero rows in the current data, padding the result with empty groups. Passing observed=True to groupby keeps only the categories that actually appear.\n", "
" ] }, { "cell_type": "markdown", "id": "37", "metadata": {}, "source": [ "
\n", " Activity 3 - Dropout Rate by Caste

\n", "\n", "Goal: Group df by caste and compute, for each group, what fraction of students have continue_drop == \"drop\". (series == \"drop\").mean() gives a fraction directly, no separate count and divide needed.\n", "
df.groupby(\"program\")[\"continue_drop\"].apply(lambda s: (s == \"drop\").mean())
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "38", "metadata": {}, "outputs": [], "source": [ "# TODO: group by program, compute the pass fraction per program\n", "..." ] }, { "cell_type": "markdown", "id": "39", "metadata": {}, "source": [ "## 4. Pivoting Data\n", "\n", "`pivot_table` is `groupby` plus a reshape, in one call: it groups by one column, splits further by another, and lays the second grouping out as columns instead of stacking it into more rows, much easier to scan as a summary:" ] }, { "cell_type": "code", "execution_count": null, "id": "40", "metadata": {}, "outputs": [], "source": [ "pd.pivot_table(df, index=\"program\", columns=\"gender\", values=\"average_marks\", aggfunc=\"mean\")" ] }, { "cell_type": "markdown", "id": "41", "metadata": {}, "source": [ "
\n", " Key Concept: pivot_table is groupby with a different output shape

\n", "df.groupby([\"program\", \"gender\"])[\"average_marks\"].mean() from Sec. 3 and the pivot_table call above compute the exact same numbers. groupby returns them stacked into a long Series with a two-level index; pivot_table lays the second key out across columns instead. Reach for pivot_table specifically when the result is meant to be read by a person, not processed further by code.\n", "
" ] }, { "cell_type": "markdown", "id": "42", "metadata": {}, "source": [ "`aggfunc` accepts a list too, giving more than one statistic per cell:" ] }, { "cell_type": "code", "execution_count": null, "id": "43", "metadata": {}, "outputs": [], "source": [ "pd.pivot_table(df, index=\"program\", columns=\"gender\", values=\"average_marks\", aggfunc=[\"mean\", \"count\"])" ] }, { "cell_type": "markdown", "id": "44", "metadata": {}, "source": [ "
\n", " Activity 4 - Region by Program Pivot

\n", "\n", "Goal: Merge df with courses (Sec. 2), then build a pivot table with region as the index, program as the columns, and the mean average_marks in each cell.\n", "
merged = pd.merge(df, courses, on=\"course_id\", how=\"left\")\n",
    "pd.pivot_table(merged, index=\"region\", columns=\"program\", values=\"average_marks\", aggfunc=\"mean\")
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "45", "metadata": {}, "outputs": [], "source": [ "# TODO: merge with courses, then pivot region x program on mean final_score\n", "..." ] }, { "cell_type": "markdown", "id": "46", "metadata": {}, "source": [ "## Capstone: Regional Performance Report\n", "\n", "Combine every operation from this notebook into one report: a merge to bring in region, a groupby to summarize, and a pivot to lay the summary out for reading." ] }, { "cell_type": "markdown", "id": "47", "metadata": {}, "source": [ "
\n", " Capstone Exercise - Regional Performance Report

\n", "\n", "Goal:\n", "
    \n", "
  1. Merge df with courses on course_id, keeping every student (Sec. 2)
  2. \n", "
  3. Group the merged table by region and compute the mean average_marks and the drop fraction, using the function from Activity 3 (Sec. 3)
  4. \n", "
  5. Build a pivot table with region as rows and caste as columns, mean average_marks in each cell (Sec. 4)
  6. \n", "
\n", "
merged = pd.merge(df, courses, on=\"course_id\", how=\"left\")\n",
    "\n",
    "regional_summary = merged.groupby(\"region\").agg(\n",
    "    mean_marks=(\"average_marks\", \"mean\"),\n",
    "    drop_fraction=(\"continue_drop\", lambda s: (s == \"drop\").mean()),\n",
    ")\n",
    "\n",
    "region_caste_pivot = pd.pivot_table(merged, index=\"region\", columns=\"program\", values=\"average_marks\", aggfunc=\"mean\")
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "48", "metadata": {}, "outputs": [], "source": [ "# TODO: build the regional performance report described above\n", "..." ] }, { "attachments": {}, "cell_type": "markdown", "id": "49", "metadata": {}, "source": [ "## 5. Time Series Data\n", "\n", "The student exam results dataset has no date column. The attendance dataset does: daily records for five schools across a full school term. Loading it alongside the exam data gives Part 10 two tables to work with and shows how the same pandas idioms scale to a new problem shape." ] }, { "cell_type": "code", "execution_count": null, "id": "50", "metadata": {}, "outputs": [], "source": [ "attendance = pd.read_csv(\"data/daily_attendance.csv\")\n", "attendance.dtypes" ] }, { "attachments": {}, "cell_type": "markdown", "id": "51", "metadata": {}, "source": [ "## 5. Date and Time Data Types\n", "\n", "The `date` column above read in as plain text, `str` dtype, not a date pandas can do arithmetic on. `pd.to_datetime` converts it to pandas' dedicated datetime dtype, `datetime64`:" ] }, { "cell_type": "code", "execution_count": null, "id": "52", "metadata": {}, "outputs": [], "source": [ "attendance[\"date\"] = pd.to_datetime(attendance[\"date\"])\n", "attendance.dtypes" ] }, { "attachments": {}, "cell_type": "markdown", "id": "53", "metadata": {}, "source": [ "
\n", " Key Concept: Pandas 3 infers the resolution it needs, not always nanoseconds

\n", "Earlier pandas versions always stored datetimes as datetime64[ns], nanosecond precision, whether the data needed it or not. Pandas 3's to_datetime infers a resolution from what is actually in the data: day-level strings like the ones here become datetime64[s] or coarser, not nanoseconds. attendance[\"date\"].dtype shows whichever resolution was inferred for this column.\n", "
" ] }, { "attachments": {}, "cell_type": "markdown", "id": "54", "metadata": {}, "source": [ "A single value out of a datetime column is a `Timestamp`, pandas' equivalent of Python's `datetime.datetime`, with the same year, month, day, and weekday attributes:" ] }, { "cell_type": "code", "execution_count": null, "id": "55", "metadata": {}, "outputs": [], "source": [ "first_day = attendance[\"date\"].iloc[0]\n", "print(type(first_day))\n", "print(f\"year={first_day.year}, month={first_day.month}, day_name={first_day.day_name()}\")" ] }, { "attachments": {}, "cell_type": "markdown", "id": "56", "metadata": {}, "source": [ "
\n", " Activity 5 - Parse and Inspect

\n", "\n", "Goal: Convert a list of three date strings, [\"2025-01-06\", \"2025-02-14\", \"2025-03-28\"], to Timestamp values with pd.to_datetime, then print the day name of each one.\n", "
dates = pd.to_datetime([\"2025-01-06\", \"2025-02-14\", \"2025-03-28\"])\n",
    "for d in dates:\n",
    "    print(d.day_name())
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "57", "metadata": {}, "outputs": [], "source": [ "# TODO: convert the three date strings and print each day name\n", "..." ] }, { "attachments": {}, "cell_type": "markdown", "id": "58", "metadata": {}, "source": [ "## 6. The `DatetimeIndex`\n", "\n", "Setting a datetime column as the index turns it into a `DatetimeIndex`, which unlocks date-based slicing instead of only position-based or exact-label lookups. Each school's rows are pulled out first, since a `DatetimeIndex` only makes sense for one time series at a time:" ] }, { "cell_type": "code", "execution_count": null, "id": "59", "metadata": {}, "outputs": [], "source": [ "school_300 = attendance[attendance[\"school_id\"] == 300].set_index(\"date\")\n", "school_300.index" ] }, { "attachments": {}, "cell_type": "markdown", "id": "60", "metadata": {}, "source": [ "
\n", " Common Mistake: Setting the index before filtering to one entity

\n", "attendance.set_index(\"date\") on the full table produces a DatetimeIndex with the same date repeated once per school, since every school has a row for every day. Slicing that index by date then returns rows from every school mixed together for that date, not a clean single time series. Filter to one entity first, exactly as school_300 does above, then set the index.\n", "
" ] }, { "attachments": {}, "cell_type": "markdown", "id": "61", "metadata": {}, "source": [ "
\n", " Activity 6 - Build Another School's Series

\n", "\n", "Goal: Filter attendance to school_id == 302, set date as the index, and confirm the result's index is a DatetimeIndex with isinstance(result.index, pd.DatetimeIndex).\n", "
school_302 = attendance[attendance[\"school_id\"] == 302].set_index(\"date\")\n",
    "isinstance(school_302.index, pd.DatetimeIndex)
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "62", "metadata": {}, "outputs": [], "source": [ "# TODO: filter to school_id 302, set date as index, confirm DatetimeIndex\n", "..." ] }, { "attachments": {}, "cell_type": "markdown", "id": "63", "metadata": {}, "source": [ "## 7. Selecting Data from a Time Series\n", "\n", "A `DatetimeIndex` accepts a partial date string in `.loc`, matching every row that falls inside it. `\"2025-02\"` selects the whole month without spelling out the first and last day:" ] }, { "cell_type": "code", "execution_count": null, "id": "64", "metadata": {}, "outputs": [], "source": [ "school_300.loc[\"2025-02\"].head()" ] }, { "attachments": {}, "cell_type": "markdown", "id": "65", "metadata": {}, "source": [ "A slice with two partial dates selects everything between them, inclusive of both ends:" ] }, { "cell_type": "code", "execution_count": null, "id": "66", "metadata": {}, "outputs": [], "source": [ "school_300.loc[\"2025-02-01\":\"2025-02-07\"]" ] }, { "attachments": {}, "cell_type": "markdown", "id": "67", "metadata": {}, "source": [ "
\n", " Example: Comparing the size of two date ranges

\n", "len(school_300.loc[\"2025-01\"]) against len(school_300.loc[\"2025-02\"]) confirms the row count for each month matches its number of business days, the same `bdate_range` weekday-only pattern used to build this dataset in the first place.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "68", "metadata": {}, "outputs": [], "source": [ "print(f\"January business days : {len(school_300.loc['2025-01'])}\")\n", "print(f\"February business days : {len(school_300.loc['2025-02'])}\")" ] }, { "attachments": {}, "cell_type": "markdown", "id": "69", "metadata": {}, "source": [ "
\n", " Activity 7 - Filter the Last Two Weeks of Term

\n", "\n", "Goal: Select every row in school_300 from \"2025-03-15\" to \"2025-03-28\" inclusive, and print the mean attendance_rate over that range.\n", "
last_two_weeks = school_300.loc[\"2025-03-15\":\"2025-03-28\"]\n",
    "last_two_weeks[\"attendance_rate\"].mean()
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "70", "metadata": {}, "outputs": [], "source": [ "# TODO: select 2025-03-15 through 2025-03-28 and print the mean attendance_rate\n", "..." ] }, { "attachments": {}, "cell_type": "markdown", "id": "71", "metadata": {}, "source": [ "## 8. The Power of Pandas: `resample`\n", "\n", "`resample` changes the time granularity of a series: daily data summarized into weekly or monthly figures, the same split-apply-combine idea from Part 3's `groupby`, except the groups are time intervals instead of category values:" ] }, { "cell_type": "code", "execution_count": null, "id": "72", "metadata": {}, "outputs": [], "source": [ "weekly_attendance = school_300[\"attendance_rate\"].resample(\"W\").mean()\n", "weekly_attendance.head()" ] }, { "attachments": {}, "cell_type": "markdown", "id": "73", "metadata": {}, "source": [ "
\n", " Key Concept: resample groups by time interval, groupby groups by value

\n", "df.groupby(\"program\") (Part 3) splits rows by whatever value is already in the caste column. series.resample(\"W\") splits rows by which week their DatetimeIndex label falls into, intervals that did not exist as a column at all until resample created them. Both still end with an aggregation like .mean() to combine each group into one number.\n", "
" ] }, { "attachments": {}, "cell_type": "markdown", "id": "74", "metadata": {}, "source": [ "Monthly resampling on the same series needs only a different frequency string. In pandas 3 the old `\"M\"` alias was removed; the replacement is `\"ME\"` (month-end), which anchors each bucket to the last calendar day of the month:" ] }, { "cell_type": "code", "execution_count": null, "id": "75", "metadata": {}, "outputs": [], "source": [ "monthly_attendance = school_300[\"attendance_rate\"].resample(\"ME\").mean()\n", "monthly_attendance" ] }, { "attachments": {}, "cell_type": "markdown", "id": "76", "metadata": {}, "source": [ "
\n", " Pro Tip: Resampling the whole DataFrame keeps every entity separate, with care

\n", "attendance.set_index(\"date\").groupby(\"school_id\")[\"attendance_rate\"].resample(\"W\").mean() resamples each school's series independently in one call, instead of looping over schools and resampling each one by hand. groupby before resample is what keeps the schools from being averaged together.\n", "
" ] }, { "attachments": {}, "cell_type": "markdown", "id": "77", "metadata": {}, "source": [ "
\n", " Activity 8 - Monthly Comparison Across Schools

\n", "\n", "Goal: Set date as the index on the full attendance table, group by school_id, and resample to monthly means in one chained call. Use \"ME\" (month-end) — the pandas 3 replacement for the old \"M\" alias.\n", "
attendance.set_index(\"date\").groupby(\"school_id\")[\"attendance_rate\"].resample(\"ME\").mean()
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "78", "metadata": {}, "outputs": [], "source": [ "# TODO: set date as index, group by school_id, resample monthly, take the mean\n", "..." ] }, { "attachments": {}, "cell_type": "markdown", "id": "79", "metadata": {}, "source": [ "## Capstone: Term-End Attendance Report\n", "\n", "Combine every operation from this notebook: parsing dates, building a per-school time series, slicing by date, and resampling, into one short report comparing the start and end of term." ] }, { "attachments": {}, "cell_type": "markdown", "id": "80", "metadata": {}, "source": [ "
\n", " Capstone Exercise - Term-End Attendance Report

\n", "\n", "Goal:\n", "
    \n", "
  1. Set date as the index on the full attendance table (Sec. 2)
  2. \n", "
  3. Group by school_id and resample to weekly means (Sec. 4)
  4. \n", "
  5. From the result, select the first week of January and the last week of March for every school, using a partial date string (Sec. 3)
  6. \n", "
  7. Report which school had the largest drop in attendance between those two weeks
  8. \n", "
\n", "
weekly = attendance.set_index(\"date\").groupby(\"school_id\")[\"attendance_rate\"].resample(\"W\").mean()\n",
    "\n",
    "first_week = weekly.loc[:, \"2025-01-06\":\"2025-01-12\"]\n",
    "last_week = weekly.loc[:, \"2025-03-24\":\"2025-03-28\"]\n",
    "drop_per_school = first_week.groupby(\"school_id\").mean() - last_week.groupby(\"school_id\").mean()\n",
    "drop_per_school.sort_values(ascending=False)
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "81", "metadata": {}, "outputs": [], "source": [ "# TODO: build the term-end attendance report described above\n", "..." ] }, { "cell_type": "markdown", "id": "82", "metadata": {}, "source": [ "## Further Reading\n", "\n", "| Resource | Why it matters |\n", "|---|---|\n", "| McKinney, W. (2022). *Python for Data Analysis*, 3rd ed. O'Reilly. | Chapter 8 (data wrangling) and Chapter 10 (aggregation) are the canonical references for `concat`, `merge`, and `groupby` |\n", "| [pandas documentation — Merge, join, concatenate and compare](https://pandas.pydata.org/docs/user_guide/merging.html) | Full API reference with worked examples for every `how=` variant and join edge-case |\n", "| Wickham, H. (2014). [Tidy data](https://doi.org/10.18637/jss.v059.i10). *Journal of Statistical Software* 59(10). | Free PDF — defines when data is \"tidy\" and explains `.melt()` / `.pivot()` as transformations toward tidy form |\n", "| [pandas documentation — Reshaping and pivot tables](https://pandas.pydata.org/docs/user_guide/reshaping.html) | `.pivot_table()`, `.stack()`, `.unstack()`, and `.crosstab()` in one place |\n" ] }, { "attachments": {}, "cell_type": "markdown", "id": "83", "metadata": {}, "source": [ "## Summary\n", "\n", "| Concept | Key rule |\n", "|---|---|\n", "| `pd.concat(axis=0)` | Stack rows; use when the same columns appear in separate batches |\n", "| `pd.concat(axis=1)` | Stack columns by index position, not by key; prefer `merge` when there is a real key |\n", "| `pd.merge(..., on=key, how=...)` | Combine two tables on a shared key, the same idea as a SQL join |\n", "| `how=\"inner\"` | Keep only rows with a match on both sides |\n", "| `how=\"left\"` | Keep every row from the left table; unmatched rows get `NaN` |\n", "| Many-to-one merge | Row count stays the same; each left row matches at most one right row |\n", "| One-to-many merge | Row count grows; each left row can match multiple right rows |\n", "| Fan-out check | Always compare `len(before)` vs `len(after)` when a row count increase is not expected |\n", "| `groupby(...)` | Returns a plan, not a result; nothing computes until you aggregate |\n", "| `.agg([...])` after `groupby` | Several statistics per group, in one call |\n", "| `pivot_table` | `groupby` with the second key laid out as columns instead of stacked as rows |\n", "| `pd.to_datetime` | Parses text into pandas' `datetime64` dtype; pandas 3 infers the resolution |\n", "| `Timestamp` | A single datetime value, with `.year`, `.month`, `.day_name()`, and similar attributes |\n", "| `DatetimeIndex` | Set a datetime column as the index to unlock date-based slicing |\n", "| Filter before indexing | Set a `DatetimeIndex` on one entity's rows, not a table mixing several entities |\n", "| `.loc[\"2025-02\"]` | A partial date string selects every row inside that period |\n", "| `.loc[start:end]` | A date range slice is inclusive of both ends |\n", "| `.resample(freq)` | Groups rows by time interval instead of by value, then needs an aggregation like `.mean()` |\n", "| `groupby(...).resample(...)` | Resample each entity's series independently in one chained call |\n", "\n", "**Next:** `11-polars.ipynb`, covering Polars' DataFrame, expression API, and lazy evaluation." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.12" } }, "nbformat": 4, "nbformat_minor": 5 }