{ "cells": [ { "cell_type": "markdown", "id": "0", "metadata": {}, "source": [ "---\n", "title: \"Part 10: Combining, Reshaping & Time Series\"\n", "---" ] }, { "cell_type": "markdown", "id": "1", "metadata": {}, "source": [ "[](https://colab.research.google.com/github/sambaiga/ds-mlops-path/blob/main/tutorials/01-python-basics/10-combining-reshaping.ipynb) [](https://raw.githubusercontent.com/sambaiga/ds-mlops-path/main/tutorials/01-python-basics/10-combining-reshaping.ipynb)" ] }, { "cell_type": "markdown", "id": "2", "metadata": {}, "source": [ "**DS-MLOps Data Analysis**\n", "\n", "**Python 3.12+ | Author: Anthony Faustine**\n", "\n", "## Before you begin\n", "\n", "This notebook continues from Part 9 (`09-pandas-operations.ipynb`). Real analyses rarely live in one table: a second file with reference data shows up, a result needs splitting by category, and a wide summary table is easier to read than a long one. This part covers four multi-table operations and adds time-indexed data as a fifth theme: concatenation, merging, `groupby`, pivoting, and working with dates and `resample`.\n", "\n", "Alongside the student exam results, we use a small synthetic table mapping each `school_id` to a region, and a second dataset of daily attendance records covering five schools over a full school term.\n", "\n", "Part 11 (`11-polars.ipynb`) continues with Polars and the expression API.\n", "\n", "::: {.callout-note collapse=\"true\" icon=false}\n", "## Topics covered\n", "\n", "| Topic | Why it matters |\n", "|---|---|\n", "| **`pd.concat`** | Stack DataFrames on top of each other, or side by side |\n", "| **`pd.merge`** | Combine two tables on a shared key, the same idea as a SQL join |\n", "| **`groupby`** | Split a dataset into groups, apply a function to each, combine the results |\n", "| **`pivot_table`** | Reshape a long result into a wide, readable summary table |\n", "| **`Timestamp` and `to_datetime`** | Parse text into dates; the building block for all time-indexed work |\n", "| **`DatetimeIndex`** | An index made of dates unlocks date-based slicing instead of position-only |\n", "| **`resample`** | Change the time granularity of a series: daily to weekly, weekly to monthly |\n", ":::\n", "\n", "> Callout markers used throughout this notebook are explained on the [book cover page](../../index.qmd#callout-guide)." ] }, { "cell_type": "markdown", "id": "3", "metadata": {}, "source": [ "::: {.callout-note collapse=\"true\" icon=false}\n", "## Learning Objectives\n", "\n", "By the end of Part 10 you will be able to:\n", "\n", "| # | Skill | Covered in |\n", "|---|---|---|\n", "| 1 | Stack DataFrames with `pd.concat`, by row and by column | Sec. 1 |\n", "| 2 | Combine two tables on a shared key with `pd.merge`, and choose the right `how` | Sec. 2 |\n", "| 2b | Predict row counts after a merge and diagnose unexpected fan-out | Sec. 2 |\n", "| 3 | Split a dataset into groups and aggregate each one with `groupby` | Sec. 3 |\n", "| 4 | Reshape a grouped result into a wide summary with `pivot_table` | Sec. 4 |\n", "| 5 | Parse text into datetime values with `to_datetime` and extract date components | Sec. 5 |\n", "| 6 | Build a `DatetimeIndex` and use it to slice a time series by date | Sec. 6 |\n", "| 7 | Select rows with a partial date string or a date range | Sec. 7 |\n", "| 8 | Change the time granularity of a series with `resample` | Sec. 8 |\n", ":::\n" ] }, { "cell_type": "code", "execution_count": null, "id": "4", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "df = pd.read_csv(\"data/university_analytics.csv\")\n", "df[\"average_marks\"] = (df[\"midterm_score\"] + df[\"final_score\"] + df[\"project_score\"]) / 3\n", "\n", "# courses.csv is a thin reference table — one row per course — useful for joins\n", "courses = pd.read_csv(\"data/courses.csv\")\n", "df.head()" ] }, { "cell_type": "markdown", "id": "5", "metadata": {}, "source": [ "## 1. Concatenating DataFrames with `pd.concat`\n", "\n", "`pd.concat` stacks DataFrames together. By default it stacks rows on top of each other (`axis=0`), the operation you need when the same columns show up in two separate files or two separate batches:" ] }, { "cell_type": "code", "execution_count": null, "id": "6", "metadata": {}, "outputs": [], "source": [ "male_students = df[df[\"gender\"] == \"M\"]\n", "female_students = df[df[\"gender\"] == \"F\"]\n", "\n", "recombined = pd.concat([male_students, female_students], axis=0)\n", "print(f\"male: {len(male_students)}, female: {len(female_students)}, combined: {len(recombined)}\")" ] }, { "cell_type": "markdown", "id": "7", "metadata": {}, "source": [ "Passing `axis=1` stacks columns side by side instead, matching rows up by index. This is the shape you get when a calculation is done separately and needs joining back onto the original table:" ] }, { "cell_type": "code", "execution_count": null, "id": "8", "metadata": {}, "outputs": [], "source": [ "pass_fail = (df[\"average_marks\"] >= 0.6).rename(\"passed\")\n", "with_pass_fail = pd.concat([df[[\"student_id\", \"average_marks\"]], pass_fail], axis=1)\n", "with_pass_fail.head()" ] }, { "cell_type": "markdown", "id": "9", "metadata": {}, "source": [ "
pd.concat([df_a, df_b], axis=1) lines rows up by index position, not by any shared key. If df_a and df_b were filtered, sorted, or reset differently beforehand, row 0 in one might not be row 0 in the other, and the result silently combines the wrong rows. pd.merge (Sec. 2) is the safer choice whenever there is an actual key column to join on.\n",
"df into two DataFrames by caste: one for \"BC\" and one for everything else, using .isin() or a boolean mask. Concatenate them back with pd.concat and confirm the row count matches the original.\n",
"bc_only = df[df[\"program\"] == \"BC\"]\n",
"not_bc = df[df[\"program\"] != \"BC\"]\n",
"recombined = pd.concat([bc_only, not_bc], axis=0)\n",
"len(recombined) == len(df)\n",
"how actually changes the resultdf against only half of courses with how=\"inner\" keeps only enrollments whose course is in that half. The same merge with how=\"left\" keeps every student, with region set to NaN wherever the school was missing from the smaller table.\n",
"NaN, not a dropped rowhow=\"left\", an enrollment whose course_id has no match in courses still gets a row, just with NaN in every column that came from courses. Code that assumes every row has a real region after a left merge will silently miscount or mis-group those rows later, usually in a groupby a few cells down.\n",
"df has one row per student and groupby(\"student_id\")[\"score\"].sum() gives the right answer before a merge, but the merge doubles some rows, the same sum() after the merge silently counts those scores twice. The most reliable diagnostic is to compare len(df) before and after any merge. A row count increase without a planned one-to-many join is a bug, not a feature.\n",
"# Always verify after merging\n",
"before = len(df)\n",
"merged = pd.merge(df, other_table, on=\"key\", how=\"left\")\n",
"assert len(merged) == before, f\"Unexpected fan-out: {before} -> {len(merged)} rows\"\n",
"df with courses using how=\"left\", then use .value_counts() on the resulting department column.egion column to see how many student rows fall in each region.\n",
"merged = pd.merge(df, courses, on=\"course_id\", how=\"left\")\n",
"merged[\"department\"].value_counts()\n",
"groupby computes nothing until you aggregatedf.groupby(\"program\") on its own returns a DataFrameGroupBy object, a plan for splitting the data, not a result. Nothing is actually computed until a method like .mean(), .sum(), or .agg() is called on it, the same lazy-then-compute pattern you will see again in Part 11's Polars notebook.\n",
"observed=True when grouping a category columncategory dtype (Part 2, Sec. 2) by default includes every category the dtype knows about, even ones with zero rows in the current data, padding the result with empty groups. Passing observed=True to groupby keeps only the categories that actually appear.\n",
"df by caste and compute, for each group, what fraction of students have continue_drop == \"drop\". (series == \"drop\").mean() gives a fraction directly, no separate count and divide needed.\n",
"df.groupby(\"program\")[\"continue_drop\"].apply(lambda s: (s == \"drop\").mean())\n", "
pivot_table is groupby with a different output shapedf.groupby([\"program\", \"gender\"])[\"average_marks\"].mean() from Sec. 3 and the pivot_table call above compute the exact same numbers. groupby returns them stacked into a long Series with a two-level index; pivot_table lays the second key out across columns instead. Reach for pivot_table specifically when the result is meant to be read by a person, not processed further by code.\n",
"df with courses (Sec. 2), then build a pivot table with region as the index, program as the columns, and the mean average_marks in each cell.\n",
"merged = pd.merge(df, courses, on=\"course_id\", how=\"left\")\n",
"pd.pivot_table(merged, index=\"region\", columns=\"program\", values=\"average_marks\", aggfunc=\"mean\")\n",
"df with courses on course_id, keeping every student (Sec. 2)region and compute the mean average_marks and the drop fraction, using the function from Activity 3 (Sec. 3)region as rows and caste as columns, mean average_marks in each cell (Sec. 4)merged = pd.merge(df, courses, on=\"course_id\", how=\"left\")\n",
"\n",
"regional_summary = merged.groupby(\"region\").agg(\n",
" mean_marks=(\"average_marks\", \"mean\"),\n",
" drop_fraction=(\"continue_drop\", lambda s: (s == \"drop\").mean()),\n",
")\n",
"\n",
"region_caste_pivot = pd.pivot_table(merged, index=\"region\", columns=\"program\", values=\"average_marks\", aggfunc=\"mean\")\n",
"datetime64[ns], nanosecond precision, whether the data needed it or not. Pandas 3's to_datetime infers a resolution from what is actually in the data: day-level strings like the ones here become datetime64[s] or coarser, not nanoseconds. attendance[\"date\"].dtype shows whichever resolution was inferred for this column.\n",
"[\"2025-01-06\", \"2025-02-14\", \"2025-03-28\"], to Timestamp values with pd.to_datetime, then print the day name of each one.\n",
"dates = pd.to_datetime([\"2025-01-06\", \"2025-02-14\", \"2025-03-28\"])\n",
"for d in dates:\n",
" print(d.day_name())\n",
"attendance.set_index(\"date\") on the full table produces a DatetimeIndex with the same date repeated once per school, since every school has a row for every day. Slicing that index by date then returns rows from every school mixed together for that date, not a clean single time series. Filter to one entity first, exactly as school_300 does above, then set the index.\n",
"attendance to school_id == 302, set date as the index, and confirm the result's index is a DatetimeIndex with isinstance(result.index, pd.DatetimeIndex).\n",
"school_302 = attendance[attendance[\"school_id\"] == 302].set_index(\"date\")\n",
"isinstance(school_302.index, pd.DatetimeIndex)\n",
"len(school_300.loc[\"2025-01\"]) against len(school_300.loc[\"2025-02\"]) confirms the row count for each month matches its number of business days, the same `bdate_range` weekday-only pattern used to build this dataset in the first place.\n",
"school_300 from \"2025-03-15\" to \"2025-03-28\" inclusive, and print the mean attendance_rate over that range.\n",
"last_two_weeks = school_300.loc[\"2025-03-15\":\"2025-03-28\"]\n",
"last_two_weeks[\"attendance_rate\"].mean()\n",
"resample groups by time interval, groupby groups by valuedf.groupby(\"program\") (Part 3) splits rows by whatever value is already in the caste column. series.resample(\"W\") splits rows by which week their DatetimeIndex label falls into, intervals that did not exist as a column at all until resample created them. Both still end with an aggregation like .mean() to combine each group into one number.\n",
"attendance.set_index(\"date\").groupby(\"school_id\")[\"attendance_rate\"].resample(\"W\").mean() resamples each school's series independently in one call, instead of looping over schools and resampling each one by hand. groupby before resample is what keeps the schools from being averaged together.\n",
"date as the index on the full attendance table, group by school_id, and resample to monthly means in one chained call. Use \"ME\" (month-end) — the pandas 3 replacement for the old \"M\" alias.\n",
"attendance.set_index(\"date\").groupby(\"school_id\")[\"attendance_rate\"].resample(\"ME\").mean()\n", "
date as the index on the full attendance table (Sec. 2)school_id and resample to weekly means (Sec. 4)weekly = attendance.set_index(\"date\").groupby(\"school_id\")[\"attendance_rate\"].resample(\"W\").mean()\n",
"\n",
"first_week = weekly.loc[:, \"2025-01-06\":\"2025-01-12\"]\n",
"last_week = weekly.loc[:, \"2025-03-24\":\"2025-03-28\"]\n",
"drop_per_school = first_week.groupby(\"school_id\").mean() - last_week.groupby(\"school_id\").mean()\n",
"drop_per_school.sort_values(ascending=False)\n",
"