{ "cells": [ { "cell_type": "markdown", "id": "0", "metadata": {}, "source": [ "---\n", "title: \"Part 8: Pandas Core\"\n", "---" ] }, { "cell_type": "markdown", "id": "1", "metadata": {}, "source": [ "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sambaiga/ds-mlops-path/blob/main/tutorials/01-python-basics/08-pandas-core.ipynb) [![Download Notebook](https://img.shields.io/badge/Download-Notebook-blue.svg?logo=jupyter&logoColor=white)](https://raw.githubusercontent.com/sambaiga/ds-mlops-path/main/tutorials/01-python-basics/08-pandas-core.ipynb)" ] }, { "cell_type": "markdown", "id": "2", "metadata": {}, "source": [ "**DS-MLOps Data Analysis**\n", "\n", "**Python 3.12+ | Author: Anthony Faustine**\n", "\n", "## Before you begin\n", "\n", "This notebook assumes you have completed Parts 1-7 (the Python Basics section), especially the NumPy notebook (Part 4): pandas builds directly on NumPy arrays, and a lot of what looks new here is really a NumPy idea wearing a label.\n", "\n", "Every example in this part uses a real dataset: exam results for 2,400 students across several schools, with columns covering marks, teacher counts, school size, and whether each student continued or dropped out. Real data means real messiness, missing values included, which is exactly what makes it worth learning on.\n", "\n", "Part 9 (`09-pandas-operations.ipynb`) continues with row-wise transformations, string methods, and descriptive statistics on this same dataset.\n", "\n", "::: {.callout-note collapse=\"true\" icon=false}\n", "## Topics covered\n", "\n", "| Topic | Why it matters |\n", "|---|---|\n", "| **Series and DataFrame** | The two objects every pandas operation works on |\n", "| **Reading data** | `pd.read_csv` and the first checks every dataset deserves |\n", "| **Selecting rows and columns** | `loc` vs `iloc`, and why mixing them up is the most common pandas bug |\n", "| **Boolean filtering** | The same masking idea from NumPy, applied to a labelled table |\n", "| **Missing data** | Real datasets have gaps; pandas gives you tools to see and handle them, not ignore them |\n", "| **Memory and categorical dtype** | Profile what a DataFrame costs in RAM; reduce it for columns with few distinct values |\n", ":::\n", "\n", "> Callout markers used throughout this notebook are explained on the [book cover page](../../index.qmd#callout-guide)." ] }, { "cell_type": "markdown", "id": "3", "metadata": {}, "source": [ "::: {.callout-note collapse=\"true\" icon=false}\n", "## Learning Objectives\n", "\n", "By the end of Part 8 you will be able to:\n", "\n", "| # | Skill | Covered in |\n", "|---|---|---|\n", "| 1 | Explain the relationship between a Series, a DataFrame, and a NumPy array | Sec. 1 |\n", "| 2 | Load a CSV and run the first checks any new dataset deserves | Sec. 2 |\n", "| 3 | Select columns and rows correctly with `loc` and `iloc` | Sec. 3 |\n", "| 4 | Filter rows with boolean conditions | Sec. 4 |\n", "| 5 | Add, modify, and drop columns without the inplace trap | Sec. 5 |\n", "| 6 | Sort a DataFrame by one or more columns | Sec. 6 |\n", "| 7 | Find, count, and handle missing values | Sec. 7 |\n", "| 8 | Profile memory usage and reduce it with the categorical dtype | Sec. 8 |\n", ":::\n" ] }, { "cell_type": "markdown", "id": "4", "metadata": {}, "source": [ "## 0. The Tool That Thinks in Tables\n", "\n", "You have a CSV. It has 2,400 rows, 10 columns, and three questions from your manager: which schools have the highest pass rate? Is there a relationship between teacher count and average score? How did results change between semesters?\n", "\n", "NumPy can store the numbers. A plain Python dict can hold the column names. But neither of them has a concept of a *labelled row*, a *named column*, or a *group-by operation*. You would spend more time writing plumbing than answering the questions.\n", "\n", "**pandas** ([pandas.pydata.org](https://pandas.pydata.org)) was created at AQR Capital in 2008 by Wes McKinney to bring the R data frame into Python. The idea was to give Python scientists a two-dimensional, labelled data structure with built-in operations for filtering, grouping, joining, and reshaping — the things you actually do with tabular data, every day. It became the de-facto standard for data analysis in Python and the input format expected by every ML library in the ecosystem.\n", "\n", "### How it compares to alternatives\n", "\n", "| Library | Strengths | When to reach for it instead |\n", "| --- | --- | --- |\n", "| **pandas** ([pandas.pydata.org](https://pandas.pydata.org)) | Mature, enormous ecosystem, readable API | Your default for anything up to a few million rows |\n", "| **Polars** ([pola.rs](https://pola.rs)) | Rust core, multi-threaded, lazy evaluation | Datasets too large for pandas to fit comfortably in RAM; see Part 11 |\n", "| **Dask** ([dask.org](https://dask.org)) | Distributed pandas on a cluster | Multi-machine computation, out-of-core data |\n", "| **Modin** ([modin.readthedocs.io](https://modin.readthedocs.io)) | pandas API, parallel execution | Parallelism without changing your pandas code |\n", "| **cuDF** ([docs.rapids.ai](https://docs.rapids.ai/api/cudf/stable)) | GPU-accelerated pandas | Large datasets on NVIDIA GPUs |\n", "\n", "For most data science and ML work pandas is the right starting point. You will meet Polars in Part 11, when the scenario specifically calls for speed at scale.\n", "\n", "### Already in your environment\n", "\n", "pandas is in `pyproject.toml`. For a standalone project:\n", "\n", "```bash\n", "uv add pandas # or: pip install pandas\n", "```\n", "\n", "Official docs: [pandas.pydata.org/docs](https://pandas.pydata.org/docs/)" ] }, { "cell_type": "markdown", "id": "5", "metadata": {}, "source": [ "## 1. Series and DataFrame\n", "\n", "pandas gives you two data structures, and almost everything else in the library is a method on one of them. A **Series** is a one-dimensional, labelled array: a NumPy array with an index attached. A **DataFrame** is a two-dimensional table: a collection of Series that all share the same index, one Series per column." ] }, { "cell_type": "code", "execution_count": null, "id": "6", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "# A Series is a NumPy array plus a label for every position.\n", "marks = pd.Series([62, 78, 91], index=[\"s001\", \"s002\", \"s003\"])\n", "print(marks)\n", "print(f\"underlying NumPy array: {marks.to_numpy()}\")" ] }, { "cell_type": "raw", "id": "7", "metadata": { "raw_mimetype": "text/markdown" }, "source": [ "> **Relationship: Series are columns, DataFrame shares one Index**\n", "\n", "```{mermaid}\n", "flowchart TB\n", " subgraph S [\"Series (1-D)\"]\n", " I1[\"Index: 0\n", "1\n", "2\"] --- V1[\"Values: 85.0\n", "72.0\n", "91.0\"]\n", " LB[\"name: midterm_score\"]\n", " end\n", " subgraph DF [\"DataFrame (2-D)\"]\n", " direction LR\n", " COL1[\"Series: midterm_score\"]\n", " COL2[\"Series: final_score\"]\n", " COL3[\"Series: program\"]\n", " IDX[\"shared Index: S0001, S0002, S0003\"]\n", " end\n", " S -->|\"pd.DataFrame({'mid': s1, 'fin': s2})\"| DF\n", "\n", " style DF fill:#EAF3FA,stroke:#0369A1\n", " style S fill:#EBF5F0,stroke:#059669\n", "```\n" ] }, { "cell_type": "markdown", "id": "8", "metadata": {}, "source": [ "A DataFrame is what you get when several Series that share an index sit next to each other as columns. Building one from a dict of equal-length arrays, the same pattern used to build the feature matrices in the NumPy notebook, makes this explicit:" ] }, { "cell_type": "code", "execution_count": null, "id": "9", "metadata": {}, "outputs": [], "source": [ "data = pd.DataFrame(\n", " {\n", " \"student_id\": [\"s001\", \"s002\", \"s003\"],\n", " \"midterm_score\": [62, 78, 91],\n", " \"gender\": [\"F\", \"M\", \"F\"],\n", " }\n", ")\n", "data" ] }, { "cell_type": "markdown", "id": "10", "metadata": {}, "source": [ "
\n", " Key Concept: A DataFrame is a dict of same-length Series

\n", "Every column of a DataFrame is a Series sharing the DataFrame's index. data[\"midterm_score\"] returns that column as a standalone Series, with the same index, the same values, and every NumPy method still available on it through .to_numpy(). This is why the array skills from the NumPy notebook keep paying off here: a pandas column is a NumPy array with row labels attached, not a different kind of thing.\n", "
" ] }, { "cell_type": "markdown", "id": "11", "metadata": {}, "source": [ "## 2. Reading and Inspecting Data\n", "\n", "`pd.read_csv` is almost always the first line of a real analysis. The first thing to do with whatever it returns is never to start analysing it: it is to check what you actually got." ] }, { "cell_type": "code", "execution_count": null, "id": "12", "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv(\"data/university_analytics.csv\")\n", "print(f\"shape : {df.shape}\")\n", "df.head()" ] }, { "cell_type": "markdown", "id": "13", "metadata": {}, "source": [ "`.dtypes` shows what type pandas inferred for every column, `.info()` adds memory usage and non-null counts in one call, and `.describe()` summarises every numeric column at once:" ] }, { "cell_type": "code", "execution_count": null, "id": "14", "metadata": {}, "outputs": [], "source": [ "df.dtypes" ] }, { "cell_type": "code", "execution_count": null, "id": "15", "metadata": {}, "outputs": [], "source": [ "df.info()" ] }, { "cell_type": "code", "execution_count": null, "id": "16", "metadata": {}, "outputs": [], "source": [ "df.describe().T" ] }, { "cell_type": "markdown", "id": "17", "metadata": {}, "source": [ "
\n", " Pro Tip: Run these three checks on every new dataset, before anything else

\n", ".shape, .dtypes, and .describe() take seconds and catch most of the embarrassing mistakes early: a column pandas read as text when it should be numeric, a date stored as a string, a column that is almost entirely one value. Cheaper to catch here than three analysis steps later.\n", "
" ] }, { "cell_type": "markdown", "id": "18", "metadata": {}, "source": [ "
\n", " Activity 1 - First Look

\n", "\n", "Goal: Print the column names with df.columns, then print how many unique values the caste\n", "column has using .nunique().\n", "
df.columns\n",
    "df[\"program\"].nunique()\n",
    "# -> 4
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "19", "metadata": {}, "outputs": [], "source": [ "# TODO: print df.columns, then the number of unique values in \"program\"\n", "..." ] }, { "cell_type": "markdown", "id": "20", "metadata": {}, "source": [ "## 3. Selecting Rows and Columns\n", "\n", "`df[\"col\"]` selects one column as a Series; `df[[\"col1\", \"col2\"]]` selects several columns as a DataFrame, note the double brackets. Selecting rows by label or by position needs `.loc` or `.iloc`, and the two are not interchangeable." ] }, { "cell_type": "code", "execution_count": null, "id": "21", "metadata": {}, "outputs": [], "source": [ "one_column = df[\"midterm_score\"]\n", "several_columns = df[[\"student_id\", \"midterm_score\", \"final_score\"]]\n", "print(type(one_column), type(several_columns))" ] }, { "cell_type": "markdown", "id": "22", "metadata": {}, "source": [ "`.loc` selects by **label**: the value in the index, or a column name. `.iloc` selects by **integer position**, exactly like a NumPy array, regardless of what the labels actually are:" ] }, { "cell_type": "code", "execution_count": null, "id": "23", "metadata": {}, "outputs": [], "source": [ "# .loc: by label. Row label 0, column label \"midterm_score\".\n", "print(df.loc[0, \"midterm_score\"])\n", "\n", "# .iloc: by position. First row, second column, whatever they are labelled.\n", "print(df.iloc[0, 1])\n", "\n", "# A label-based row slice with .loc is inclusive of both ends.\n", "df.loc[0:2, [\"student_id\", \"midterm_score\"]]" ] }, { "cell_type": "markdown", "id": "24", "metadata": {}, "source": [ "
\n", " Common Mistake: Treating .loc and .iloc as interchangeable

\n", "df.loc[0:2] and df.iloc[0:2] can return a different number of rows. .iloc slicing behaves exactly like a Python list: the end is excluded. .loc slicing is inclusive of the end label, because that label might not even be an integer; it could be a date or a name. The two only happen to look similar when the index is the default 0, 1, 2, ... range, which is exactly when this mistake hides best.\n", "
" ] }, { "cell_type": "markdown", "id": "25", "metadata": {}, "source": [ "
\n", " Activity 2 - Select a Slice

\n", "\n", "Goal: Using .iloc, select rows 10 through 14 (5 rows) and only the student_id and\n", "project_score columns.\n", "
df.iloc[10:15, [?, ?]]\n",
    "# Hint: find the integer position of each column with df.columns.get_loc(...)
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "26", "metadata": {}, "outputs": [], "source": [ "# TODO: select rows 10-14 and the student_id/project_score columns using .iloc\n", "..." ] }, { "cell_type": "markdown", "id": "27", "metadata": {}, "source": [ "## 4. Boolean Filtering\n", "\n", "This is the same trick from the NumPy notebook's boolean masking section, applied to a DataFrame instead of an array: a comparison produces a Series of `True`/`False`, and indexing with that Series keeps only the `True` rows." ] }, { "cell_type": "code", "execution_count": null, "id": "28", "metadata": {}, "outputs": [], "source": [ "high_scorers = df[\"midterm_score\"] > 90\n", "print(f\"high scorers: {high_scorers.sum()} of {len(df)} students\")\n", "\n", "df[high_scorers].head()" ] }, { "cell_type": "markdown", "id": "29", "metadata": {}, "source": [ "Combine conditions with `&` and `|`, the same operators and the same parentheses requirement as NumPy, not Python's `and`/`or`:" ] }, { "cell_type": "code", "execution_count": null, "id": "30", "metadata": {}, "outputs": [], "source": [ "struggling_without_internet = (df[\"final_grade\"] == \"F\") & (~df[\"has_internet\"])\n", "print(f\"failed students without internet: {struggling_without_internet.sum()}\")\n", "\n", "df.loc[struggling_without_internet, [\"student_id\", \"final_grade\", \"has_internet\"]].head()" ] }, { "cell_type": "markdown", "id": "31", "metadata": {}, "source": [ "
\n", " Activity 3 - Female High Scorers

\n", "\n", "Goal: Count how many female students (gender == \"F\") scored above 80 in final_score.\n", "
condition = (df[\"gender\"] == \"F\") & (df[\"final_score\"] > 80)\n",
    "condition.sum()
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "32", "metadata": {}, "outputs": [], "source": [ "# TODO: count female students with final_score > 80\n", "..." ] }, { "cell_type": "markdown", "id": "33", "metadata": {}, "source": [ "## 5. Adding, Modifying, and Dropping Columns\n", "\n", "Assigning to a new column name creates it; assigning to an existing one overwrites it. Both work the same way and both operate on the whole column at once, no loop required:" ] }, { "cell_type": "code", "execution_count": null, "id": "34", "metadata": {}, "outputs": [], "source": [ "df[\"average_marks\"] = (df[\"midterm_score\"] + df[\"final_score\"] + df[\"project_score\"]) / 3\n", "df[[\"student_id\", \"average_marks\"]].head()" ] }, { "cell_type": "markdown", "id": "35", "metadata": {}, "source": [ "`.drop()` removes rows or columns. `axis=1` means columns, `axis=0` (the default) means rows, the same convention as NumPy's `axis` parameter:" ] }, { "cell_type": "code", "execution_count": null, "id": "36", "metadata": {}, "outputs": [], "source": [ "# axis=1: drop a column. Returns a NEW DataFrame; df itself is unchanged\n", "# unless you reassign it.\n", "without_internet_col = df.drop(\"has_internet\", axis=1)\n", "print(f\"original columns : {len(df.columns)}\")\n", "print(f\"after drop : {len(without_internet_col.columns)}\")" ] }, { "cell_type": "markdown", "id": "37", "metadata": {}, "source": [ "
\n", " Common Mistake: Reaching for inplace=True by default

\n", "df.drop(\"has_internet\", axis=1, inplace=True) looks like it saves memory by not creating a copy. It does not, in most pandas versions it builds the new data and discards the old either way, and it makes code harder to debug: a function that mutates its input instead of returning a new one is a common source of \"why did my DataFrame change\" bugs. Prefer df = df.drop(...), explicit and easy to trace.\n", "
" ] }, { "cell_type": "markdown", "id": "38", "metadata": {}, "source": [ "
\n", " Key Concept: Copy-on-Write is the default in pandas 3

\n", "Pandas 3 enables Copy-on-Write (CoW) by default. Every operation on a subset of a DataFrame returns a new independent object instead of a view into the original. Two rules follow directly:

\n", "
    \n", "
  1. Direct column assignment works fine: df[\"average_marks\"] = ... modifies df through __setitem__, not through a copy, so it still does what you expect.
  2. \n", "
  3. Chained assignment is silently dropped: df[df[\"gender\"] == \"F\"][\"passed\"] = True looks like it updates df, but the inner df[mask] now returns a copy. The assignment goes to that temporary copy and is immediately lost. The fix is one step: df.loc[df[\"gender\"] == \"F\", \"passed\"] = True.
  4. \n", "
\n", "CoW is why inplace=True (above) is being phased out — explicit reassignment (df = df.drop(...)) is both clearer and CoW-compatible. Part 9 shows the same rule applied to groupby and transform.\n", "
" ] }, { "cell_type": "markdown", "id": "39", "metadata": {}, "source": [ "
\n", " Activity 4 - Add a Pass/Fail Column

\n", "\n", "Goal: Add a new column passed that is True when average_marks is at least 0.5,\n", "using np.where the same way it was used in the NumPy notebook.\n", "
df[\"passed\"] = np.where(df[\"average_marks\"] >= 0.5, True, False)\n",
    "df[\"passed\"].value_counts()
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "40", "metadata": {}, "outputs": [], "source": [ "# TODO: add the \"passed\" column\n", "..." ] }, { "cell_type": "markdown", "id": "41", "metadata": {}, "source": [ "## 6. Sorting\n", "\n", "`.sort_values()` sorts by one or more columns; `.sort_index()` sorts by the index instead. Neither changes `df` unless you reassign it, same rule as every method so far:" ] }, { "cell_type": "code", "execution_count": null, "id": "42", "metadata": {}, "outputs": [], "source": [ "top_5 = df.sort_values(\"average_marks\", ascending=False).head(5)\n", "top_5[[\"student_id\", \"average_marks\"]]" ] }, { "cell_type": "code", "execution_count": null, "id": "43", "metadata": {}, "outputs": [], "source": [ "# Sort by multiple columns: gender first, then average_marks descending within each\n", "by_gender_then_marks = df.sort_values([\"gender\", \"average_marks\"], ascending=[True, False])\n", "by_gender_then_marks[[\"student_id\", \"gender\", \"average_marks\"]].head(6)" ] }, { "cell_type": "markdown", "id": "44", "metadata": {}, "source": [ "## 7. Handling Missing Data\n", "\n", "`total_toilets` and `establishment_year` are missing for some schools in this dataset, encoded as `NaN`. `.isna()` finds them, `.dropna()` removes rows that have them, and `.fillna()` replaces them, three different decisions with three different consequences for an analysis." ] }, { "cell_type": "code", "execution_count": null, "id": "45", "metadata": {}, "outputs": [], "source": [ "missing_per_column = df.isna().sum()\n", "print(missing_per_column[missing_per_column > 0])" ] }, { "cell_type": "markdown", "id": "46", "metadata": {}, "source": [ "Dropping every row with any missing value is the simplest option and often the wrong one: it can throw away far more data than the missing values themselves justify:" ] }, { "cell_type": "code", "execution_count": null, "id": "47", "metadata": {}, "outputs": [], "source": [ "print(f\"rows before dropna : {len(df)}\")\n", "print(f\"rows after dropna : {len(df.dropna())}\")\n", "print(f\"rows lost : {len(df) - len(df.dropna())}\")" ] }, { "cell_type": "markdown", "id": "48", "metadata": {}, "source": [ "
\n", " Pro Tip: Decide what missing means before you decide how to handle it

\n", "A missing total_toilets value might mean the school never reported it, which is different from meaning zero. Filling with the column mean (df[\"total_toilets\"].fillna(df[\"total_toilets\"].mean())) assumes the missing schools look like the average school. Dropping those rows assumes they are not worth analysing at all. Both are real decisions with real consequences, not a technicality to get past quickly.\n", "
" ] }, { "cell_type": "markdown", "id": "49", "metadata": {}, "source": [ "
\n", " Activity 5 - Fill With the Median

\n", "\n", "Goal: Fill the missing total_toilets values with the column's median instead of dropping those rows,\n", "and confirm there are no missing values left in that column afterward.\n", "
median_toilets = df[\"total_toilets\"].median()\n",
    "df[\"total_toilets\"] = df[\"total_toilets\"].fillna(median_toilets)\n",
    "df[\"total_toilets\"].isna().sum()  # -> 0
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "50", "metadata": {}, "outputs": [], "source": [ "# TODO: fill missing total_toilets with the column median\n", "..." ] }, { "cell_type": "markdown", "id": "51", "metadata": {}, "source": [ "## 8. Memory Usage and the Categorical Dtype\n", "\n", "A 2,400-row dataset fits in memory with ease, but the patterns here scale to millions of rows where they matter. Understanding what pandas keeps in memory and how to reduce it is a practical skill, not an optimization curiosity.\n", "\n", "`df.memory_usage(deep=True)` shows how many bytes each column occupies. `deep=True` is required for string columns: without it, pandas reports only the object pointer size, not the actual string bytes." ] }, { "cell_type": "code", "execution_count": null, "id": "52", "metadata": {}, "outputs": [], "source": [ "mem = df.memory_usage(deep=True)\n", "mem_mb = (mem / 1024**2).round(3)\n", "print(mem_mb.sort_values(ascending=False))\n", "print(f\"\\nTotal: {mem_mb.sum():.3f} MB\")" ] }, { "cell_type": "markdown", "id": "53", "metadata": {}, "source": [ "
\n", " Key Concept: String columns dominate memory in most DataFrames

\n", "An object or str column stores a Python string object per row: the string plus Python object overhead. A numeric column stores a flat NumPy array: 8 bytes per row for float64, 4 for float32. A column like program with only 4 distinct values across 2,400 rows wastes most of that memory repeating the same strings. The categorical dtype solves this by storing each unique string once and keeping only a compact integer code per row.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "54", "metadata": {}, "outputs": [], "source": [ "# Compare memory: str dtype vs category dtype\n", "before = df[\"program\"].memory_usage(deep=True)\n", "df[\"program\"] = df[\"program\"].astype(\"category\")\n", "after = df[\"program\"].memory_usage(deep=True)\n", "\n", "print(f\"str dtype : {before:,} bytes\")\n", "print(f\"category : {after:,} bytes\")\n", "print(f\"reduction : {(1 - after / before) * 100:.1f}%\")" ] }, { "cell_type": "code", "execution_count": null, "id": "55", "metadata": {}, "outputs": [], "source": [ "# Any low-cardinality string column is a candidate for category dtype\n", "low_cardinality = [c for c in df.select_dtypes(\"object\").columns if df[c].nunique() < 50]\n", "print(\"Low-cardinality columns:\", low_cardinality)\n", "\n", "for col in low_cardinality:\n", " df[col] = df[col].astype(\"category\")\n", "\n", "print(f\"Memory after converting all: {df.memory_usage(deep=True).sum() / 1024**2:.3f} MB\")" ] }, { "cell_type": "markdown", "id": "56", "metadata": {}, "source": [ "
\n", " Pro Tip: Use category dtype early; undo it before one-hot encoding

\n", "Convert low-cardinality string columns to category dtype right after loading, before any analysis. It speeds up groupby and value_counts and reduces memory. The one time to undo it: pd.get_dummies() (Part 9) works on category columns but produces the same output, so it is not strictly required, just worth knowing the pattern.\n", "
" ] }, { "cell_type": "markdown", "id": "57", "metadata": {}, "source": [ "
\n", " Activity 6 - Measure and Reduce Memory

\n", "\n", "Goal: Load the dataset fresh, print its total memory usage in MB with memory_usage(deep=True).sum(), convert the gender and guardian columns to category dtype, and print the new total. What percentage of memory did you save?\n", "
df_fresh = pd.read_csv(\"data/university_analytics.csv\")\n",
    "before_mb = df_fresh.memory_usage(deep=True).sum() / 1024**2\n",
    "df_fresh[\"gender\"] = df_fresh[\"gender\"].astype(\"category\")\n",
    "df_fresh[\"guardian\"] = df_fresh[\"guardian\"].astype(\"category\")\n",
    "after_mb = df_fresh.memory_usage(deep=True).sum() / 1024**2\n",
    "print(f\"Before: {before_mb:.3f} MB, After: {after_mb:.3f} MB\")\n",
    "print(f\"Saving: {(1 - after_mb/before_mb)*100:.1f}%\")
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "58", "metadata": {}, "outputs": [], "source": [ "# TODO: measure and reduce memory as described above\n", "..." ] }, { "cell_type": "markdown", "id": "59", "metadata": {}, "source": [ "## Capstone: Top Schools Report\n", "\n", "Combine everything from this notebook: select, filter, add a column, sort, and handle missing data, to answer one question. Which schools have the highest average student performance, among schools with reasonable internet access?" ] }, { "cell_type": "markdown", "id": "60", "metadata": {}, "source": [ "
\n", " Capstone Exercise - Top Schools by Performance

\n", "\n", "Goal:\n", "
    \n", "
  1. Fill missing total_toilets values with the column median (Sec. 7)
  2. \n", "
  3. Filter to students with internet == 1
  4. \n", "
  5. Group by school_id and compute the mean average_marks per school (a preview of Part 3's\n", "groupby, used here in its simplest one-line form)
  6. \n", "
  7. Sort the result descending and show the top 5 school IDs
  8. \n", "
\n", "
with_internet = df[df[\"has_internet\"] == 1]\n",
    "school_means = with_internet.groupby(\"school_id\")[\"average_marks\"].mean()\n",
    "school_means.sort_values(ascending=False).head(5)
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "61", "metadata": {}, "outputs": [], "source": [ "# TODO: build the top-5-schools report described above\n", "..." ] }, { "cell_type": "markdown", "id": "62", "metadata": {}, "source": [ "## Further Reading\n", "\n", "| Resource | Why it matters |\n", "|---|---|\n", "| McKinney, W. (2022). *Python for Data Analysis*, 3rd ed. O'Reilly. | The book by pandas' creator; free online at [wesmckinney.com/book](https://wesmckinney.com/book) — Chapters 5 and 6 map directly to this notebook |\n", "| [10 minutes to pandas](https://pandas.pydata.org/docs/user_guide/10min.html) | The official fast-start guide; read it after this notebook to see everything in one place |\n", "| [pandas 3.0 migration guide](https://pandas.pydata.org/docs/whatsnew/v3.0.0.html) | Covers Copy-on-Write and the new `str` dtype — the two biggest breaking changes from pandas 2 |\n", "| Harris, C.R. et al. (2020). [Array programming with NumPy](https://doi.org/10.1038/s41586-020-2649-2). *Nature* 585, 357–362. | pandas DataFrames are built on NumPy arrays; understanding arrays makes every pandas index operation predictable |\n" ] }, { "cell_type": "markdown", "id": "63", "metadata": {}, "source": [ "## Summary\n", "\n", "| Concept | Key rule |\n", "|---|---|\n", "| Series | A NumPy array with a label for every position |\n", "| DataFrame | A collection of same-length Series sharing one index, one Series per column |\n", "| `pd.read_csv` | Always follow with `.shape`, `.dtypes`, `.describe()` before analysing anything |\n", "| `df[\"col\"]` vs `df[[\"col\"]]` | Single brackets return a Series, double brackets return a DataFrame |\n", "| `.loc` | Selects by label; slices are inclusive of the end label |\n", "| `.iloc` | Selects by integer position; slices exclude the end, like a Python list |\n", "| Boolean filtering | `(cond1) & (cond2)`, parentheses required, same as NumPy |\n", "| Adding columns | `df[\"new\"] = ...` operates on the whole column, no loop needed |\n", "| `inplace=True` | Avoid it; reassign with `df = df.method(...)` instead |\n", "| `.sort_values()` | Pass a list of columns to sort by more than one, with a matching list for `ascending` |\n", "| Missing data | `.isna()` to find it, `.dropna()`/`.fillna()` to handle it, decide what missing means first |\n", "| `memory_usage(deep=True)` | Shows true memory cost per column, including string bytes; use before optimising |\n", "| `.astype(\"category\")` | Stores each unique value once + an integer code per row; cuts memory for low-cardinality columns |\n", "| Low-cardinality threshold | A rule of thumb: columns with fewer than 50 distinct values are worth converting |\n", "\n", "**Next:** `09-pandas-operations.ipynb`, covering row-wise transformations with `apply`, string methods, and descriptive statistics on this same dataset." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.12" } }, "nbformat": 4, "nbformat_minor": 5 }