{ "cells": [ { "cell_type": "markdown", "id": "0", "metadata": {}, "source": [ "---\n", "title: \"Part 8: Pandas Core\"\n", "---" ] }, { "cell_type": "markdown", "id": "1", "metadata": {}, "source": [ "[](https://colab.research.google.com/github/sambaiga/ds-mlops-path/blob/main/tutorials/01-python-basics/08-pandas-core.ipynb) [](https://raw.githubusercontent.com/sambaiga/ds-mlops-path/main/tutorials/01-python-basics/08-pandas-core.ipynb)" ] }, { "cell_type": "markdown", "id": "2", "metadata": {}, "source": [ "**DS-MLOps Data Analysis**\n", "\n", "**Python 3.12+ | Author: Anthony Faustine**\n", "\n", "## Before you begin\n", "\n", "This notebook assumes you have completed Parts 1-7 (the Python Basics section), especially the NumPy notebook (Part 4): pandas builds directly on NumPy arrays, and a lot of what looks new here is really a NumPy idea wearing a label.\n", "\n", "Every example in this part uses a real dataset: exam results for 2,400 students across several schools, with columns covering marks, teacher counts, school size, and whether each student continued or dropped out. Real data means real messiness, missing values included, which is exactly what makes it worth learning on.\n", "\n", "Part 9 (`09-pandas-operations.ipynb`) continues with row-wise transformations, string methods, and descriptive statistics on this same dataset.\n", "\n", "::: {.callout-note collapse=\"true\" icon=false}\n", "## Topics covered\n", "\n", "| Topic | Why it matters |\n", "|---|---|\n", "| **Series and DataFrame** | The two objects every pandas operation works on |\n", "| **Reading data** | `pd.read_csv` and the first checks every dataset deserves |\n", "| **Selecting rows and columns** | `loc` vs `iloc`, and why mixing them up is the most common pandas bug |\n", "| **Boolean filtering** | The same masking idea from NumPy, applied to a labelled table |\n", "| **Missing data** | Real datasets have gaps; pandas gives you tools to see and handle them, not ignore them |\n", "| **Memory and categorical dtype** | Profile what a DataFrame costs in RAM; reduce it for columns with few distinct values |\n", ":::\n", "\n", "> Callout markers used throughout this notebook are explained on the [book cover page](../../index.qmd#callout-guide)." ] }, { "cell_type": "markdown", "id": "3", "metadata": {}, "source": [ "::: {.callout-note collapse=\"true\" icon=false}\n", "## Learning Objectives\n", "\n", "By the end of Part 8 you will be able to:\n", "\n", "| # | Skill | Covered in |\n", "|---|---|---|\n", "| 1 | Explain the relationship between a Series, a DataFrame, and a NumPy array | Sec. 1 |\n", "| 2 | Load a CSV and run the first checks any new dataset deserves | Sec. 2 |\n", "| 3 | Select columns and rows correctly with `loc` and `iloc` | Sec. 3 |\n", "| 4 | Filter rows with boolean conditions | Sec. 4 |\n", "| 5 | Add, modify, and drop columns without the inplace trap | Sec. 5 |\n", "| 6 | Sort a DataFrame by one or more columns | Sec. 6 |\n", "| 7 | Find, count, and handle missing values | Sec. 7 |\n", "| 8 | Profile memory usage and reduce it with the categorical dtype | Sec. 8 |\n", ":::\n" ] }, { "cell_type": "markdown", "id": "4", "metadata": {}, "source": [ "## 0. The Tool That Thinks in Tables\n", "\n", "You have a CSV. It has 2,400 rows, 10 columns, and three questions from your manager: which schools have the highest pass rate? Is there a relationship between teacher count and average score? How did results change between semesters?\n", "\n", "NumPy can store the numbers. A plain Python dict can hold the column names. But neither of them has a concept of a *labelled row*, a *named column*, or a *group-by operation*. You would spend more time writing plumbing than answering the questions.\n", "\n", "**pandas** ([pandas.pydata.org](https://pandas.pydata.org)) was created at AQR Capital in 2008 by Wes McKinney to bring the R data frame into Python. The idea was to give Python scientists a two-dimensional, labelled data structure with built-in operations for filtering, grouping, joining, and reshaping — the things you actually do with tabular data, every day. It became the de-facto standard for data analysis in Python and the input format expected by every ML library in the ecosystem.\n", "\n", "### How it compares to alternatives\n", "\n", "| Library | Strengths | When to reach for it instead |\n", "| --- | --- | --- |\n", "| **pandas** ([pandas.pydata.org](https://pandas.pydata.org)) | Mature, enormous ecosystem, readable API | Your default for anything up to a few million rows |\n", "| **Polars** ([pola.rs](https://pola.rs)) | Rust core, multi-threaded, lazy evaluation | Datasets too large for pandas to fit comfortably in RAM; see Part 11 |\n", "| **Dask** ([dask.org](https://dask.org)) | Distributed pandas on a cluster | Multi-machine computation, out-of-core data |\n", "| **Modin** ([modin.readthedocs.io](https://modin.readthedocs.io)) | pandas API, parallel execution | Parallelism without changing your pandas code |\n", "| **cuDF** ([docs.rapids.ai](https://docs.rapids.ai/api/cudf/stable)) | GPU-accelerated pandas | Large datasets on NVIDIA GPUs |\n", "\n", "For most data science and ML work pandas is the right starting point. You will meet Polars in Part 11, when the scenario specifically calls for speed at scale.\n", "\n", "### Already in your environment\n", "\n", "pandas is in `pyproject.toml`. For a standalone project:\n", "\n", "```bash\n", "uv add pandas # or: pip install pandas\n", "```\n", "\n", "Official docs: [pandas.pydata.org/docs](https://pandas.pydata.org/docs/)" ] }, { "cell_type": "markdown", "id": "5", "metadata": {}, "source": [ "## 1. Series and DataFrame\n", "\n", "pandas gives you two data structures, and almost everything else in the library is a method on one of them. A **Series** is a one-dimensional, labelled array: a NumPy array with an index attached. A **DataFrame** is a two-dimensional table: a collection of Series that all share the same index, one Series per column." ] }, { "cell_type": "code", "execution_count": null, "id": "6", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "# A Series is a NumPy array plus a label for every position.\n", "marks = pd.Series([62, 78, 91], index=[\"s001\", \"s002\", \"s003\"])\n", "print(marks)\n", "print(f\"underlying NumPy array: {marks.to_numpy()}\")" ] }, { "cell_type": "raw", "id": "7", "metadata": { "raw_mimetype": "text/markdown" }, "source": [ "> **Relationship: Series are columns, DataFrame shares one Index**\n", "\n", "```{mermaid}\n", "flowchart TB\n", " subgraph S [\"Series (1-D)\"]\n", " I1[\"Index: 0\n", "1\n", "2\"] --- V1[\"Values: 85.0\n", "72.0\n", "91.0\"]\n", " LB[\"name: midterm_score\"]\n", " end\n", " subgraph DF [\"DataFrame (2-D)\"]\n", " direction LR\n", " COL1[\"Series: midterm_score\"]\n", " COL2[\"Series: final_score\"]\n", " COL3[\"Series: program\"]\n", " IDX[\"shared Index: S0001, S0002, S0003\"]\n", " end\n", " S -->|\"pd.DataFrame({'mid': s1, 'fin': s2})\"| DF\n", "\n", " style DF fill:#EAF3FA,stroke:#0369A1\n", " style S fill:#EBF5F0,stroke:#059669\n", "```\n" ] }, { "cell_type": "markdown", "id": "8", "metadata": {}, "source": [ "A DataFrame is what you get when several Series that share an index sit next to each other as columns. Building one from a dict of equal-length arrays, the same pattern used to build the feature matrices in the NumPy notebook, makes this explicit:" ] }, { "cell_type": "code", "execution_count": null, "id": "9", "metadata": {}, "outputs": [], "source": [ "data = pd.DataFrame(\n", " {\n", " \"student_id\": [\"s001\", \"s002\", \"s003\"],\n", " \"midterm_score\": [62, 78, 91],\n", " \"gender\": [\"F\", \"M\", \"F\"],\n", " }\n", ")\n", "data" ] }, { "cell_type": "markdown", "id": "10", "metadata": {}, "source": [ "
Series sharing the DataFrame's index. data[\"midterm_score\"] returns that column as a standalone Series, with the same index, the same values, and every NumPy method still available on it through .to_numpy(). This is why the array skills from the NumPy notebook keep paying off here: a pandas column is a NumPy array with row labels attached, not a different kind of thing.\n",
".shape, .dtypes, and .describe() take seconds and catch most of the embarrassing mistakes early: a column pandas read as text when it should be numeric, a date stored as a string, a column that is almost entirely one value. Cheaper to catch here than three analysis steps later.\n",
"df.columns, then print how many unique values the caste\n",
"column has using .nunique().\n",
"df.columns\n",
"df[\"program\"].nunique()\n",
"# -> 4\n",
".loc and .iloc as interchangeabledf.loc[0:2] and df.iloc[0:2] can return a different number of rows. .iloc slicing behaves exactly like a Python list: the end is excluded. .loc slicing is inclusive of the end label, because that label might not even be an integer; it could be a date or a name. The two only happen to look similar when the index is the default 0, 1, 2, ... range, which is exactly when this mistake hides best.\n",
".iloc, select rows 10 through 14 (5 rows) and only the student_id and\n",
"project_score columns.\n",
"df.iloc[10:15, [?, ?]]\n",
"# Hint: find the integer position of each column with df.columns.get_loc(...)\n",
"gender == \"F\") scored above 80 in final_score.\n",
"condition = (df[\"gender\"] == \"F\") & (df[\"final_score\"] > 80)\n",
"condition.sum()\n",
"inplace=True by defaultdf.drop(\"has_internet\", axis=1, inplace=True) looks like it saves memory by not creating a copy. It does not, in most pandas versions it builds the new data and discards the old either way, and it makes code harder to debug: a function that mutates its input instead of returning a new one is a common source of \"why did my DataFrame change\" bugs. Prefer df = df.drop(...), explicit and easy to trace.\n",
"df[\"average_marks\"] = ... modifies df through __setitem__, not through a copy, so it still does what you expect.df[df[\"gender\"] == \"F\"][\"passed\"] = True looks like it updates df, but the inner df[mask] now returns a copy. The assignment goes to that temporary copy and is immediately lost. The fix is one step: df.loc[df[\"gender\"] == \"F\", \"passed\"] = True.inplace=True (above) is being phased out — explicit reassignment (df = df.drop(...)) is both clearer and CoW-compatible. Part 9 shows the same rule applied to groupby and transform.\n",
"passed that is True when average_marks is at least 0.5,\n",
"using np.where the same way it was used in the NumPy notebook.\n",
"df[\"passed\"] = np.where(df[\"average_marks\"] >= 0.5, True, False)\n",
"df[\"passed\"].value_counts()\n",
"total_toilets value might mean the school never reported it, which is different from meaning zero. Filling with the column mean (df[\"total_toilets\"].fillna(df[\"total_toilets\"].mean())) assumes the missing schools look like the average school. Dropping those rows assumes they are not worth analysing at all. Both are real decisions with real consequences, not a technicality to get past quickly.\n",
"total_toilets values with the column's median instead of dropping those rows,\n",
"and confirm there are no missing values left in that column afterward.\n",
"median_toilets = df[\"total_toilets\"].median()\n",
"df[\"total_toilets\"] = df[\"total_toilets\"].fillna(median_toilets)\n",
"df[\"total_toilets\"].isna().sum() # -> 0\n",
"object or str column stores a Python string object per row: the string plus Python object overhead. A numeric column stores a flat NumPy array: 8 bytes per row for float64, 4 for float32. A column like program with only 4 distinct values across 2,400 rows wastes most of that memory repeating the same strings. The categorical dtype solves this by storing each unique string once and keeping only a compact integer code per row.\n",
"category dtype right after loading, before any analysis. It speeds up groupby and value_counts and reduces memory. The one time to undo it: pd.get_dummies() (Part 9) works on category columns but produces the same output, so it is not strictly required, just worth knowing the pattern.\n",
"memory_usage(deep=True).sum(), convert the gender and guardian columns to category dtype, and print the new total. What percentage of memory did you save?\n",
"df_fresh = pd.read_csv(\"data/university_analytics.csv\")\n",
"before_mb = df_fresh.memory_usage(deep=True).sum() / 1024**2\n",
"df_fresh[\"gender\"] = df_fresh[\"gender\"].astype(\"category\")\n",
"df_fresh[\"guardian\"] = df_fresh[\"guardian\"].astype(\"category\")\n",
"after_mb = df_fresh.memory_usage(deep=True).sum() / 1024**2\n",
"print(f\"Before: {before_mb:.3f} MB, After: {after_mb:.3f} MB\")\n",
"print(f\"Saving: {(1 - after_mb/before_mb)*100:.1f}%\")\n",
"total_toilets values with the column median (Sec. 7)internet == 1school_id and compute the mean average_marks per school (a preview of Part 3's\n",
"groupby, used here in its simplest one-line form)with_internet = df[df[\"has_internet\"] == 1]\n",
"school_means = with_internet.groupby(\"school_id\")[\"average_marks\"].mean()\n",
"school_means.sort_values(ascending=False).head(5)\n",
"