{ "cells": [ { "cell_type": "code", "execution_count": null, "id": "b221a566-8a04-4689-8eb1-c266ede5a264", "metadata": { "tags": [ "hide-cell" ] }, "outputs": [], "source": [ "# Install the necessary dependencies\n", "\n", "import os\n", "import sys\n", "import numpy as np\n", "import pandas as pd\n", "!{sys.executable} -m pip install --quiet jupyterlab_myst ipython" ] }, { "cell_type": "markdown", "id": "186ba947", "metadata": { "tags": [ "remove-cell" ] }, "source": [ "---\n", "license:\n", " code: MIT\n", " content: CC-BY-4.0\n", "github: https://github.com/ocademy-ai/machine-learning\n", "venue: By Ocademy\n", "open_access: true\n", "bibliography:\n", " - https://raw.githubusercontent.com/ocademy-ai/machine-learning/main/open-machine-learning-jupyter-book/references.bib\n", "---" ] }, { "cell_type": "markdown", "id": "aa35406e-c73d-49f1-aa84-5cc5ced6c294", "metadata": {}, "source": [ "# Advanced Pandas Techniques" ] }, { "cell_type": "markdown", "id": "64ccbdd4-f4c6-4893-80e9-7d87595020d2", "metadata": {}, "source": [ "## Overview\n", "\n", "In this section, we'll continue to introduce combining datasets: concat, merge and join along with data aggregation and grouping." ] }, { "cell_type": "markdown", "id": "9edbb0f3", "metadata": {}, "source": [ "## Combining datasets: concat, merge and join\n", "\n", "### concat\n", "\n", "- Concatenate Pandas objects along a particular axis.\n", "\n", "- Allows optional set logic along the other axes.\n", "\n", "- Can also add a layer of hierarchical indexing on the concatenation axis, which may be useful if the labels are the same (or overlapping) on the passed axis number.\n", "\n", "For example:\n", "\n", "Combine two `Series`." ] }, { "cell_type": "code", "execution_count": 2, "id": "b08dcc94", "metadata": { "attributes": { "classes": [ "code-cell" ], "id": "" } }, "outputs": [ { "data": { "text/plain": [ "0 a\n", "1 b\n", "0 c\n", "1 d\n", "dtype: object" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s1 = pd.Series(['a', 'b'])\n", "s2 = pd.Series(['c', 'd'])\n", "pd.concat([s1, s2])" ] }, { "cell_type": "markdown", "id": "b1c47e7c", "metadata": {}, "source": [ "Clear the existing index and reset it in the result by setting the `ignore_index` option to `True`." ] }, { "cell_type": "code", "execution_count": 3, "id": "32049abb", "metadata": { "attributes": { "classes": [ "code-cell" ], "id": "" } }, "outputs": [ { "data": { "text/plain": [ "0 a\n", "1 b\n", "2 c\n", "3 d\n", "dtype: object" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([s1, s2], ignore_index=True)" ] }, { "cell_type": "markdown", "id": "31f73f90", "metadata": {}, "source": [ "Add a hierarchical index at the outermost level of the data with the `keys` option." ] }, { "cell_type": "code", "execution_count": 4, "id": "d5b95507", "metadata": { "attributes": { "classes": [ "code-cell" ], "id": "" } }, "outputs": [ { "data": { "text/plain": [ "s1 0 a\n", " 1 b\n", "s2 0 c\n", " 1 d\n", "dtype: object" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([s1, s2], keys=['s1', 's2'])" ] }, { "cell_type": "markdown", "id": "9c618012", "metadata": {}, "source": [ "Label the index keys you create with the `names` option." ] }, { "cell_type": "code", "execution_count": 5, "id": "6d54830d", "metadata": { "attributes": { "classes": [ "code-cell" ], "id": "" } }, "outputs": [ { "data": { "text/plain": [ "Series name Row ID\n", "s1 0 a\n", " 1 b\n", "s2 0 c\n", " 1 d\n", "dtype: object" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([s1, s2], keys=['s1', 's2'],\n", " names=['Series name', 'Row ID'])" ] }, { "cell_type": "markdown", "id": "31fac69f", "metadata": {}, "source": [ "Combine two `DataFrame` objects with identical columns." ] }, { "cell_type": "code", "execution_count": 6, "id": "fec72294", "metadata": { "attributes": { "classes": [ "code-cell" ], "id": "" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
letternumber
0a1
1b2
\n", "
" ], "text/plain": [ " letter number\n", "0 a 1\n", "1 b 2" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1 = pd.DataFrame([['a', 1], ['b', 2]],\n", " columns=['letter', 'number'])\n", "df1" ] }, { "cell_type": "code", "execution_count": 7, "id": "80a1f5b0", "metadata": { "attributes": { "classes": [ "code-cell" ], "id": "" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
letternumber
0c3
1d4
\n", "
" ], "text/plain": [ " letter number\n", "0 c 3\n", "1 d 4" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2 = pd.DataFrame([['c', 3], ['d', 4]],\n", " columns=['letter', 'number'])\n", "df2" ] }, { "cell_type": "code", "execution_count": 8, "id": "4e9e65f6", "metadata": { "attributes": { "classes": [ "code-cell" ], "id": "" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
letternumber
0a1
1b2
0c3
1d4
\n", "
" ], "text/plain": [ " letter number\n", "0 a 1\n", "1 b 2\n", "0 c 3\n", "1 d 4" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([df1, df2])" ] }, { "cell_type": "markdown", "id": "49d878b5", "metadata": {}, "source": [ "Combine `DataFrame` objects with overlapping columns and return everything. Columns outside the intersection will be filled with `NaN` values." ] }, { "cell_type": "code", "execution_count": 9, "id": "f50e8ede", "metadata": { "attributes": { "classes": [ "code-cell" ], "id": "" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
letternumberanimal
0c3cat
1d4dog
\n", "
" ], "text/plain": [ " letter number animal\n", "0 c 3 cat\n", "1 d 4 dog" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df3 = pd.DataFrame([['c', 3, 'cat'], ['d', 4, 'dog']],\n", " columns=['letter', 'number', 'animal'])\n", "df3" ] }, { "cell_type": "code", "execution_count": 10, "id": "9def1cdd", "metadata": { "attributes": { "classes": [ "code-cell" ], "id": "" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
letternumberanimal
0a1NaN
1b2NaN
0c3cat
1d4dog
\n", "
" ], "text/plain": [ " letter number animal\n", "0 a 1 NaN\n", "1 b 2 NaN\n", "0 c 3 cat\n", "1 d 4 dog" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([df1, df3], sort=False)" ] }, { "cell_type": "markdown", "id": "6f2fcb0c", "metadata": {}, "source": [ "Combine DataFrame objects with overlapping columns and return only those that are shared by passing inner to the join keyword argument." ] }, { "cell_type": "code", "execution_count": 11, "id": "ef69d51c", "metadata": { "attributes": { "classes": [ "code-cell" ], "id": "" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
letternumber
0a1
1b2
0c3
1d4
\n", "
" ], "text/plain": [ " letter number\n", "0 a 1\n", "1 b 2\n", "0 c 3\n", "1 d 4" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([df1, df3], join=\"inner\")" ] }, { "cell_type": "markdown", "id": "0fda5cf5", "metadata": {}, "source": [ "Combine `DataFrame` objects horizontally along the x-axis by passing in `axis=1`." ] }, { "cell_type": "code", "execution_count": 12, "id": "2159161d", "metadata": { "attributes": { "classes": [ "code-cell" ], "id": "" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
letternumberanimalname
0a1birdpolly
1b2monkeygeorge
\n", "
" ], "text/plain": [ " letter number animal name\n", "0 a 1 bird polly\n", "1 b 2 monkey george" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df4 = pd.DataFrame([['bird', 'polly'], ['monkey', 'george']],\n", " columns=['animal', 'name'])\n", "pd.concat([df1, df4], axis=1)" ] }, { "cell_type": "markdown", "id": "adb11ea6", "metadata": {}, "source": [ "Prevent the result from including duplicate index values with the `verify_integrity` option." ] }, { "cell_type": "code", "execution_count": 13, "id": "45bea28a", "metadata": { "attributes": { "classes": [ "code-cell" ], "id": "" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0
a1
\n", "
" ], "text/plain": [ " 0\n", "a 1" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df5 = pd.DataFrame([1], index=['a'])\n", "df5" ] }, { "cell_type": "code", "execution_count": 14, "id": "db871526", "metadata": { "attributes": { "classes": [ "code-cell" ], "id": "" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0
a2
\n", "
" ], "text/plain": [ " 0\n", "a 2" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df6 = pd.DataFrame([2], index=['a'])\n", "df6" ] }, { "cell_type": "code", "execution_count": null, "id": "1ab6b3b0", "metadata": { "attributes": { "classes": [ "code-cell" ], "id": "" }, "tags": [ "raises-exception" ] }, "outputs": [], "source": [ "pd.concat([df5, df6], verify_integrity=True)" ] }, { "cell_type": "markdown", "id": "90fc36d4", "metadata": {}, "source": [ "Append a single row to the end of a `DataFrame` object." ] }, { "cell_type": "code", "execution_count": null, "id": "007c1ed6", "metadata": { "attributes": { "classes": [ "code-cell" ], "id": "" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
012
\n", "
" ], "text/plain": [ " a b\n", "0 1 2" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df7 = pd.DataFrame({'a': 1, 'b': 2}, index=[0])\n", "df7" ] }, { "cell_type": "code", "execution_count": null, "id": "9dbaddff", "metadata": { "attributes": { "classes": [ "code-cell" ], "id": "" } }, "outputs": [ { "data": { "text/plain": [ "a 3\n", "b 4\n", "dtype: int64" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "new_row = pd.Series({'a': 3, 'b': 4})\n", "new_row" ] }, { "cell_type": "code", "execution_count": null, "id": "ad2d1313", "metadata": { "attributes": { "classes": [ "code-cell" ], "id": "" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
012
134
\n", "
" ], "text/plain": [ " a b\n", "0 1 2\n", "1 3 4" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([df7, new_row.to_frame().T], ignore_index=True)" ] }, { "cell_type": "markdown", "id": "39223d1c", "metadata": {}, "source": [ ":::{note}\n", "`append()` has been deprecated since version 1.4.0: Use `concat()` instead. \n", ":::\n", "\n", "### merge\n", "\n", "- Merge DataFrame or named Series objects with a database-style join.\n", "\n", "- A named Series object is treated as a DataFrame with a single named column.\n", "\n", "- The join is done on columns or indexes. If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on. When performing a cross-merge, no column specifications to merge on are allowed." ] }, { "cell_type": "markdown", "id": "c1afc536-2209-4fa1-8d63-0b19c18c66c6", "metadata": { "attributes": { "classes": [ "warning" ], "id": "" } }, "source": [ "If both key columns contain rows where the key is a null value, those rows will be matched against each other. This is different from usual SQL join behaviour and can lead to unexpected results." ] }, { "cell_type": "markdown", "id": "0f2ffec1", "metadata": {}, "source": [ "For example:" ] }, { "cell_type": "code", "execution_count": null, "id": "e223179b", "metadata": {}, "outputs": [], "source": [ "df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],\n", " 'value': [1, 2, 3, 5]})\n", "df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],\n", " 'value': [5, 6, 7, 8]})" ] }, { "cell_type": "markdown", "id": "ee9441ec", "metadata": {}, "source": [ "Merge DataFrames `df1` and `df2` with specified left and right suffixes appended to any overlapping columns." ] }, { "cell_type": "code", "execution_count": null, "id": "e22da8fc", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
lkeyvalue_leftrkeyvalue_right
0foo1foo5
1foo1foo8
2foo5foo5
3foo5foo8
4bar2bar6
5baz3baz7
\n", "
" ], "text/plain": [ " lkey value_left rkey value_right\n", "0 foo 1 foo 5\n", "1 foo 1 foo 8\n", "2 foo 5 foo 5\n", "3 foo 5 foo 8\n", "4 bar 2 bar 6\n", "5 baz 3 baz 7" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1.merge(df2, left_on='lkey', right_on='rkey', suffixes=('_left', '_right'))" ] }, { "cell_type": "code", "execution_count": null, "id": "6147bab8-4644-4a23-ba71-205573a1c3f9", "metadata": { "tags": [ "hide-cell" ] }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "
\n", "
\n", "

Let's visualize it! 🎥

\n", "
\n", " \n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from IPython.display import HTML\n", "\n", "display(\n", " HTML(\n", " \"\"\"\n", "\n", "\n", "
\n", "
\n", "

Let's visualize it! 🎥

\n", "
\n", " \n", "
\n", "
\n", "
\n", "\n", "\n", "\"\"\"\n", " )\n", ")\n" ] }, { "cell_type": "markdown", "id": "5112fc3a", "metadata": {}, "source": [ "\n", "Merge DataFrames `df1` and `df2`, but raise an exception if the DataFrames have any overlapping columns." ] }, { "cell_type": "code", "execution_count": null, "id": "3dea68f6", "metadata": { "attributes": { "classes": [ "code-cell" ], "id": "" }, "tags": [ "raises-exception" ] }, "outputs": [], "source": [ "df1.merge(df2, left_on='lkey', right_on='rkey', suffixes=(False, False))" ] }, { "cell_type": "markdown", "id": "86efca65", "metadata": {}, "source": [ "Using `how` parameter decide the type of merge to be performed." ] }, { "cell_type": "code", "execution_count": null, "id": "1026fc27", "metadata": {}, "outputs": [], "source": [ "df1 = pd.DataFrame({'a': ['foo', 'bar'], 'b': [1, 2]})\n", "df2 = pd.DataFrame({'a': ['foo', 'baz'], 'c': [3, 4]})" ] }, { "cell_type": "code", "execution_count": null, "id": "b4379cb1", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
0foo13
\n", "
" ], "text/plain": [ " a b c\n", "0 foo 1 3" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1.merge(df2, how='inner', on='a')" ] }, { "cell_type": "code", "execution_count": null, "id": "90916930-6a8e-40e3-871e-d0043aae93d8", "metadata": { "tags": [ "hide-cell" ] }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "
\n", "
\n", "

Let's visualize it! 🎥

\n", "
\n", " \n", "
\n", "
\n", "
\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from IPython.display import HTML\n", "\n", "display(\n", " HTML(\n", " \"\"\"\n", "\n", "\n", "
\n", "
\n", "

Let's visualize it! 🎥

\n", "
\n", " \n", "
\n", "
\n", "
\n", "\n", "\"\"\"\n", " )\n", ")\n" ] }, { "cell_type": "code", "execution_count": null, "id": "2a8bb3d7", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
0foo13.0
1bar2NaN
\n", "
" ], "text/plain": [ " a b c\n", "0 foo 1 3.0\n", "1 bar 2 NaN" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1.merge(df2, how='left', on='a')" ] }, { "cell_type": "code", "execution_count": null, "id": "467da7f9-a710-442e-9fcf-afb4990ea3b0", "metadata": { "tags": [ "hide-cell" ] }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "
\n", "
\n", "

Let's visualize it! 🎥

\n", "
\n", " \n", "
\n", "
\n", "
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from IPython.display import HTML\n", "\n", "display(\n", " HTML(\n", " \"\"\"\n", "\n", "\n", "
\n", "
\n", "

Let's visualize it! 🎥

\n", "
\n", " \n", "
\n", "
\n", "
\n", "\"\"\"\n", " )\n", ")\n" ] }, { "cell_type": "code", "execution_count": null, "id": "8951b7b9", "metadata": {}, "outputs": [], "source": [ "df1 = pd.DataFrame({'left': ['foo', 'bar']})\n", "df2 = pd.DataFrame({'right': [7, 8]})" ] }, { "cell_type": "code", "execution_count": null, "id": "93051401", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
leftright
0foo7
1foo8
2bar7
3bar8
\n", "
" ], "text/plain": [ " left right\n", "0 foo 7\n", "1 foo 8\n", "2 bar 7\n", "3 bar 8" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1.merge(df2, how='cross')" ] }, { "cell_type": "code", "execution_count": null, "id": "bc243059-83f7-485c-bcd0-453d611c3d1f", "metadata": { "tags": [ "hide-cell" ] }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "
\n", "
\n", "

Let's visualize it! 🎥

\n", "
\n", " \n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from IPython.display import HTML\n", "\n", "display(\n", " HTML(\n", " \"\"\"\n", "\n", "\n", "
\n", "
\n", "

Let's visualize it! 🎥

\n", "
\n", " \n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\"\"\"\n", " )\n", ")\n" ] }, { "cell_type": "markdown", "id": "b58237c9", "metadata": {}, "source": [ "\n", "### join\n", "\n", "- Join columns of another DataFrame.\n", "\n", "- Join columns with other DataFrame either on index or on a key column. Efficiently join multiple DataFrame objects by index at once by passing a list.\n", "\n", "For example:" ] }, { "cell_type": "code", "execution_count": null, "id": "5ad178d6", "metadata": { "attributes": { "classes": [ "code-cell" ], "id": "" } }, "outputs": [], "source": [ "df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],\n", " 'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})" ] }, { "cell_type": "code", "execution_count": null, "id": "ff1aa936", "metadata": { "attributes": { "classes": [ "code-cell" ], "id": "" } }, "outputs": [], "source": [ "other = pd.DataFrame({'key': ['K0', 'K1', 'K2'],\n", " 'B': ['B0', 'B1', 'B2']}) " ] }, { "cell_type": "markdown", "id": "3278bb56", "metadata": {}, "source": [ "Join DataFrames using their indexes." ] }, { "cell_type": "code", "execution_count": null, "id": "a2517b83", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
key_callerAkey_otherB
0K0A0K0B0
1K1A1K1B1
2K2A2K2B2
3K3A3NaNNaN
4K4A4NaNNaN
5K5A5NaNNaN
\n", "
" ], "text/plain": [ " key_caller A key_other B\n", "0 K0 A0 K0 B0\n", "1 K1 A1 K1 B1\n", "2 K2 A2 K2 B2\n", "3 K3 A3 NaN NaN\n", "4 K4 A4 NaN NaN\n", "5 K5 A5 NaN NaN" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.join(other, lsuffix='_caller', rsuffix='_other')" ] }, { "cell_type": "code", "execution_count": null, "id": "81738ab5-bc94-4264-bb43-8c64c041c332", "metadata": { "tags": [ "hide-cell" ] }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "
\n", "
\n", "

Let's visualize it! 🎥

\n", "
\n", " \n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from IPython.display import HTML\n", "\n", "display(\n", " HTML(\n", " \"\"\"\n", "\n", "\n", "
\n", "
\n", "

Let's visualize it! 🎥

\n", "
\n", " \n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\"\"\"\n", " )\n", ")\n" ] }, { "cell_type": "markdown", "id": "59935609", "metadata": {}, "source": [ "\n", "If we want to join using the `key` columns, we need to set `key` to be the index in both `df` and `other`. The joined DataFrame will have `key` as its index." ] }, { "cell_type": "code", "execution_count": null, "id": "91c6f0f0", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AB
key
K0A0B0
K1A1B1
K2A2B2
K3A3NaN
K4A4NaN
K5A5NaN
\n", "
" ], "text/plain": [ " A B\n", "key \n", "K0 A0 B0\n", "K1 A1 B1\n", "K2 A2 B2\n", "K3 A3 NaN\n", "K4 A4 NaN\n", "K5 A5 NaN" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.set_index('key').join(other.set_index('key'))" ] }, { "cell_type": "code", "execution_count": null, "id": "f942120e-c151-473d-aa0a-3ed6b0679204", "metadata": { "tags": [ "hide-cell" ] }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "
\n", "
\n", "

Let's visualize it! 🎥

\n", "
\n", " \n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from IPython.display import HTML\n", "\n", "display(\n", " HTML(\n", " \"\"\"\n", "\n", "\n", "
\n", "
\n", "

Let's visualize it! 🎥

\n", "
\n", " \n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\"\"\"\n", " )\n", ")\n" ] }, { "cell_type": "markdown", "id": "1483f153", "metadata": {}, "source": [ "\n", "Another option to join using the key columns is to use the `on` parameter. `DataFrame.join` always uses `other`'s index but we can use any column in `df`. This method preserves the original DataFrame's index in the result." ] }, { "cell_type": "code", "execution_count": null, "id": "d8fbb1f7", "metadata": { "attributes": { "classes": [ "code-cell" ], "id": "" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
keyAB
0K0A0B0
1K1A1B1
2K2A2B2
3K3A3NaN
4K4A4NaN
5K5A5NaN
\n", "
" ], "text/plain": [ " key A B\n", "0 K0 A0 B0\n", "1 K1 A1 B1\n", "2 K2 A2 B2\n", "3 K3 A3 NaN\n", "4 K4 A4 NaN\n", "5 K5 A5 NaN" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.join(other.set_index('key'), on='key')" ] }, { "cell_type": "markdown", "id": "0ed06755", "metadata": {}, "source": [ "Using non-unique key values shows how they are matched." ] }, { "cell_type": "code", "execution_count": null, "id": "b4d1eb0d", "metadata": { "attributes": { "classes": [ "code-cell" ], "id": "" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
keyA
0K0A0
1K1A1
2K1A2
3K3A3
4K0A4
5K1A5
\n", "
" ], "text/plain": [ " key A\n", "0 K0 A0\n", "1 K1 A1\n", "2 K1 A2\n", "3 K3 A3\n", "4 K0 A4\n", "5 K1 A5" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame({'key': ['K0', 'K1', 'K1', 'K3', 'K0', 'K1'],\n", " 'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})\n", "df " ] }, { "cell_type": "code", "execution_count": null, "id": "7f6bc83d", "metadata": { "attributes": { "classes": [ "code-cell" ], "id": "" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
keyAB
0K0A0B0
1K1A1B1
2K1A2B1
3K3A3NaN
4K0A4B0
5K1A5B1
\n", "
" ], "text/plain": [ " key A B\n", "0 K0 A0 B0\n", "1 K1 A1 B1\n", "2 K1 A2 B1\n", "3 K3 A3 NaN\n", "4 K0 A4 B0\n", "5 K1 A5 B1" ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.join(other.set_index('key'), on='key', validate='m:1')" ] }, { "cell_type": "markdown", "id": "61fb9627", "metadata": {}, "source": [ "## Aggregation and grouping\n", "\n", "Group `DataFrame` using a mapper or by a `Series` of columns.\n", "\n", "A `groupby` operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.\n", "\n", "For example:" ] }, { "cell_type": "code", "execution_count": null, "id": "38adb2b7", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Max Speed
Animal
Falcon375.0
Parrot25.0
\n", "
" ], "text/plain": [ " Max Speed\n", "Animal \n", "Falcon 375.0\n", "Parrot 25.0" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',\n", " 'Parrot', 'Parrot'],\n", " 'Max Speed': [380., 370., 24., 26.]})\n", "df\n", "df.groupby(['Animal']).mean()" ] }, { "cell_type": "code", "execution_count": null, "id": "917ba231-1ee4-4f2c-bcb9-4262d7eba119", "metadata": { "tags": [ "hide-cell" ] }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
\n", "
\n", "

Let's visualize it! 🎥

\n", "
\n", " \n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from IPython.display import HTML\n", "\n", "display(\n", " HTML(\n", " \"\"\"\n", "\n", "
\n", "
\n", "

Let's visualize it! 🎥

\n", "
\n", " \n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\"\"\"\n", " )\n", ")\n" ] }, { "cell_type": "markdown", "id": "84fe11db", "metadata": {}, "source": [ "\n", "### Hierarchical Indexes\n", "\n", "We can `groupby` different levels of a hierarchical index using the `level` parameter:" ] }, { "cell_type": "code", "execution_count": null, "id": "5e84fd8b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Max Speed
Animal
Falcon370.0
Parrot25.0
\n", "
" ], "text/plain": [ " Max Speed\n", "Animal \n", "Falcon 370.0\n", "Parrot 25.0" ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "arrays = [['Falcon', 'Falcon', 'Parrot', 'Parrot'],\n", " ['Captive', 'Wild', 'Captive', 'Wild']]\n", "index = pd.MultiIndex.from_arrays(arrays, names=('Animal', 'Type'))\n", "df = pd.DataFrame({'Max Speed': [390., 350., 30., 20.]},\n", " index=index)\n", "df.groupby(level=0).mean()" ] }, { "cell_type": "code", "execution_count": null, "id": "8d6ff678-1c1e-4629-9e06-1874511ecdf0", "metadata": { "tags": [ "hide-cell" ] }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
\n", "
\n", "

Let's visualize it! 🎥

\n", "
\n", " \n", "
\n", "
\n", "
\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from IPython.display import HTML\n", "\n", "display(\n", " HTML(\n", " \"\"\"\n", "\n", "
\n", "
\n", "

Let's visualize it! 🎥

\n", "
\n", " \n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\"\"\"\n", " )\n", ")\n" ] }, { "cell_type": "code", "execution_count": null, "id": "5a7a2d6a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Max Speed
Type
Captive210.0
Wild185.0
\n", "
" ], "text/plain": [ " Max Speed\n", "Type \n", "Captive 210.0\n", "Wild 185.0" ] }, "execution_count": 81, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby(level=\"Type\").mean()" ] }, { "cell_type": "code", "execution_count": null, "id": "31f4c668-6a8b-4dba-a6db-29673e7fbdba", "metadata": { "tags": [ "hide-cell" ] }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
\n", "
\n", "

Let's visualize it! 🎥

\n", "
\n", " \n", "
\n", "
\n", "
\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from IPython.display import HTML\n", "\n", "display(\n", " HTML(\n", " \"\"\"\n", "\n", "
\n", "
\n", "

Let's visualize it! 🎥

\n", "
\n", " \n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\"\"\"\n", " )\n", ")\n" ] }, { "cell_type": "markdown", "id": "fe08b062", "metadata": {}, "source": [ "\n", "We can also choose to include NA in group keys or not by setting `dropna` parameter, the default setting is `True`." ] }, { "cell_type": "code", "execution_count": null, "id": "f27b6536", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ac
b
1.023
2.025
\n", "
" ], "text/plain": [ " a c\n", "b \n", "1.0 2 3\n", "2.0 2 5" ] }, "execution_count": 83, "metadata": {}, "output_type": "execute_result" } ], "source": [ "l = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]\n", "df = pd.DataFrame(l, columns=[\"a\", \"b\", \"c\"])\n", "df.groupby(by=[\"b\"]).sum()" ] }, { "cell_type": "code", "execution_count": null, "id": "47261c15-1d74-4a39-a7bb-073f6835cbf8", "metadata": { "tags": [ "hide-cell" ] }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
\n", "
\n", "

Let's visualize it! 🎥

\n", "
\n", " \n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from IPython.display import HTML\n", "\n", "display(\n", " HTML(\n", " \"\"\"\n", "\n", "
\n", "
\n", "

Let's visualize it! 🎥

\n", "
\n", " \n", "
\n", "
\n", "
\n", "\n", "\n", "\"\"\"\n", " )\n", ")\n" ] }, { "cell_type": "code", "execution_count": null, "id": "815ba4c3", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ac
b
1.023
2.025
NaN14
\n", "
" ], "text/plain": [ " a c\n", "b \n", "1.0 2 3\n", "2.0 2 5\n", "NaN 1 4" ] }, "execution_count": 85, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby(by=[\"b\"], dropna=False).sum()" ] }, { "cell_type": "code", "execution_count": null, "id": "17c93213-8bcf-4ac8-a30d-09df48b9ca71", "metadata": { "tags": [ "hide-cell" ] }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
\n", "
\n", "

Let's visualize it! 🎥

\n", "
\n", " \n", "
\n", "
\n", "
\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from IPython.display import HTML\n", "\n", "display(\n", " HTML(\n", " \"\"\"\n", "\n", "
\n", "
\n", "

Let's visualize it! 🎥

\n", "
\n", " \n", "
\n", "
\n", "
\n", "\n", "\"\"\"\n", " )\n", ")\n" ] }, { "cell_type": "code", "execution_count": null, "id": "719dc004", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bc
a
a13.013.0
b12.3123.0
\n", "
" ], "text/plain": [ " b c\n", "a \n", "a 13.0 13.0\n", "b 12.3 123.0" ] }, "execution_count": 87, "metadata": {}, "output_type": "execute_result" } ], "source": [ "l = [[\"a\", 12, 12], [None, 12.3, 33.], [\"b\", 12.3, 123], [\"a\", 1, 1]]\n", "df = pd.DataFrame(l, columns=[\"a\", \"b\", \"c\"])\n", "df.groupby(by=\"a\").sum()" ] }, { "cell_type": "code", "execution_count": null, "id": "ba2d22de-ed75-4d52-a6d8-badf4791429f", "metadata": { "tags": [ "hide-cell" ] }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
\n", "
\n", "

Let's visualize it! 🎥

\n", "
\n", " \n", "
\n", "
\n", "
\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from IPython.display import HTML\n", "\n", "display(\n", " HTML(\n", " \"\"\"\n", "\n", "
\n", "
\n", "

Let's visualize it! 🎥

\n", "
\n", " \n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\"\"\"\n", " )\n", ")\n" ] }, { "cell_type": "code", "execution_count": null, "id": "cce87c6a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bc
a
a13.013.0
b12.3123.0
NaN12.333.0
\n", "
" ], "text/plain": [ " b c\n", "a \n", "a 13.0 13.0\n", "b 12.3 123.0\n", "NaN 12.3 33.0" ] }, "execution_count": 89, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby(by=\"a\", dropna=False).sum()" ] }, { "cell_type": "code", "execution_count": null, "id": "70cc2217-577e-4b8c-8fc2-ce02f036622b", "metadata": { "tags": [ "hide-cell" ] }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
\n", "

Let's visualize it! 🎥

\n", "
\n", " \n", "
\n", "
\n", "
\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from IPython.display import HTML\n", "\n", "display(\n", " HTML(\n", " \"\"\"\n", "
\n", "
\n", "

Let's visualize it! 🎥

\n", "
\n", " \n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\"\"\"\n", " )\n", ")\n" ] }, { "cell_type": "markdown", "id": "6988f12c", "metadata": {}, "source": [ "\n", "When using `.apply()`, use `group_keys` to include or exclude the group keys. The `group_keys` argument defaults to `True` (include)." ] }, { "cell_type": "code", "execution_count": null, "id": "1fa5930a", "metadata": { "attributes": { "classes": [ "code-cell" ], "id": "" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AnimalMax Speed
Animal
Falcon0Falcon380.0
1Falcon370.0
Parrot2Parrot24.0
3Parrot26.0
\n", "
" ], "text/plain": [ " Animal Max Speed\n", "Animal \n", "Falcon 0 Falcon 380.0\n", " 1 Falcon 370.0\n", "Parrot 2 Parrot 24.0\n", " 3 Parrot 26.0" ] }, "execution_count": 91, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',\n", " 'Parrot', 'Parrot'],\n", " 'Max Speed': [380., 370., 24., 26.]})\n", "df.groupby(\"Animal\", group_keys=True).apply(lambda x: x)" ] }, { "cell_type": "code", "execution_count": null, "id": "67e4668e", "metadata": { "attributes": { "classes": [ "code-cell" ], "id": "" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AnimalMax Speed
0Falcon380.0
1Falcon370.0
2Parrot24.0
3Parrot26.0
\n", "
" ], "text/plain": [ " Animal Max Speed\n", "0 Falcon 380.0\n", "1 Falcon 370.0\n", "2 Parrot 24.0\n", "3 Parrot 26.0" ] }, "execution_count": 92, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby(\"Animal\", group_keys=False).apply(lambda x: x)" ] }, { "cell_type": "markdown", "id": "c8777695", "metadata": {}, "source": [ "## Pivot table\n", "\n", "Create a spreadsheet-style pivot table as a DataFrame.\n", "\n", "The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame." ] }, { "cell_type": "code", "execution_count": null, "id": "c8e1b317", "metadata": { "attributes": { "classes": [ "code-cell" ], "id": "" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCDE
0fooonesmall12
1fooonelarge24
2fooonelarge25
3footwosmall35
4footwosmall36
5baronelarge46
6baronesmall58
7bartwosmall69
8bartwolarge79
\n", "
" ], "text/plain": [ " A B C D E\n", "0 foo one small 1 2\n", "1 foo one large 2 4\n", "2 foo one large 2 5\n", "3 foo two small 3 5\n", "4 foo two small 3 6\n", "5 bar one large 4 6\n", "6 bar one small 5 8\n", "7 bar two small 6 9\n", "8 bar two large 7 9" ] }, "execution_count": 93, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame({\"A\": [\"foo\", \"foo\", \"foo\", \"foo\", \"foo\",\n", " \"bar\", \"bar\", \"bar\", \"bar\"],\n", " \"B\": [\"one\", \"one\", \"one\", \"two\", \"two\",\n", " \"one\", \"one\", \"two\", \"two\"],\n", " \"C\": [\"small\", \"large\", \"large\", \"small\",\n", " \"small\", \"large\", \"small\", \"small\",\n", " \"large\"],\n", " \"D\": [1, 2, 2, 3, 3, 4, 5, 6, 7],\n", " \"E\": [2, 4, 5, 5, 6, 6, 8, 9, 9]})\n", "df" ] }, { "cell_type": "markdown", "id": "ef96918e", "metadata": {}, "source": [ "This first example aggregates values by taking the sum." ] }, { "cell_type": "code", "execution_count": null, "id": "7206f156", "metadata": { "attributes": { "classes": [ "code-cell" ], "id": "" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Clargesmall
AB
barone4.05.0
two7.06.0
fooone4.01.0
twoNaN6.0
\n", "
" ], "text/plain": [ "C large small\n", "A B \n", "bar one 4.0 5.0\n", " two 7.0 6.0\n", "foo one 4.0 1.0\n", " two NaN 6.0" ] }, "execution_count": 94, "metadata": {}, "output_type": "execute_result" } ], "source": [ "table = pd.pivot_table(df, values='D', index=['A', 'B'],\n", " columns=['C'], aggfunc=np.sum)\n", "table" ] }, { "cell_type": "markdown", "id": "e0df6460", "metadata": {}, "source": [ "We can also fill in missing values using the `fill_value` parameter." ] }, { "cell_type": "code", "execution_count": null, "id": "6cfd03f9", "metadata": { "attributes": { "classes": [ "code-cell" ], "id": "" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Clargesmall
AB
barone45
two76
fooone41
two06
\n", "
" ], "text/plain": [ "C large small\n", "A B \n", "bar one 4 5\n", " two 7 6\n", "foo one 4 1\n", " two 0 6" ] }, "execution_count": 95, "metadata": {}, "output_type": "execute_result" } ], "source": [ "table = pd.pivot_table(df, values='D', index=['A', 'B'],\n", " columns=['C'], aggfunc=np.sum, fill_value=0)\n", "table" ] }, { "cell_type": "markdown", "id": "bf713c57", "metadata": {}, "source": [ "The next example aggregates by taking the mean across multiple columns." ] }, { "cell_type": "code", "execution_count": null, "id": "900dc876", "metadata": { "attributes": { "classes": [ "code-cell" ], "id": "" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DE
AC
barlarge5.5000007.500000
small5.5000008.500000
foolarge2.0000004.500000
small2.3333334.333333
\n", "
" ], "text/plain": [ " D E\n", "A C \n", "bar large 5.500000 7.500000\n", " small 5.500000 8.500000\n", "foo large 2.000000 4.500000\n", " small 2.333333 4.333333" ] }, "execution_count": 96, "metadata": {}, "output_type": "execute_result" } ], "source": [ "table = pd.pivot_table(df, values=['D', 'E'], index=['A', 'C'],\n", " aggfunc={'D': np.mean,\n", " 'E': np.mean})\n", "table" ] }, { "cell_type": "markdown", "id": "6a428fdc", "metadata": {}, "source": [ "We can also calculate multiple types of aggregations for any given value column." ] }, { "cell_type": "code", "execution_count": null, "id": "36ccdfaf", "metadata": { "attributes": { "classes": [ "code-cell" ], "id": "" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DE
meanmaxmeanmin
AC
barlarge5.50000097.5000006
small5.50000098.5000008
foolarge2.00000054.5000004
small2.33333364.3333332
\n", "
" ], "text/plain": [ " D E \n", " mean max mean min\n", "A C \n", "bar large 5.500000 9 7.500000 6\n", " small 5.500000 9 8.500000 8\n", "foo large 2.000000 5 4.500000 4\n", " small 2.333333 6 4.333333 2" ] }, "execution_count": 97, "metadata": {}, "output_type": "execute_result" } ], "source": [ "table = pd.pivot_table(df, values=['D', 'E'], index=['A', 'C'],\n", " aggfunc={'D': np.mean,\n", " 'E': [min, max, np.mean]})\n", "table" ] }, { "cell_type": "markdown", "id": "19eeb851", "metadata": {}, "source": [ "## High-performance Pandas: eval() and query()\n", "\n", "### eval()\n", "\n", "Evaluate a string describing operations on DataFrame columns.\n", "\n", "Operates on columns only, not specific rows or elements. This allows `eval` to run arbitrary code, which can make you vulnerable to code injection if you pass user input to this function.\n", "\n", "For example:" ] }, { "cell_type": "code", "execution_count": null, "id": "db6fdd36", "metadata": { "attributes": { "classes": [ "code-cell" ], "id": "" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AB
0110
128
236
344
452
\n", "
" ], "text/plain": [ " A B\n", "0 1 10\n", "1 2 8\n", "2 3 6\n", "3 4 4\n", "4 5 2" ] }, "execution_count": 98, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame({'A': range(1, 6), 'B': range(10, 0, -2)})\n", "df" ] }, { "cell_type": "code", "execution_count": null, "id": "92e71f86", "metadata": { "attributes": { "classes": [ "code-cell" ], "id": "" } }, "outputs": [ { "data": { "text/plain": [ "0 11\n", "1 10\n", "2 9\n", "3 8\n", "4 7\n", "dtype: int64" ] }, "execution_count": 99, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.eval('A + B')" ] }, { "cell_type": "markdown", "id": "e5f51480", "metadata": {}, "source": [ "The assignment is allowed though by default the original `DataFrame` is not modified." ] }, { "cell_type": "code", "execution_count": null, "id": "b6387047", "metadata": { "attributes": { "classes": [ "code-cell" ], "id": "" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABC
011011
12810
2369
3448
4527
\n", "
" ], "text/plain": [ " A B C\n", "0 1 10 11\n", "1 2 8 10\n", "2 3 6 9\n", "3 4 4 8\n", "4 5 2 7" ] }, "execution_count": 100, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.eval('C = A + B')" ] }, { "cell_type": "code", "execution_count": null, "id": "a5322c51", "metadata": { "attributes": { "classes": [ "code-cell" ], "id": "" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AB
0110
128
236
344
452
\n", "
" ], "text/plain": [ " A B\n", "0 1 10\n", "1 2 8\n", "2 3 6\n", "3 4 4\n", "4 5 2" ] }, "execution_count": 101, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "id": "9a0a5d4d", "metadata": {}, "source": [ "Use `inplace=True` to modify the original DataFrame." ] }, { "cell_type": "code", "execution_count": null, "id": "13d2dffa", "metadata": { "attributes": { "classes": [ "code-cell" ], "id": "" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABC
011011
12810
2369
3448
4527
\n", "
" ], "text/plain": [ " A B C\n", "0 1 10 11\n", "1 2 8 10\n", "2 3 6 9\n", "3 4 4 8\n", "4 5 2 7" ] }, "execution_count": 102, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.eval('C = A + B', inplace=True)\n", "df" ] }, { "cell_type": "markdown", "id": "e9c14654", "metadata": {}, "source": [ "Multiple columns can be assigned using multi-line expressions:" ] }, { "cell_type": "code", "execution_count": null, "id": "8ee5ceea", "metadata": { "attributes": { "classes": [ "code-cell" ], "id": "" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCD
011011-9
12810-6
2369-3
34480
45273
\n", "
" ], "text/plain": [ " A B C D\n", "0 1 10 11 -9\n", "1 2 8 10 -6\n", "2 3 6 9 -3\n", "3 4 4 8 0\n", "4 5 2 7 3" ] }, "execution_count": 103, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.eval(\n", " '''\n", " C = A + B\n", " D = A - B\n", " '''\n", ")" ] }, { "cell_type": "markdown", "id": "9c052b27", "metadata": {}, "source": [ "### query()\n", "\n", "Query the columns of a DataFrame with a boolean expression.\n", "\n", "For example:" ] }, { "cell_type": "code", "execution_count": null, "id": "d99bb798", "metadata": { "attributes": { "classes": [ "code-cell" ], "id": "" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABC C
011010
1289
2368
3447
4526
\n", "
" ], "text/plain": [ " A B C C\n", "0 1 10 10\n", "1 2 8 9\n", "2 3 6 8\n", "3 4 4 7\n", "4 5 2 6" ] }, "execution_count": 104, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame({\n", " 'A': range(1, 6),\n", " 'B': range(10, 0, -2),\n", " 'C C': range(10, 5, -1)\n", "})\n", "df" ] }, { "cell_type": "code", "execution_count": null, "id": "c228b08b", "metadata": { "attributes": { "classes": [ "code-cell" ], "id": "" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABC C
4526
\n", "
" ], "text/plain": [ " A B C C\n", "4 5 2 6" ] }, "execution_count": 105, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.query('A > B')" ] }, { "cell_type": "markdown", "id": "e90ed305", "metadata": {}, "source": [ "The previous expression is equivalent to" ] }, { "cell_type": "code", "execution_count": null, "id": "28a30c04", "metadata": { "attributes": { "classes": [ "code-cell" ], "id": "" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABC C
4526
\n", "
" ], "text/plain": [ " A B C C\n", "4 5 2 6" ] }, "execution_count": 106, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df.A > df.B]" ] }, { "cell_type": "markdown", "id": "454bb2b9", "metadata": {}, "source": [ "For columns with spaces in their name, you can use backtick quoting." ] }, { "cell_type": "code", "execution_count": null, "id": "4d06bb30", "metadata": { "attributes": { "classes": [ "code-cell" ], "id": "" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABC C
011010
\n", "
" ], "text/plain": [ " A B C C\n", "0 1 10 10" ] }, "execution_count": 107, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.query('B == `C C`')" ] }, { "cell_type": "markdown", "id": "2ac03c29", "metadata": {}, "source": [ "The previous expression is equivalent to" ] }, { "cell_type": "code", "execution_count": null, "id": "f8dacc1f", "metadata": { "attributes": { "classes": [ "code-cell" ], "id": "" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABC C
011010
\n", "
" ], "text/plain": [ " A B C C\n", "0 1 10 10" ] }, "execution_count": 108, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df.B == df['C C']]" ] }, { "cell_type": "code", "execution_count": null, "id": "6ec1ded1-6f8a-46ca-b304-25621fe08677", "metadata": { "tags": [ "hide-cell" ] }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
\n", "

Let's visualize it! 🎥

\n", "
\n", " \n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from IPython.display import HTML\n", "\n", "display(\n", " HTML(\n", " \"\"\"\n", "
\n", "
\n", "

Let's visualize it! 🎥

\n", "
\n", " \n", "
\n", "
\n", "
\n", "\n", "\n", "\"\"\"\n", " )\n", ")\n" ] }, { "cell_type": "markdown", "id": "bc6c4cd4", "metadata": {}, "source": [ "\n", "## Your turn! 🚀\n", "\n", "### Processing image data\n", "\n", "Recently, very powerful AI models have been developed that allow us to understand images. There are many tasks that can be solved using pre-trained neural networks, or cloud services. Some examples include:\n", "\n", "- **Image Classification**, can help you categorize the image into one of the pre-defined classes. You can easily train your own image classifiers using services such as [Custom Vision](https://azure.microsoft.com/services/cognitive-services/custom-vision-service/?WT.mc_id=academic-77958-bethanycheum)\n", "- **Object Detection** to detect different objects in the image. Services such as [computer vision](https://azure.microsoft.com/services/cognitive-services/computer-vision/?WT.mc_id=academic-77958-bethanycheum) can detect a number of common objects, and you can train [Custom Vision](https://azure.microsoft.com/services/cognitive-services/custom-vision-service/?WT.mc_id=academic-77958-bethanycheum) model to detect some specific objects of interest.\n", "- **Face Detection**, including Age, Gender and Emotion detection. This can be done via [Face API](https://azure.microsoft.com/services/cognitive-services/face/?WT.mc_id=academic-77958-bethanycheum).\n", "\n", "All those cloud services can be called using [Python SDKs](https://docs.microsoft.com/samples/azure-samples/cognitive-services-python-sdk-samples/cognitive-services-python-sdk-samples/?WT.mc_id=academic-77958-bethanycheum), and thus can be easily incorporated into your data exploration workflow.\n", "\n", "Here are some examples of exploring data from Image data sources:\n", "\n", "- In the blog post [How to Learn Data Science without Coding](https://soshnikov.com/azure/how-to-learn-data-science-without-coding/) we explore Instagram photos, trying to understand what makes people give more likes to a photo. We first extract as much information from pictures as possible using [computer vision](https://azure.microsoft.com/services/cognitive-services/computer-vision/?WT.mc_id=academic-77958-bethanycheum), and then use [Azure Machine Learning AutoML](https://docs.microsoft.com/azure/machine-learning/concept-automated-ml/?WT.mc_id=academic-77958-bethanycheum) to build the interpretable model.\n", "- In [Facial Studies Workshop](https://github.com/CloudAdvocacy/FaceStudies) we use [Face API](https://azure.microsoft.com/services/cognitive-services/face/?WT.mc_id=academic-77958-bethanycheum) to extract emotions from people on photographs from events, in order to try to understand what makes people happy.\n", "\n", "### Assignment\n", "\n", "[Perform more detailed data study for the challenges above](../../../assignments/data-science/data-processing-in-python.md)\n", "\n", "## Self study\n", "\n", "In this chapter, we've covered many of the basics of using Pandas effectively for data analysis. Still, much has been omitted from our discussion. To learn more about Pandas, we recommend the following resources:\n", "\n", "- [Pandas online documentation](http://pandas.pydata.org/): This is the go-to source for complete documentation of the package. While the examples in the documentation tend to be small generated datasets, the description of the options is complete and generally very useful for understanding the use of various functions.\n", "\n", "- [*Python for Data Analysis*](http://shop.oreilly.com/product/0636920023784.do) Written by Wes McKinney (the original creator of Pandas), this book contains much more detail on the Pandas package than we had room for in this chapter. In particular, he takes a deep dive into tools for time series, which were his bread and butter as a financial consultant. The book also has many entertaining examples of applying Pandas to gain insight from real-world datasets. Keep in mind, though, that the book is now several years old, and the Pandas package has quite a few new features that this book does not cover (but be on the lookout for a new edition in 2017).\n", "\n", "- [Stack Overflow](http://stackoverflow.com/questions/tagged/pandas): Pandas has so many users that any question you have has likely been asked and answered on Stack Overflow. Using Pandas is a case where some Google-Fu is your best friend. Simply go to your favorite search engine and type in the question, problem, or error you're coming across-more than likely you'll find your answer on a Stack Overflow page.\n", "\n", "- [Pandas on PyVideo](http://pyvideo.org/search?q=pandas): From PyCon to SciPy to PyData, many conferences have featured tutorials from Pandas developers and power users. The PyCon tutorials in particular tend to be given by very well-vetted presenters.\n", "\n", "Using these resources, combined with the walk-through given in this chapter, my hope is that you'll be poised to use Pandas to tackle any data analysis problem you come across!\n", "\n", "## Acknowledgments\n", "\n", "Thanks for [Pandas user guide](https://pandas.pydata.org/docs/user_guide/index.html). It contributes the majority of the content in this chapter." ] } ], "metadata": { "kernelspec": { "display_name": "py39", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.18" } }, "nbformat": 4, "nbformat_minor": 5 }