{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b221a566-8a04-4689-8eb1-c266ede5a264",
   "metadata": {
    "tags": [
     "hide-cell"
    ]
   },
   "outputs": [],
   "source": [
    "# Install the necessary dependencies\n",
    "\n",
    "import os\n",
    "import sys\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "!{sys.executable} -m pip install --quiet jupyterlab_myst ipython"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "186ba947",
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "source": [
    "---\n",
    "license:\n",
    "    code: MIT\n",
    "    content: CC-BY-4.0\n",
    "github: https://github.com/ocademy-ai/machine-learning\n",
    "venue: By Ocademy\n",
    "open_access: true\n",
    "bibliography:\n",
    "  - https://raw.githubusercontent.com/ocademy-ai/machine-learning/main/open-machine-learning-jupyter-book/references.bib\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aa35406e-c73d-49f1-aa84-5cc5ced6c294",
   "metadata": {},
   "source": [
    "# Advanced Pandas Techniques"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "64ccbdd4-f4c6-4893-80e9-7d87595020d2",
   "metadata": {},
   "source": [
    "## Overview\n",
    "\n",
    "In this section, we'll continue to introduce combining datasets: concat, merge and join along with data aggregation and grouping."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9edbb0f3",
   "metadata": {},
   "source": [
    "## Combining datasets: concat, merge and join\n",
    "\n",
    "### concat\n",
    "\n",
    "- Concatenate Pandas objects along a particular axis.\n",
    "\n",
    "- Allows optional set logic along the other axes.\n",
    "\n",
    "- Can also add a layer of hierarchical indexing on the concatenation axis, which may be useful if the labels are the same (or overlapping) on the passed axis number.\n",
    "\n",
    "For example:\n",
    "\n",
    "Combine two `Series`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "b08dcc94",
   "metadata": {
    "attributes": {
     "classes": [
      "code-cell"
     ],
     "id": ""
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    a\n",
       "1    b\n",
       "0    c\n",
       "1    d\n",
       "dtype: object"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s1 = pd.Series(['a', 'b'])\n",
    "s2 = pd.Series(['c', 'd'])\n",
    "pd.concat([s1, s2])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b1c47e7c",
   "metadata": {},
   "source": [
    "Clear the existing index and reset it in the result by setting the `ignore_index` option to `True`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "32049abb",
   "metadata": {
    "attributes": {
     "classes": [
      "code-cell"
     ],
     "id": ""
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    a\n",
       "1    b\n",
       "2    c\n",
       "3    d\n",
       "dtype: object"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.concat([s1, s2], ignore_index=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "31f73f90",
   "metadata": {},
   "source": [
    "Add a hierarchical index at the outermost level of the data with the `keys` option."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "d5b95507",
   "metadata": {
    "attributes": {
     "classes": [
      "code-cell"
     ],
     "id": ""
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "s1  0    a\n",
       "    1    b\n",
       "s2  0    c\n",
       "    1    d\n",
       "dtype: object"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.concat([s1, s2], keys=['s1', 's2'])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9c618012",
   "metadata": {},
   "source": [
    "Label the index keys you create with the `names` option."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "6d54830d",
   "metadata": {
    "attributes": {
     "classes": [
      "code-cell"
     ],
     "id": ""
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Series name  Row ID\n",
       "s1           0         a\n",
       "             1         b\n",
       "s2           0         c\n",
       "             1         d\n",
       "dtype: object"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.concat([s1, s2], keys=['s1', 's2'],\n",
    "          names=['Series name', 'Row ID'])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "31fac69f",
   "metadata": {},
   "source": [
    "Combine two `DataFrame` objects with identical columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "fec72294",
   "metadata": {
    "attributes": {
     "classes": [
      "code-cell"
     ],
     "id": ""
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>letter</th>\n",
       "      <th>number</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>a</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>b</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  letter  number\n",
       "0      a       1\n",
       "1      b       2"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df1 = pd.DataFrame([['a', 1], ['b', 2]],\n",
    "                   columns=['letter', 'number'])\n",
    "df1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "80a1f5b0",
   "metadata": {
    "attributes": {
     "classes": [
      "code-cell"
     ],
     "id": ""
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>letter</th>\n",
       "      <th>number</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>c</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>d</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  letter  number\n",
       "0      c       3\n",
       "1      d       4"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df2 = pd.DataFrame([['c', 3], ['d', 4]],\n",
    "                   columns=['letter', 'number'])\n",
    "df2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "4e9e65f6",
   "metadata": {
    "attributes": {
     "classes": [
      "code-cell"
     ],
     "id": ""
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>letter</th>\n",
       "      <th>number</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>a</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>b</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>c</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>d</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  letter  number\n",
       "0      a       1\n",
       "1      b       2\n",
       "0      c       3\n",
       "1      d       4"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.concat([df1, df2])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "49d878b5",
   "metadata": {},
   "source": [
    "Combine `DataFrame` objects with overlapping columns and return everything. Columns outside the intersection will be filled with `NaN` values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "f50e8ede",
   "metadata": {
    "attributes": {
     "classes": [
      "code-cell"
     ],
     "id": ""
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>letter</th>\n",
       "      <th>number</th>\n",
       "      <th>animal</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>c</td>\n",
       "      <td>3</td>\n",
       "      <td>cat</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>d</td>\n",
       "      <td>4</td>\n",
       "      <td>dog</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  letter  number animal\n",
       "0      c       3    cat\n",
       "1      d       4    dog"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df3 = pd.DataFrame([['c', 3, 'cat'], ['d', 4, 'dog']],\n",
    "                   columns=['letter', 'number', 'animal'])\n",
    "df3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "9def1cdd",
   "metadata": {
    "attributes": {
     "classes": [
      "code-cell"
     ],
     "id": ""
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>letter</th>\n",
       "      <th>number</th>\n",
       "      <th>animal</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>a</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>b</td>\n",
       "      <td>2</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>c</td>\n",
       "      <td>3</td>\n",
       "      <td>cat</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>d</td>\n",
       "      <td>4</td>\n",
       "      <td>dog</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  letter  number animal\n",
       "0      a       1    NaN\n",
       "1      b       2    NaN\n",
       "0      c       3    cat\n",
       "1      d       4    dog"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.concat([df1, df3], sort=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6f2fcb0c",
   "metadata": {},
   "source": [
    "Combine DataFrame objects with overlapping columns and return only those that are shared by passing inner to the join keyword argument."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "ef69d51c",
   "metadata": {
    "attributes": {
     "classes": [
      "code-cell"
     ],
     "id": ""
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>letter</th>\n",
       "      <th>number</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>a</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>b</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>c</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>d</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  letter  number\n",
       "0      a       1\n",
       "1      b       2\n",
       "0      c       3\n",
       "1      d       4"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.concat([df1, df3], join=\"inner\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0fda5cf5",
   "metadata": {},
   "source": [
    "Combine `DataFrame` objects horizontally along the x-axis by passing in `axis=1`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "2159161d",
   "metadata": {
    "attributes": {
     "classes": [
      "code-cell"
     ],
     "id": ""
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>letter</th>\n",
       "      <th>number</th>\n",
       "      <th>animal</th>\n",
       "      <th>name</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>a</td>\n",
       "      <td>1</td>\n",
       "      <td>bird</td>\n",
       "      <td>polly</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>b</td>\n",
       "      <td>2</td>\n",
       "      <td>monkey</td>\n",
       "      <td>george</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  letter  number  animal    name\n",
       "0      a       1    bird   polly\n",
       "1      b       2  monkey  george"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df4 = pd.DataFrame([['bird', 'polly'], ['monkey', 'george']],\n",
    "                   columns=['animal', 'name'])\n",
    "pd.concat([df1, df4], axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "adb11ea6",
   "metadata": {},
   "source": [
    "Prevent the result from including duplicate index values with the `verify_integrity` option."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "45bea28a",
   "metadata": {
    "attributes": {
     "classes": [
      "code-cell"
     ],
     "id": ""
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>a</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   0\n",
       "a  1"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df5 = pd.DataFrame([1], index=['a'])\n",
    "df5"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "db871526",
   "metadata": {
    "attributes": {
     "classes": [
      "code-cell"
     ],
     "id": ""
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>a</th>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   0\n",
       "a  2"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df6 = pd.DataFrame([2], index=['a'])\n",
    "df6"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1ab6b3b0",
   "metadata": {
    "attributes": {
     "classes": [
      "code-cell"
     ],
     "id": ""
    },
    "tags": [
     "raises-exception"
    ]
   },
   "outputs": [],
   "source": [
    "pd.concat([df5, df6], verify_integrity=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "90fc36d4",
   "metadata": {},
   "source": [
    "Append a single row to the end of a `DataFrame` object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "007c1ed6",
   "metadata": {
    "attributes": {
     "classes": [
      "code-cell"
     ],
     "id": ""
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   a  b\n",
       "0  1  2"
      ]
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df7 = pd.DataFrame({'a': 1, 'b': 2}, index=[0])\n",
    "df7"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9dbaddff",
   "metadata": {
    "attributes": {
     "classes": [
      "code-cell"
     ],
     "id": ""
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "a    3\n",
       "b    4\n",
       "dtype: int64"
      ]
     },
     "execution_count": 54,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "new_row = pd.Series({'a': 3, 'b': 4})\n",
    "new_row"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ad2d1313",
   "metadata": {
    "attributes": {
     "classes": [
      "code-cell"
     ],
     "id": ""
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   a  b\n",
       "0  1  2\n",
       "1  3  4"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.concat([df7, new_row.to_frame().T], ignore_index=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "39223d1c",
   "metadata": {},
   "source": [
    ":::{note}\n",
    "`append()` has been deprecated since version 1.4.0: Use `concat()` instead. \n",
    ":::\n",
    "\n",
    "### merge\n",
    "\n",
    "- Merge DataFrame or named Series objects with a database-style join.\n",
    "\n",
    "- A named Series object is treated as a DataFrame with a single named column.\n",
    "\n",
    "- The join is done on columns or indexes. If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on. When performing a cross-merge, no column specifications to merge on are allowed."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c1afc536-2209-4fa1-8d63-0b19c18c66c6",
   "metadata": {
    "attributes": {
     "classes": [
      "warning"
     ],
     "id": ""
    }
   },
   "source": [
    "If both key columns contain rows where the key is a null value, those rows will be matched against each other. This is different from usual SQL join behaviour and can lead to unexpected results."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0f2ffec1",
   "metadata": {},
   "source": [
    "For example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e223179b",
   "metadata": {},
   "outputs": [],
   "source": [
    "df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],\n",
    "                    'value': [1, 2, 3, 5]})\n",
    "df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],\n",
    "                    'value': [5, 6, 7, 8]})"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ee9441ec",
   "metadata": {},
   "source": [
    "Merge DataFrames `df1` and `df2` with specified left and right suffixes appended to any overlapping columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e22da8fc",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>lkey</th>\n",
       "      <th>value_left</th>\n",
       "      <th>rkey</th>\n",
       "      <th>value_right</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>foo</td>\n",
       "      <td>1</td>\n",
       "      <td>foo</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>foo</td>\n",
       "      <td>1</td>\n",
       "      <td>foo</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>foo</td>\n",
       "      <td>5</td>\n",
       "      <td>foo</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>foo</td>\n",
       "      <td>5</td>\n",
       "      <td>foo</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>bar</td>\n",
       "      <td>2</td>\n",
       "      <td>bar</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>baz</td>\n",
       "      <td>3</td>\n",
       "      <td>baz</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  lkey  value_left rkey  value_right\n",
       "0  foo           1  foo            5\n",
       "1  foo           1  foo            8\n",
       "2  foo           5  foo            5\n",
       "3  foo           5  foo            8\n",
       "4  bar           2  bar            6\n",
       "5  baz           3  baz            7"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df1.merge(df2, left_on='lkey', right_on='rkey', suffixes=('_left', '_right'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6147bab8-4644-4a23-ba71-205573a1c3f9",
   "metadata": {
    "tags": [
     "hide-cell"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "\n",
       "\n",
       "<div class='full-width docutils' >\n",
       "  <div class=\"admonition note pandastutor\" name=\"html-admonition\" style=\"margin-right:20%\">\n",
       "    <p class=\"admonition-title pandastutor\">Let's visualize it! 🎥</p>\n",
       "    <div class=\"pandastutor inner\" style=\"height:800px;\">\n",
       "      <iframe frameborder=\"0\" scrolling=\"no\" src=\"https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0Aimport%20numpy%20as%20np%0Adf1%20%3D%20pd.DataFrame%28%7B'lkey'%3A%20%5B'foo',%20'bar',%20'baz',%20'foo'%5D,%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20'value'%3A%20%5B1,%202,%203,%205%5D%7D%29%0Adf2%20%3D%20pd.DataFrame%28%7B'rkey'%3A%20%5B'foo',%20'bar',%20'baz',%20'foo'%5D,%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20'value'%3A%20%5B5,%206,%207,%208%5D%7D%29%0Adf1.merge%28df2,%20left_on%3D'lkey',%20right_on%3D'rkey'%29&d=2023-05-27&lang=py&v=v1\"> </iframe>\n",
       "    </div>\n",
       "  </div>\n",
       "</div>\n",
       "\n",
       "\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from IPython.display import HTML\n",
    "\n",
    "display(\n",
    "    HTML(\n",
    "        \"\"\"\n",
    "\n",
    "\n",
    "<div class='full-width docutils' >\n",
    "  <div class=\"admonition note pandastutor\" name=\"html-admonition\" style=\"margin-right:20%\">\n",
    "    <p class=\"admonition-title pandastutor\">Let's visualize it! 🎥</p>\n",
    "    <div class=\"pandastutor inner\" style=\"height:800px;\">\n",
    "      <iframe frameborder=\"0\" scrolling=\"no\" src=\"https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0Aimport%20numpy%20as%20np%0Adf1%20%3D%20pd.DataFrame%28%7B'lkey'%3A%20%5B'foo',%20'bar',%20'baz',%20'foo'%5D,%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20'value'%3A%20%5B1,%202,%203,%205%5D%7D%29%0Adf2%20%3D%20pd.DataFrame%28%7B'rkey'%3A%20%5B'foo',%20'bar',%20'baz',%20'foo'%5D,%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20'value'%3A%20%5B5,%206,%207,%208%5D%7D%29%0Adf1.merge%28df2,%20left_on%3D'lkey',%20right_on%3D'rkey'%29&d=2023-05-27&lang=py&v=v1\"> </iframe>\n",
    "    </div>\n",
    "  </div>\n",
    "</div>\n",
    "\n",
    "\n",
    "\"\"\"\n",
    "    )\n",
    ")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5112fc3a",
   "metadata": {},
   "source": [
    "\n",
    "Merge DataFrames `df1` and `df2`, but raise an exception if the DataFrames have any overlapping columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3dea68f6",
   "metadata": {
    "attributes": {
     "classes": [
      "code-cell"
     ],
     "id": ""
    },
    "tags": [
     "raises-exception"
    ]
   },
   "outputs": [],
   "source": [
    "df1.merge(df2, left_on='lkey', right_on='rkey', suffixes=(False, False))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "86efca65",
   "metadata": {},
   "source": [
    "Using `how` parameter decide the type of merge to be performed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1026fc27",
   "metadata": {},
   "outputs": [],
   "source": [
    "df1 = pd.DataFrame({'a': ['foo', 'bar'], 'b': [1, 2]})\n",
    "df2 = pd.DataFrame({'a': ['foo', 'baz'], 'c': [3, 4]})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b4379cb1",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>foo</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     a  b  c\n",
       "0  foo  1  3"
      ]
     },
     "execution_count": 61,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df1.merge(df2, how='inner', on='a')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "90916930-6a8e-40e3-871e-d0043aae93d8",
   "metadata": {
    "tags": [
     "hide-cell"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "\n",
       "\n",
       "<div class='full-width docutils' >\n",
       "  <div class=\"admonition note pandastutor\" name=\"html-admonition\" style=\"margin-right:20%\">\n",
       "    <p class=\"admonition-title pandastutor\">Let's visualize it! 🎥</p>\n",
       "    <div class=\"pandastutor inner\" style=\"height:600px;\">\n",
       "      <iframe frameborder=\"0\" scrolling=\"no\" src=\"https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0Adf1%20%3D%20pd.DataFrame%28%7B'a'%3A%20%5B'foo',%20'bar'%5D,%20'b'%3A%20%5B1,%202%5D%7D%29%0Adf2%20%3D%20pd.DataFrame%28%7B'a'%3A%20%5B'foo',%20'baz'%5D,%20'c'%3A%20%5B3,%204%5D%7D%29%0Adf1.merge%28df2,%20how%3D'inner',%20on%3D'a'%29&d=2023-07-15&lang=py&v=v1\"> </iframe>\n",
       "    </div>\n",
       "  </div>\n",
       "</div>\n",
       "\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from IPython.display import HTML\n",
    "\n",
    "display(\n",
    "    HTML(\n",
    "        \"\"\"\n",
    "\n",
    "\n",
    "<div class='full-width docutils' >\n",
    "  <div class=\"admonition note pandastutor\" name=\"html-admonition\" style=\"margin-right:20%\">\n",
    "    <p class=\"admonition-title pandastutor\">Let's visualize it! 🎥</p>\n",
    "    <div class=\"pandastutor inner\" style=\"height:600px;\">\n",
    "      <iframe frameborder=\"0\" scrolling=\"no\" src=\"https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0Adf1%20%3D%20pd.DataFrame%28%7B'a'%3A%20%5B'foo',%20'bar'%5D,%20'b'%3A%20%5B1,%202%5D%7D%29%0Adf2%20%3D%20pd.DataFrame%28%7B'a'%3A%20%5B'foo',%20'baz'%5D,%20'c'%3A%20%5B3,%204%5D%7D%29%0Adf1.merge%28df2,%20how%3D'inner',%20on%3D'a'%29&d=2023-07-15&lang=py&v=v1\"> </iframe>\n",
    "    </div>\n",
    "  </div>\n",
    "</div>\n",
    "\n",
    "\"\"\"\n",
    "    )\n",
    ")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2a8bb3d7",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>foo</td>\n",
       "      <td>1</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>bar</td>\n",
       "      <td>2</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     a  b    c\n",
       "0  foo  1  3.0\n",
       "1  bar  2  NaN"
      ]
     },
     "execution_count": 63,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df1.merge(df2, how='left', on='a')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "467da7f9-a710-442e-9fcf-afb4990ea3b0",
   "metadata": {
    "tags": [
     "hide-cell"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "\n",
       "\n",
       "<div class='full-width docutils' >\n",
       "  <div class=\"admonition note pandastutor\" name=\"html-admonition\" style=\"margin-right:20%\">\n",
       "    <p class=\"admonition-title pandastutor\">Let's visualize it! 🎥</p>\n",
       "    <div class=\"pandastutor inner\" style=\"height:630px;\">\n",
       "      <iframe frameborder=\"0\" scrolling=\"no\" src=\"https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0Adf1%20%3D%20pd.DataFrame%28%7B'a'%3A%20%5B'foo',%20'bar'%5D,%20'b'%3A%20%5B1,%202%5D%7D%29%0Adf2%20%3D%20pd.DataFrame%28%7B'a'%3A%20%5B'foo',%20'baz'%5D,%20'c'%3A%20%5B3,%204%5D%7D%29%0Adf1.merge%28df2,%20how%3D'left',%20on%3D'a'%29&d=2023-07-15&lang=py&v=v1\"> </iframe>\n",
       "    </div>\n",
       "  </div>\n",
       "</div>\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from IPython.display import HTML\n",
    "\n",
    "display(\n",
    "    HTML(\n",
    "        \"\"\"\n",
    "\n",
    "\n",
    "<div class='full-width docutils' >\n",
    "  <div class=\"admonition note pandastutor\" name=\"html-admonition\" style=\"margin-right:20%\">\n",
    "    <p class=\"admonition-title pandastutor\">Let's visualize it! 🎥</p>\n",
    "    <div class=\"pandastutor inner\" style=\"height:630px;\">\n",
    "      <iframe frameborder=\"0\" scrolling=\"no\" src=\"https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0Adf1%20%3D%20pd.DataFrame%28%7B'a'%3A%20%5B'foo',%20'bar'%5D,%20'b'%3A%20%5B1,%202%5D%7D%29%0Adf2%20%3D%20pd.DataFrame%28%7B'a'%3A%20%5B'foo',%20'baz'%5D,%20'c'%3A%20%5B3,%204%5D%7D%29%0Adf1.merge%28df2,%20how%3D'left',%20on%3D'a'%29&d=2023-07-15&lang=py&v=v1\"> </iframe>\n",
    "    </div>\n",
    "  </div>\n",
    "</div>\n",
    "\"\"\"\n",
    "    )\n",
    ")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8951b7b9",
   "metadata": {},
   "outputs": [],
   "source": [
    "df1 = pd.DataFrame({'left': ['foo', 'bar']})\n",
    "df2 = pd.DataFrame({'right': [7, 8]})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "93051401",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>left</th>\n",
       "      <th>right</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>foo</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>foo</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>bar</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>bar</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  left  right\n",
       "0  foo      7\n",
       "1  foo      8\n",
       "2  bar      7\n",
       "3  bar      8"
      ]
     },
     "execution_count": 66,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df1.merge(df2, how='cross')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bc243059-83f7-485c-bcd0-453d611c3d1f",
   "metadata": {
    "tags": [
     "hide-cell"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "\n",
       "\n",
       "<div class='full-width docutils' >\n",
       "  <div class=\"admonition note pandastutor\" name=\"html-admonition\" style=\"margin-right:20%\">\n",
       "    <p class=\"admonition-title pandastutor\">Let's visualize it! 🎥</p>\n",
       "    <div class=\"pandastutor inner\" style=\"height:670px;\">\n",
       "      <iframe frameborder=\"0\" scrolling=\"no\" src=\"https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0Adf1%20%3D%20pd.DataFrame%28%7B'left'%3A%20%5B'foo',%20'bar'%5D%7D%29%0Adf2%20%3D%20pd.DataFrame%28%7B'right'%3A%20%5B7,%208%5D%7D%29%0Adf1.merge%28df2,%20how%3D'cross'%29&d=2023-07-15&lang=py&v=v1\"> </iframe>\n",
       "    </div>\n",
       "  </div>\n",
       "</div>\n",
       "\n",
       "\n",
       "\n",
       "\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from IPython.display import HTML\n",
    "\n",
    "display(\n",
    "    HTML(\n",
    "        \"\"\"\n",
    "\n",
    "\n",
    "<div class='full-width docutils' >\n",
    "  <div class=\"admonition note pandastutor\" name=\"html-admonition\" style=\"margin-right:20%\">\n",
    "    <p class=\"admonition-title pandastutor\">Let's visualize it! 🎥</p>\n",
    "    <div class=\"pandastutor inner\" style=\"height:670px;\">\n",
    "      <iframe frameborder=\"0\" scrolling=\"no\" src=\"https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0Adf1%20%3D%20pd.DataFrame%28%7B'left'%3A%20%5B'foo',%20'bar'%5D%7D%29%0Adf2%20%3D%20pd.DataFrame%28%7B'right'%3A%20%5B7,%208%5D%7D%29%0Adf1.merge%28df2,%20how%3D'cross'%29&d=2023-07-15&lang=py&v=v1\"> </iframe>\n",
    "    </div>\n",
    "  </div>\n",
    "</div>\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\"\"\"\n",
    "    )\n",
    ")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b58237c9",
   "metadata": {},
   "source": [
    "\n",
    "### join\n",
    "\n",
    "- Join columns of another DataFrame.\n",
    "\n",
    "- Join columns with other DataFrame either on index or on a key column. Efficiently join multiple DataFrame objects by index at once by passing a list.\n",
    "\n",
    "For example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5ad178d6",
   "metadata": {
    "attributes": {
     "classes": [
      "code-cell"
     ],
     "id": ""
    }
   },
   "outputs": [],
   "source": [
    "df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],\n",
    "                   'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ff1aa936",
   "metadata": {
    "attributes": {
     "classes": [
      "code-cell"
     ],
     "id": ""
    }
   },
   "outputs": [],
   "source": [
    "other = pd.DataFrame({'key': ['K0', 'K1', 'K2'],\n",
    "                      'B': ['B0', 'B1', 'B2']})                     "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3278bb56",
   "metadata": {},
   "source": [
    "Join DataFrames using their indexes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a2517b83",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>key_caller</th>\n",
       "      <th>A</th>\n",
       "      <th>key_other</th>\n",
       "      <th>B</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>K0</td>\n",
       "      <td>A0</td>\n",
       "      <td>K0</td>\n",
       "      <td>B0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>K1</td>\n",
       "      <td>A1</td>\n",
       "      <td>K1</td>\n",
       "      <td>B1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>K2</td>\n",
       "      <td>A2</td>\n",
       "      <td>K2</td>\n",
       "      <td>B2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>K3</td>\n",
       "      <td>A3</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>K4</td>\n",
       "      <td>A4</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>K5</td>\n",
       "      <td>A5</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  key_caller   A key_other    B\n",
       "0         K0  A0        K0   B0\n",
       "1         K1  A1        K1   B1\n",
       "2         K2  A2        K2   B2\n",
       "3         K3  A3       NaN  NaN\n",
       "4         K4  A4       NaN  NaN\n",
       "5         K5  A5       NaN  NaN"
      ]
     },
     "execution_count": 70,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.join(other, lsuffix='_caller', rsuffix='_other')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "81738ab5-bc94-4264-bb43-8c64c041c332",
   "metadata": {
    "tags": [
     "hide-cell"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "\n",
       "\n",
       "<div class='full-width docutils' >\n",
       "  <div class=\"admonition note pandastutor\" name=\"html-admonition\" style=\"margin-right:20%\">\n",
       "    <p class=\"admonition-title pandastutor\">Let's visualize it! 🎥</p>\n",
       "    <div class=\"pandastutor inner\" style=\"height:830px;\">\n",
       "      <iframe frameborder=\"0\" scrolling=\"no\" src=\"https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0Adf%20%3D%20pd.DataFrame%28%7B'key'%3A%20%5B'K0',%20'K1',%20'K2',%20'K3',%20'K4',%20'K5'%5D,'A'%3A%20%5B'A0',%20'A1',%20'A2',%20'A3',%20'A4',%20'A5'%5D%7D%29%0Aother%20%3D%20pd.DataFrame%28%7B'key'%3A%20%5B'K0',%20'K1',%20'K2'%5D,'B'%3A%20%5B'B0',%20'B1',%20'B2'%5D%7D%29%0Adf.join%28other,%20lsuffix%3D'_caller',%20rsuffix%3D'_other'%29&d=2023-07-15&lang=py&v=v1\"> </iframe>\n",
       "    </div>\n",
       "  </div>\n",
       "</div>\n",
       "\n",
       "\n",
       "\n",
       "\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from IPython.display import HTML\n",
    "\n",
    "display(\n",
    "    HTML(\n",
    "        \"\"\"\n",
    "\n",
    "\n",
    "<div class='full-width docutils' >\n",
    "  <div class=\"admonition note pandastutor\" name=\"html-admonition\" style=\"margin-right:20%\">\n",
    "    <p class=\"admonition-title pandastutor\">Let's visualize it! 🎥</p>\n",
    "    <div class=\"pandastutor inner\" style=\"height:830px;\">\n",
    "      <iframe frameborder=\"0\" scrolling=\"no\" src=\"https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0Adf%20%3D%20pd.DataFrame%28%7B'key'%3A%20%5B'K0',%20'K1',%20'K2',%20'K3',%20'K4',%20'K5'%5D,'A'%3A%20%5B'A0',%20'A1',%20'A2',%20'A3',%20'A4',%20'A5'%5D%7D%29%0Aother%20%3D%20pd.DataFrame%28%7B'key'%3A%20%5B'K0',%20'K1',%20'K2'%5D,'B'%3A%20%5B'B0',%20'B1',%20'B2'%5D%7D%29%0Adf.join%28other,%20lsuffix%3D'_caller',%20rsuffix%3D'_other'%29&d=2023-07-15&lang=py&v=v1\"> </iframe>\n",
    "    </div>\n",
    "  </div>\n",
    "</div>\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\"\"\"\n",
    "    )\n",
    ")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "59935609",
   "metadata": {},
   "source": [
    "\n",
    "If we want to join using the `key` columns, we need to set `key` to be the index in both `df` and `other`. The joined DataFrame will have `key` as its index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "91c6f0f0",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>key</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>K0</th>\n",
       "      <td>A0</td>\n",
       "      <td>B0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>K1</th>\n",
       "      <td>A1</td>\n",
       "      <td>B1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>K2</th>\n",
       "      <td>A2</td>\n",
       "      <td>B2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>K3</th>\n",
       "      <td>A3</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>K4</th>\n",
       "      <td>A4</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>K5</th>\n",
       "      <td>A5</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      A    B\n",
       "key         \n",
       "K0   A0   B0\n",
       "K1   A1   B1\n",
       "K2   A2   B2\n",
       "K3   A3  NaN\n",
       "K4   A4  NaN\n",
       "K5   A5  NaN"
      ]
     },
     "execution_count": 72,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.set_index('key').join(other.set_index('key'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f942120e-c151-473d-aa0a-3ed6b0679204",
   "metadata": {
    "tags": [
     "hide-cell"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "\n",
       "\n",
       "<div class='full-width docutils' >\n",
       "  <div class=\"admonition note pandastutor\" name=\"html-admonition\" style=\"margin-right:20%\">\n",
       "    <p class=\"admonition-title pandastutor\">Let's visualize it! 🎥</p>\n",
       "    <div class=\"pandastutor inner\" style=\"height:1170px;\">\n",
       "      <iframe frameborder=\"0\" scrolling=\"no\" src=\"https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0Aimport%20numpy%20as%20np%0Adf%20%3D%20pd.DataFrame%28%7B'key'%3A%20%5B'K0',%20'K1',%20'K2',%20'K3',%20'K4',%20'K5'%5D,%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20'A'%3A%20%5B'A0',%20'A1',%20'A2',%20'A3',%20'A4',%20'A5'%5D%7D%29%0Aother%20%3D%20pd.DataFrame%28%7B'key'%3A%20%5B'K0',%20'K1',%20'K2'%5D,%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20'B'%3A%20%5B'B0',%20'B1',%20'B2'%5D%7D%29%0Adf.set_index%28'key'%29.join%28other.set_index%28'key'%29%29&d=2023-05-27&lang=py&v=v1\"> </iframe>\n",
       "    </div>\n",
       "  </div>\n",
       "</div>\n",
       "\n",
       "\n",
       "\n",
       "\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from IPython.display import HTML\n",
    "\n",
    "display(\n",
    "    HTML(\n",
    "        \"\"\"\n",
    "\n",
    "\n",
    "<div class='full-width docutils' >\n",
    "  <div class=\"admonition note pandastutor\" name=\"html-admonition\" style=\"margin-right:20%\">\n",
    "    <p class=\"admonition-title pandastutor\">Let's visualize it! 🎥</p>\n",
    "    <div class=\"pandastutor inner\" style=\"height:1170px;\">\n",
    "      <iframe frameborder=\"0\" scrolling=\"no\" src=\"https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0Aimport%20numpy%20as%20np%0Adf%20%3D%20pd.DataFrame%28%7B'key'%3A%20%5B'K0',%20'K1',%20'K2',%20'K3',%20'K4',%20'K5'%5D,%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20'A'%3A%20%5B'A0',%20'A1',%20'A2',%20'A3',%20'A4',%20'A5'%5D%7D%29%0Aother%20%3D%20pd.DataFrame%28%7B'key'%3A%20%5B'K0',%20'K1',%20'K2'%5D,%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20'B'%3A%20%5B'B0',%20'B1',%20'B2'%5D%7D%29%0Adf.set_index%28'key'%29.join%28other.set_index%28'key'%29%29&d=2023-05-27&lang=py&v=v1\"> </iframe>\n",
    "    </div>\n",
    "  </div>\n",
    "</div>\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\"\"\"\n",
    "    )\n",
    ")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1483f153",
   "metadata": {},
   "source": [
    "\n",
    "Another option to join using the key columns is to use the `on` parameter. `DataFrame.join` always uses `other`'s index but we can use any column in `df`. This method preserves the original DataFrame's index in the result."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d8fbb1f7",
   "metadata": {
    "attributes": {
     "classes": [
      "code-cell"
     ],
     "id": ""
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>key</th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>K0</td>\n",
       "      <td>A0</td>\n",
       "      <td>B0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>K1</td>\n",
       "      <td>A1</td>\n",
       "      <td>B1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>K2</td>\n",
       "      <td>A2</td>\n",
       "      <td>B2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>K3</td>\n",
       "      <td>A3</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>K4</td>\n",
       "      <td>A4</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>K5</td>\n",
       "      <td>A5</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  key   A    B\n",
       "0  K0  A0   B0\n",
       "1  K1  A1   B1\n",
       "2  K2  A2   B2\n",
       "3  K3  A3  NaN\n",
       "4  K4  A4  NaN\n",
       "5  K5  A5  NaN"
      ]
     },
     "execution_count": 74,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.join(other.set_index('key'), on='key')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0ed06755",
   "metadata": {},
   "source": [
    "Using non-unique key values shows how they are matched."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b4d1eb0d",
   "metadata": {
    "attributes": {
     "classes": [
      "code-cell"
     ],
     "id": ""
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>key</th>\n",
       "      <th>A</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>K0</td>\n",
       "      <td>A0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>K1</td>\n",
       "      <td>A1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>K1</td>\n",
       "      <td>A2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>K3</td>\n",
       "      <td>A3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>K0</td>\n",
       "      <td>A4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>K1</td>\n",
       "      <td>A5</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  key   A\n",
       "0  K0  A0\n",
       "1  K1  A1\n",
       "2  K1  A2\n",
       "3  K3  A3\n",
       "4  K0  A4\n",
       "5  K1  A5"
      ]
     },
     "execution_count": 75,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.DataFrame({'key': ['K0', 'K1', 'K1', 'K3', 'K0', 'K1'],\n",
    "                   'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})\n",
    "df                   "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7f6bc83d",
   "metadata": {
    "attributes": {
     "classes": [
      "code-cell"
     ],
     "id": ""
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>key</th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>K0</td>\n",
       "      <td>A0</td>\n",
       "      <td>B0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>K1</td>\n",
       "      <td>A1</td>\n",
       "      <td>B1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>K1</td>\n",
       "      <td>A2</td>\n",
       "      <td>B1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>K3</td>\n",
       "      <td>A3</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>K0</td>\n",
       "      <td>A4</td>\n",
       "      <td>B0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>K1</td>\n",
       "      <td>A5</td>\n",
       "      <td>B1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  key   A    B\n",
       "0  K0  A0   B0\n",
       "1  K1  A1   B1\n",
       "2  K1  A2   B1\n",
       "3  K3  A3  NaN\n",
       "4  K0  A4   B0\n",
       "5  K1  A5   B1"
      ]
     },
     "execution_count": 76,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.join(other.set_index('key'), on='key', validate='m:1')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "61fb9627",
   "metadata": {},
   "source": [
    "## Aggregation and grouping\n",
    "\n",
    "Group `DataFrame` using a mapper or by a `Series` of columns.\n",
    "\n",
    "A `groupby` operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.\n",
    "\n",
    "For example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "38adb2b7",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Max Speed</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Animal</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Falcon</th>\n",
       "      <td>375.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Parrot</th>\n",
       "      <td>25.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        Max Speed\n",
       "Animal           \n",
       "Falcon      375.0\n",
       "Parrot       25.0"
      ]
     },
     "execution_count": 77,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',\n",
    "                              'Parrot', 'Parrot'],\n",
    "                   'Max Speed': [380., 370., 24., 26.]})\n",
    "df\n",
    "df.groupby(['Animal']).mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "917ba231-1ee4-4f2c-bcb9-4262d7eba119",
   "metadata": {
    "tags": [
     "hide-cell"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "\n",
       "<div class='full-width docutils' >\n",
       "  <div class=\"admonition note pandastutor\" name=\"html-admonition\" style=\"margin-right:20%\">\n",
       "    <p class=\"admonition-title pandastutor\">Let's visualize it! 🎥</p>\n",
       "    <div class=\"pandastutor inner\" style=\"height:785px;\">\n",
       "      <iframe frameborder=\"0\" scrolling=\"no\" src=\"https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0Aimport%20numpy%20as%20np%0Adf%20%3D%20pd.DataFrame%28%7B'Animal'%3A%20%5B'Falcon',%20'Falcon',%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20'Parrot',%20'Parrot'%5D,%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20'Max%20Speed'%3A%20%5B380.,%20370.,%2024.,%2026.%5D%7D%29%0Adf%0Adf.groupby%28%5B'Animal'%5D%29.mean%28%29&d=2023-05-27&lang=py&v=v1\"> </iframe>\n",
       "    </div>\n",
       "  </div>\n",
       "</div>\n",
       "\n",
       "\n",
       "\n",
       "\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from IPython.display import HTML\n",
    "\n",
    "display(\n",
    "    HTML(\n",
    "        \"\"\"\n",
    "\n",
    "<div class='full-width docutils' >\n",
    "  <div class=\"admonition note pandastutor\" name=\"html-admonition\" style=\"margin-right:20%\">\n",
    "    <p class=\"admonition-title pandastutor\">Let's visualize it! 🎥</p>\n",
    "    <div class=\"pandastutor inner\" style=\"height:785px;\">\n",
    "      <iframe frameborder=\"0\" scrolling=\"no\" src=\"https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0Aimport%20numpy%20as%20np%0Adf%20%3D%20pd.DataFrame%28%7B'Animal'%3A%20%5B'Falcon',%20'Falcon',%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20'Parrot',%20'Parrot'%5D,%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20'Max%20Speed'%3A%20%5B380.,%20370.,%2024.,%2026.%5D%7D%29%0Adf%0Adf.groupby%28%5B'Animal'%5D%29.mean%28%29&d=2023-05-27&lang=py&v=v1\"> </iframe>\n",
    "    </div>\n",
    "  </div>\n",
    "</div>\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\"\"\"\n",
    "    )\n",
    ")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "84fe11db",
   "metadata": {},
   "source": [
    "\n",
    "### Hierarchical Indexes\n",
    "\n",
    "We can `groupby` different levels of a hierarchical index using the `level` parameter:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5e84fd8b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Max Speed</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Animal</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Falcon</th>\n",
       "      <td>370.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Parrot</th>\n",
       "      <td>25.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        Max Speed\n",
       "Animal           \n",
       "Falcon      370.0\n",
       "Parrot       25.0"
      ]
     },
     "execution_count": 79,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "arrays = [['Falcon', 'Falcon', 'Parrot', 'Parrot'],\n",
    "          ['Captive', 'Wild', 'Captive', 'Wild']]\n",
    "index = pd.MultiIndex.from_arrays(arrays, names=('Animal', 'Type'))\n",
    "df = pd.DataFrame({'Max Speed': [390., 350., 30., 20.]},\n",
    "                  index=index)\n",
    "df.groupby(level=0).mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8d6ff678-1c1e-4629-9e06-1874511ecdf0",
   "metadata": {
    "tags": [
     "hide-cell"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "\n",
       "<div class='full-width docutils' >\n",
       "  <div class=\"admonition note pandastutor\" name=\"html-admonition\" style=\"margin-right:20%\">\n",
       "    <p class=\"admonition-title pandastutor\">Let's visualize it! 🎥</p>\n",
       "    <div class=\"pandastutor inner\" style=\"height:810px;\">\n",
       "      <iframe frameborder=\"0\" scrolling=\"no\" src=\"https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0Aarrays%20%3D%20%5B%5B'Falcon',%20'Falcon',%20'Parrot',%20'Parrot'%5D,%0A%20%20%20%20%20%20%20%20%20%20%5B'Captive',%20'Wild',%20'Captive',%20'Wild'%5D%5D%0Aindex%20%3D%20pd.MultiIndex.from_arrays%28arrays,%20names%3D%28'Animal',%20'Type'%29%29%0Adf%20%3D%20pd.DataFrame%28%7B'Max%20Speed'%3A%20%5B390.,%20350.,%2030.,%2020.%5D%7D,%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20index%3Dindex%29%0Adf.groupby%28level%3D0%29.mean%28%29&d=2023-07-15&lang=py&v=v1\"> </iframe>\n",
       "    </div>\n",
       "  </div>\n",
       "</div>\n",
       "\n",
       "\n",
       "\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from IPython.display import HTML\n",
    "\n",
    "display(\n",
    "    HTML(\n",
    "        \"\"\"\n",
    "\n",
    "<div class='full-width docutils' >\n",
    "  <div class=\"admonition note pandastutor\" name=\"html-admonition\" style=\"margin-right:20%\">\n",
    "    <p class=\"admonition-title pandastutor\">Let's visualize it! 🎥</p>\n",
    "    <div class=\"pandastutor inner\" style=\"height:810px;\">\n",
    "      <iframe frameborder=\"0\" scrolling=\"no\" src=\"https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0Aarrays%20%3D%20%5B%5B'Falcon',%20'Falcon',%20'Parrot',%20'Parrot'%5D,%0A%20%20%20%20%20%20%20%20%20%20%5B'Captive',%20'Wild',%20'Captive',%20'Wild'%5D%5D%0Aindex%20%3D%20pd.MultiIndex.from_arrays%28arrays,%20names%3D%28'Animal',%20'Type'%29%29%0Adf%20%3D%20pd.DataFrame%28%7B'Max%20Speed'%3A%20%5B390.,%20350.,%2030.,%2020.%5D%7D,%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20index%3Dindex%29%0Adf.groupby%28level%3D0%29.mean%28%29&d=2023-07-15&lang=py&v=v1\"> </iframe>\n",
    "    </div>\n",
    "  </div>\n",
    "</div>\n",
    "\n",
    "\n",
    "\n",
    "\"\"\"\n",
    "    )\n",
    ")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5a7a2d6a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Max Speed</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Type</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Captive</th>\n",
       "      <td>210.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Wild</th>\n",
       "      <td>185.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         Max Speed\n",
       "Type              \n",
       "Captive      210.0\n",
       "Wild         185.0"
      ]
     },
     "execution_count": 81,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.groupby(level=\"Type\").mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "31f4c668-6a8b-4dba-a6db-29673e7fbdba",
   "metadata": {
    "tags": [
     "hide-cell"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "\n",
       "<div class='full-width docutils' >\n",
       "  <div class=\"admonition note pandastutor\" name=\"html-admonition\" style=\"margin-right:20%\">\n",
       "    <p class=\"admonition-title pandastutor\">Let's visualize it! 🎥</p>\n",
       "    <div class=\"pandastutor inner\" style=\"height:810px;\">\n",
       "      <iframe frameborder=\"0\" scrolling=\"no\" src=\"https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0Aarrays%20%3D%20%5B%5B'Falcon',%20'Falcon',%20'Parrot',%20'Parrot'%5D,%0A%20%20%20%20%20%20%20%20%20%20%5B'Captive',%20'Wild',%20'Captive',%20'Wild'%5D%5D%0Aindex%20%3D%20pd.MultiIndex.from_arrays%28arrays,%20names%3D%28'Animal',%20'Type'%29%29%0Adf%20%3D%20pd.DataFrame%28%7B'Max%20Speed'%3A%20%5B390.,%20350.,%2030.,%2020.%5D%7D,%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20index%3Dindex%29%0Adf.groupby%28level%3D%22Type%22%29.mean%28%29&d=2023-07-15&lang=py&v=v1\"> </iframe>\n",
       "    </div>\n",
       "  </div>\n",
       "</div>\n",
       "\n",
       "\n",
       "\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from IPython.display import HTML\n",
    "\n",
    "display(\n",
    "    HTML(\n",
    "        \"\"\"\n",
    "\n",
    "<div class='full-width docutils' >\n",
    "  <div class=\"admonition note pandastutor\" name=\"html-admonition\" style=\"margin-right:20%\">\n",
    "    <p class=\"admonition-title pandastutor\">Let's visualize it! 🎥</p>\n",
    "    <div class=\"pandastutor inner\" style=\"height:810px;\">\n",
    "      <iframe frameborder=\"0\" scrolling=\"no\" src=\"https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0Aarrays%20%3D%20%5B%5B'Falcon',%20'Falcon',%20'Parrot',%20'Parrot'%5D,%0A%20%20%20%20%20%20%20%20%20%20%5B'Captive',%20'Wild',%20'Captive',%20'Wild'%5D%5D%0Aindex%20%3D%20pd.MultiIndex.from_arrays%28arrays,%20names%3D%28'Animal',%20'Type'%29%29%0Adf%20%3D%20pd.DataFrame%28%7B'Max%20Speed'%3A%20%5B390.,%20350.,%2030.,%2020.%5D%7D,%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20index%3Dindex%29%0Adf.groupby%28level%3D%22Type%22%29.mean%28%29&d=2023-07-15&lang=py&v=v1\"> </iframe>\n",
    "    </div>\n",
    "  </div>\n",
    "</div>\n",
    "\n",
    "\n",
    "\n",
    "\"\"\"\n",
    "    )\n",
    ")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fe08b062",
   "metadata": {},
   "source": [
    "\n",
    "We can also choose to include NA in group keys or not by setting `dropna` parameter, the default setting is `True`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f27b6536",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>c</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>b</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1.0</th>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2.0</th>\n",
       "      <td>2</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     a  c\n",
       "b        \n",
       "1.0  2  3\n",
       "2.0  2  5"
      ]
     },
     "execution_count": 83,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "l = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]\n",
    "df = pd.DataFrame(l, columns=[\"a\", \"b\", \"c\"])\n",
    "df.groupby(by=[\"b\"]).sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "47261c15-1d74-4a39-a7bb-073f6835cbf8",
   "metadata": {
    "tags": [
     "hide-cell"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "\n",
       "<div class='full-width docutils' >\n",
       "  <div class=\"admonition note pandastutor\" name=\"html-admonition\" style=\"margin-right:20%\">\n",
       "    <p class=\"admonition-title pandastutor\">Let's visualize it! 🎥</p>\n",
       "    <div class=\"pandastutor inner\" style=\"height:770px;\">\n",
       "      <iframe frameborder=\"0\" scrolling=\"no\" src=\"https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0Al%20%3D%20%5B%5B1,%202,%203%5D,%20%5B1,%20None,%204%5D,%20%5B2,%201,%203%5D,%20%5B1,%202,%202%5D%5D%0Adf%20%3D%20pd.DataFrame%28l,%20columns%3D%5B%22a%22,%20%22b%22,%20%22c%22%5D%29%0Adf.groupby%28by%3D%5B%22b%22%5D%29.sum%28%29%0A&d=2023-07-15&lang=py&v=v1\"> </iframe>\n",
       "    </div>\n",
       "  </div>\n",
       "</div>\n",
       "\n",
       "\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from IPython.display import HTML\n",
    "\n",
    "display(\n",
    "    HTML(\n",
    "        \"\"\"\n",
    "\n",
    "<div class='full-width docutils' >\n",
    "  <div class=\"admonition note pandastutor\" name=\"html-admonition\" style=\"margin-right:20%\">\n",
    "    <p class=\"admonition-title pandastutor\">Let's visualize it! 🎥</p>\n",
    "    <div class=\"pandastutor inner\" style=\"height:770px;\">\n",
    "      <iframe frameborder=\"0\" scrolling=\"no\" src=\"https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0Al%20%3D%20%5B%5B1,%202,%203%5D,%20%5B1,%20None,%204%5D,%20%5B2,%201,%203%5D,%20%5B1,%202,%202%5D%5D%0Adf%20%3D%20pd.DataFrame%28l,%20columns%3D%5B%22a%22,%20%22b%22,%20%22c%22%5D%29%0Adf.groupby%28by%3D%5B%22b%22%5D%29.sum%28%29%0A&d=2023-07-15&lang=py&v=v1\"> </iframe>\n",
    "    </div>\n",
    "  </div>\n",
    "</div>\n",
    "\n",
    "\n",
    "\"\"\"\n",
    "    )\n",
    ")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "815ba4c3",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>c</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>b</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1.0</th>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2.0</th>\n",
       "      <td>2</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>NaN</th>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     a  c\n",
       "b        \n",
       "1.0  2  3\n",
       "2.0  2  5\n",
       "NaN  1  4"
      ]
     },
     "execution_count": 85,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.groupby(by=[\"b\"], dropna=False).sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "17c93213-8bcf-4ac8-a30d-09df48b9ca71",
   "metadata": {
    "tags": [
     "hide-cell"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "\n",
       "<div class='full-width docutils' >\n",
       "  <div class=\"admonition note pandastutor\" name=\"html-admonition\" style=\"margin-right:20%\">\n",
       "    <p class=\"admonition-title pandastutor\">Let's visualize it! 🎥</p>\n",
       "    <div class=\"pandastutor inner\" style=\"height:750px;\">\n",
       "      <iframe frameborder=\"0\" scrolling=\"no\" src=\"https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0Aimport%20numpy%20as%20np%0Al%20%3D%20%5B%5B1,%202,%203%5D,%20%5B1,%20None,%204%5D,%20%5B2,%201,%203%5D,%20%5B1,%202,%202%5D%5D%0Adf%20%3D%20pd.DataFrame%28l,%20columns%3D%5B%22a%22,%20%22b%22,%20%22c%22%5D%29%0Adf.groupby%28by%3D%5B%22b%22%5D,%20dropna%3DFalse%29.sum%28%29&d=2023-05-27&lang=py&v=v1\"> </iframe>\n",
       "    </div>\n",
       "  </div>\n",
       "</div>\n",
       "\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from IPython.display import HTML\n",
    "\n",
    "display(\n",
    "    HTML(\n",
    "        \"\"\"\n",
    "\n",
    "<div class='full-width docutils' >\n",
    "  <div class=\"admonition note pandastutor\" name=\"html-admonition\" style=\"margin-right:20%\">\n",
    "    <p class=\"admonition-title pandastutor\">Let's visualize it! 🎥</p>\n",
    "    <div class=\"pandastutor inner\" style=\"height:750px;\">\n",
    "      <iframe frameborder=\"0\" scrolling=\"no\" src=\"https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0Aimport%20numpy%20as%20np%0Al%20%3D%20%5B%5B1,%202,%203%5D,%20%5B1,%20None,%204%5D,%20%5B2,%201,%203%5D,%20%5B1,%202,%202%5D%5D%0Adf%20%3D%20pd.DataFrame%28l,%20columns%3D%5B%22a%22,%20%22b%22,%20%22c%22%5D%29%0Adf.groupby%28by%3D%5B%22b%22%5D,%20dropna%3DFalse%29.sum%28%29&d=2023-05-27&lang=py&v=v1\"> </iframe>\n",
    "    </div>\n",
    "  </div>\n",
    "</div>\n",
    "\n",
    "\"\"\"\n",
    "    )\n",
    ")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "719dc004",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>a</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>a</th>\n",
       "      <td>13.0</td>\n",
       "      <td>13.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>b</th>\n",
       "      <td>12.3</td>\n",
       "      <td>123.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      b      c\n",
       "a             \n",
       "a  13.0   13.0\n",
       "b  12.3  123.0"
      ]
     },
     "execution_count": 87,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "l = [[\"a\", 12, 12], [None, 12.3, 33.], [\"b\", 12.3, 123], [\"a\", 1, 1]]\n",
    "df = pd.DataFrame(l, columns=[\"a\", \"b\", \"c\"])\n",
    "df.groupby(by=\"a\").sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ba2d22de-ed75-4d52-a6d8-badf4791429f",
   "metadata": {
    "tags": [
     "hide-cell"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "\n",
       "<div class='full-width docutils' >\n",
       "  <div class=\"admonition note pandastutor\" name=\"html-admonition\" style=\"margin-right:20%\">\n",
       "    <p class=\"admonition-title pandastutor\">Let's visualize it! 🎥</p>\n",
       "    <div class=\"pandastutor inner\" style=\"height:750px;\">\n",
       "      <iframe frameborder=\"0\" scrolling=\"no\" src=\"https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0Al%20%3D%20%5B%5B%22a%22,%2012,%2012%5D,%20%5BNone,%2012.3,%2033.%5D,%20%5B%22b%22,%2012.3,%20123%5D,%20%5B%22a%22,%201,%201%5D%5D%0Adf%20%3D%20pd.DataFrame%28l,%20columns%3D%5B%22a%22,%20%22b%22,%20%22c%22%5D%29%0Adf.groupby%28by%3D%22a%22%29.sum%28%29&d=2023-07-15&lang=py&v=v1\"> </iframe>\n",
       "    </div>\n",
       "  </div>\n",
       "</div>\n",
       "\n",
       "\n",
       "\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from IPython.display import HTML\n",
    "\n",
    "display(\n",
    "    HTML(\n",
    "        \"\"\"\n",
    "\n",
    "<div class='full-width docutils' >\n",
    "  <div class=\"admonition note pandastutor\" name=\"html-admonition\" style=\"margin-right:20%\">\n",
    "    <p class=\"admonition-title pandastutor\">Let's visualize it! 🎥</p>\n",
    "    <div class=\"pandastutor inner\" style=\"height:750px;\">\n",
    "      <iframe frameborder=\"0\" scrolling=\"no\" src=\"https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0Al%20%3D%20%5B%5B%22a%22,%2012,%2012%5D,%20%5BNone,%2012.3,%2033.%5D,%20%5B%22b%22,%2012.3,%20123%5D,%20%5B%22a%22,%201,%201%5D%5D%0Adf%20%3D%20pd.DataFrame%28l,%20columns%3D%5B%22a%22,%20%22b%22,%20%22c%22%5D%29%0Adf.groupby%28by%3D%22a%22%29.sum%28%29&d=2023-07-15&lang=py&v=v1\"> </iframe>\n",
    "    </div>\n",
    "  </div>\n",
    "</div>\n",
    "\n",
    "\n",
    "\n",
    "\"\"\"\n",
    "    )\n",
    ")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cce87c6a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>a</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>a</th>\n",
       "      <td>13.0</td>\n",
       "      <td>13.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>b</th>\n",
       "      <td>12.3</td>\n",
       "      <td>123.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>NaN</th>\n",
       "      <td>12.3</td>\n",
       "      <td>33.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        b      c\n",
       "a               \n",
       "a    13.0   13.0\n",
       "b    12.3  123.0\n",
       "NaN  12.3   33.0"
      ]
     },
     "execution_count": 89,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.groupby(by=\"a\", dropna=False).sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "70cc2217-577e-4b8c-8fc2-ce02f036622b",
   "metadata": {
    "tags": [
     "hide-cell"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "<div class='full-width docutils' >\n",
       "  <div class=\"admonition note pandastutor\" name=\"html-admonition\" style=\"margin-right:20%\">\n",
       "    <p class=\"admonition-title pandastutor\">Let's visualize it! 🎥</p>\n",
       "    <div class=\"pandastutor inner\" style=\"height:750px;\">\n",
       "      <iframe frameborder=\"0\" scrolling=\"no\" src=\"https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0Al%20%3D%20%5B%5B%22a%22,%2012,%2012%5D,%20%5BNone,%2012.3,%2033.%5D,%20%5B%22b%22,%2012.3,%20123%5D,%20%5B%22a%22,%201,%201%5D%5D%0Adf%20%3D%20pd.DataFrame%28l,%20columns%3D%5B%22a%22,%20%22b%22,%20%22c%22%5D%29%0Adf.groupby%28by%3D%22a%22,%20dropna%3DFalse%29.sum%28%29%0A&d=2023-07-15&lang=py&v=v1\"> </iframe>\n",
       "    </div>\n",
       "  </div>\n",
       "</div>\n",
       "\n",
       "\n",
       "\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from IPython.display import HTML\n",
    "\n",
    "display(\n",
    "    HTML(\n",
    "        \"\"\"\n",
    "<div class='full-width docutils' >\n",
    "  <div class=\"admonition note pandastutor\" name=\"html-admonition\" style=\"margin-right:20%\">\n",
    "    <p class=\"admonition-title pandastutor\">Let's visualize it! 🎥</p>\n",
    "    <div class=\"pandastutor inner\" style=\"height:750px;\">\n",
    "      <iframe frameborder=\"0\" scrolling=\"no\" src=\"https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0Al%20%3D%20%5B%5B%22a%22,%2012,%2012%5D,%20%5BNone,%2012.3,%2033.%5D,%20%5B%22b%22,%2012.3,%20123%5D,%20%5B%22a%22,%201,%201%5D%5D%0Adf%20%3D%20pd.DataFrame%28l,%20columns%3D%5B%22a%22,%20%22b%22,%20%22c%22%5D%29%0Adf.groupby%28by%3D%22a%22,%20dropna%3DFalse%29.sum%28%29%0A&d=2023-07-15&lang=py&v=v1\"> </iframe>\n",
    "    </div>\n",
    "  </div>\n",
    "</div>\n",
    "\n",
    "\n",
    "\n",
    "\"\"\"\n",
    "    )\n",
    ")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6988f12c",
   "metadata": {},
   "source": [
    "\n",
    "When using `.apply()`, use `group_keys` to include or exclude the group keys. The `group_keys` argument defaults to `True` (include)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1fa5930a",
   "metadata": {
    "attributes": {
     "classes": [
      "code-cell"
     ],
     "id": ""
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>Animal</th>\n",
       "      <th>Max Speed</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Animal</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">Falcon</th>\n",
       "      <th>0</th>\n",
       "      <td>Falcon</td>\n",
       "      <td>380.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Falcon</td>\n",
       "      <td>370.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">Parrot</th>\n",
       "      <th>2</th>\n",
       "      <td>Parrot</td>\n",
       "      <td>24.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Parrot</td>\n",
       "      <td>26.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          Animal  Max Speed\n",
       "Animal                     \n",
       "Falcon 0  Falcon      380.0\n",
       "       1  Falcon      370.0\n",
       "Parrot 2  Parrot       24.0\n",
       "       3  Parrot       26.0"
      ]
     },
     "execution_count": 91,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',\n",
    "                              'Parrot', 'Parrot'],\n",
    "                   'Max Speed': [380., 370., 24., 26.]})\n",
    "df.groupby(\"Animal\", group_keys=True).apply(lambda x: x)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "67e4668e",
   "metadata": {
    "attributes": {
     "classes": [
      "code-cell"
     ],
     "id": ""
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Animal</th>\n",
       "      <th>Max Speed</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Falcon</td>\n",
       "      <td>380.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Falcon</td>\n",
       "      <td>370.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Parrot</td>\n",
       "      <td>24.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Parrot</td>\n",
       "      <td>26.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Animal  Max Speed\n",
       "0  Falcon      380.0\n",
       "1  Falcon      370.0\n",
       "2  Parrot       24.0\n",
       "3  Parrot       26.0"
      ]
     },
     "execution_count": 92,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.groupby(\"Animal\", group_keys=False).apply(lambda x: x)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c8777695",
   "metadata": {},
   "source": [
    "## Pivot table\n",
    "\n",
    "Create a spreadsheet-style pivot table as a DataFrame.\n",
    "\n",
    "The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c8e1b317",
   "metadata": {
    "attributes": {
     "classes": [
      "code-cell"
     ],
     "id": ""
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "      <th>C</th>\n",
       "      <th>D</th>\n",
       "      <th>E</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>foo</td>\n",
       "      <td>one</td>\n",
       "      <td>small</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>foo</td>\n",
       "      <td>one</td>\n",
       "      <td>large</td>\n",
       "      <td>2</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>foo</td>\n",
       "      <td>one</td>\n",
       "      <td>large</td>\n",
       "      <td>2</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>foo</td>\n",
       "      <td>two</td>\n",
       "      <td>small</td>\n",
       "      <td>3</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>foo</td>\n",
       "      <td>two</td>\n",
       "      <td>small</td>\n",
       "      <td>3</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>bar</td>\n",
       "      <td>one</td>\n",
       "      <td>large</td>\n",
       "      <td>4</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>bar</td>\n",
       "      <td>one</td>\n",
       "      <td>small</td>\n",
       "      <td>5</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>bar</td>\n",
       "      <td>two</td>\n",
       "      <td>small</td>\n",
       "      <td>6</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>bar</td>\n",
       "      <td>two</td>\n",
       "      <td>large</td>\n",
       "      <td>7</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     A    B      C  D  E\n",
       "0  foo  one  small  1  2\n",
       "1  foo  one  large  2  4\n",
       "2  foo  one  large  2  5\n",
       "3  foo  two  small  3  5\n",
       "4  foo  two  small  3  6\n",
       "5  bar  one  large  4  6\n",
       "6  bar  one  small  5  8\n",
       "7  bar  two  small  6  9\n",
       "8  bar  two  large  7  9"
      ]
     },
     "execution_count": 93,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.DataFrame({\"A\": [\"foo\", \"foo\", \"foo\", \"foo\", \"foo\",\n",
    "                         \"bar\", \"bar\", \"bar\", \"bar\"],\n",
    "                   \"B\": [\"one\", \"one\", \"one\", \"two\", \"two\",\n",
    "                         \"one\", \"one\", \"two\", \"two\"],\n",
    "                   \"C\": [\"small\", \"large\", \"large\", \"small\",\n",
    "                         \"small\", \"large\", \"small\", \"small\",\n",
    "                         \"large\"],\n",
    "                   \"D\": [1, 2, 2, 3, 3, 4, 5, 6, 7],\n",
    "                   \"E\": [2, 4, 5, 5, 6, 6, 8, 9, 9]})\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ef96918e",
   "metadata": {},
   "source": [
    "This first example aggregates values by taking the sum."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7206f156",
   "metadata": {
    "attributes": {
     "classes": [
      "code-cell"
     ],
     "id": ""
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>C</th>\n",
       "      <th>large</th>\n",
       "      <th>small</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">bar</th>\n",
       "      <th>one</th>\n",
       "      <td>4.0</td>\n",
       "      <td>5.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>two</th>\n",
       "      <td>7.0</td>\n",
       "      <td>6.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">foo</th>\n",
       "      <th>one</th>\n",
       "      <td>4.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>two</th>\n",
       "      <td>NaN</td>\n",
       "      <td>6.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "C        large  small\n",
       "A   B                \n",
       "bar one    4.0    5.0\n",
       "    two    7.0    6.0\n",
       "foo one    4.0    1.0\n",
       "    two    NaN    6.0"
      ]
     },
     "execution_count": 94,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "table = pd.pivot_table(df, values='D', index=['A', 'B'],\n",
    "                    columns=['C'], aggfunc=np.sum)\n",
    "table"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e0df6460",
   "metadata": {},
   "source": [
    "We can also fill in missing values using the `fill_value` parameter."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6cfd03f9",
   "metadata": {
    "attributes": {
     "classes": [
      "code-cell"
     ],
     "id": ""
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>C</th>\n",
       "      <th>large</th>\n",
       "      <th>small</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">bar</th>\n",
       "      <th>one</th>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>two</th>\n",
       "      <td>7</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">foo</th>\n",
       "      <th>one</th>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>two</th>\n",
       "      <td>0</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "C        large  small\n",
       "A   B                \n",
       "bar one      4      5\n",
       "    two      7      6\n",
       "foo one      4      1\n",
       "    two      0      6"
      ]
     },
     "execution_count": 95,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "table = pd.pivot_table(df, values='D', index=['A', 'B'],\n",
    "                    columns=['C'], aggfunc=np.sum, fill_value=0)\n",
    "table"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bf713c57",
   "metadata": {},
   "source": [
    "The next example aggregates by taking the mean across multiple columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "900dc876",
   "metadata": {
    "attributes": {
     "classes": [
      "code-cell"
     ],
     "id": ""
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>D</th>\n",
       "      <th>E</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>A</th>\n",
       "      <th>C</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">bar</th>\n",
       "      <th>large</th>\n",
       "      <td>5.500000</td>\n",
       "      <td>7.500000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>small</th>\n",
       "      <td>5.500000</td>\n",
       "      <td>8.500000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">foo</th>\n",
       "      <th>large</th>\n",
       "      <td>2.000000</td>\n",
       "      <td>4.500000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>small</th>\n",
       "      <td>2.333333</td>\n",
       "      <td>4.333333</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                  D         E\n",
       "A   C                        \n",
       "bar large  5.500000  7.500000\n",
       "    small  5.500000  8.500000\n",
       "foo large  2.000000  4.500000\n",
       "    small  2.333333  4.333333"
      ]
     },
     "execution_count": 96,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "table = pd.pivot_table(df, values=['D', 'E'], index=['A', 'C'],\n",
    "                    aggfunc={'D': np.mean,\n",
    "                             'E': np.mean})\n",
    "table"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6a428fdc",
   "metadata": {},
   "source": [
    "We can also calculate multiple types of aggregations for any given value column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "36ccdfaf",
   "metadata": {
    "attributes": {
     "classes": [
      "code-cell"
     ],
     "id": ""
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr:last-of-type th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>D</th>\n",
       "      <th colspan=\"3\" halign=\"left\">E</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>mean</th>\n",
       "      <th>max</th>\n",
       "      <th>mean</th>\n",
       "      <th>min</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>A</th>\n",
       "      <th>C</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">bar</th>\n",
       "      <th>large</th>\n",
       "      <td>5.500000</td>\n",
       "      <td>9</td>\n",
       "      <td>7.500000</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>small</th>\n",
       "      <td>5.500000</td>\n",
       "      <td>9</td>\n",
       "      <td>8.500000</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">foo</th>\n",
       "      <th>large</th>\n",
       "      <td>2.000000</td>\n",
       "      <td>5</td>\n",
       "      <td>4.500000</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>small</th>\n",
       "      <td>2.333333</td>\n",
       "      <td>6</td>\n",
       "      <td>4.333333</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                  D   E              \n",
       "               mean max      mean min\n",
       "A   C                                \n",
       "bar large  5.500000   9  7.500000   6\n",
       "    small  5.500000   9  8.500000   8\n",
       "foo large  2.000000   5  4.500000   4\n",
       "    small  2.333333   6  4.333333   2"
      ]
     },
     "execution_count": 97,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "table = pd.pivot_table(df, values=['D', 'E'], index=['A', 'C'],\n",
    "                    aggfunc={'D': np.mean,\n",
    "                             'E': [min, max, np.mean]})\n",
    "table"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "19eeb851",
   "metadata": {},
   "source": [
    "## High-performance Pandas: eval() and query()\n",
    "\n",
    "### eval()\n",
    "\n",
    "Evaluate a string describing operations on DataFrame columns.\n",
    "\n",
    "Operates on columns only, not specific rows or elements. This allows `eval` to run arbitrary code, which can make you vulnerable to code injection if you pass user input to this function.\n",
    "\n",
    "For example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "db6fdd36",
   "metadata": {
    "attributes": {
     "classes": [
      "code-cell"
     ],
     "id": ""
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   A   B\n",
       "0  1  10\n",
       "1  2   8\n",
       "2  3   6\n",
       "3  4   4\n",
       "4  5   2"
      ]
     },
     "execution_count": 98,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.DataFrame({'A': range(1, 6), 'B': range(10, 0, -2)})\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "92e71f86",
   "metadata": {
    "attributes": {
     "classes": [
      "code-cell"
     ],
     "id": ""
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    11\n",
       "1    10\n",
       "2     9\n",
       "3     8\n",
       "4     7\n",
       "dtype: int64"
      ]
     },
     "execution_count": 99,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.eval('A + B')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e5f51480",
   "metadata": {},
   "source": [
    "The assignment is allowed though by default the original `DataFrame` is not modified."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b6387047",
   "metadata": {
    "attributes": {
     "classes": [
      "code-cell"
     ],
     "id": ""
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "      <th>C</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>10</td>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>8</td>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>6</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>4</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>2</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   A   B   C\n",
       "0  1  10  11\n",
       "1  2   8  10\n",
       "2  3   6   9\n",
       "3  4   4   8\n",
       "4  5   2   7"
      ]
     },
     "execution_count": 100,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.eval('C = A + B')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a5322c51",
   "metadata": {
    "attributes": {
     "classes": [
      "code-cell"
     ],
     "id": ""
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   A   B\n",
       "0  1  10\n",
       "1  2   8\n",
       "2  3   6\n",
       "3  4   4\n",
       "4  5   2"
      ]
     },
     "execution_count": 101,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9a0a5d4d",
   "metadata": {},
   "source": [
    "Use `inplace=True` to modify the original DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "13d2dffa",
   "metadata": {
    "attributes": {
     "classes": [
      "code-cell"
     ],
     "id": ""
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "      <th>C</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>10</td>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>8</td>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>6</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>4</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>2</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   A   B   C\n",
       "0  1  10  11\n",
       "1  2   8  10\n",
       "2  3   6   9\n",
       "3  4   4   8\n",
       "4  5   2   7"
      ]
     },
     "execution_count": 102,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.eval('C = A + B', inplace=True)\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e9c14654",
   "metadata": {},
   "source": [
    "Multiple columns can be assigned using multi-line expressions:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8ee5ceea",
   "metadata": {
    "attributes": {
     "classes": [
      "code-cell"
     ],
     "id": ""
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "      <th>C</th>\n",
       "      <th>D</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>10</td>\n",
       "      <td>11</td>\n",
       "      <td>-9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>8</td>\n",
       "      <td>10</td>\n",
       "      <td>-6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>6</td>\n",
       "      <td>9</td>\n",
       "      <td>-3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>4</td>\n",
       "      <td>8</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>2</td>\n",
       "      <td>7</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   A   B   C  D\n",
       "0  1  10  11 -9\n",
       "1  2   8  10 -6\n",
       "2  3   6   9 -3\n",
       "3  4   4   8  0\n",
       "4  5   2   7  3"
      ]
     },
     "execution_count": 103,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.eval(\n",
    "    '''\n",
    "    C = A + B\n",
    "    D = A - B\n",
    "    '''\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9c052b27",
   "metadata": {},
   "source": [
    "### query()\n",
    "\n",
    "Query the columns of a DataFrame with a boolean expression.\n",
    "\n",
    "For example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d99bb798",
   "metadata": {
    "attributes": {
     "classes": [
      "code-cell"
     ],
     "id": ""
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "      <th>C C</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>10</td>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>8</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>6</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>4</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>2</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   A   B  C C\n",
       "0  1  10   10\n",
       "1  2   8    9\n",
       "2  3   6    8\n",
       "3  4   4    7\n",
       "4  5   2    6"
      ]
     },
     "execution_count": 104,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.DataFrame({\n",
    "    'A': range(1, 6),\n",
    "    'B': range(10, 0, -2),\n",
    "    'C C': range(10, 5, -1)\n",
    "})\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c228b08b",
   "metadata": {
    "attributes": {
     "classes": [
      "code-cell"
     ],
     "id": ""
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "      <th>C C</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>2</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   A  B  C C\n",
       "4  5  2    6"
      ]
     },
     "execution_count": 105,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.query('A > B')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e90ed305",
   "metadata": {},
   "source": [
    "The previous expression is equivalent to"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "28a30c04",
   "metadata": {
    "attributes": {
     "classes": [
      "code-cell"
     ],
     "id": ""
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "      <th>C C</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>2</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   A  B  C C\n",
       "4  5  2    6"
      ]
     },
     "execution_count": 106,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[df.A > df.B]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "454bb2b9",
   "metadata": {},
   "source": [
    "For columns with spaces in their name, you can use backtick quoting."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4d06bb30",
   "metadata": {
    "attributes": {
     "classes": [
      "code-cell"
     ],
     "id": ""
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "      <th>C C</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>10</td>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   A   B  C C\n",
       "0  1  10   10"
      ]
     },
     "execution_count": 107,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.query('B == `C C`')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2ac03c29",
   "metadata": {},
   "source": [
    "The previous expression is equivalent to"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f8dacc1f",
   "metadata": {
    "attributes": {
     "classes": [
      "code-cell"
     ],
     "id": ""
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "      <th>C C</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>10</td>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   A   B  C C\n",
       "0  1  10   10"
      ]
     },
     "execution_count": 108,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[df.B == df['C C']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6ec1ded1-6f8a-46ca-b304-25621fe08677",
   "metadata": {
    "tags": [
     "hide-cell"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "<div class='full-width docutils' >\n",
       "  <div class=\"admonition note pandastutor\" name=\"html-admonition\" style=\"margin-right:20%\">\n",
       "    <p class=\"admonition-title pandastutor\">Let's visualize it! 🎥</p>\n",
       "    <div class=\"pandastutor inner\" style=\"height:660px;\">\n",
       "      <iframe frameborder=\"0\" scrolling=\"no\" src=\"https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0Aimport%20numpy%20as%20np%0Adf%20%3D%20pd.DataFrame%28%7B%0A%20%20%20%20'A'%3A%20range%281,%206%29,%0A%20%20%20%20'B'%3A%20range%2810,%200,%20-2%29,%0A%20%20%20%20'C%20C'%3A%20range%2810,%205,%20-1%29%0A%7D%29%0A%0Adf%5Bdf.B%20%3D%3D%20df%5B'C%20C'%5D%5D&d=2023-07-13&lang=py&v=v1\"> </iframe>\n",
       "    </div>\n",
       "  </div>\n",
       "</div>\n",
       "\n",
       "\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from IPython.display import HTML\n",
    "\n",
    "display(\n",
    "    HTML(\n",
    "        \"\"\"\n",
    "<div class='full-width docutils' >\n",
    "  <div class=\"admonition note pandastutor\" name=\"html-admonition\" style=\"margin-right:20%\">\n",
    "    <p class=\"admonition-title pandastutor\">Let's visualize it! 🎥</p>\n",
    "    <div class=\"pandastutor inner\" style=\"height:660px;\">\n",
    "      <iframe frameborder=\"0\" scrolling=\"no\" src=\"https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0Aimport%20numpy%20as%20np%0Adf%20%3D%20pd.DataFrame%28%7B%0A%20%20%20%20'A'%3A%20range%281,%206%29,%0A%20%20%20%20'B'%3A%20range%2810,%200,%20-2%29,%0A%20%20%20%20'C%20C'%3A%20range%2810,%205,%20-1%29%0A%7D%29%0A%0Adf%5Bdf.B%20%3D%3D%20df%5B'C%20C'%5D%5D&d=2023-07-13&lang=py&v=v1\"> </iframe>\n",
    "    </div>\n",
    "  </div>\n",
    "</div>\n",
    "\n",
    "\n",
    "\"\"\"\n",
    "    )\n",
    ")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bc6c4cd4",
   "metadata": {},
   "source": [
    "\n",
    "## Your turn! 🚀\n",
    "\n",
    "### Processing image data\n",
    "\n",
    "Recently, very powerful AI models have been developed that allow us to understand images. There are many tasks that can be solved using pre-trained neural networks, or cloud services. Some examples include:\n",
    "\n",
    "- **Image Classification**, can help you categorize the image into one of the pre-defined classes. You can easily train your own image classifiers using services such as [Custom Vision](https://azure.microsoft.com/services/cognitive-services/custom-vision-service/?WT.mc_id=academic-77958-bethanycheum)\n",
    "- **Object Detection** to detect different objects in the image. Services such as [computer vision](https://azure.microsoft.com/services/cognitive-services/computer-vision/?WT.mc_id=academic-77958-bethanycheum) can detect a number of common objects, and you can train [Custom Vision](https://azure.microsoft.com/services/cognitive-services/custom-vision-service/?WT.mc_id=academic-77958-bethanycheum) model to detect some specific objects of interest.\n",
    "- **Face Detection**, including Age, Gender and Emotion detection. This can be done via [Face API](https://azure.microsoft.com/services/cognitive-services/face/?WT.mc_id=academic-77958-bethanycheum).\n",
    "\n",
    "All those cloud services can be called using [Python SDKs](https://docs.microsoft.com/samples/azure-samples/cognitive-services-python-sdk-samples/cognitive-services-python-sdk-samples/?WT.mc_id=academic-77958-bethanycheum), and thus can be easily incorporated into your data exploration workflow.\n",
    "\n",
    "Here are some examples of exploring data from Image data sources:\n",
    "\n",
    "- In the blog post [How to Learn Data Science without Coding](https://soshnikov.com/azure/how-to-learn-data-science-without-coding/) we explore Instagram photos, trying to understand what makes people give more likes to a photo. We first extract as much information from pictures as possible using [computer vision](https://azure.microsoft.com/services/cognitive-services/computer-vision/?WT.mc_id=academic-77958-bethanycheum), and then use [Azure Machine Learning AutoML](https://docs.microsoft.com/azure/machine-learning/concept-automated-ml/?WT.mc_id=academic-77958-bethanycheum) to build the interpretable model.\n",
    "- In [Facial Studies Workshop](https://github.com/CloudAdvocacy/FaceStudies) we use [Face API](https://azure.microsoft.com/services/cognitive-services/face/?WT.mc_id=academic-77958-bethanycheum) to extract emotions from people on photographs from events, in order to try to understand what makes people happy.\n",
    "\n",
    "### Assignment\n",
    "\n",
    "[Perform more detailed data study for the challenges above](../../../assignments/data-science/data-processing-in-python.md)\n",
    "\n",
    "## Self study\n",
    "\n",
    "In this chapter, we've covered many of the basics of using Pandas effectively for data analysis. Still, much has been omitted from our discussion. To learn more about Pandas, we recommend the following resources:\n",
    "\n",
    "- [Pandas online documentation](http://pandas.pydata.org/): This is the go-to source for complete documentation of the package. While the examples in the documentation tend to be small generated datasets, the description of the options is complete and generally very useful for understanding the use of various functions.\n",
    "\n",
    "- [*Python for Data Analysis*](http://shop.oreilly.com/product/0636920023784.do) Written by Wes McKinney (the original creator of Pandas), this book contains much more detail on the Pandas package than we had room for in this chapter. In particular, he takes a deep dive into tools for time series, which were his bread and butter as a financial consultant. The book also has many entertaining examples of applying Pandas to gain insight from real-world datasets. Keep in mind, though, that the book is now several years old, and the Pandas package has quite a few new features that this book does not cover (but be on the lookout for a new edition in 2017).\n",
    "\n",
    "- [Stack Overflow](http://stackoverflow.com/questions/tagged/pandas): Pandas has so many users that any question you have has likely been asked and answered on Stack Overflow. Using Pandas is a case where some Google-Fu is your best friend. Simply go to your favorite search engine and type in the question, problem, or error you're coming across-more than likely you'll find your answer on a Stack Overflow page.\n",
    "\n",
    "- [Pandas on PyVideo](http://pyvideo.org/search?q=pandas): From PyCon to SciPy to PyData, many conferences have featured tutorials from Pandas developers and power users. The PyCon tutorials in particular tend to be given by very well-vetted presenters.\n",
    "\n",
    "Using these resources, combined with the walk-through given in this chapter, my hope is that you'll be poised to use Pandas to tackle any data analysis problem you come across!\n",
    "\n",
    "## Acknowledgments\n",
    "\n",
    "Thanks for [Pandas user guide](https://pandas.pydata.org/docs/user_guide/index.html). It contributes the majority of the content in this chapter."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "py39",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.18"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}