{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "09f51c61",
   "metadata": {
    "colab_type": "text",
    "id": "view-in-github"
   },
   "source": [
    "<a href=\"https://colab.research.google.com/github/cwf2/style_2025/blob/main/Tabular%20data%20primer.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dd2dec5e-c01f-4e38-9ca9-9cf9542f1d20",
   "metadata": {
    "id": "dd2dec5e-c01f-4e38-9ca9-9cf9542f1d20"
   },
   "source": [
    "# Tabular data primer\n",
    "\n",
    "This notebook demonstrates downloading speeches from DICES, creating tables of speech metadata, and using grouping and aggregation to create summary statistics.\n",
    "\n",
    "## Preliminaries\n",
    "\n",
    "### Install DICES client software\n",
    "\n",
    "Because Google Colab runs this notebook on a fresh virtual machine every time, we always need to install DICES as the first step."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "fe3b7c7f-7275-473c-b4ce-103691e832f2",
   "metadata": {
    "id": "fe3b7c7f-7275-473c-b4ce-103691e832f2",
    "outputId": "bd70aad7-7e87-4dfd-c33b-0add01a6c254"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m24.2\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.0.1\u001b[0m\n",
      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "# install DICES client\n",
    "!pip install -q git+https://github.com/cwf2/dices-client"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9ab01393-bcba-4615-b6f4-43872129ce5f",
   "metadata": {
    "id": "9ab01393-bcba-4615-b6f4-43872129ce5f"
   },
   "source": [
    "### Import statements\n",
    "\n",
    "Here we tell Python which ancillary packages we want to make accessible. In this case, the DICES client and Pandas."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "61c1ffed-271e-4135-8124-321fe1bd677c",
   "metadata": {
    "id": "61c1ffed-271e-4135-8124-321fe1bd677c"
   },
   "outputs": [],
   "source": [
    "# for talking to DICES\n",
    "from dicesapi import DicesAPI\n",
    "\n",
    "# for creating data tables\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "24029a68-8dd4-4971-91b0-288b41358e6a",
   "metadata": {
    "id": "24029a68-8dd4-4971-91b0-288b41358e6a"
   },
   "source": [
    "### Connect to DICES\n",
    "Here we instantiate a connection to the DICES database. As we continue to work on it, the DICES client spits out a lot of debugging information. It's convenient to divert those messages to a separate file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "ac5cf7ce-1617-4889-a7b3-4f177571f613",
   "metadata": {
    "id": "ac5cf7ce-1617-4889-a7b3-4f177571f613"
   },
   "outputs": [],
   "source": [
    "# create a new DICES connection\n",
    "api = DicesAPI(logfile=\"dices.log\", logdetail=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4fd565bc-4579-45b0-a3c1-d17d66b7ccf2",
   "metadata": {
    "id": "4fd565bc-4579-45b0-a3c1-d17d66b7ccf2"
   },
   "source": [
    "## Download speech metadata\n",
    "The `getSpeeches()` method is our main tool for downloading speech metadata. If you provide no other criteria, you will get all the speeches."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "ec1043ce-bd07-4b13-aa2f-5879b7c48a54",
   "metadata": {
    "id": "ec1043ce-bd07-4b13-aa2f-5879b7c48a54",
    "outputId": "0af5a141-0f56-407f-e21a-fc697431b670"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "4696 speeches.\n"
     ]
    }
   ],
   "source": [
    "# download all speeches\n",
    "speeches = api.getSpeeches()\n",
    "\n",
    "# check that we got them\n",
    "print (len(speeches), \"speeches.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4e01595f-500f-4d5e-94c1-eab8cdce3ea1",
   "metadata": {
    "id": "4e01595f-500f-4d5e-94c1-eab8cdce3ea1"
   },
   "source": [
    "## Create a table\n",
    "One convenient way to manipulate the speech metadata from DICES is in tabular format, like a spreadsheet. The ancillary package Pandas is designed to create and analyze tabular data.\n",
    "\n",
    "### A table with one row per speech\n",
    "Here, we create a table in which each row represents one speech. Using a `for` loop over the results from DICES, we take each speech in turn. After making a few calculations, we create a row record for that speech. We choose which values we want to keep and giving each a label. Then we add the record to a growing list of rows. In the final step, we create a **DataFrame** (i.e., a table) from the list of individual row records."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "a702a5cd-9240-4b3a-9ba4-bc74f90b75e6",
   "metadata": {
    "id": "a702a5cd-9240-4b3a-9ba4-bc74f90b75e6",
    "outputId": "d2ad2ff1-41bf-4587-9a38-e3d49826e2fe"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "skipping <Speech 890: Odyssey 9.2-11.332>\n",
      "skipping <Speech 957: Odyssey 11.378-12.453>\n",
      "skipping <Speech 3950: Paraphrase 14.89-16.54>\n",
      "skipping <Speech 1552: Aeneid 2.3-3.715>\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>speech_id</th>\n",
       "      <th>language</th>\n",
       "      <th>author</th>\n",
       "      <th>work</th>\n",
       "      <th>prefix</th>\n",
       "      <th>first_line</th>\n",
       "      <th>last_line</th>\n",
       "      <th>num_lines</th>\n",
       "      <th>speaker</th>\n",
       "      <th>gender</th>\n",
       "      <th>addressee</th>\n",
       "      <th>turn</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>4230</td>\n",
       "      <td>greek</td>\n",
       "      <td>Anonymous</td>\n",
       "      <td>Orphic Argonautica</td>\n",
       "      <td></td>\n",
       "      <td>77</td>\n",
       "      <td>96</td>\n",
       "      <td>20</td>\n",
       "      <td>Jason</td>\n",
       "      <td>male</td>\n",
       "      <td>Orpheus</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4231</td>\n",
       "      <td>greek</td>\n",
       "      <td>Anonymous</td>\n",
       "      <td>Orphic Argonautica</td>\n",
       "      <td></td>\n",
       "      <td>98</td>\n",
       "      <td>113</td>\n",
       "      <td>16</td>\n",
       "      <td>Orpheus</td>\n",
       "      <td>male</td>\n",
       "      <td>Jason</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4232</td>\n",
       "      <td>greek</td>\n",
       "      <td>Anonymous</td>\n",
       "      <td>Orphic Argonautica</td>\n",
       "      <td></td>\n",
       "      <td>256</td>\n",
       "      <td>267</td>\n",
       "      <td>12</td>\n",
       "      <td>Orpheus</td>\n",
       "      <td>male</td>\n",
       "      <td>Argo, Argonauts</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4233</td>\n",
       "      <td>greek</td>\n",
       "      <td>Anonymous</td>\n",
       "      <td>Orphic Argonautica</td>\n",
       "      <td></td>\n",
       "      <td>283</td>\n",
       "      <td>294</td>\n",
       "      <td>12</td>\n",
       "      <td>Jason</td>\n",
       "      <td>male</td>\n",
       "      <td>Argonauts</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4234</td>\n",
       "      <td>greek</td>\n",
       "      <td>Anonymous</td>\n",
       "      <td>Orphic Argonautica</td>\n",
       "      <td></td>\n",
       "      <td>335</td>\n",
       "      <td>354</td>\n",
       "      <td>20</td>\n",
       "      <td>Orpheus</td>\n",
       "      <td>male</td>\n",
       "      <td>gods of the sea</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4687</th>\n",
       "      <td>1864</td>\n",
       "      <td>latin</td>\n",
       "      <td>Virgil</td>\n",
       "      <td>Aeneid</td>\n",
       "      <td>12</td>\n",
       "      <td>872</td>\n",
       "      <td>884</td>\n",
       "      <td>13</td>\n",
       "      <td>Juturna</td>\n",
       "      <td>female</td>\n",
       "      <td>Juturna, Turnus</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4688</th>\n",
       "      <td>1865</td>\n",
       "      <td>latin</td>\n",
       "      <td>Virgil</td>\n",
       "      <td>Aeneid</td>\n",
       "      <td>12</td>\n",
       "      <td>889</td>\n",
       "      <td>893</td>\n",
       "      <td>5</td>\n",
       "      <td>Aeneas</td>\n",
       "      <td>male</td>\n",
       "      <td>Turnus</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4689</th>\n",
       "      <td>1866</td>\n",
       "      <td>latin</td>\n",
       "      <td>Virgil</td>\n",
       "      <td>Aeneid</td>\n",
       "      <td>12</td>\n",
       "      <td>894</td>\n",
       "      <td>895</td>\n",
       "      <td>2</td>\n",
       "      <td>Turnus</td>\n",
       "      <td>male</td>\n",
       "      <td>Aeneas</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4690</th>\n",
       "      <td>1867</td>\n",
       "      <td>latin</td>\n",
       "      <td>Virgil</td>\n",
       "      <td>Aeneid</td>\n",
       "      <td>12</td>\n",
       "      <td>931</td>\n",
       "      <td>938</td>\n",
       "      <td>8</td>\n",
       "      <td>Turnus</td>\n",
       "      <td>male</td>\n",
       "      <td>Aeneas</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4691</th>\n",
       "      <td>1868</td>\n",
       "      <td>latin</td>\n",
       "      <td>Virgil</td>\n",
       "      <td>Aeneid</td>\n",
       "      <td>12</td>\n",
       "      <td>947</td>\n",
       "      <td>949</td>\n",
       "      <td>3</td>\n",
       "      <td>Aeneas</td>\n",
       "      <td>male</td>\n",
       "      <td>Turnus</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>4692 rows × 12 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      speech_id language     author                work prefix first_line  \\\n",
       "0          4230    greek  Anonymous  Orphic Argonautica                77   \n",
       "1          4231    greek  Anonymous  Orphic Argonautica                98   \n",
       "2          4232    greek  Anonymous  Orphic Argonautica               256   \n",
       "3          4233    greek  Anonymous  Orphic Argonautica               283   \n",
       "4          4234    greek  Anonymous  Orphic Argonautica               335   \n",
       "...         ...      ...        ...                 ...    ...        ...   \n",
       "4687       1864    latin     Virgil              Aeneid     12        872   \n",
       "4688       1865    latin     Virgil              Aeneid     12        889   \n",
       "4689       1866    latin     Virgil              Aeneid     12        894   \n",
       "4690       1867    latin     Virgil              Aeneid     12        931   \n",
       "4691       1868    latin     Virgil              Aeneid     12        947   \n",
       "\n",
       "     last_line  num_lines  speaker  gender        addressee  turn  \n",
       "0           96         20    Jason    male          Orpheus     1  \n",
       "1          113         16  Orpheus    male            Jason     2  \n",
       "2          267         12  Orpheus    male  Argo, Argonauts     1  \n",
       "3          294         12    Jason    male        Argonauts     1  \n",
       "4          354         20  Orpheus    male  gods of the sea     1  \n",
       "...        ...        ...      ...     ...              ...   ...  \n",
       "4687       884         13  Juturna  female  Juturna, Turnus     1  \n",
       "4688       893          5   Aeneas    male           Turnus     1  \n",
       "4689       895          2   Turnus    male           Aeneas     2  \n",
       "4690       938          8   Turnus    male           Aeneas     1  \n",
       "4691       949          3   Aeneas    male           Turnus     2  \n",
       "\n",
       "[4692 rows x 12 columns]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# an empty list to hold the rows\n",
    "rows = list()\n",
    "\n",
    "# iterate over the speeches\n",
    "for speech in speeches:\n",
    "\n",
    "    # skip speeches that span more than one book/poem\n",
    "    if speech.isMultiPrefix():\n",
    "        print(\"skipping\", speech)\n",
    "        continue\n",
    "\n",
    "    # rough estimate of number of lines based on subtraction\n",
    "    # - get line numbers, strip alphabetic suffixes like \"101a\"\n",
    "    # - subtract first line from last line and add one\n",
    "    first_line = speech.getLineNo(\"first\", alpha=False)\n",
    "    last_line = speech.getLineNo(\"last\", alpha=False)\n",
    "    nlines = int(last_line) - int(first_line) + 1\n",
    "\n",
    "    # get first speaker gender\n",
    "    spkr_gender = speech.spkr[0].gender\n",
    "\n",
    "    # create a new row, labelling all the data values\n",
    "    row = {\n",
    "        \"speech_id\": speech.id,\n",
    "        \"language\": speech.lang,\n",
    "        \"author\": speech.author.name,\n",
    "        \"work\": speech.work.title,\n",
    "        \"prefix\": speech.getPrefix(),\n",
    "        \"first_line\": speech.getLineNo(\"first\"),\n",
    "        \"last_line\": speech.getLineNo(\"last\"),\n",
    "        \"num_lines\": nlines,\n",
    "        \"speaker\": speech.getSpkrString(),\n",
    "        \"gender\": spkr_gender,\n",
    "        \"addressee\": speech.getAddrString(),\n",
    "        \"turn\": speech.part,\n",
    "    }\n",
    "\n",
    "    # add the row to the list\n",
    "    rows.append(row)\n",
    "\n",
    "# make the table\n",
    "table = pd.DataFrame(rows)\n",
    "\n",
    "# display the table\n",
    "display(table)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9aa03746-697b-44eb-9673-49a3e752228b",
   "metadata": {
    "id": "9aa03746-697b-44eb-9673-49a3e752228b"
   },
   "source": [
    "#### Export the table as a CSV file\n",
    "I like to save the output to a CSV file right away. This can be imported into Excel to check the data or to continue your work outside of Python. If you're using Google Colab, you might have to click a \"files\" icon on the left in order to see and download the new file.\n",
    "\n",
    "💁🏻‍♂️ *I like tab-separated files better than comma-separated files; I think they're a little easier to read. Excel will import either format without a problem. If you prefer to use commas instead of tabs, just remove the option `sep=\"\\t\"` below.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "ec32cc13-9786-4827-b153-56ac877d054c",
   "metadata": {
    "id": "ec32cc13-9786-4827-b153-56ac877d054c"
   },
   "outputs": [],
   "source": [
    "# write the table to a file for import to Excel\n",
    "table.to_csv(\"speeches.tsv\", sep=\"\\t\", index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "73ed8396-5abf-4600-981f-21123697e949",
   "metadata": {
    "id": "73ed8396-5abf-4600-981f-21123697e949"
   },
   "source": [
    "## Manipulating tabular data\n",
    "Now that we have a table with one row per speech, we probably want to look for different ways to group and summarize the collection. In this workshop we introduce three Pandas methods that aggregate records in ways that might be familiar from Excel."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d27186f1-d07f-4239-9e39-672893573bbe",
   "metadata": {
    "id": "d27186f1-d07f-4239-9e39-672893573bbe"
   },
   "source": [
    "### Grouping and aggregating\n",
    "We can use `groupby()` to group the speeches according to some factor and `aggregate()` to summarize the new groups. The values of our grouping column become the row names of the new table.\n",
    "\n",
    "#### Example\n",
    "Let's find out how many speeches there are in each language. Here's how columns in the original table map onto the rows, columns, and values of the new summary table:\n",
    "\n",
    "- We're grouping by column **language**. That means *rows* in the new table will correspond to the values of **language** in the original table\n",
    "- The column we're summarizing is **speech_id**, because we're counting individual speeches. Let's call the corresponding *column* in our new summary table **speeches**.\n",
    "- The values of the new column should be how many speeches are in each group. So our *aggregation function* is just `\"count\"`.\n",
    "\n",
    "So, our new table will have rows taken from **language**, a single column called **speeches**, in which each value will be the `count` of column **speech_id** in the original table.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "7afdc954-98ec-4efa-b076-faf280569993",
   "metadata": {
    "id": "7afdc954-98ec-4efa-b076-faf280569993",
    "outputId": "911800df-bb36-4bde-dc0d-896016cda1bd"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>speeches</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>language</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>greek</th>\n",
       "      <td>2778</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>latin</th>\n",
       "      <td>1914</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          speeches\n",
       "language          \n",
       "greek         2778\n",
       "latin         1914"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# group by column \"language\"\n",
    "grouped_data = table.groupby(\"language\")\n",
    "\n",
    "# aggregate column \"speech_id\" using aggregation function \"count\"\n",
    "language_summary = grouped_data.aggregate(speeches=(\"speech_id\", \"count\"))\n",
    "\n",
    "# show the results\n",
    "display(language_summary)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2a33c319-9830-4893-bdce-d95c18a78cf0",
   "metadata": {
    "id": "2a33c319-9830-4893-bdce-d95c18a78cf0"
   },
   "source": [
    "#### Alternative style\n",
    "Here's a slightly different way of doing the same thing. It omits a step by chaining `aggregate()` directly to `groupby()` while adding some whitespace to make the layout of the new table more prominent. `agg()` is a synonym for `aggregate()`. The outer parentheses are necessary so that Python understands when the whole chain is finished."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "738a279d-7078-4ebb-8ed4-69e15a8cb834",
   "metadata": {
    "id": "738a279d-7078-4ebb-8ed4-69e15a8cb834",
    "outputId": "127c0f49-d450-4081-bf36-e0b13d1bad86"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>speeches</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>language</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>greek</th>\n",
       "      <td>2778</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>latin</th>\n",
       "      <td>1914</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          speeches\n",
       "language          \n",
       "greek         2778\n",
       "latin         1914"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "language_summary = (\n",
    "    table                                  # original data\n",
    "    .groupby(\"language\")                   # column to group by\n",
    "    .agg(                                  # define summary table\n",
    "        speeches = (\"speech_id\", \"count\"), #  - one new column\n",
    "    )\n",
    ")\n",
    "\n",
    "# show results\n",
    "display(language_summary)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "972d5945-4881-45be-8d3f-f40624f1b20d",
   "metadata": {
    "id": "972d5945-4881-45be-8d3f-f40624f1b20d"
   },
   "source": [
    "### More aggregation functions\n",
    "In this example, each row of the new table represents the group of speeches in one language. Besides just counting **speech_id**s, we can summarize other columns of the original data using more interesting functions.\n",
    "\n",
    "#### Example\n",
    "Let's add some new columns about speech length to our summary table. For each language, we'll give the total number of lines, the longest speech, and the average speech length.\n",
    "\n",
    "- All the new columns summarize the column **num_lines** in the original table.\n",
    "- New column **lines** will use aggregation function `\"sum\"`.\n",
    "- New column **max_lines** will use aggregation function `\"max\"`.\n",
    "- New column **avg_lines** will use aggregation function `\"mean\"`.\n",
    "\n",
    "We can add these column definitions into the code we used before."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "8214a06e-c769-476b-aa00-779a8719171a",
   "metadata": {
    "id": "8214a06e-c769-476b-aa00-779a8719171a",
    "outputId": "d6be2895-5f09-4918-e306-b0874aa07b84"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>speeches</th>\n",
       "      <th>lines</th>\n",
       "      <th>max_lines</th>\n",
       "      <th>avg_lines</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>language</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>greek</th>\n",
       "      <td>2778</td>\n",
       "      <td>31886</td>\n",
       "      <td>478</td>\n",
       "      <td>11.478042</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>latin</th>\n",
       "      <td>1914</td>\n",
       "      <td>27123</td>\n",
       "      <td>592</td>\n",
       "      <td>14.170846</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          speeches  lines  max_lines  avg_lines\n",
       "language                                       \n",
       "greek         2778  31886        478  11.478042\n",
       "latin         1914  27123        592  14.170846"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "language_summary = (\n",
    "    table                                  # original data\n",
    "    .groupby(\"language\")                   # column to group by\n",
    "    .agg(                                  # define summary table\n",
    "        speeches = (\"speech_id\", \"count\"), #  - count of speech ids\n",
    "        lines = (\"num_lines\", \"sum\"),      #  - sum lines in all speeches\n",
    "        max_lines = (\"num_lines\", \"max\"),  #  - length of longest speech\n",
    "        avg_lines = (\"num_lines\", \"mean\"), #  - average speech length\n",
    "    )\n",
    ")\n",
    "\n",
    "# show results\n",
    "display(language_summary)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7567107f-ee0c-44b9-82dd-14f5d93de531",
   "metadata": {
    "id": "7567107f-ee0c-44b9-82dd-14f5d93de531"
   },
   "source": [
    "### Grouping by additional factors\n",
    "If we provide `groupby()` with a *list* of column names instead of just one, we get a multi-dimensional grouping in which each possible combination of values from the respective columns is taken in turn.\n",
    "\n",
    "#### Example\n",
    "How many speeches are delivered by speakers of each gender in each language? Our *aggregation function* will again be `\"count\"`, but now the rows (or *index*) of our summary table have two levels, corresponding to **language** and **gender**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "ae2655f8-8e0c-4d14-8495-7bc4adaaa9c0",
   "metadata": {
    "id": "ae2655f8-8e0c-4d14-8495-7bc4adaaa9c0",
    "outputId": "9c5d8528-7f79-4d49-8dc0-2302b3198bc2"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>speeches</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>language</th>\n",
       "      <th>gender</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"4\" valign=\"top\">greek</th>\n",
       "      <th>female</th>\n",
       "      <td>575</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>male</th>\n",
       "      <td>2071</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>none</th>\n",
       "      <td>76</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>x</th>\n",
       "      <td>56</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"4\" valign=\"top\">latin</th>\n",
       "      <th>female</th>\n",
       "      <td>546</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>male</th>\n",
       "      <td>1334</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>none</th>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>x</th>\n",
       "      <td>25</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                 speeches\n",
       "language gender          \n",
       "greek    female       575\n",
       "         male        2071\n",
       "         none          76\n",
       "         x             56\n",
       "latin    female       546\n",
       "         male        1334\n",
       "         none           9\n",
       "         x             25"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "language_summary = (\n",
    "    table                                  # original data\n",
    "    .groupby([\"language\", \"gender\"])       # columns to group by\n",
    "    .agg(                                  # define summary table\n",
    "        speeches = (\"speech_id\", \"count\"), #  - count of speech ids\n",
    "    )\n",
    ")\n",
    "\n",
    "# show results\n",
    "display(language_summary)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2e237d13-2986-446d-b7f7-7884cb9125a1",
   "metadata": {
    "id": "2e237d13-2986-446d-b7f7-7884cb9125a1"
   },
   "source": [
    "#### Alternative table orientation\n",
    "To make it easier to compare values across genders, we can *unstack* the table: this rotates the inner level of the *index* or row labels (i.e., gender) over to the column headings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "75ba9ce5-ed40-4236-ade4-bd5ac0a2e326",
   "metadata": {
    "id": "75ba9ce5-ed40-4236-ade4-bd5ac0a2e326",
    "outputId": "b458badd-f1d1-4622-f9c6-dc3f88ad03f8"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr:last-of-type th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th colspan=\"4\" halign=\"left\">speeches</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>gender</th>\n",
       "      <th>female</th>\n",
       "      <th>male</th>\n",
       "      <th>none</th>\n",
       "      <th>x</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>language</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>greek</th>\n",
       "      <td>575</td>\n",
       "      <td>2071</td>\n",
       "      <td>76</td>\n",
       "      <td>56</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>latin</th>\n",
       "      <td>546</td>\n",
       "      <td>1334</td>\n",
       "      <td>9</td>\n",
       "      <td>25</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         speeches               \n",
       "gender     female  male none   x\n",
       "language                        \n",
       "greek         575  2071   76  56\n",
       "latin         546  1334    9  25"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "language_summary = (\n",
    "    table                                  # original data\n",
    "    .groupby([\"language\", \"gender\"])       # columns to group by\n",
    "    .agg(                                  # define summary table\n",
    "        speeches = (\"speech_id\", \"count\"), #  - count of speech ids\n",
    "    )\n",
    "    .unstack()                             # move second factor to columns\n",
    ")\n",
    "\n",
    "# show results\n",
    "display(language_summary)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ccd380cb-3f78-4c82-94de-eab43b457743",
   "metadata": {
    "id": "ccd380cb-3f78-4c82-94de-eab43b457743"
   },
   "source": [
    "### Complex, multi-factor tables\n",
    "Let's put everything together and calculate multiple statistics for every combination of language and gender. The results are necessarily complex; you'll have to decide whether the stacked or unstacked version is easier to read.\n",
    "\n",
    "#### stacked version"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "7e81f6ca-4a15-4334-90c4-12c03055e8b6",
   "metadata": {
    "id": "7e81f6ca-4a15-4334-90c4-12c03055e8b6",
    "outputId": "caf962b7-814b-4508-9a53-bbb0adee55e4"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>speeches</th>\n",
       "      <th>lines</th>\n",
       "      <th>max_lines</th>\n",
       "      <th>avg_lines</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>language</th>\n",
       "      <th>gender</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"4\" valign=\"top\">greek</th>\n",
       "      <th>female</th>\n",
       "      <td>575</td>\n",
       "      <td>7053</td>\n",
       "      <td>100</td>\n",
       "      <td>12.266087</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>male</th>\n",
       "      <td>2071</td>\n",
       "      <td>24321</td>\n",
       "      <td>478</td>\n",
       "      <td>11.743602</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>none</th>\n",
       "      <td>76</td>\n",
       "      <td>302</td>\n",
       "      <td>34</td>\n",
       "      <td>3.973684</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>x</th>\n",
       "      <td>56</td>\n",
       "      <td>210</td>\n",
       "      <td>12</td>\n",
       "      <td>3.750000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"4\" valign=\"top\">latin</th>\n",
       "      <th>female</th>\n",
       "      <td>546</td>\n",
       "      <td>8913</td>\n",
       "      <td>450</td>\n",
       "      <td>16.324176</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>male</th>\n",
       "      <td>1334</td>\n",
       "      <td>17922</td>\n",
       "      <td>592</td>\n",
       "      <td>13.434783</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>none</th>\n",
       "      <td>9</td>\n",
       "      <td>92</td>\n",
       "      <td>25</td>\n",
       "      <td>10.222222</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>x</th>\n",
       "      <td>25</td>\n",
       "      <td>196</td>\n",
       "      <td>49</td>\n",
       "      <td>7.840000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                 speeches  lines  max_lines  avg_lines\n",
       "language gender                                       \n",
       "greek    female       575   7053        100  12.266087\n",
       "         male        2071  24321        478  11.743602\n",
       "         none          76    302         34   3.973684\n",
       "         x             56    210         12   3.750000\n",
       "latin    female       546   8913        450  16.324176\n",
       "         male        1334  17922        592  13.434783\n",
       "         none           9     92         25  10.222222\n",
       "         x             25    196         49   7.840000"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "language_summary = (\n",
    "    table                                  # original data\n",
    "    .groupby([\"language\", \"gender\"])       # columns to group by\n",
    "    .agg(                                  # define summary table\n",
    "        speeches = (\"speech_id\", \"count\"), #  - count of speech ids\n",
    "        lines = (\"num_lines\", \"sum\"),      #  - sum lines in all speeches\n",
    "        max_lines = (\"num_lines\", \"max\"),  #  - length of longest speech\n",
    "        avg_lines = (\"num_lines\", \"mean\"), #  - average speech length\n",
    "    )\n",
    ")\n",
    "\n",
    "# show results\n",
    "display(language_summary)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9fa81c16-58b7-4951-969d-81a2c30ed380",
   "metadata": {
    "id": "9fa81c16-58b7-4951-969d-81a2c30ed380"
   },
   "source": [
    "#### unstacked version"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "1d552308-3a92-4688-8d25-add5bc918c6a",
   "metadata": {
    "id": "1d552308-3a92-4688-8d25-add5bc918c6a",
    "outputId": "9969334c-f5df-41e0-9e83-c228a35bbabe"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr:last-of-type th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th colspan=\"4\" halign=\"left\">speeches</th>\n",
       "      <th colspan=\"4\" halign=\"left\">lines</th>\n",
       "      <th colspan=\"4\" halign=\"left\">max_lines</th>\n",
       "      <th colspan=\"4\" halign=\"left\">avg_lines</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>gender</th>\n",
       "      <th>female</th>\n",
       "      <th>male</th>\n",
       "      <th>none</th>\n",
       "      <th>x</th>\n",
       "      <th>female</th>\n",
       "      <th>male</th>\n",
       "      <th>none</th>\n",
       "      <th>x</th>\n",
       "      <th>female</th>\n",
       "      <th>male</th>\n",
       "      <th>none</th>\n",
       "      <th>x</th>\n",
       "      <th>female</th>\n",
       "      <th>male</th>\n",
       "      <th>none</th>\n",
       "      <th>x</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>language</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>greek</th>\n",
       "      <td>575</td>\n",
       "      <td>2071</td>\n",
       "      <td>76</td>\n",
       "      <td>56</td>\n",
       "      <td>7053</td>\n",
       "      <td>24321</td>\n",
       "      <td>302</td>\n",
       "      <td>210</td>\n",
       "      <td>100</td>\n",
       "      <td>478</td>\n",
       "      <td>34</td>\n",
       "      <td>12</td>\n",
       "      <td>12.266087</td>\n",
       "      <td>11.743602</td>\n",
       "      <td>3.973684</td>\n",
       "      <td>3.75</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>latin</th>\n",
       "      <td>546</td>\n",
       "      <td>1334</td>\n",
       "      <td>9</td>\n",
       "      <td>25</td>\n",
       "      <td>8913</td>\n",
       "      <td>17922</td>\n",
       "      <td>92</td>\n",
       "      <td>196</td>\n",
       "      <td>450</td>\n",
       "      <td>592</td>\n",
       "      <td>25</td>\n",
       "      <td>49</td>\n",
       "      <td>16.324176</td>\n",
       "      <td>13.434783</td>\n",
       "      <td>10.222222</td>\n",
       "      <td>7.84</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         speeches                 lines                  max_lines            \\\n",
       "gender     female  male none   x female   male none    x    female male none   \n",
       "language                                                                       \n",
       "greek         575  2071   76  56   7053  24321  302  210       100  478   34   \n",
       "latin         546  1334    9  25   8913  17922   92  196       450  592   25   \n",
       "\n",
       "              avg_lines                              \n",
       "gender     x     female       male       none     x  \n",
       "language                                             \n",
       "greek     12  12.266087  11.743602   3.973684  3.75  \n",
       "latin     49  16.324176  13.434783  10.222222  7.84  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "language_summary = (\n",
    "    table                                  # original data\n",
    "    .groupby([\"language\", \"gender\"])       # columns to group by\n",
    "    .agg(                                  # define summary table\n",
    "        speeches = (\"speech_id\", \"count\"), #  - count of speech ids\n",
    "        lines = (\"num_lines\", \"sum\"),      #  - sum lines in all speeches\n",
    "        max_lines = (\"num_lines\", \"max\"),  #  - length of longest speech\n",
    "        avg_lines = (\"num_lines\", \"mean\"), #  - average speech length\n",
    "    )\n",
    "    .unstack()                             # move second factor to columns\n",
    ")\n",
    "\n",
    "# show results\n",
    "display(language_summary)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "88e00860-375e-4fba-8b13-776d6ebe90bb",
   "metadata": {
    "id": "88e00860-375e-4fba-8b13-776d6ebe90bb"
   },
   "source": [
    "### Bonus: filtering\n",
    "We don't have time here to explore all the ways to select and filter data, but here's a quick example. The **gender** column includes some values that I'm not interested in at the moment, so I'm going to use the `query()` method to select only speeches where the gender is `'male'` or `'female'`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "8d2af79b-f87d-43cc-bfae-a9f350e0a74b",
   "metadata": {
    "id": "8d2af79b-f87d-43cc-bfae-a9f350e0a74b",
    "outputId": "1b7923cc-9cbe-405a-8fd5-649fef156372"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr:last-of-type th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th colspan=\"2\" halign=\"left\">speeches</th>\n",
       "      <th colspan=\"2\" halign=\"left\">lines</th>\n",
       "      <th colspan=\"2\" halign=\"left\">max_lines</th>\n",
       "      <th colspan=\"2\" halign=\"left\">avg_lines</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>gender</th>\n",
       "      <th>female</th>\n",
       "      <th>male</th>\n",
       "      <th>female</th>\n",
       "      <th>male</th>\n",
       "      <th>female</th>\n",
       "      <th>male</th>\n",
       "      <th>female</th>\n",
       "      <th>male</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>language</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>greek</th>\n",
       "      <td>575</td>\n",
       "      <td>2071</td>\n",
       "      <td>7053</td>\n",
       "      <td>24321</td>\n",
       "      <td>100</td>\n",
       "      <td>478</td>\n",
       "      <td>12.266087</td>\n",
       "      <td>11.743602</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>latin</th>\n",
       "      <td>546</td>\n",
       "      <td>1334</td>\n",
       "      <td>8913</td>\n",
       "      <td>17922</td>\n",
       "      <td>450</td>\n",
       "      <td>592</td>\n",
       "      <td>16.324176</td>\n",
       "      <td>13.434783</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         speeches        lines        max_lines       avg_lines           \n",
       "gender     female  male female   male    female male     female       male\n",
       "language                                                                  \n",
       "greek         575  2071   7053  24321       100  478  12.266087  11.743602\n",
       "latin         546  1334   8913  17922       450  592  16.324176  13.434783"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "language_summary = (\n",
    "    table                                  # original data\n",
    "    .query(\"gender in ['male', 'female']\") # filter\n",
    "    .groupby([\"language\", \"gender\"])       # columns to group by\n",
    "    .agg(                                  # define summary table\n",
    "        speeches = (\"speech_id\", \"count\"), #  - count of speech ids\n",
    "        lines = (\"num_lines\", \"sum\"),      #  - sum lines in all speeches\n",
    "        max_lines = (\"num_lines\", \"max\"),  #  - length of longest speech\n",
    "        avg_lines = (\"num_lines\", \"mean\"), #  - average speech length\n",
    "    )\n",
    "    .unstack()                             # move second factor to columns\n",
    ")\n",
    "\n",
    "# show results\n",
    "display(language_summary)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8d83ff41-6a31-4917-9527-558388bc1b91",
   "metadata": {},
   "source": [
    "## Further reading\n",
    "Pandas provides complete documentation for the methods we use here. They also have some helpful tutorials.\n",
    "\n",
    "- [Guide to grouping and aggregating](https://pandas.pydata.org/docs/user_guide/groupby.html)\n",
    "- [Complete user guide to Pandas](https://pandas.pydata.org/docs/user_guide/index.html)\n",
    "- [Manual page for `groupby()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html)\n",
    "- [Manual page for `aggregate()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.aggregate.html)\n",
    "- [Manual page for `query()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "ce12337e-234b-41ac-9605-605c6be13a15",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr:last-of-type th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th colspan=\"2\" halign=\"left\">sum</th>\n",
       "      <th colspan=\"2\" halign=\"left\">max</th>\n",
       "      <th colspan=\"2\" halign=\"left\">mean</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>gender</th>\n",
       "      <th>female</th>\n",
       "      <th>male</th>\n",
       "      <th>female</th>\n",
       "      <th>male</th>\n",
       "      <th>female</th>\n",
       "      <th>male</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>language</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>greek</th>\n",
       "      <td>7053</td>\n",
       "      <td>24321</td>\n",
       "      <td>100</td>\n",
       "      <td>478</td>\n",
       "      <td>12.266087</td>\n",
       "      <td>11.743602</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>latin</th>\n",
       "      <td>8913</td>\n",
       "      <td>17922</td>\n",
       "      <td>450</td>\n",
       "      <td>592</td>\n",
       "      <td>16.324176</td>\n",
       "      <td>13.434783</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            sum           max            mean           \n",
       "gender   female   male female male     female       male\n",
       "language                                                \n",
       "greek      7053  24321    100  478  12.266087  11.743602\n",
       "latin      8913  17922    450  592  16.324176  13.434783"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(\n",
    "    table\n",
    "    .query(\"gender in ['male', 'female']\")\n",
    "    .pivot_table(\n",
    "        index = \"language\",\n",
    "        columns = \"gender\",\n",
    "        values = \"num_lines\",\n",
    "        aggfunc = [\"sum\", \"max\", \"mean\"],\n",
    "    )\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "92dbe7ad-677a-4969-a26b-dc32bcb3455c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr:last-of-type th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th colspan=\"2\" halign=\"left\">lines</th>\n",
       "      <th colspan=\"2\" halign=\"left\">max_lines</th>\n",
       "      <th colspan=\"2\" halign=\"left\">avg_lines</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>gender</th>\n",
       "      <th>female</th>\n",
       "      <th>male</th>\n",
       "      <th>female</th>\n",
       "      <th>male</th>\n",
       "      <th>female</th>\n",
       "      <th>male</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>language</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>greek</th>\n",
       "      <td>7053</td>\n",
       "      <td>24321</td>\n",
       "      <td>100</td>\n",
       "      <td>478</td>\n",
       "      <td>12.266087</td>\n",
       "      <td>11.743602</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>latin</th>\n",
       "      <td>8913</td>\n",
       "      <td>17922</td>\n",
       "      <td>450</td>\n",
       "      <td>592</td>\n",
       "      <td>16.324176</td>\n",
       "      <td>13.434783</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          lines        max_lines       avg_lines           \n",
       "gender   female   male    female male     female       male\n",
       "language                                                   \n",
       "greek      7053  24321       100  478  12.266087  11.743602\n",
       "latin      8913  17922       450  592  16.324176  13.434783"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(\n",
    "    table                                  # original data\n",
    "    .query(\"gender in ['male', 'female']\") # filter\n",
    "    .groupby([\"language\", \"gender\"])       # columns to group by\n",
    "    .agg(                                  # define summary table\n",
    "        lines = (\"num_lines\", \"sum\"),      #  - sum lines in all speeches\n",
    "        max_lines = (\"num_lines\", \"max\"),  #  - length of longest speech\n",
    "        avg_lines = (\"num_lines\", \"mean\"), #  - average speech length\n",
    "    )\n",
    "    .unstack()                             # move second factor to columns\n",
    ")"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "include_colab_link": true,
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.15"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}