{ "cells": [ { "cell_type": "markdown", "id": "09f51c61", "metadata": { "colab_type": "text", "id": "view-in-github" }, "source": [ "\"Open" ] }, { "cell_type": "markdown", "id": "dd2dec5e-c01f-4e38-9ca9-9cf9542f1d20", "metadata": { "id": "dd2dec5e-c01f-4e38-9ca9-9cf9542f1d20" }, "source": [ "# Tabular data primer\n", "\n", "This notebook demonstrates downloading speeches from DICES, creating tables of speech metadata, and using grouping and aggregation to create summary statistics.\n", "\n", "## Preliminaries\n", "\n", "### Install DICES client software\n", "\n", "Because Google Colab runs this notebook on a fresh virtual machine every time, we always need to install DICES as the first step." ] }, { "cell_type": "code", "execution_count": 1, "id": "fe3b7c7f-7275-473c-b4ce-103691e832f2", "metadata": { "id": "fe3b7c7f-7275-473c-b4ce-103691e832f2", "outputId": "bd70aad7-7e87-4dfd-c33b-0add01a6c254" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m24.2\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.0.1\u001b[0m\n", "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n" ] } ], "source": [ "# install DICES client\n", "!pip install -q git+https://github.com/cwf2/dices-client" ] }, { "cell_type": "markdown", "id": "9ab01393-bcba-4615-b6f4-43872129ce5f", "metadata": { "id": "9ab01393-bcba-4615-b6f4-43872129ce5f" }, "source": [ "### Import statements\n", "\n", "Here we tell Python which ancillary packages we want to make accessible. In this case, the DICES client and Pandas." ] }, { "cell_type": "code", "execution_count": 2, "id": "61c1ffed-271e-4135-8124-321fe1bd677c", "metadata": { "id": "61c1ffed-271e-4135-8124-321fe1bd677c" }, "outputs": [], "source": [ "# for talking to DICES\n", "from dicesapi import DicesAPI\n", "\n", "# for creating data tables\n", "import pandas as pd" ] }, { "cell_type": "markdown", "id": "24029a68-8dd4-4971-91b0-288b41358e6a", "metadata": { "id": "24029a68-8dd4-4971-91b0-288b41358e6a" }, "source": [ "### Connect to DICES\n", "Here we instantiate a connection to the DICES database. As we continue to work on it, the DICES client spits out a lot of debugging information. It's convenient to divert those messages to a separate file." ] }, { "cell_type": "code", "execution_count": 3, "id": "ac5cf7ce-1617-4889-a7b3-4f177571f613", "metadata": { "id": "ac5cf7ce-1617-4889-a7b3-4f177571f613" }, "outputs": [], "source": [ "# create a new DICES connection\n", "api = DicesAPI(logfile=\"dices.log\", logdetail=0)" ] }, { "cell_type": "markdown", "id": "4fd565bc-4579-45b0-a3c1-d17d66b7ccf2", "metadata": { "id": "4fd565bc-4579-45b0-a3c1-d17d66b7ccf2" }, "source": [ "## Download speech metadata\n", "The `getSpeeches()` method is our main tool for downloading speech metadata. If you provide no other criteria, you will get all the speeches." ] }, { "cell_type": "code", "execution_count": 4, "id": "ec1043ce-bd07-4b13-aa2f-5879b7c48a54", "metadata": { "id": "ec1043ce-bd07-4b13-aa2f-5879b7c48a54", "outputId": "0af5a141-0f56-407f-e21a-fc697431b670" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "4696 speeches.\n" ] } ], "source": [ "# download all speeches\n", "speeches = api.getSpeeches()\n", "\n", "# check that we got them\n", "print (len(speeches), \"speeches.\")" ] }, { "cell_type": "markdown", "id": "4e01595f-500f-4d5e-94c1-eab8cdce3ea1", "metadata": { "id": "4e01595f-500f-4d5e-94c1-eab8cdce3ea1" }, "source": [ "## Create a table\n", "One convenient way to manipulate the speech metadata from DICES is in tabular format, like a spreadsheet. The ancillary package Pandas is designed to create and analyze tabular data.\n", "\n", "### A table with one row per speech\n", "Here, we create a table in which each row represents one speech. Using a `for` loop over the results from DICES, we take each speech in turn. After making a few calculations, we create a row record for that speech. We choose which values we want to keep and giving each a label. Then we add the record to a growing list of rows. In the final step, we create a **DataFrame** (i.e., a table) from the list of individual row records." ] }, { "cell_type": "code", "execution_count": 5, "id": "a702a5cd-9240-4b3a-9ba4-bc74f90b75e6", "metadata": { "id": "a702a5cd-9240-4b3a-9ba4-bc74f90b75e6", "outputId": "d2ad2ff1-41bf-4587-9a38-e3d49826e2fe" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "skipping \n", "skipping \n", "skipping \n", "skipping \n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speech_idlanguageauthorworkprefixfirst_linelast_linenum_linesspeakergenderaddresseeturn
04230greekAnonymousOrphic Argonautica779620JasonmaleOrpheus1
14231greekAnonymousOrphic Argonautica9811316OrpheusmaleJason2
24232greekAnonymousOrphic Argonautica25626712OrpheusmaleArgo, Argonauts1
34233greekAnonymousOrphic Argonautica28329412JasonmaleArgonauts1
44234greekAnonymousOrphic Argonautica33535420Orpheusmalegods of the sea1
.......................................
46871864latinVirgilAeneid1287288413JuturnafemaleJuturna, Turnus1
46881865latinVirgilAeneid128898935AeneasmaleTurnus1
46891866latinVirgilAeneid128948952TurnusmaleAeneas2
46901867latinVirgilAeneid129319388TurnusmaleAeneas1
46911868latinVirgilAeneid129479493AeneasmaleTurnus2
\n", "

4692 rows × 12 columns

\n", "
" ], "text/plain": [ " speech_id language author work prefix first_line \\\n", "0 4230 greek Anonymous Orphic Argonautica 77 \n", "1 4231 greek Anonymous Orphic Argonautica 98 \n", "2 4232 greek Anonymous Orphic Argonautica 256 \n", "3 4233 greek Anonymous Orphic Argonautica 283 \n", "4 4234 greek Anonymous Orphic Argonautica 335 \n", "... ... ... ... ... ... ... \n", "4687 1864 latin Virgil Aeneid 12 872 \n", "4688 1865 latin Virgil Aeneid 12 889 \n", "4689 1866 latin Virgil Aeneid 12 894 \n", "4690 1867 latin Virgil Aeneid 12 931 \n", "4691 1868 latin Virgil Aeneid 12 947 \n", "\n", " last_line num_lines speaker gender addressee turn \n", "0 96 20 Jason male Orpheus 1 \n", "1 113 16 Orpheus male Jason 2 \n", "2 267 12 Orpheus male Argo, Argonauts 1 \n", "3 294 12 Jason male Argonauts 1 \n", "4 354 20 Orpheus male gods of the sea 1 \n", "... ... ... ... ... ... ... \n", "4687 884 13 Juturna female Juturna, Turnus 1 \n", "4688 893 5 Aeneas male Turnus 1 \n", "4689 895 2 Turnus male Aeneas 2 \n", "4690 938 8 Turnus male Aeneas 1 \n", "4691 949 3 Aeneas male Turnus 2 \n", "\n", "[4692 rows x 12 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# an empty list to hold the rows\n", "rows = list()\n", "\n", "# iterate over the speeches\n", "for speech in speeches:\n", "\n", " # skip speeches that span more than one book/poem\n", " if speech.isMultiPrefix():\n", " print(\"skipping\", speech)\n", " continue\n", "\n", " # rough estimate of number of lines based on subtraction\n", " # - get line numbers, strip alphabetic suffixes like \"101a\"\n", " # - subtract first line from last line and add one\n", " first_line = speech.getLineNo(\"first\", alpha=False)\n", " last_line = speech.getLineNo(\"last\", alpha=False)\n", " nlines = int(last_line) - int(first_line) + 1\n", "\n", " # get first speaker gender\n", " spkr_gender = speech.spkr[0].gender\n", "\n", " # create a new row, labelling all the data values\n", " row = {\n", " \"speech_id\": speech.id,\n", " \"language\": speech.lang,\n", " \"author\": speech.author.name,\n", " \"work\": speech.work.title,\n", " \"prefix\": speech.getPrefix(),\n", " \"first_line\": speech.getLineNo(\"first\"),\n", " \"last_line\": speech.getLineNo(\"last\"),\n", " \"num_lines\": nlines,\n", " \"speaker\": speech.getSpkrString(),\n", " \"gender\": spkr_gender,\n", " \"addressee\": speech.getAddrString(),\n", " \"turn\": speech.part,\n", " }\n", "\n", " # add the row to the list\n", " rows.append(row)\n", "\n", "# make the table\n", "table = pd.DataFrame(rows)\n", "\n", "# display the table\n", "display(table)" ] }, { "cell_type": "markdown", "id": "9aa03746-697b-44eb-9673-49a3e752228b", "metadata": { "id": "9aa03746-697b-44eb-9673-49a3e752228b" }, "source": [ "#### Export the table as a CSV file\n", "I like to save the output to a CSV file right away. This can be imported into Excel to check the data or to continue your work outside of Python. If you're using Google Colab, you might have to click a \"files\" icon on the left in order to see and download the new file.\n", "\n", "💁🏻‍♂️ *I like tab-separated files better than comma-separated files; I think they're a little easier to read. Excel will import either format without a problem. If you prefer to use commas instead of tabs, just remove the option `sep=\"\\t\"` below.*" ] }, { "cell_type": "code", "execution_count": 6, "id": "ec32cc13-9786-4827-b153-56ac877d054c", "metadata": { "id": "ec32cc13-9786-4827-b153-56ac877d054c" }, "outputs": [], "source": [ "# write the table to a file for import to Excel\n", "table.to_csv(\"speeches.tsv\", sep=\"\\t\", index=False)" ] }, { "cell_type": "markdown", "id": "73ed8396-5abf-4600-981f-21123697e949", "metadata": { "id": "73ed8396-5abf-4600-981f-21123697e949" }, "source": [ "## Manipulating tabular data\n", "Now that we have a table with one row per speech, we probably want to look for different ways to group and summarize the collection. In this workshop we introduce three Pandas methods that aggregate records in ways that might be familiar from Excel." ] }, { "cell_type": "markdown", "id": "d27186f1-d07f-4239-9e39-672893573bbe", "metadata": { "id": "d27186f1-d07f-4239-9e39-672893573bbe" }, "source": [ "### Grouping and aggregating\n", "We can use `groupby()` to group the speeches according to some factor and `aggregate()` to summarize the new groups. The values of our grouping column become the row names of the new table.\n", "\n", "#### Example\n", "Let's find out how many speeches there are in each language. Here's how columns in the original table map onto the rows, columns, and values of the new summary table:\n", "\n", "- We're grouping by column **language**. That means *rows* in the new table will correspond to the values of **language** in the original table\n", "- The column we're summarizing is **speech_id**, because we're counting individual speeches. Let's call the corresponding *column* in our new summary table **speeches**.\n", "- The values of the new column should be how many speeches are in each group. So our *aggregation function* is just `\"count\"`.\n", "\n", "So, our new table will have rows taken from **language**, a single column called **speeches**, in which each value will be the `count` of column **speech_id** in the original table.\n" ] }, { "cell_type": "code", "execution_count": 7, "id": "7afdc954-98ec-4efa-b076-faf280569993", "metadata": { "id": "7afdc954-98ec-4efa-b076-faf280569993", "outputId": "911800df-bb36-4bde-dc0d-896016cda1bd" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speeches
language
greek2778
latin1914
\n", "
" ], "text/plain": [ " speeches\n", "language \n", "greek 2778\n", "latin 1914" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# group by column \"language\"\n", "grouped_data = table.groupby(\"language\")\n", "\n", "# aggregate column \"speech_id\" using aggregation function \"count\"\n", "language_summary = grouped_data.aggregate(speeches=(\"speech_id\", \"count\"))\n", "\n", "# show the results\n", "display(language_summary)" ] }, { "cell_type": "markdown", "id": "2a33c319-9830-4893-bdce-d95c18a78cf0", "metadata": { "id": "2a33c319-9830-4893-bdce-d95c18a78cf0" }, "source": [ "#### Alternative style\n", "Here's a slightly different way of doing the same thing. It omits a step by chaining `aggregate()` directly to `groupby()` while adding some whitespace to make the layout of the new table more prominent. `agg()` is a synonym for `aggregate()`. The outer parentheses are necessary so that Python understands when the whole chain is finished." ] }, { "cell_type": "code", "execution_count": 8, "id": "738a279d-7078-4ebb-8ed4-69e15a8cb834", "metadata": { "id": "738a279d-7078-4ebb-8ed4-69e15a8cb834", "outputId": "127c0f49-d450-4081-bf36-e0b13d1bad86" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speeches
language
greek2778
latin1914
\n", "
" ], "text/plain": [ " speeches\n", "language \n", "greek 2778\n", "latin 1914" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "language_summary = (\n", " table # original data\n", " .groupby(\"language\") # column to group by\n", " .agg( # define summary table\n", " speeches = (\"speech_id\", \"count\"), # - one new column\n", " )\n", ")\n", "\n", "# show results\n", "display(language_summary)" ] }, { "cell_type": "markdown", "id": "972d5945-4881-45be-8d3f-f40624f1b20d", "metadata": { "id": "972d5945-4881-45be-8d3f-f40624f1b20d" }, "source": [ "### More aggregation functions\n", "In this example, each row of the new table represents the group of speeches in one language. Besides just counting **speech_id**s, we can summarize other columns of the original data using more interesting functions.\n", "\n", "#### Example\n", "Let's add some new columns about speech length to our summary table. For each language, we'll give the total number of lines, the longest speech, and the average speech length.\n", "\n", "- All the new columns summarize the column **num_lines** in the original table.\n", "- New column **lines** will use aggregation function `\"sum\"`.\n", "- New column **max_lines** will use aggregation function `\"max\"`.\n", "- New column **avg_lines** will use aggregation function `\"mean\"`.\n", "\n", "We can add these column definitions into the code we used before." ] }, { "cell_type": "code", "execution_count": 9, "id": "8214a06e-c769-476b-aa00-779a8719171a", "metadata": { "id": "8214a06e-c769-476b-aa00-779a8719171a", "outputId": "d6be2895-5f09-4918-e306-b0874aa07b84" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speecheslinesmax_linesavg_lines
language
greek27783188647811.478042
latin19142712359214.170846
\n", "
" ], "text/plain": [ " speeches lines max_lines avg_lines\n", "language \n", "greek 2778 31886 478 11.478042\n", "latin 1914 27123 592 14.170846" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "language_summary = (\n", " table # original data\n", " .groupby(\"language\") # column to group by\n", " .agg( # define summary table\n", " speeches = (\"speech_id\", \"count\"), # - count of speech ids\n", " lines = (\"num_lines\", \"sum\"), # - sum lines in all speeches\n", " max_lines = (\"num_lines\", \"max\"), # - length of longest speech\n", " avg_lines = (\"num_lines\", \"mean\"), # - average speech length\n", " )\n", ")\n", "\n", "# show results\n", "display(language_summary)" ] }, { "cell_type": "markdown", "id": "7567107f-ee0c-44b9-82dd-14f5d93de531", "metadata": { "id": "7567107f-ee0c-44b9-82dd-14f5d93de531" }, "source": [ "### Grouping by additional factors\n", "If we provide `groupby()` with a *list* of column names instead of just one, we get a multi-dimensional grouping in which each possible combination of values from the respective columns is taken in turn.\n", "\n", "#### Example\n", "How many speeches are delivered by speakers of each gender in each language? Our *aggregation function* will again be `\"count\"`, but now the rows (or *index*) of our summary table have two levels, corresponding to **language** and **gender**." ] }, { "cell_type": "code", "execution_count": 10, "id": "ae2655f8-8e0c-4d14-8495-7bc4adaaa9c0", "metadata": { "id": "ae2655f8-8e0c-4d14-8495-7bc4adaaa9c0", "outputId": "9c5d8528-7f79-4d49-8dc0-2302b3198bc2" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speeches
languagegender
greekfemale575
male2071
none76
x56
latinfemale546
male1334
none9
x25
\n", "
" ], "text/plain": [ " speeches\n", "language gender \n", "greek female 575\n", " male 2071\n", " none 76\n", " x 56\n", "latin female 546\n", " male 1334\n", " none 9\n", " x 25" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "language_summary = (\n", " table # original data\n", " .groupby([\"language\", \"gender\"]) # columns to group by\n", " .agg( # define summary table\n", " speeches = (\"speech_id\", \"count\"), # - count of speech ids\n", " )\n", ")\n", "\n", "# show results\n", "display(language_summary)" ] }, { "cell_type": "markdown", "id": "2e237d13-2986-446d-b7f7-7884cb9125a1", "metadata": { "id": "2e237d13-2986-446d-b7f7-7884cb9125a1" }, "source": [ "#### Alternative table orientation\n", "To make it easier to compare values across genders, we can *unstack* the table: this rotates the inner level of the *index* or row labels (i.e., gender) over to the column headings." ] }, { "cell_type": "code", "execution_count": 11, "id": "75ba9ce5-ed40-4236-ade4-bd5ac0a2e326", "metadata": { "id": "75ba9ce5-ed40-4236-ade4-bd5ac0a2e326", "outputId": "b458badd-f1d1-4622-f9c6-dc3f88ad03f8" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speeches
genderfemalemalenonex
language
greek57520717656
latin5461334925
\n", "
" ], "text/plain": [ " speeches \n", "gender female male none x\n", "language \n", "greek 575 2071 76 56\n", "latin 546 1334 9 25" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "language_summary = (\n", " table # original data\n", " .groupby([\"language\", \"gender\"]) # columns to group by\n", " .agg( # define summary table\n", " speeches = (\"speech_id\", \"count\"), # - count of speech ids\n", " )\n", " .unstack() # move second factor to columns\n", ")\n", "\n", "# show results\n", "display(language_summary)" ] }, { "cell_type": "markdown", "id": "ccd380cb-3f78-4c82-94de-eab43b457743", "metadata": { "id": "ccd380cb-3f78-4c82-94de-eab43b457743" }, "source": [ "### Complex, multi-factor tables\n", "Let's put everything together and calculate multiple statistics for every combination of language and gender. The results are necessarily complex; you'll have to decide whether the stacked or unstacked version is easier to read.\n", "\n", "#### stacked version" ] }, { "cell_type": "code", "execution_count": 12, "id": "7e81f6ca-4a15-4334-90c4-12c03055e8b6", "metadata": { "id": "7e81f6ca-4a15-4334-90c4-12c03055e8b6", "outputId": "caf962b7-814b-4508-9a53-bbb0adee55e4" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speecheslinesmax_linesavg_lines
languagegender
greekfemale575705310012.266087
male20712432147811.743602
none76302343.973684
x56210123.750000
latinfemale546891345016.324176
male13341792259213.434783
none9922510.222222
x25196497.840000
\n", "
" ], "text/plain": [ " speeches lines max_lines avg_lines\n", "language gender \n", "greek female 575 7053 100 12.266087\n", " male 2071 24321 478 11.743602\n", " none 76 302 34 3.973684\n", " x 56 210 12 3.750000\n", "latin female 546 8913 450 16.324176\n", " male 1334 17922 592 13.434783\n", " none 9 92 25 10.222222\n", " x 25 196 49 7.840000" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "language_summary = (\n", " table # original data\n", " .groupby([\"language\", \"gender\"]) # columns to group by\n", " .agg( # define summary table\n", " speeches = (\"speech_id\", \"count\"), # - count of speech ids\n", " lines = (\"num_lines\", \"sum\"), # - sum lines in all speeches\n", " max_lines = (\"num_lines\", \"max\"), # - length of longest speech\n", " avg_lines = (\"num_lines\", \"mean\"), # - average speech length\n", " )\n", ")\n", "\n", "# show results\n", "display(language_summary)" ] }, { "cell_type": "markdown", "id": "9fa81c16-58b7-4951-969d-81a2c30ed380", "metadata": { "id": "9fa81c16-58b7-4951-969d-81a2c30ed380" }, "source": [ "#### unstacked version" ] }, { "cell_type": "code", "execution_count": 13, "id": "1d552308-3a92-4688-8d25-add5bc918c6a", "metadata": { "id": "1d552308-3a92-4688-8d25-add5bc918c6a", "outputId": "9969334c-f5df-41e0-9e83-c228a35bbabe" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speecheslinesmax_linesavg_lines
genderfemalemalenonexfemalemalenonexfemalemalenonexfemalemalenonex
language
greek57520717656705324321302210100478341212.26608711.7436023.9736843.75
latin546133492589131792292196450592254916.32417613.43478310.2222227.84
\n", "
" ], "text/plain": [ " speeches lines max_lines \\\n", "gender female male none x female male none x female male none \n", "language \n", "greek 575 2071 76 56 7053 24321 302 210 100 478 34 \n", "latin 546 1334 9 25 8913 17922 92 196 450 592 25 \n", "\n", " avg_lines \n", "gender x female male none x \n", "language \n", "greek 12 12.266087 11.743602 3.973684 3.75 \n", "latin 49 16.324176 13.434783 10.222222 7.84 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "language_summary = (\n", " table # original data\n", " .groupby([\"language\", \"gender\"]) # columns to group by\n", " .agg( # define summary table\n", " speeches = (\"speech_id\", \"count\"), # - count of speech ids\n", " lines = (\"num_lines\", \"sum\"), # - sum lines in all speeches\n", " max_lines = (\"num_lines\", \"max\"), # - length of longest speech\n", " avg_lines = (\"num_lines\", \"mean\"), # - average speech length\n", " )\n", " .unstack() # move second factor to columns\n", ")\n", "\n", "# show results\n", "display(language_summary)" ] }, { "cell_type": "markdown", "id": "88e00860-375e-4fba-8b13-776d6ebe90bb", "metadata": { "id": "88e00860-375e-4fba-8b13-776d6ebe90bb" }, "source": [ "### Bonus: filtering\n", "We don't have time here to explore all the ways to select and filter data, but here's a quick example. The **gender** column includes some values that I'm not interested in at the moment, so I'm going to use the `query()` method to select only speeches where the gender is `'male'` or `'female'`." ] }, { "cell_type": "code", "execution_count": 14, "id": "8d2af79b-f87d-43cc-bfae-a9f350e0a74b", "metadata": { "id": "8d2af79b-f87d-43cc-bfae-a9f350e0a74b", "outputId": "1b7923cc-9cbe-405a-8fd5-649fef156372" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speecheslinesmax_linesavg_lines
genderfemalemalefemalemalefemalemalefemalemale
language
greek575207170532432110047812.26608711.743602
latin546133489131792245059216.32417613.434783
\n", "
" ], "text/plain": [ " speeches lines max_lines avg_lines \n", "gender female male female male female male female male\n", "language \n", "greek 575 2071 7053 24321 100 478 12.266087 11.743602\n", "latin 546 1334 8913 17922 450 592 16.324176 13.434783" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "language_summary = (\n", " table # original data\n", " .query(\"gender in ['male', 'female']\") # filter\n", " .groupby([\"language\", \"gender\"]) # columns to group by\n", " .agg( # define summary table\n", " speeches = (\"speech_id\", \"count\"), # - count of speech ids\n", " lines = (\"num_lines\", \"sum\"), # - sum lines in all speeches\n", " max_lines = (\"num_lines\", \"max\"), # - length of longest speech\n", " avg_lines = (\"num_lines\", \"mean\"), # - average speech length\n", " )\n", " .unstack() # move second factor to columns\n", ")\n", "\n", "# show results\n", "display(language_summary)" ] }, { "cell_type": "markdown", "id": "8d83ff41-6a31-4917-9527-558388bc1b91", "metadata": {}, "source": [ "## Further reading\n", "Pandas provides complete documentation for the methods we use here. They also have some helpful tutorials.\n", "\n", "- [Guide to grouping and aggregating](https://pandas.pydata.org/docs/user_guide/groupby.html)\n", "- [Complete user guide to Pandas](https://pandas.pydata.org/docs/user_guide/index.html)\n", "- [Manual page for `groupby()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html)\n", "- [Manual page for `aggregate()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.aggregate.html)\n", "- [Manual page for `query()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html)" ] }, { "cell_type": "code", "execution_count": 17, "id": "ce12337e-234b-41ac-9605-605c6be13a15", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
summaxmean
genderfemalemalefemalemalefemalemale
language
greek70532432110047812.26608711.743602
latin89131792245059216.32417613.434783
\n", "
" ], "text/plain": [ " sum max mean \n", "gender female male female male female male\n", "language \n", "greek 7053 24321 100 478 12.266087 11.743602\n", "latin 8913 17922 450 592 16.324176 13.434783" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(\n", " table\n", " .query(\"gender in ['male', 'female']\")\n", " .pivot_table(\n", " index = \"language\",\n", " columns = \"gender\",\n", " values = \"num_lines\",\n", " aggfunc = [\"sum\", \"max\", \"mean\"],\n", " )\n", ")" ] }, { "cell_type": "code", "execution_count": 18, "id": "92dbe7ad-677a-4969-a26b-dc32bcb3455c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
linesmax_linesavg_lines
genderfemalemalefemalemalefemalemale
language
greek70532432110047812.26608711.743602
latin89131792245059216.32417613.434783
\n", "
" ], "text/plain": [ " lines max_lines avg_lines \n", "gender female male female male female male\n", "language \n", "greek 7053 24321 100 478 12.266087 11.743602\n", "latin 8913 17922 450 592 16.324176 13.434783" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(\n", " table # original data\n", " .query(\"gender in ['male', 'female']\") # filter\n", " .groupby([\"language\", \"gender\"]) # columns to group by\n", " .agg( # define summary table\n", " lines = (\"num_lines\", \"sum\"), # - sum lines in all speeches\n", " max_lines = (\"num_lines\", \"max\"), # - length of longest speech\n", " avg_lines = (\"num_lines\", \"mean\"), # - average speech length\n", " )\n", " .unstack() # move second factor to columns\n", ")" ] } ], "metadata": { "colab": { "include_colab_link": true, "provenance": [] }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.15" } }, "nbformat": 4, "nbformat_minor": 5 }