{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Using data frame indexes\n", "\n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Colab setup ------------------\n", "import os, sys, subprocess\n", "if \"google.colab\" in sys.modules:\n", " cmd = \"pip install --upgrade watermark\"\n", " process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)\n", " stdout, stderr = process.communicate()\n", " data_path = \"https://s3.amazonaws.com/bebi103.caltech.edu/data/\"\n", "else:\n", " data_path = \"../data/\"\n", "# ------------------------------\n", "\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "We continue to use the face matching data from the [Beatie, et al. paper](https://doi.org/10.1098/rsos.160321)." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
participant numbergenderagecorrect hit percentagecorrect reject percentagepercent correctconfidence when correct hitconfidence incorrect hitconfidence correct rejectconfidence incorrect rejectconfidence when correctconfidence when incorrectscipsqiess
08f39658072.591.090.093.083.593.090.09132
116m42909090.075.555.570.550.075.050.04117
218f31909592.589.590.086.081.089.088.01093
322f351007587.589.5NaN71.080.088.080.013820
427f74606562.568.549.061.049.065.049.013912
\n", "
" ], "text/plain": [ " participant number gender age correct hit percentage \\\n", "0 8 f 39 65 \n", "1 16 m 42 90 \n", "2 18 f 31 90 \n", "3 22 f 35 100 \n", "4 27 f 74 60 \n", "\n", " correct reject percentage percent correct confidence when correct hit \\\n", "0 80 72.5 91.0 \n", "1 90 90.0 75.5 \n", "2 95 92.5 89.5 \n", "3 75 87.5 89.5 \n", "4 65 62.5 68.5 \n", "\n", " confidence incorrect hit confidence correct reject \\\n", "0 90.0 93.0 \n", "1 55.5 70.5 \n", "2 90.0 86.0 \n", "3 NaN 71.0 \n", "4 49.0 61.0 \n", "\n", " confidence incorrect reject confidence when correct \\\n", "0 83.5 93.0 \n", "1 50.0 75.0 \n", "2 81.0 89.0 \n", "3 80.0 88.0 \n", "4 49.0 65.0 \n", "\n", " confidence when incorrect sci psqi ess \n", "0 90.0 9 13 2 \n", "1 50.0 4 11 7 \n", "2 88.0 10 9 3 \n", "3 80.0 13 8 20 \n", "4 49.0 13 9 12 " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv(os.path.join(data_path, 'gfmt_sleep.csv'), na_values='*')\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So far, we have used Boolean indexing for extracting data out of data frames, and I advocate for taking primarily that approach. The logic and syntax are very clean. In this sense, the index of a data frame is disposable. In fact, Hadley Wickham [advocates for disposing of them completely](https://adv-r.hadley.nz/vectors-chap.html#rownames). We will mostly dispose of them.\n", "\n", "However, when wrangling, we often need to use indexes, so let's get more familiar with them." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Changing index\n", "\n", "As I mentioned before, indexes are immutable. Let's try changing the index of our data frame." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "ename": "TypeError", "evalue": "Index does not support mutable operations", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mindex\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m7\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m'index 7'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/indexes/base.py\u001b[0m in \u001b[0;36m__setitem__\u001b[0;34m(self, key, value)\u001b[0m\n\u001b[1;32m 4079\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4080\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m__setitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mvalue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 4081\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mTypeError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"Index does not support mutable operations\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 4082\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4083\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mTypeError\u001b[0m: Index does not support mutable operations" ] } ], "source": [ "df.index[7] = 'index 7'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But we can change our index wholesale. That is, we can set `df.index` to a list and all indices in the index will be updated." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
participant numbergenderagecorrect hit percentagecorrect reject percentagepercent correctconfidence when correct hitconfidence incorrect hitconfidence correct rejectconfidence incorrect rejectconfidence when correctconfidence when incorrectscipsqiess
18f39658072.591.090.093.083.593.090.09132
216m42909090.075.555.570.550.075.050.04117
318f31909592.589.590.086.081.089.088.01093
422f351007587.589.5NaN71.080.088.080.013820
527f74606562.568.549.061.049.065.049.013912
\n", "
" ], "text/plain": [ " participant number gender age correct hit percentage \\\n", "1 8 f 39 65 \n", "2 16 m 42 90 \n", "3 18 f 31 90 \n", "4 22 f 35 100 \n", "5 27 f 74 60 \n", "\n", " correct reject percentage percent correct confidence when correct hit \\\n", "1 80 72.5 91.0 \n", "2 90 90.0 75.5 \n", "3 95 92.5 89.5 \n", "4 75 87.5 89.5 \n", "5 65 62.5 68.5 \n", "\n", " confidence incorrect hit confidence correct reject \\\n", "1 90.0 93.0 \n", "2 55.5 70.5 \n", "3 90.0 86.0 \n", "4 NaN 71.0 \n", "5 49.0 61.0 \n", "\n", " confidence incorrect reject confidence when correct \\\n", "1 83.5 93.0 \n", "2 50.0 75.0 \n", "3 81.0 89.0 \n", "4 80.0 88.0 \n", "5 49.0 65.0 \n", "\n", " confidence when incorrect sci psqi ess \n", "1 90.0 9 13 2 \n", "2 50.0 4 11 7 \n", "3 88.0 10 9 3 \n", "4 80.0 13 8 20 \n", "5 49.0 13 9 12 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Just to demonstrate, shift to 1-based indexing\n", "df.index = df.index + 1\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We may instead wish to have one of the columns of the data frame serve as the index. It would make sense in this case to index by participant number. We can do that using the `set_index()` method of the data frame." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
genderagecorrect hit percentagecorrect reject percentagepercent correctconfidence when correct hitconfidence incorrect hitconfidence correct rejectconfidence incorrect rejectconfidence when correctconfidence when incorrectscipsqiess
participant number
8f39658072.591.090.093.083.593.090.09132
16m42909090.075.555.570.550.075.050.04117
18f31909592.589.590.086.081.089.088.01093
22f351007587.589.5NaN71.080.088.080.013820
27f74606562.568.549.061.049.065.049.013912
\n", "
" ], "text/plain": [ " gender age correct hit percentage \\\n", "participant number \n", "8 f 39 65 \n", "16 m 42 90 \n", "18 f 31 90 \n", "22 f 35 100 \n", "27 f 74 60 \n", "\n", " correct reject percentage percent correct \\\n", "participant number \n", "8 80 72.5 \n", "16 90 90.0 \n", "18 95 92.5 \n", "22 75 87.5 \n", "27 65 62.5 \n", "\n", " confidence when correct hit confidence incorrect hit \\\n", "participant number \n", "8 91.0 90.0 \n", "16 75.5 55.5 \n", "18 89.5 90.0 \n", "22 89.5 NaN \n", "27 68.5 49.0 \n", "\n", " confidence correct reject confidence incorrect reject \\\n", "participant number \n", "8 93.0 83.5 \n", "16 70.5 50.0 \n", "18 86.0 81.0 \n", "22 71.0 80.0 \n", "27 61.0 49.0 \n", "\n", " confidence when correct confidence when incorrect sci \\\n", "participant number \n", "8 93.0 90.0 9 \n", "16 75.0 50.0 4 \n", "18 89.0 88.0 10 \n", "22 88.0 80.0 13 \n", "27 65.0 49.0 13 \n", "\n", " psqi ess \n", "participant number \n", "8 13 2 \n", "16 11 7 \n", "18 9 3 \n", "22 8 20 \n", "27 9 12 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = df.set_index(\"participant number\")\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice now that the index of the data frame has a **name**. We can also now index the records we want directly using the participant number." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "gender m\n", "age 42\n", "correct hit percentage 90\n", "correct reject percentage 90\n", "percent correct 90\n", "confidence when correct hit 75.5\n", "confidence incorrect hit 55.5\n", "confidence correct reject 70.5\n", "confidence incorrect reject 50\n", "confidence when correct 75\n", "confidence when incorrect 50\n", "sci 4\n", "psqi 11\n", "ess 7\n", "Name: 16, dtype: object" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[16]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that when we index this way, we get a series where the columns of the data frame now comprise the index of the series.\n", "\n", "If we wish to make the index into a column (or columns in the case of Multiindexes, which we will discuss next) of the data frame, we can use the `reset_index()` method." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
participant numbergenderagecorrect hit percentagecorrect reject percentagepercent correctconfidence when correct hitconfidence incorrect hitconfidence correct rejectconfidence incorrect rejectconfidence when correctconfidence when incorrectscipsqiess
08f39658072.591.090.093.083.593.090.09132
116m42909090.075.555.570.550.075.050.04117
218f31909592.589.590.086.081.089.088.01093
322f351007587.589.5NaN71.080.088.080.013820
427f74606562.568.549.061.049.065.049.013912
\n", "
" ], "text/plain": [ " participant number gender age correct hit percentage \\\n", "0 8 f 39 65 \n", "1 16 m 42 90 \n", "2 18 f 31 90 \n", "3 22 f 35 100 \n", "4 27 f 74 60 \n", "\n", " correct reject percentage percent correct confidence when correct hit \\\n", "0 80 72.5 91.0 \n", "1 90 90.0 75.5 \n", "2 95 92.5 89.5 \n", "3 75 87.5 89.5 \n", "4 65 62.5 68.5 \n", "\n", " confidence incorrect hit confidence correct reject \\\n", "0 90.0 93.0 \n", "1 55.5 70.5 \n", "2 90.0 86.0 \n", "3 NaN 71.0 \n", "4 49.0 61.0 \n", "\n", " confidence incorrect reject confidence when correct \\\n", "0 83.5 93.0 \n", "1 50.0 75.0 \n", "2 81.0 89.0 \n", "3 80.0 88.0 \n", "4 49.0 65.0 \n", "\n", " confidence when incorrect sci psqi ess \n", "0 90.0 9 13 2 \n", "1 50.0 4 11 7 \n", "2 88.0 10 9 3 \n", "3 80.0 13 8 20 \n", "4 49.0 13 9 12 " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = df.reset_index()\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The index was reset to a range index in the process." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Aside: Data frames are not changed in place by default\n", "\n", "Note that when we set the index, we used\n", "\n", "```python\n", "df = df.set_index('participant number')\n", "```\n", "\n", "instead of\n", "\n", "```python\n", "df.set_index('participant number')\n", "```\n", "\n", "The latter would create a data frame indexed by participant number, but the value of the variable `df` would not be changed. Instead, you need to explicitly make the assignment as is done in the former. Pandas in general will be cowardly in changing your data frame, which is a good idea.\n", "\n", "Note that many methods have an `inplace` keyword argument, which will then allow the data frame to be changed in place. I generally avoid this because I find code where the assignment is explicit, right there at the front of the line, easier to read." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Multiindexes\n", "\n", "Let's say that we know we will be interested in pulling out results based on gender. For example, if we wanted all records for females, we could use Boolean indexing with the current data frame as\n", "\n", "```python\n", "df.loc[df['gender']=='f', :]\n", "```\n", "\n", "This uses Boolean indexing and is a perfectly good way of doing this. But we may want increased speed and instead directly use indexing. So, we might want to index by gender." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
participant numberagecorrect hit percentagecorrect reject percentagepercent correctconfidence when correct hitconfidence incorrect hitconfidence correct rejectconfidence incorrect rejectconfidence when correctconfidence when incorrectscipsqiess
gender
f839658072.591.090.093.083.593.090.09132
m1642909090.075.555.570.550.075.050.04117
f1831909592.589.590.086.081.089.088.01093
f22351007587.589.5NaN71.080.088.080.013820
f2774606562.568.549.061.049.065.049.013912
\n", "
" ], "text/plain": [ " participant number age correct hit percentage \\\n", "gender \n", "f 8 39 65 \n", "m 16 42 90 \n", "f 18 31 90 \n", "f 22 35 100 \n", "f 27 74 60 \n", "\n", " correct reject percentage percent correct \\\n", "gender \n", "f 80 72.5 \n", "m 90 90.0 \n", "f 95 92.5 \n", "f 75 87.5 \n", "f 65 62.5 \n", "\n", " confidence when correct hit confidence incorrect hit \\\n", "gender \n", "f 91.0 90.0 \n", "m 75.5 55.5 \n", "f 89.5 90.0 \n", "f 89.5 NaN \n", "f 68.5 49.0 \n", "\n", " confidence correct reject confidence incorrect reject \\\n", "gender \n", "f 93.0 83.5 \n", "m 70.5 50.0 \n", "f 86.0 81.0 \n", "f 71.0 80.0 \n", "f 61.0 49.0 \n", "\n", " confidence when correct confidence when incorrect sci psqi ess \n", "gender \n", "f 93.0 90.0 9 13 2 \n", "m 75.0 50.0 4 11 7 \n", "f 89.0 88.0 10 9 3 \n", "f 88.0 80.0 13 8 20 \n", "f 65.0 49.0 13 9 12 " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = df.set_index('gender')\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note now that we have **repeated indices**. This is totally legal. If we now want to take out all of the female entries, we can do so by direct indexing." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
participant numberagecorrect hit percentagecorrect reject percentagepercent correctconfidence when correct hitconfidence incorrect hitconfidence correct rejectconfidence incorrect rejectconfidence when correctconfidence when incorrectscipsqiess
gender
f839658072.591.090.093.083.593.090.09132
f1831909592.589.590.086.081.089.088.01093
f22351007587.589.5NaN71.080.088.080.013820
f2774606562.568.549.061.049.065.049.013912
f2861802050.071.063.031.072.564.570.515142
\n", "
" ], "text/plain": [ " participant number age correct hit percentage \\\n", "gender \n", "f 8 39 65 \n", "f 18 31 90 \n", "f 22 35 100 \n", "f 27 74 60 \n", "f 28 61 80 \n", "\n", " correct reject percentage percent correct \\\n", "gender \n", "f 80 72.5 \n", "f 95 92.5 \n", "f 75 87.5 \n", "f 65 62.5 \n", "f 20 50.0 \n", "\n", " confidence when correct hit confidence incorrect hit \\\n", "gender \n", "f 91.0 90.0 \n", "f 89.5 90.0 \n", "f 89.5 NaN \n", "f 68.5 49.0 \n", "f 71.0 63.0 \n", "\n", " confidence correct reject confidence incorrect reject \\\n", "gender \n", "f 93.0 83.5 \n", "f 86.0 81.0 \n", "f 71.0 80.0 \n", "f 61.0 49.0 \n", "f 31.0 72.5 \n", "\n", " confidence when correct confidence when incorrect sci psqi ess \n", "gender \n", "f 93.0 90.0 9 13 2 \n", "f 89.0 88.0 10 9 3 \n", "f 88.0 80.0 13 8 20 \n", "f 65.0 49.0 13 9 12 \n", "f 64.5 70.5 15 14 2 " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc['f'].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Again, the main reason you might do this is for speed. To check, we can measure the time it takes to pull the female records, first by direct indexing and then by Boolean indexing. Before the Boolean indexing, we'll reset the index so that we are back to dealing with the original data frame." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "150 µs ± 1.15 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n", "307 µs ± 15.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n" ] } ], "source": [ "%timeit df.loc['f']\n", "\n", "df = df.reset_index()\n", "%timeit df.loc[df['gender']=='f']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For this small data frame, direct indexing is about twice as fast, and can be even faster for larger data frames.\n", "\n", "If we do, in fact, want to use direct indexing, as opposed to Boolean indexing, for pulling rows out of a data frame, we should have unique indices. If we still wish to index by gender, this can be a problem. To address this, we can use a **multiindex**. To create a multiindex for a data frame, we can use `set_index()` with a list of column names to use as indexes, as opposed to a single column name." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agecorrect hit percentagecorrect reject percentagepercent correctconfidence when correct hitconfidence incorrect hitconfidence correct rejectconfidence incorrect rejectconfidence when correctconfidence when incorrectscipsqiess
genderparticipant number
f839658072.591.090.093.083.593.090.09132
m1642909090.075.555.570.550.075.050.04117
f1831909592.589.590.086.081.089.088.01093
22351007587.589.5NaN71.080.088.080.013820
2774606562.568.549.061.049.065.049.013912
\n", "
" ], "text/plain": [ " age correct hit percentage \\\n", "gender participant number \n", "f 8 39 65 \n", "m 16 42 90 \n", "f 18 31 90 \n", " 22 35 100 \n", " 27 74 60 \n", "\n", " correct reject percentage percent correct \\\n", "gender participant number \n", "f 8 80 72.5 \n", "m 16 90 90.0 \n", "f 18 95 92.5 \n", " 22 75 87.5 \n", " 27 65 62.5 \n", "\n", " confidence when correct hit \\\n", "gender participant number \n", "f 8 91.0 \n", "m 16 75.5 \n", "f 18 89.5 \n", " 22 89.5 \n", " 27 68.5 \n", "\n", " confidence incorrect hit \\\n", "gender participant number \n", "f 8 90.0 \n", "m 16 55.5 \n", "f 18 90.0 \n", " 22 NaN \n", " 27 49.0 \n", "\n", " confidence correct reject \\\n", "gender participant number \n", "f 8 93.0 \n", "m 16 70.5 \n", "f 18 86.0 \n", " 22 71.0 \n", " 27 61.0 \n", "\n", " confidence incorrect reject \\\n", "gender participant number \n", "f 8 83.5 \n", "m 16 50.0 \n", "f 18 81.0 \n", " 22 80.0 \n", " 27 49.0 \n", "\n", " confidence when correct confidence when incorrect \\\n", "gender participant number \n", "f 8 93.0 90.0 \n", "m 16 75.0 50.0 \n", "f 18 89.0 88.0 \n", " 22 88.0 80.0 \n", " 27 65.0 49.0 \n", "\n", " sci psqi ess \n", "gender participant number \n", "f 8 9 13 2 \n", "m 16 4 11 7 \n", "f 18 10 9 3 \n", " 22 13 8 20 \n", " 27 13 9 12 " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = df.set_index(['gender', 'participant number'])\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice now that the index consists of two columns, both with names. To slice by a multiindex, we enter the indices as tuples. For example, to get the record for participant number 18, a female, we could do" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "age 31.0\n", "correct hit percentage 90.0\n", "correct reject percentage 95.0\n", "percent correct 92.5\n", "confidence when correct hit 89.5\n", "confidence incorrect hit 90.0\n", "confidence correct reject 86.0\n", "confidence incorrect reject 81.0\n", "confidence when correct 89.0\n", "confidence when incorrect 88.0\n", "sci 10.0\n", "psqi 9.0\n", "ess 3.0\n", "Name: (f, 18), dtype: float64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[('f', 18)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we wanted participants 8 and 16, both females, we would use a list within the second level of indexing. We need to include the color for the column location to get all columns for the rows." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agecorrect hit percentagecorrect reject percentagepercent correctconfidence when correct hitconfidence incorrect hitconfidence correct rejectconfidence incorrect rejectconfidence when correctconfidence when incorrectscipsqiess
genderparticipant number
f839658072.591.090.093.083.593.090.09132
1831909592.589.590.086.081.089.088.01093
\n", "
" ], "text/plain": [ " age correct hit percentage \\\n", "gender participant number \n", "f 8 39 65 \n", " 18 31 90 \n", "\n", " correct reject percentage percent correct \\\n", "gender participant number \n", "f 8 80 72.5 \n", " 18 95 92.5 \n", "\n", " confidence when correct hit \\\n", "gender participant number \n", "f 8 91.0 \n", " 18 89.5 \n", "\n", " confidence incorrect hit \\\n", "gender participant number \n", "f 8 90.0 \n", " 18 90.0 \n", "\n", " confidence correct reject \\\n", "gender participant number \n", "f 8 93.0 \n", " 18 86.0 \n", "\n", " confidence incorrect reject \\\n", "gender participant number \n", "f 8 83.5 \n", " 18 81.0 \n", "\n", " confidence when correct confidence when incorrect \\\n", "gender participant number \n", "f 8 93.0 90.0 \n", " 18 89.0 88.0 \n", "\n", " sci psqi ess \n", "gender participant number \n", "f 8 9 13 2 \n", " 18 10 9 3 " ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[('f', [8, 18]), :]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What if we wanted records for participants 8, 16, and 18? Participant 16 is a male, so we effectively want to ignore the first index. We can do that by inserting `slice(None)` for the first index." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agecorrect hit percentagecorrect reject percentagepercent correctconfidence when correct hitconfidence incorrect hitconfidence correct rejectconfidence incorrect rejectconfidence when correctconfidence when incorrectscipsqiess
genderparticipant number
f839658072.591.090.093.083.593.090.09132
m1642909090.075.555.570.550.075.050.04117
f1831909592.589.590.086.081.089.088.01093
\n", "
" ], "text/plain": [ " age correct hit percentage \\\n", "gender participant number \n", "f 8 39 65 \n", "m 16 42 90 \n", "f 18 31 90 \n", "\n", " correct reject percentage percent correct \\\n", "gender participant number \n", "f 8 80 72.5 \n", "m 16 90 90.0 \n", "f 18 95 92.5 \n", "\n", " confidence when correct hit \\\n", "gender participant number \n", "f 8 91.0 \n", "m 16 75.5 \n", "f 18 89.5 \n", "\n", " confidence incorrect hit \\\n", "gender participant number \n", "f 8 90.0 \n", "m 16 55.5 \n", "f 18 90.0 \n", "\n", " confidence correct reject \\\n", "gender participant number \n", "f 8 93.0 \n", "m 16 70.5 \n", "f 18 86.0 \n", "\n", " confidence incorrect reject \\\n", "gender participant number \n", "f 8 83.5 \n", "m 16 50.0 \n", "f 18 81.0 \n", "\n", " confidence when correct confidence when incorrect \\\n", "gender participant number \n", "f 8 93.0 90.0 \n", "m 16 75.0 50.0 \n", "f 18 89.0 88.0 \n", "\n", " sci psqi ess \n", "gender participant number \n", "f 8 9 13 2 \n", "m 16 4 11 7 \n", "f 18 10 9 3 " ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[(slice(None), [8, 16, 18]), :]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we left the female specification in there, number 16 is simply ignored." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agecorrect hit percentagecorrect reject percentagepercent correctconfidence when correct hitconfidence incorrect hitconfidence correct rejectconfidence incorrect rejectconfidence when correctconfidence when incorrectscipsqiess
genderparticipant number
f839658072.591.090.093.083.593.090.09132
1831909592.589.590.086.081.089.088.01093
\n", "
" ], "text/plain": [ " age correct hit percentage \\\n", "gender participant number \n", "f 8 39 65 \n", " 18 31 90 \n", "\n", " correct reject percentage percent correct \\\n", "gender participant number \n", "f 8 80 72.5 \n", " 18 95 92.5 \n", "\n", " confidence when correct hit \\\n", "gender participant number \n", "f 8 91.0 \n", " 18 89.5 \n", "\n", " confidence incorrect hit \\\n", "gender participant number \n", "f 8 90.0 \n", " 18 90.0 \n", "\n", " confidence correct reject \\\n", "gender participant number \n", "f 8 93.0 \n", " 18 86.0 \n", "\n", " confidence incorrect reject \\\n", "gender participant number \n", "f 8 83.5 \n", " 18 81.0 \n", "\n", " confidence when correct confidence when incorrect \\\n", "gender participant number \n", "f 8 93.0 90.0 \n", " 18 89.0 88.0 \n", "\n", " sci psqi ess \n", "gender participant number \n", "f 8 9 13 2 \n", " 18 10 9 3 " ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[('f', [8, 16, 18]), :]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Column names are also indexes\n", "\n", "The concepts we have laid out for indexes applied to rows also apply to columns. The current state of our data frame has multiindexed rows with single index column names. The column names represent the various aspects of the test (such as percent correct and sleep quality scores) and the row multiindex are gender and participant number. By doing a transpose operation, we can swap the column and rows, giving a data frame where each _column_ represents a single experiment, indexed by gender and participant number. (Note that this is not a tidy data frame, since for tidy data each _row_ represented a single observation/experiment and each column represents an aspect of the observation. We are looking at an untidy data frame here for illustrative purposes about indexes.)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
genderfmfmf...mfmf
participant number8161822272830333435...9192949596979899102103
age39.042.031.035.074.061.032.062.033.053.0...62.022.041.046.056.023.070.024.040.033.0
correct hit percentage65.090.090.0100.060.080.090.045.080.0100.0...100.085.035.095.070.070.090.070.075.085.0
correct reject percentage80.090.095.075.065.020.075.090.0100.050.0...80.095.075.080.050.085.085.080.065.040.0
percent correct72.590.092.587.562.550.082.567.590.075.0...90.090.055.087.560.077.587.575.070.062.5
confidence when correct hit91.075.589.589.568.571.067.054.070.574.5...81.066.055.090.063.077.065.561.553.080.0
\n", "

5 rows × 102 columns

\n", "
" ], "text/plain": [ "gender f m f m \\\n", "participant number 8 16 18 22 27 28 30 33 \n", "age 39.0 42.0 31.0 35.0 74.0 61.0 32.0 62.0 \n", "correct hit percentage 65.0 90.0 90.0 100.0 60.0 80.0 90.0 45.0 \n", "correct reject percentage 80.0 90.0 95.0 75.0 65.0 20.0 75.0 90.0 \n", "percent correct 72.5 90.0 92.5 87.5 62.5 50.0 82.5 67.5 \n", "confidence when correct hit 91.0 75.5 89.5 89.5 68.5 71.0 67.0 54.0 \n", "\n", "gender f ... m f m f \\\n", "participant number 34 35 ... 91 92 94 95 96 \n", "age 33.0 53.0 ... 62.0 22.0 41.0 46.0 56.0 \n", "correct hit percentage 80.0 100.0 ... 100.0 85.0 35.0 95.0 70.0 \n", "correct reject percentage 100.0 50.0 ... 80.0 95.0 75.0 80.0 50.0 \n", "percent correct 90.0 75.0 ... 90.0 90.0 55.0 87.5 60.0 \n", "confidence when correct hit 70.5 74.5 ... 81.0 66.0 55.0 90.0 63.0 \n", "\n", "gender \n", "participant number 97 98 99 102 103 \n", "age 23.0 70.0 24.0 40.0 33.0 \n", "correct hit percentage 70.0 90.0 70.0 75.0 85.0 \n", "correct reject percentage 85.0 85.0 80.0 65.0 40.0 \n", "percent correct 77.5 87.5 75.0 70.0 62.5 \n", "confidence when correct hit 77.0 65.5 61.5 53.0 80.0 \n", "\n", "[5 rows x 102 columns]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = df.transpose()\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We could sort the hierarchical index of the column to make things look a bit nicer (though sorting is unnecessary when working with the data frame)." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
genderf...m
participant number123456781011...74788081878890919295
age42.045.016.021.018.028.038.039.025.022.0...21.031.028.041.026.066.045.062.022.046.0
correct hit percentage80.080.070.070.090.095.090.065.0100.080.0...40.0100.0100.090.095.060.0100.0100.085.095.0
correct reject percentage65.090.080.065.0100.080.095.080.0100.060.0...40.070.050.085.075.085.095.080.095.080.0
percent correct72.585.075.067.595.087.592.572.5100.070.0...40.085.075.087.585.072.597.590.090.087.5
confidence when correct hit51.575.070.063.576.5100.077.091.090.070.0...90.592.0100.080.085.067.5100.081.066.090.0
\n", "

5 rows × 102 columns

\n", "
" ], "text/plain": [ "gender f \\\n", "participant number 1 2 3 4 5 6 7 8 \n", "age 42.0 45.0 16.0 21.0 18.0 28.0 38.0 39.0 \n", "correct hit percentage 80.0 80.0 70.0 70.0 90.0 95.0 90.0 65.0 \n", "correct reject percentage 65.0 90.0 80.0 65.0 100.0 80.0 95.0 80.0 \n", "percent correct 72.5 85.0 75.0 67.5 95.0 87.5 92.5 72.5 \n", "confidence when correct hit 51.5 75.0 70.0 63.5 76.5 100.0 77.0 91.0 \n", "\n", "gender ... m \\\n", "participant number 10 11 ... 74 78 80 81 87 \n", "age 25.0 22.0 ... 21.0 31.0 28.0 41.0 26.0 \n", "correct hit percentage 100.0 80.0 ... 40.0 100.0 100.0 90.0 95.0 \n", "correct reject percentage 100.0 60.0 ... 40.0 70.0 50.0 85.0 75.0 \n", "percent correct 100.0 70.0 ... 40.0 85.0 75.0 87.5 85.0 \n", "confidence when correct hit 90.0 70.0 ... 90.5 92.0 100.0 80.0 85.0 \n", "\n", "gender \n", "participant number 88 90 91 92 95 \n", "age 66.0 45.0 62.0 22.0 46.0 \n", "correct hit percentage 60.0 100.0 100.0 85.0 95.0 \n", "correct reject percentage 85.0 95.0 80.0 95.0 80.0 \n", "percent correct 72.5 97.5 90.0 90.0 87.5 \n", "confidence when correct hit 67.5 100.0 81.0 66.0 90.0 \n", "\n", "[5 rows x 102 columns]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = df.sort_index(axis='columns')\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can index by gender and participant number for columns as for rows (though we do not need to use `.loc` for columns)." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "age 22.0\n", "correct hit percentage 80.0\n", "correct reject percentage 60.0\n", "percent correct 70.0\n", "confidence when correct hit 70.0\n", "confidence incorrect hit 70.0\n", "confidence correct reject 70.0\n", "confidence incorrect reject 65.0\n", "confidence when correct 70.0\n", "confidence when incorrect 70.0\n", "sci 22.0\n", "psqi 4.0\n", "ess 6.0\n", "Name: (f, 11), dtype: float64" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[('f', 11)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## When to use direct vs Boolean indexing\n", "\n", "I generally only use direct indexing when I need the speed. As we will see, it is sometimes useful to set up multiindexes when wrangling en route to a tidy data frame that can be indexed with Boolean indexing. But aside from those two uses, I generally advocate using simple data frames with a range index for the rows (which is ignored) and a standard (not multi-) index for column names. Importantly, most high-level plotting libraries, including HoloViews, do not recognize indexes as data, and therefor the indexes cannot be conveniently used in making plots.\n", "\n", "Nonetheless, it is important to know how indexes work, since you will often encounter them while wrangling and reading documentation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Computing environment" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPython 3.8.5\n", "IPython 7.18.1\n", "\n", "pandas 1.1.3\n", "jupyterlab 2.2.6\n" ] } ], "source": [ "%load_ext watermark\n", "%watermark -v -p pandas,jupyterlab" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }