{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Structure of a data frame\n", "\n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Colab setup ------------------\n", "import os, sys, subprocess\n", "if \"google.colab\" in sys.modules:\n", " cmd = \"pip install --upgrade watermark\"\n", " process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)\n", " stdout, stderr = process.communicate()\n", " data_path = \"https://s3.amazonaws.com/bebi103.caltech.edu/data/\"\n", "else:\n", " data_path = \"../data/\"\n", "# ------------------------------\n", "\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "So far, we have been working with easily-readable, pre-tidied data frames. Having data frames in tidy format allows you to harness the power of split-apply-combine operations, whether grouping or computing with the data themselves or with plotting. Furthermore, Boolean indexing allows for clean syntax in pulling out records of interest.\n", "\n", "However, data are often not present in CSV files in tidy format. When this is the case, we have to manipulate and reshape data frames, or **wrangle** them, into tidy format. This lesson goes into more depth on data frame structure and capabilities.\n", "\n", "For this part of the lesson, we will continue using the data set we are already familiar with, the face matching data from the [Beatie, et al. paper](https://doi.org/10.1098/rsos.160321). To have it in hand, we'll load it. The data set is available here: [https://s3.amazonaws.com/bebi103.caltech.edu/data/gfmt_sleep.csv](https://s3.amazonaws.com/bebi103.caltech.edu/data/gfmt_sleep.csv)." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
participant numbergenderagecorrect hit percentagecorrect reject percentagepercent correctconfidence when correct hitconfidence incorrect hitconfidence correct rejectconfidence incorrect rejectconfidence when correctconfidence when incorrectscipsqiess
08f39658072.591.090.093.083.593.090.09132
116m42909090.075.555.570.550.075.050.04117
218f31909592.589.590.086.081.089.088.01093
322f351007587.589.5NaN71.080.088.080.013820
427f74606562.568.549.061.049.065.049.013912
\n", "
" ], "text/plain": [ " participant number gender age correct hit percentage \\\n", "0 8 f 39 65 \n", "1 16 m 42 90 \n", "2 18 f 31 90 \n", "3 22 f 35 100 \n", "4 27 f 74 60 \n", "\n", " correct reject percentage percent correct confidence when correct hit \\\n", "0 80 72.5 91.0 \n", "1 90 90.0 75.5 \n", "2 95 92.5 89.5 \n", "3 75 87.5 89.5 \n", "4 65 62.5 68.5 \n", "\n", " confidence incorrect hit confidence correct reject \\\n", "0 90.0 93.0 \n", "1 55.5 70.5 \n", "2 90.0 86.0 \n", "3 NaN 71.0 \n", "4 49.0 61.0 \n", "\n", " confidence incorrect reject confidence when correct \\\n", "0 83.5 93.0 \n", "1 50.0 75.0 \n", "2 81.0 89.0 \n", "3 80.0 88.0 \n", "4 49.0 65.0 \n", "\n", " confidence when incorrect sci psqi ess \n", "0 90.0 9 13 2 \n", "1 50.0 4 11 7 \n", "2 88.0 10 9 3 \n", "3 80.0 13 8 20 \n", "4 49.0 13 9 12 " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv(os.path.join(data_path, 'gfmt_sleep.csv'), na_values='*')\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The components of a data frame\n", "\n", "Thus far, we have talked about Pandas data frames, and have not carefully explained what they are. To do so, it helps to start by thinking about a Pandas **series**. A series is a collection of data, with each datum having associated with it an **index**. This sounds an awful lot like a dictionary, where the indices are the keys and the data are the values. Like keys of a dictionary, the index of a series is immutable. Like the values of a dictionary, the data are mutable. A key difference, though, is that the indices do not have to be unique.\n", "\n", "A **data frame** is a collection of series that share the same index. For example, the participant number column of the facial matching data frame is a series." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pandas.core.series.Series" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s = df['participant number']\n", "\n", "type(df['participant number'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### A note on the words \"index,\" \"indexes,\" and \"indices\"\n", "\n", "At this point, we should clarify some language. When I was \"the index of a series,\" we are referring to the set of \"keys\" for that series. For example, the index for the series given by the participant number column of the facial recognition data frame is a range index, going from zero to 101, inclusive." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RangeIndex(start=0, stop=102, step=1)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When we say an \"index of a datum\" or \"index of a row,\" we are referring to a single \"key\". For example, if we wanted to pull out the value for index 8, we would do the following." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "34" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s[8]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When we say \"indices,\" we mean several of these individual \"keys.\" We can access the values are several indices as follows." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "8 34\n", "19 80\n", "27 3\n", "Name: participant number, dtype: int64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s[[8, 19, 27]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the indices come along for the ride; `8`, `19`, and `27` are still associated with their respective values.\n", "\n", "Finally, when we say \"indexes,\" we mean more than one of these *sets* of numbers. " ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "s2 = df['gender']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We would say, \"`s` and `s2` have the same index,\" or \"The indexes of `s` and `s1` are the same.\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Columns are indexes\n", "\n", "Internally to Pandas, the column names of a data frame collectively comprise an index." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pandas.core.indexes.base.Index" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(df.columns)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To recap:\n", "\n", "- An `Index` is a set of labels for data points that can be thought of analogously to dictionary keys. An index is immutable.\n", "- A Pandas `Series` is an index-data set pair, where the data-set is one-dimensional.\n", "- A Pandas `DataFrame` is a collection of `Series`, all of which have the same index. Each of these series is a column of the data frame. The names of the columns themselves comprise an `Index`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Computing environment" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPython 3.8.5\n", "IPython 7.18.1\n", "\n", "pandas 1.1.3\n", "jupyterlab 2.2.6\n" ] } ], "source": [ "%load_ext watermark\n", "%watermark -v -p pandas,jupyterlab" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }