{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Manipulation with Pandas\n", "\n", "- [Download the lecture notes](https://philchodrow.github.io/PIC16A/content/pd/pd_1.ipynb). \n", "\n", "The `pandas` package for Python offers a set of powerful tools for working with tabular data. The name `pandas` is originally derived from the common term \"panel data.\" While different programmers pronounce the package name in different ways, I prefer to pronounce it like the plural of \"panda [bears]\". I then imagine a small army of adorable bears performing my computations for me. \n", "\n", "
\n", " \"A\n", "
This panda is upside down because it is sorting data in reverse order.
\n", "
\n", "\n", "The first step when working with `pandas` is always to `import` it: " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# CSV Data, Revisited\n", "\n", "A few weeks ago, we discussed some tools for reading CSV data from files using the `csv` module. `pandas` is usually a better choice for reading (and working with) CSV data. To read a CSV file from data, we use the function `pd.read_csv()`. First, the following code block will place a copy of our data into the current working directory. " ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "import urllib\n", "\n", "url = \"https://philchodrow.github.io/PIC16A/datasets/palmer_penguins.csv\"\n", "filedata = urllib.request.urlopen(url)\n", "to_write = filedata.read()\n", "\n", "with open(\"palmer_penguins.csv\", \"wb\") as f:\n", " f.write(to_write)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we can read in the the file as a DataFrame:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pandas.core.frame.DataFrame" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguins = pd.read_csv(\"palmer_penguins.csv\")\n", "type(penguins)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's inspect our new DataFrame:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
studyNameSample NumberSpeciesRegionIslandStageIndividual IDClutch CompletionDate EggCulmen Length (mm)Culmen Depth (mm)Flipper Length (mm)Body Mass (g)SexDelta 15 N (o/oo)Delta 13 C (o/oo)Comments
0PAL07081Adelie Penguin (Pygoscelis adeliae)AnversTorgersenAdult, 1 Egg StageN1A1Yes11/11/0739.118.7181.03750.0MALENaNNaNNot enough blood for isotopes.
1PAL07082Adelie Penguin (Pygoscelis adeliae)AnversTorgersenAdult, 1 Egg StageN1A2Yes11/11/0739.517.4186.03800.0FEMALE8.94956-24.69454NaN
2PAL07083Adelie Penguin (Pygoscelis adeliae)AnversTorgersenAdult, 1 Egg StageN2A1Yes11/16/0740.318.0195.03250.0FEMALE8.36821-25.33302NaN
3PAL07084Adelie Penguin (Pygoscelis adeliae)AnversTorgersenAdult, 1 Egg StageN2A2Yes11/16/07NaNNaNNaNNaNNaNNaNNaNAdult not sampled.
4PAL07085Adelie Penguin (Pygoscelis adeliae)AnversTorgersenAdult, 1 Egg StageN3A1Yes11/16/0736.719.3193.03450.0FEMALE8.76651-25.32426NaN
......................................................
339PAL0910120Gentoo penguin (Pygoscelis papua)AnversBiscoeAdult, 1 Egg StageN38A2No12/1/09NaNNaNNaNNaNNaNNaNNaNNaN
340PAL0910121Gentoo penguin (Pygoscelis papua)AnversBiscoeAdult, 1 Egg StageN39A1Yes11/22/0946.814.3215.04850.0FEMALE8.41151-26.13832NaN
341PAL0910122Gentoo penguin (Pygoscelis papua)AnversBiscoeAdult, 1 Egg StageN39A2Yes11/22/0950.415.7222.05750.0MALE8.30166-26.04117NaN
342PAL0910123Gentoo penguin (Pygoscelis papua)AnversBiscoeAdult, 1 Egg StageN43A1Yes11/22/0945.214.8212.05200.0FEMALE8.24246-26.11969NaN
343PAL0910124Gentoo penguin (Pygoscelis papua)AnversBiscoeAdult, 1 Egg StageN43A2Yes11/22/0949.916.1213.05400.0MALE8.36390-26.15531NaN
\n", "

344 rows × 17 columns

\n", "
" ], "text/plain": [ " studyName Sample Number Species Region \\\n", "0 PAL0708 1 Adelie Penguin (Pygoscelis adeliae) Anvers \n", "1 PAL0708 2 Adelie Penguin (Pygoscelis adeliae) Anvers \n", "2 PAL0708 3 Adelie Penguin (Pygoscelis adeliae) Anvers \n", "3 PAL0708 4 Adelie Penguin (Pygoscelis adeliae) Anvers \n", "4 PAL0708 5 Adelie Penguin (Pygoscelis adeliae) Anvers \n", ".. ... ... ... ... \n", "339 PAL0910 120 Gentoo penguin (Pygoscelis papua) Anvers \n", "340 PAL0910 121 Gentoo penguin (Pygoscelis papua) Anvers \n", "341 PAL0910 122 Gentoo penguin (Pygoscelis papua) Anvers \n", "342 PAL0910 123 Gentoo penguin (Pygoscelis papua) Anvers \n", "343 PAL0910 124 Gentoo penguin (Pygoscelis papua) Anvers \n", "\n", " Island Stage Individual ID Clutch Completion Date Egg \\\n", "0 Torgersen Adult, 1 Egg Stage N1A1 Yes 11/11/07 \n", "1 Torgersen Adult, 1 Egg Stage N1A2 Yes 11/11/07 \n", "2 Torgersen Adult, 1 Egg Stage N2A1 Yes 11/16/07 \n", "3 Torgersen Adult, 1 Egg Stage N2A2 Yes 11/16/07 \n", "4 Torgersen Adult, 1 Egg Stage N3A1 Yes 11/16/07 \n", ".. ... ... ... ... ... \n", "339 Biscoe Adult, 1 Egg Stage N38A2 No 12/1/09 \n", "340 Biscoe Adult, 1 Egg Stage N39A1 Yes 11/22/09 \n", "341 Biscoe Adult, 1 Egg Stage N39A2 Yes 11/22/09 \n", "342 Biscoe Adult, 1 Egg Stage N43A1 Yes 11/22/09 \n", "343 Biscoe Adult, 1 Egg Stage N43A2 Yes 11/22/09 \n", "\n", " Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) \\\n", "0 39.1 18.7 181.0 \n", "1 39.5 17.4 186.0 \n", "2 40.3 18.0 195.0 \n", "3 NaN NaN NaN \n", "4 36.7 19.3 193.0 \n", ".. ... ... ... \n", "339 NaN NaN NaN \n", "340 46.8 14.3 215.0 \n", "341 50.4 15.7 222.0 \n", "342 45.2 14.8 212.0 \n", "343 49.9 16.1 213.0 \n", "\n", " Body Mass (g) Sex Delta 15 N (o/oo) Delta 13 C (o/oo) \\\n", "0 3750.0 MALE NaN NaN \n", "1 3800.0 FEMALE 8.94956 -24.69454 \n", "2 3250.0 FEMALE 8.36821 -25.33302 \n", "3 NaN NaN NaN NaN \n", "4 3450.0 FEMALE 8.76651 -25.32426 \n", ".. ... ... ... ... \n", "339 NaN NaN NaN NaN \n", "340 4850.0 FEMALE 8.41151 -26.13832 \n", "341 5750.0 MALE 8.30166 -26.04117 \n", "342 5200.0 FEMALE 8.24246 -26.11969 \n", "343 5400.0 MALE 8.36390 -26.15531 \n", "\n", " Comments \n", "0 Not enough blood for isotopes. \n", "1 NaN \n", "2 NaN \n", "3 Adult not sampled. \n", "4 NaN \n", ".. ... \n", "339 NaN \n", "340 NaN \n", "341 NaN \n", "342 NaN \n", "343 NaN \n", "\n", "[344 rows x 17 columns]" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguins" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(344, 17)" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguins.shape # (rows, columns)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "studyName object\n", "Sample Number int64\n", "Species object\n", "Region object\n", "Island object\n", "Stage object\n", "Individual ID object\n", "Clutch Completion object\n", "Date Egg object\n", "Culmen Length (mm) float64\n", "Culmen Depth (mm) float64\n", "Flipper Length (mm) float64\n", "Body Mass (g) float64\n", "Sex object\n", "Delta 15 N (o/oo) float64\n", "Delta 13 C (o/oo) float64\n", "Comments object\n", "dtype: object" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguins.dtypes # data type of each column. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A data type of `object` means that `pandas` isn't sure what kind of data are in the corresponding columns. This is very common when the columns contain strings. [It is possible](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html) to give string columns a dedicated data type, but we won't focus on that here. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A pleasant way to get a quick overview of the numerical columns in your data set is the `describe` method. " ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sample NumberCulmen Length (mm)Culmen Depth (mm)Flipper Length (mm)Body Mass (g)Delta 15 N (o/oo)Delta 13 C (o/oo)
count344.000000342.000000342.000000342.000000342.000000330.000000331.000000
mean63.15116343.92193017.151170200.9152054201.7543868.733382-25.686292
std40.4301995.4595841.97479314.061714801.9545360.5517700.793961
min1.00000032.10000013.100000172.0000002700.0000007.632200-27.018540
25%29.00000039.22500015.600000190.0000003550.0000008.299890-26.320305
50%58.00000044.45000017.300000197.0000004050.0000008.652405-25.833520
75%95.25000048.50000018.700000213.0000004750.0000009.172123-25.062050
max152.00000059.60000021.500000231.0000006300.00000010.025440-23.787670
\n", "
" ], "text/plain": [ " Sample Number Culmen Length (mm) Culmen Depth (mm) \\\n", "count 344.000000 342.000000 342.000000 \n", "mean 63.151163 43.921930 17.151170 \n", "std 40.430199 5.459584 1.974793 \n", "min 1.000000 32.100000 13.100000 \n", "25% 29.000000 39.225000 15.600000 \n", "50% 58.000000 44.450000 17.300000 \n", "75% 95.250000 48.500000 18.700000 \n", "max 152.000000 59.600000 21.500000 \n", "\n", " Flipper Length (mm) Body Mass (g) Delta 15 N (o/oo) \\\n", "count 342.000000 342.000000 330.000000 \n", "mean 200.915205 4201.754386 8.733382 \n", "std 14.061714 801.954536 0.551770 \n", "min 172.000000 2700.000000 7.632200 \n", "25% 190.000000 3550.000000 8.299890 \n", "50% 197.000000 4050.000000 8.652405 \n", "75% 213.000000 4750.000000 9.172123 \n", "max 231.000000 6300.000000 10.025440 \n", "\n", " Delta 13 C (o/oo) \n", "count 331.000000 \n", "mean -25.686292 \n", "std 0.793961 \n", "min -27.018540 \n", "25% -26.320305 \n", "50% -25.833520 \n", "75% -25.062050 \n", "max -23.787670 " ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguins.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Parts of a Data Frame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When working with data frames, it's important to get comfortable with their different parts. These are: \n", "\n", "1. The index. The index is used to refer to **rows**. In many cases, you can think of the index as a unique numerical label for a row. \n", "2. The column names. These tell you what kinds of data appear in each row. It is important to be able to comfortably grab columns from the data frame for use in computations. \n", "3. The data itself. You can think of the data as a set of different arrays, one for each column name. Each array has the same length. Many of the methods of these arrays will be familiar from `np.array`s. \n", "\n", "
\n", " \"A\n", "
The parts of a data frame.
\n", "
\n", "\n", "Let's now begin to look at how to obtain different parts of a data frame.\n", "\n", "## Selecting Columns\n", "\n", "The easiest way to select a column of a data frame is to pass the name of the column to the DataFrame with `[]` brackets. In this way, you can think of a data frame as being similar to a dictionary whose keys are the column names. \n" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 Anvers\n", "1 Anvers\n", "2 Anvers\n", "3 Anvers\n", "4 Anvers\n", " ... \n", "339 Anvers\n", "340 Anvers\n", "341 Anvers\n", "342 Anvers\n", "343 Anvers\n", "Name: Region, Length: 344, dtype: object" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguins['Region']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result is no longer a data frame, but rather a `pd.Series` object, which is similar to a `np.array`. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To select multiple columns, pass a list of column names: " ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SpeciesRegionIsland
0Adelie Penguin (Pygoscelis adeliae)AnversTorgersen
1Adelie Penguin (Pygoscelis adeliae)AnversTorgersen
2Adelie Penguin (Pygoscelis adeliae)AnversTorgersen
3Adelie Penguin (Pygoscelis adeliae)AnversTorgersen
4Adelie Penguin (Pygoscelis adeliae)AnversTorgersen
............
339Gentoo penguin (Pygoscelis papua)AnversBiscoe
340Gentoo penguin (Pygoscelis papua)AnversBiscoe
341Gentoo penguin (Pygoscelis papua)AnversBiscoe
342Gentoo penguin (Pygoscelis papua)AnversBiscoe
343Gentoo penguin (Pygoscelis papua)AnversBiscoe
\n", "

344 rows × 3 columns

\n", "
" ], "text/plain": [ " Species Region Island\n", "0 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen\n", "1 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen\n", "2 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen\n", "3 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen\n", "4 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen\n", ".. ... ... ...\n", "339 Gentoo penguin (Pygoscelis papua) Anvers Biscoe\n", "340 Gentoo penguin (Pygoscelis papua) Anvers Biscoe\n", "341 Gentoo penguin (Pygoscelis papua) Anvers Biscoe\n", "342 Gentoo penguin (Pygoscelis papua) Anvers Biscoe\n", "343 Gentoo penguin (Pygoscelis papua) Anvers Biscoe\n", "\n", "[344 rows x 3 columns]" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguins[['Species', 'Region', 'Island']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This time, the result is a data frame containing the specified columns. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# L1\n", "# intro, parts of a df, select\n", "# indexing and filter\n", "\n", "\n", "# L2\n", "# mutate\n", "# group_by, summarise\n", "\n", "# L3\n", "# code patterns with visualization " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }