{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Manipulation with Pandas\n", "\n", "- [Download the lecture notes](https://philchodrow.github.io/PIC16A/content/pd/pd_1.ipynb). \n", "\n", "The `pandas` package for Python offers a set of powerful tools for working with tabular data. The name `pandas` is originally derived from the common term \"panel data.\" While different programmers pronounce the package name in different ways, I prefer to pronounce it like the plural of \"panda [bears]\". I then imagine a small army of adorable bears performing my computations for me. \n", "\n", "\n", "\n", "The first step when working with `pandas` is always to `import` it: " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# CSV Data, Revisited\n", "\n", "A few weeks ago, we discussed some tools for reading CSV data from files using the `csv` module. `pandas` is usually a better choice for reading (and working with) CSV data. To read a CSV file from data, we use the function `pd.read_csv()`. First, the following code block will place a copy of our data into the current working directory. " ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "import urllib\n", "\n", "url = \"https://philchodrow.github.io/PIC16A/datasets/palmer_penguins.csv\"\n", "filedata = urllib.request.urlopen(url)\n", "to_write = filedata.read()\n", "\n", "with open(\"palmer_penguins.csv\", \"wb\") as f:\n", " f.write(to_write)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we can read in the the file as a DataFrame:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pandas.core.frame.DataFrame" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguins = pd.read_csv(\"palmer_penguins.csv\")\n", "type(penguins)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's inspect our new DataFrame:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | studyName | \n", "Sample Number | \n", "Species | \n", "Region | \n", "Island | \n", "Stage | \n", "Individual ID | \n", "Clutch Completion | \n", "Date Egg | \n", "Culmen Length (mm) | \n", "Culmen Depth (mm) | \n", "Flipper Length (mm) | \n", "Body Mass (g) | \n", "Sex | \n", "Delta 15 N (o/oo) | \n", "Delta 13 C (o/oo) | \n", "Comments | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "PAL0708 | \n", "1 | \n", "Adelie Penguin (Pygoscelis adeliae) | \n", "Anvers | \n", "Torgersen | \n", "Adult, 1 Egg Stage | \n", "N1A1 | \n", "Yes | \n", "11/11/07 | \n", "39.1 | \n", "18.7 | \n", "181.0 | \n", "3750.0 | \n", "MALE | \n", "NaN | \n", "NaN | \n", "Not enough blood for isotopes. | \n", "
1 | \n", "PAL0708 | \n", "2 | \n", "Adelie Penguin (Pygoscelis adeliae) | \n", "Anvers | \n", "Torgersen | \n", "Adult, 1 Egg Stage | \n", "N1A2 | \n", "Yes | \n", "11/11/07 | \n", "39.5 | \n", "17.4 | \n", "186.0 | \n", "3800.0 | \n", "FEMALE | \n", "8.94956 | \n", "-24.69454 | \n", "NaN | \n", "
2 | \n", "PAL0708 | \n", "3 | \n", "Adelie Penguin (Pygoscelis adeliae) | \n", "Anvers | \n", "Torgersen | \n", "Adult, 1 Egg Stage | \n", "N2A1 | \n", "Yes | \n", "11/16/07 | \n", "40.3 | \n", "18.0 | \n", "195.0 | \n", "3250.0 | \n", "FEMALE | \n", "8.36821 | \n", "-25.33302 | \n", "NaN | \n", "
3 | \n", "PAL0708 | \n", "4 | \n", "Adelie Penguin (Pygoscelis adeliae) | \n", "Anvers | \n", "Torgersen | \n", "Adult, 1 Egg Stage | \n", "N2A2 | \n", "Yes | \n", "11/16/07 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "Adult not sampled. | \n", "
4 | \n", "PAL0708 | \n", "5 | \n", "Adelie Penguin (Pygoscelis adeliae) | \n", "Anvers | \n", "Torgersen | \n", "Adult, 1 Egg Stage | \n", "N3A1 | \n", "Yes | \n", "11/16/07 | \n", "36.7 | \n", "19.3 | \n", "193.0 | \n", "3450.0 | \n", "FEMALE | \n", "8.76651 | \n", "-25.32426 | \n", "NaN | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
339 | \n", "PAL0910 | \n", "120 | \n", "Gentoo penguin (Pygoscelis papua) | \n", "Anvers | \n", "Biscoe | \n", "Adult, 1 Egg Stage | \n", "N38A2 | \n", "No | \n", "12/1/09 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
340 | \n", "PAL0910 | \n", "121 | \n", "Gentoo penguin (Pygoscelis papua) | \n", "Anvers | \n", "Biscoe | \n", "Adult, 1 Egg Stage | \n", "N39A1 | \n", "Yes | \n", "11/22/09 | \n", "46.8 | \n", "14.3 | \n", "215.0 | \n", "4850.0 | \n", "FEMALE | \n", "8.41151 | \n", "-26.13832 | \n", "NaN | \n", "
341 | \n", "PAL0910 | \n", "122 | \n", "Gentoo penguin (Pygoscelis papua) | \n", "Anvers | \n", "Biscoe | \n", "Adult, 1 Egg Stage | \n", "N39A2 | \n", "Yes | \n", "11/22/09 | \n", "50.4 | \n", "15.7 | \n", "222.0 | \n", "5750.0 | \n", "MALE | \n", "8.30166 | \n", "-26.04117 | \n", "NaN | \n", "
342 | \n", "PAL0910 | \n", "123 | \n", "Gentoo penguin (Pygoscelis papua) | \n", "Anvers | \n", "Biscoe | \n", "Adult, 1 Egg Stage | \n", "N43A1 | \n", "Yes | \n", "11/22/09 | \n", "45.2 | \n", "14.8 | \n", "212.0 | \n", "5200.0 | \n", "FEMALE | \n", "8.24246 | \n", "-26.11969 | \n", "NaN | \n", "
343 | \n", "PAL0910 | \n", "124 | \n", "Gentoo penguin (Pygoscelis papua) | \n", "Anvers | \n", "Biscoe | \n", "Adult, 1 Egg Stage | \n", "N43A2 | \n", "Yes | \n", "11/22/09 | \n", "49.9 | \n", "16.1 | \n", "213.0 | \n", "5400.0 | \n", "MALE | \n", "8.36390 | \n", "-26.15531 | \n", "NaN | \n", "
344 rows × 17 columns
\n", "\n", " | Sample Number | \n", "Culmen Length (mm) | \n", "Culmen Depth (mm) | \n", "Flipper Length (mm) | \n", "Body Mass (g) | \n", "Delta 15 N (o/oo) | \n", "Delta 13 C (o/oo) | \n", "
---|---|---|---|---|---|---|---|
count | \n", "344.000000 | \n", "342.000000 | \n", "342.000000 | \n", "342.000000 | \n", "342.000000 | \n", "330.000000 | \n", "331.000000 | \n", "
mean | \n", "63.151163 | \n", "43.921930 | \n", "17.151170 | \n", "200.915205 | \n", "4201.754386 | \n", "8.733382 | \n", "-25.686292 | \n", "
std | \n", "40.430199 | \n", "5.459584 | \n", "1.974793 | \n", "14.061714 | \n", "801.954536 | \n", "0.551770 | \n", "0.793961 | \n", "
min | \n", "1.000000 | \n", "32.100000 | \n", "13.100000 | \n", "172.000000 | \n", "2700.000000 | \n", "7.632200 | \n", "-27.018540 | \n", "
25% | \n", "29.000000 | \n", "39.225000 | \n", "15.600000 | \n", "190.000000 | \n", "3550.000000 | \n", "8.299890 | \n", "-26.320305 | \n", "
50% | \n", "58.000000 | \n", "44.450000 | \n", "17.300000 | \n", "197.000000 | \n", "4050.000000 | \n", "8.652405 | \n", "-25.833520 | \n", "
75% | \n", "95.250000 | \n", "48.500000 | \n", "18.700000 | \n", "213.000000 | \n", "4750.000000 | \n", "9.172123 | \n", "-25.062050 | \n", "
max | \n", "152.000000 | \n", "59.600000 | \n", "21.500000 | \n", "231.000000 | \n", "6300.000000 | \n", "10.025440 | \n", "-23.787670 | \n", "
\n", " | Species | \n", "Region | \n", "Island | \n", "
---|---|---|---|
0 | \n", "Adelie Penguin (Pygoscelis adeliae) | \n", "Anvers | \n", "Torgersen | \n", "
1 | \n", "Adelie Penguin (Pygoscelis adeliae) | \n", "Anvers | \n", "Torgersen | \n", "
2 | \n", "Adelie Penguin (Pygoscelis adeliae) | \n", "Anvers | \n", "Torgersen | \n", "
3 | \n", "Adelie Penguin (Pygoscelis adeliae) | \n", "Anvers | \n", "Torgersen | \n", "
4 | \n", "Adelie Penguin (Pygoscelis adeliae) | \n", "Anvers | \n", "Torgersen | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "
339 | \n", "Gentoo penguin (Pygoscelis papua) | \n", "Anvers | \n", "Biscoe | \n", "
340 | \n", "Gentoo penguin (Pygoscelis papua) | \n", "Anvers | \n", "Biscoe | \n", "
341 | \n", "Gentoo penguin (Pygoscelis papua) | \n", "Anvers | \n", "Biscoe | \n", "
342 | \n", "Gentoo penguin (Pygoscelis papua) | \n", "Anvers | \n", "Biscoe | \n", "
343 | \n", "Gentoo penguin (Pygoscelis papua) | \n", "Anvers | \n", "Biscoe | \n", "
344 rows × 3 columns
\n", "