{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# E2. To be completed after lesson 10\n", "\n", "**[Data set download](https://s3.amazonaws.com/bebi103.caltech.edu/data/penguins_subset.csv)**\n", "\n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "nbsphinx": "hidden", "tags": [] }, "outputs": [], "source": [ "# Colab setup ------------------\n", "import os, sys, subprocess\n", "if \"google.colab\" in sys.modules:\n", " cmd = \"pip install --upgrade iqplot bebi103 watermark\"\n", " process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)\n", " stdout, stderr = process.communicate()\n", " data_path = \"https://s3.amazonaws.com/bebi103.caltech.edu/data/\"\n", "else:\n", " data_path = \"../data/\"\n", "# ------------------------------" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "## Exercise 2.1\n", "\n", "In the lesson exercise, we will again work with a subset of the Palmer penguin data set. I will load it and view it now." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
GentooAdelieChinstrap
bill_depth_mmbill_length_mmflipper_length_mmbody_mass_gbill_depth_mmbill_length_mmflipper_length_mmbody_mass_gbill_depth_mmbill_length_mmflipper_length_mmbody_mass_g
016.348.4220.05400.018.536.8193.03500.018.347.6195.03850.0
115.846.3215.05050.016.937.0185.03000.016.742.5187.03350.0
214.247.5209.04600.019.542.0200.04050.016.640.9187.03200.0
315.748.7208.05350.018.342.7196.04075.020.052.8205.04550.0
414.148.7210.04450.018.035.7202.03550.018.745.4188.03525.0
\n", "
" ], "text/plain": [ " Gentoo Adelie \\\n", " bill_depth_mm bill_length_mm flipper_length_mm body_mass_g bill_depth_mm \n", "0 16.3 48.4 220.0 5400.0 18.5 \n", "1 15.8 46.3 215.0 5050.0 16.9 \n", "2 14.2 47.5 209.0 4600.0 19.5 \n", "3 15.7 48.7 208.0 5350.0 18.3 \n", "4 14.1 48.7 210.0 4450.0 18.0 \n", "\n", " Chinstrap \\\n", " bill_length_mm flipper_length_mm body_mass_g bill_depth_mm bill_length_mm \n", "0 36.8 193.0 3500.0 18.3 47.6 \n", "1 37.0 185.0 3000.0 16.7 42.5 \n", "2 42.0 200.0 4050.0 16.6 40.9 \n", "3 42.7 196.0 4075.0 20.0 52.8 \n", "4 35.7 202.0 3550.0 18.7 45.4 \n", "\n", " \n", " flipper_length_mm body_mass_g \n", "0 195.0 3850.0 \n", "1 187.0 3350.0 \n", "2 187.0 3200.0 \n", "3 205.0 4550.0 \n", "4 188.0 3525.0 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv(os.path.join(data_path, \"penguins_subset.csv\"), header=[0, 1])\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Explain in words what each of the following code cells does as we work toward tidying this data frame. For each cell, I show the top of the data frame." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciesGentooAdelieChinstrap
quantitybill_depth_mmbill_length_mmflipper_length_mmbody_mass_gbill_depth_mmbill_length_mmflipper_length_mmbody_mass_gbill_depth_mmbill_length_mmflipper_length_mmbody_mass_g
016.348.4220.05400.018.536.8193.03500.018.347.6195.03850.0
115.846.3215.05050.016.937.0185.03000.016.742.5187.03350.0
214.247.5209.04600.019.542.0200.04050.016.640.9187.03200.0
315.748.7208.05350.018.342.7196.04075.020.052.8205.04550.0
414.148.7210.04450.018.035.7202.03550.018.745.4188.03525.0
\n", "
" ], "text/plain": [ "species Gentoo \\\n", "quantity bill_depth_mm bill_length_mm flipper_length_mm body_mass_g \n", "0 16.3 48.4 220.0 5400.0 \n", "1 15.8 46.3 215.0 5050.0 \n", "2 14.2 47.5 209.0 4600.0 \n", "3 15.7 48.7 208.0 5350.0 \n", "4 14.1 48.7 210.0 4450.0 \n", "\n", "species Adelie \\\n", "quantity bill_depth_mm bill_length_mm flipper_length_mm body_mass_g \n", "0 18.5 36.8 193.0 3500.0 \n", "1 16.9 37.0 185.0 3000.0 \n", "2 19.5 42.0 200.0 4050.0 \n", "3 18.3 42.7 196.0 4075.0 \n", "4 18.0 35.7 202.0 3550.0 \n", "\n", "species Chinstrap \n", "quantity bill_depth_mm bill_length_mm flipper_length_mm body_mass_g \n", "0 18.3 47.6 195.0 3850.0 \n", "1 16.7 42.5 187.0 3350.0 \n", "2 16.6 40.9 187.0 3200.0 \n", "3 20.0 52.8 205.0 4550.0 \n", "4 18.7 45.4 188.0 3525.0 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.columns.names = ['species', 'quantity']\n", "\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciesAdelieChinstrapGentoo
quantity
0bill_depth_mm18.518.316.3
bill_length_mm36.847.648.4
body_mass_g3500.03850.05400.0
flipper_length_mm193.0195.0220.0
1bill_depth_mm16.916.715.8
\n", "
" ], "text/plain": [ "species Adelie Chinstrap Gentoo\n", " quantity \n", "0 bill_depth_mm 18.5 18.3 16.3\n", " bill_length_mm 36.8 47.6 48.4\n", " body_mass_g 3500.0 3850.0 5400.0\n", " flipper_length_mm 193.0 195.0 220.0\n", "1 bill_depth_mm 16.9 16.7 15.8" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = df.stack(level='quantity')\n", "\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "ename": "KeyError", "evalue": "'Level species not found'", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", "File \u001b[0;32m~/opt/anaconda3/envs/bebi103_build/lib/python3.11/site-packages/pandas/core/indexes/multi.py:1488\u001b[0m, in \u001b[0;36mMultiIndex._get_level_number\u001b[0;34m(self, level)\u001b[0m\n\u001b[1;32m 1487\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[0;32m-> 1488\u001b[0m level \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mnames\u001b[38;5;241m.\u001b[39mindex(level)\n\u001b[1;32m 1489\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m err:\n", "\u001b[0;31mValueError\u001b[0m: 'species' is not in list", "\nThe above exception was the direct cause of the following exception:\n", "\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)", "Cell \u001b[0;32mIn[6], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m df \u001b[38;5;241m=\u001b[39m df\u001b[38;5;241m.\u001b[39mreset_index(level\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mspecies\u001b[39m\u001b[38;5;124m'\u001b[39m)\n\u001b[1;32m 3\u001b[0m df\u001b[38;5;241m.\u001b[39mhead()\n", "File \u001b[0;32m~/opt/anaconda3/envs/bebi103_build/lib/python3.11/site-packages/pandas/core/frame.py:6162\u001b[0m, in \u001b[0;36mDataFrame.reset_index\u001b[0;34m(self, level, drop, inplace, col_level, col_fill, allow_duplicates, names)\u001b[0m\n\u001b[1;32m 6160\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(level, (\u001b[38;5;28mtuple\u001b[39m, \u001b[38;5;28mlist\u001b[39m)):\n\u001b[1;32m 6161\u001b[0m level \u001b[38;5;241m=\u001b[39m [level]\n\u001b[0;32m-> 6162\u001b[0m level \u001b[38;5;241m=\u001b[39m [\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mindex\u001b[38;5;241m.\u001b[39m_get_level_number(lev) \u001b[38;5;28;01mfor\u001b[39;00m lev \u001b[38;5;129;01min\u001b[39;00m level]\n\u001b[1;32m 6163\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mlen\u001b[39m(level) \u001b[38;5;241m<\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mindex\u001b[38;5;241m.\u001b[39mnlevels:\n\u001b[1;32m 6164\u001b[0m new_index \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mindex\u001b[38;5;241m.\u001b[39mdroplevel(level)\n", "File \u001b[0;32m~/opt/anaconda3/envs/bebi103_build/lib/python3.11/site-packages/pandas/core/frame.py:6162\u001b[0m, in \u001b[0;36m\u001b[0;34m(.0)\u001b[0m\n\u001b[1;32m 6160\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(level, (\u001b[38;5;28mtuple\u001b[39m, \u001b[38;5;28mlist\u001b[39m)):\n\u001b[1;32m 6161\u001b[0m level \u001b[38;5;241m=\u001b[39m [level]\n\u001b[0;32m-> 6162\u001b[0m level \u001b[38;5;241m=\u001b[39m [\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mindex\u001b[38;5;241m.\u001b[39m_get_level_number(lev) \u001b[38;5;28;01mfor\u001b[39;00m lev \u001b[38;5;129;01min\u001b[39;00m level]\n\u001b[1;32m 6163\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mlen\u001b[39m(level) \u001b[38;5;241m<\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mindex\u001b[38;5;241m.\u001b[39mnlevels:\n\u001b[1;32m 6164\u001b[0m new_index \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mindex\u001b[38;5;241m.\u001b[39mdroplevel(level)\n", "File \u001b[0;32m~/opt/anaconda3/envs/bebi103_build/lib/python3.11/site-packages/pandas/core/indexes/multi.py:1491\u001b[0m, in \u001b[0;36mMultiIndex._get_level_number\u001b[0;34m(self, level)\u001b[0m\n\u001b[1;32m 1489\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m err:\n\u001b[1;32m 1490\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m is_integer(level):\n\u001b[0;32m-> 1491\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mKeyError\u001b[39;00m(\u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mLevel \u001b[39m\u001b[38;5;132;01m{\u001b[39;00mlevel\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m not found\u001b[39m\u001b[38;5;124m\"\u001b[39m) \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01merr\u001b[39;00m\n\u001b[1;32m 1492\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m level \u001b[38;5;241m<\u001b[39m \u001b[38;5;241m0\u001b[39m:\n\u001b[1;32m 1493\u001b[0m level \u001b[38;5;241m+\u001b[39m\u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mnlevels\n", "\u001b[0;31mKeyError\u001b[0m: 'Level species not found'" ] } ], "source": [ "df = df.reset_index(level='species')\n", "\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = df.reset_index(drop=True)\n", "\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.columns.name = None\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 2.2\n", "\n", "What is the difference between merging and concatenating data frames?\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 2.3\n", "\n", "Describe the difference between categorical and quantitative variables. How are they fundamentally different in the way we plot them?\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 2.4\n", "\n", "Give pros and cons for using a histogram for display of repeated measurements. Then give pros and cons for using an ECDF.\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 2.5\n", "\n", "Write down any questions or points of confusion that you have." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Computing environment" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%load_ext watermark\n", "%watermark -v -p pandas,jupyterlab" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" } }, "nbformat": 4, "nbformat_minor": 4 }