{
  "cells": [
    {
      "cell_type": "markdown",
      "id": "fb84acb7-7340-4933-99ba-c5fb422cb07f",
      "metadata": {},
      "source": [
        "# Palmer penguins and split-apply-combine\n",
        "\n",
        "[Data set download](https://s3.amazonaws.com/bebi103.caltech.edu/data/penguins_subset.csv)\n",
        "\n",
        "<hr />"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "6cce2733-1b20-4b82-a1c7-e224b8c1efc5",
      "metadata": {},
      "source": [
        "The [Palmer penguins data set](https://towardsdatascience.com/penguins-dataset-overview-iris-alternative-9453bb8c8d95) is a nice data set with which to practice various data science skills. For this exercise, we will use as subset of it, which you can download here: [https://s3.amazonaws.com/bebi103.caltech.edu/data/penguins_subset.csv](https://s3.amazonaws.com/bebi103.caltech.edu/data/penguins_subset.csv). The data set consists of measurements of three different species of penguins acquired at the [Palmer Station in Antarctica](https://en.wikipedia.org/wiki/Palmer_Station). The measurements were made between 2007 and 2009 by [Kristen Gorman](https://www.uaf.edu/cfos/people/faculty/detail/kristen-gorman.php).\n",
        "\n",
        "**a)** Take a look at the CSV file containing the data set. Is it in tidy format? Why or why not?\n",
        "\n",
        "**b)** You can convert the CSV file to a \"tall\" format using the `bebi103.utils.unpivot_csv()` function. You can do that with the following function call, where `path_to_penguins` is a string containing the path to the penguin_subset.csv file.\n",
        "\n",
        "    bebi103.utils.unpivot_csv(\n",
        "        path_to_penguins,\n",
        "        \"penguins_tall.csv\",\n",
        "        n_header_rows=2,\n",
        "        header_names=[\"species\", \"quantity\"],\n",
        "        comment_prefix=\"#\",\n",
        "        retain_row_index=True,\n",
        "        row_index_name='penguin_id',\n",
        "    )    \n",
        "\n",
        "After running that function, load in the data set stored in the `penguins_tall.csv` file and store it in a variable named `df_tall`. Is this a tidy data set?\n",
        "\n",
        "**c)** Perform the following operations to make a new `DataFrame` from the one you loaded in to generate a new `DataFrame`. You do not need to worry about what these operations do (that is the topic of next week, just do them to answer this question): \n",
        "\n",
        "```python\n",
        "df = (\n",
        "    df_tall\n",
        "    .pivot(\n",
        "        index=['penguin_id', 'species'], on='quantity', values='value'\n",
        "    )\n",
        "    .select(pl.exclude('penguin_id'))\n",
        ")\n",
        "```\n",
        "\n",
        "Is the resulting data frame `df` tidy? Why or why not?\n",
        "\n",
        "**d)** Using the data frame you created in part (c), slice out all of the bill lengths for *Gentoo* penguins.\n",
        "\n",
        "**e)** Make a new data frame containing the mean measured bill depth, bill length, body mass in kg, and flipper length for each species. You can use millimeters for all length measurements.\n",
        "\n",
        "**f)** Make a scatter plot of bill length versus flipper length with the glyphs colored by species."
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "default",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.13.5"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 5
}