{ "cells": [ { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "22e4a877f4aad26d0adde61c18150499", "grade": false, "grade_id": "cell-1eb22459a7e45546", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "# Lecture Worksheet A-2: Data Wrangling with dplyr\n", "\n", "By the end of this worksheet, you will be able to: \n", "\n", "1. Use the five core dplyr verbs for data wrangling: `select()`, `filter()`, `arrange()`, `mutate()`, `summarise()`.\n", "2. Use piping when implementing function chains.\n", "3. Use `group_by()` to operate within groups (of rows) with `mutate()` and `summarise()`. \n", "4. Use `across()` to operate on multiple columns with `summarise()` and `mutate()`.\n", "\n", "## Instructions + Grading\n", "\n", "+ To get full marks for each participation worksheet, you must successfully answer at least 50% of all autograded questions: that's 10 for this worksheet. \n", "\n", "+ Autograded questions are easily identifiable through their labelling as **QUESTION**. Any other instructions that prompt the student to write code are activities, which are not graded and thus do not contribute to marks - but do contribute to the workflow of the worksheet!\n", "\n", "## Attribution\n", "\n", "Thanks to Icíar Fernández Boyano and Victor Yuan for their help in putting this worksheet together. \n", "\n", "The following resources were used as inspiration in the creation of this worksheet:\n", "\n", "+ [Swirl R Programming Tutorial](https://swirlstats.com/scn/rprog.html)\n", "+ [Palmer Penguins R Package](https://github.com/hadley/palmerpenguins)\n", "+ [RD4S Data Transformation](https://r4ds.had.co.nz/transform.html)\n", "\n", "\n", "## Five core dplyr verbs: an overview of this worksheet\n", "\n", "So far, we've **looked** at our dataset. It's time to **work with** it! Prior to creating any models, or using visualization to gain more insights about our data, it is common to tweak the data in some ways to make it a little easier to work with. For example, you may need to rename some variables, reorder observations, or even create some new variables from your existing ones!\n", "\n", "As explained in depth in the [R4DS Data Transformation chapter](https://r4ds.had.co.nz/transform.html), there are five key dplyr functions that allow you to solve the vast majority of data manipulation tasks:\n", "\n", "+ Pick variables by their names (`select()`)\n", "+ Pick observations by their values (`filter()`)\n", "+ Reorder the rows (`arrange()`)\n", "+ Create new variables with functions of existing variables (`mutate()`)\n", "+ Collapse many rows down to a single summary (`summarise()`)\n", "\n", "We can use these in conjunction with two other functions:\n", "\n", "- The `group_by()` function groups a tibble by rows. Downstream calls to `mutate()` and `summarise()` operate independently on each group.\n", "- The `across()` function, when used within the `mutate()` and `summarise()` functions, operate on multiple columns.\n", "\n", "Because data wrangling involves calling multiple of these functions, we will also see the pipe operator `%>%` for putting these together in a single statement. \n", "\n", "## Getting Started\n", "\n", "Load the required packages for this worksheet:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "message": false, "name": "load packages", "nbgrader": { "cell_type": "code", "checksum": "7d31bfd50f2f5f9b1cc6c794a240e0ad", "grade": false, "grade_id": "cell-695291a493cfa66b", "locked": true, "schema_version": 3, "solution": false, "task": false }, "warning": false }, "outputs": [], "source": [ "suppressPackageStartupMessages(library(palmerpenguins))\n", "suppressPackageStartupMessages(library(tidyverse))\n", "suppressPackageStartupMessages(library(gapminder))\n", "suppressPackageStartupMessages(library(tsibble))\n", "suppressPackageStartupMessages(library(testthat))\n", "suppressPackageStartupMessages(library(digest))\n", "expect_sorted <- function(object) {\n", " act <- quasi_label(rlang::enquo(object), arg = \"object\")\n", " expect(\n", " !is.unsorted(act$val),\n", " sprintf(\"%s not sorted\", act$lab)\n", " )\n", " invisible(act$val)\n", "}" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "c3126536aa00165d81c6e45b6101d645", "grade": false, "grade_id": "cell-fc028aa4c2f43fe9", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "The following code chunk has been unlocked, to give you the flexibility to start this document with some of your own code. Remember, it's bad manners to keep a call to `install.packages()` in your source code, so don't forget to delete these lines if you ever need to run them." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# An unlocked code chunk." ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "0fcd04e25f2f1374a91dfd204d1a123c", "grade": false, "grade_id": "cell-155784dc582502bf", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "# Part 1: The Five Verbs\n", "\n", "## Exploring your data\n", "\n", "What's the first thing that you should do when you're starting a project with a new dataset? Having a coffee is a reasonable answer, but before that, you should **look at the data**. This may sound obvious, but a common mistake is to dive into the analysis too early before being familiar with the data - only to have to go back to the start when something goes wrong and you can't quite figure out why. Some of the questions you may want to ask are:\n", "\n", "+ What is the format of the data?\n", "+ What are the dimensions?\n", "+ Are there missing data?\n", "\n", "You will learn how to answer these questions and more using dplyr." ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "42028de18995667182b569d283efc201", "grade": false, "grade_id": "cell-84babc43995821a2", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "## Penguins Data\n", "\n", "[Palmer penguins](https://github.com/allisonhorst/palmerpenguins) is an R data package created by Allison Horst. Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network. The dataset that we will be using is stored in a variable called \"penguins\". It is a subset of the \"penguins_raw\" dataset, also included in this R package. Let's have a look at it." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "ca5ffa376958ccea3c3402f656ea24ff", "grade": false, "grade_id": "cell-83ef890605e2e7b3", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "head(penguins)" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "1e0f7e3cb8539919a3429511ff151d06", "grade": false, "grade_id": "cell-c9194fdc50710021", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "`head()` returns the first 6 rows of a dataframe, instead of printing all the data to screen.\n", "\n", "## What is the format of the data?\n", "\n", "Let's begin by checking the class of the **penguins** variable. This will give us a clue about the overall structure of the data." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "8da56affc57412ea090f7d605acbefcb", "grade": false, "grade_id": "cell-48b2530804a909bf", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "class(penguins)" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "41d22407fc5c1ecbe52cc0268ee32db2", "grade": false, "grade_id": "cell-7808fbe04aa729dd", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "As you can see, the function returns 3 classes: \"tbl_df\", \"tbl\", and \"data.frame\". A dataframe is the default class for data read into R. Tibbles (\"tbl\" and \"tbl_df\") are a modern take on data frames, but slightly tweaked to work better in the tidyverse. For now, you don’t need to worry about the differences; we’ll come back to tibbles later. The dataset that we are working with was originally a data.frame that has been coerced into a tibble, which is why multiple class names are returned by the `class()` function.\n", "\n", "## What are the dimensions?\n", "\n", "There are two functions that we can use to see exactly how many rows (observations) and columns (variables) we're dealing with. `dim()` is the base R option, and `glimpse()` is the dplyr flavour, which gives us some more information besides the row and column number. Give both a try!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "607fd2bb968c9081a0a75eb52cfb15c7", "grade": false, "grade_id": "cell-2fbdffdd374117c8", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "dim(penguins)\n", "glimpse(penguins)" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "ad84ea22132542e08411250fe31c2e70", "grade": false, "grade_id": "cell-d404e821859fa65a", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "There are more functions that you can use to further explore the dimensions, such as `nrow()`, `ncol()`, `colnames()` or `rownames()`, but we won't be looking into those.\n", "\n", "## QUESTION 1.0\n", "\n", "In the `dim()` function, what is the first number that you see?\n", "\n", "Multiple choice!\n", "\n", "A) number of rows \n", "\n", "B) number of columns\n", "\n", "Put your selection (e.g. the letter corresponding to the correct option) into a variable named `answer1.0`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "name": "question-1-0", "nbgrader": { "cell_type": "code", "checksum": "eb1c470514848102c386fb24516869e3", "grade": false, "grade_id": "cell-89d2909f9da03bb4", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# answer1.0 <- \"FILL_THIS_IN\"\n", "# your code here\n", "fail() # No Answer - remove if you provide an answer" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "name": "test-1-0", "nbgrader": { "cell_type": "code", "checksum": "6e1468b48e78c9c62a4655700e016205", "grade": true, "grade_id": "cell-22216823e505e5f1", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "test_that(\"Question 1.0\", {\n", " expect_equal(digest(as.character(toupper(answer1.0))), \"75f1160e72554f4270c809f041c7a776\")\n", "})\n", "cat(\"success!\")" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "ea01fbe1c9ce3912b4d266f6b91df2c8", "grade": false, "grade_id": "cell-c08276291e700867", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "## `select()` \n", "\n", "*A brief interlude on naming things:* Names are important. Jenny Bryan has some excellent [slides](https://speakerdeck.com/jennybc/how-to-name-files) for naming things in a way that is human readable *and* machine readable. Don't worry too much about it for this worksheet, but do keep it in mind as it helps with *reproducibility*. \n", "\n", "A quick tip that you can put into practice: you can use *Pascal case* - creating names by concatenating capitalized words, such as PenguinsSubset, or PenguinsTidy. If names get too long, remove vowels! For example, PngnSubset, or PngnTidy instead. Or, you can use snake_case!\n", "\n", "## QUESTION 1.1\n", "\n", "In the next few questions, you will practice using the dplyr verb `select()` to pick and modify variables by their names. Modify the penguins data so that it contains the columns `species`, `island`, `sex`, in that order.\n", "\n", "Assign your answer to a variable named `answer1.1`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "name": "question-1-1", "nbgrader": { "cell_type": "code", "checksum": "238d978b3661e7c392411c949b39369c", "grade": false, "grade_id": "cell-8a83be57e490be91", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# answer1.1 <- select(penguins, FILL_THIS_IN)\n", "# your code here\n", "fail() # No Answer - remove if you provide an answer\n", "head(answer1.1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "lines_to_next_cell": 2, "name": "test-1-1", "nbgrader": { "cell_type": "code", "checksum": "857eb003b627aa2de5f9ce3649f06668", "grade": true, "grade_id": "cell-d77a573dc4a26343", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "test_that(\"Question 1.1\", {\n", " expect_equal(digest(as_tibble(answer1.1)), \"63491aa90dcb507c85810ba253a6a465\")\n", "})\n", "cat(\"success!\")" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "b9fcdab0762f8b4cb243006444aa7699", "grade": false, "grade_id": "cell-4beb2aac482da4c3", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "## QUESTION 1.2\n", "\n", "Out of the following options, what would be the best name for the object that you just created above (currently stored in `answer1.1`)? Put your answer in a variable named `answer1.2`.\n", "\n", "A) _penguin_subset \n", "\n", "B) penguins \n", "\n", "C) 2penguin \n", "\n", "D) PngnSub " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "name": "question-1-1a", "nbgrader": { "cell_type": "code", "checksum": "989540d0bbef2e1618eb77c45115e95c", "grade": false, "grade_id": "cell-dcb6cda112457d98", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# answer1.2 <- \"FILL_THIS_IN\"\n", "# your code here\n", "fail() # No Answer - remove if you provide an answer" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "name": "test-1-1a", "nbgrader": { "cell_type": "code", "checksum": "6f30f3dede6d9afa505acccc6c505c54", "grade": true, "grade_id": "cell-92538bdd1bd40300", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "test_that(\"Question 1.2\", {\n", " expect_equal(digest(as.character(toupper(answer1.2))), \"c1f86f7430df7ddb256980ea6a3b57a4\")\n", "})\n", "cat(\"success!\")" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "ecef2aa0fb7063bce72aa51be5926b74", "grade": false, "grade_id": "cell-2e3273def66cbb67", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "## QUESTION 1.3\n", "\n", "Select all variables, from `bill_length_mm` to `body_mass_g` (in that order). Of course, you could do it this way..." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "92020782b46eca42fa78b0c1963da07e", "grade": false, "grade_id": "cell-5cb9e1971619fc8b", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "# This will work:\n", "select(penguins, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) %>% \n", " print(n = 5)" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "e9425bb04c69d16ca860d3d7208c4ffc", "grade": false, "grade_id": "cell-f5b45d38c135b356", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "But there is a better way to do it! Which do you think would work?\n", "\n", "A) `select(penguins, body_mass_g:bill_length_mm)` \n", "\n", "B) `select(penguins, c(body_mass_g::bill_length_mm))` \n", "\n", "C) `select(penguins, bill_length_mm:body_mass_g)` \n", "\n", "D) `select(penguins, bill_length_mm::body_mass_g)`\n", "\n", "Assign your answer to a variable called `answer1.3`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "name": "question-1-2", "nbgrader": { "cell_type": "code", "checksum": "0df62b99abb510420f9d23e9c1a7662a", "grade": false, "grade_id": "cell-8b4a2b3bcfaa1afc", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# answer1.3 <- \"FILL_THIS_IN\"\n", "# your code here\n", "fail() # No Answer - remove if you provide an answer" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "name": "test-1-2", "nbgrader": { "cell_type": "code", "checksum": "a22b7918b07024a9336bdae7ffffabc0", "grade": true, "grade_id": "cell-c526ad124009a3f7", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "test_that(\"Question 1.3\", {\n", " expect_equal(digest(as.character(toupper(answer1.3))), \"475bf9280aab63a82af60791302736f6\")\n", "})\n", "cat(\"success!\")" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "72520e66aa984f689691eb85d41665d9", "grade": false, "grade_id": "cell-b6281ca333158c02", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "## QUESTION 1.4\n", "\n", "You're doing a great job. Keep it up! Now, select all variables, except `island`. How would you write this code?\n", "\n", "A) `select(penguins, \"-island\")` \n", "\n", "B) `select(penguins, -island)` \n", "\n", "C) `select(penguins, c(\"-island\"))` \n", "\n", "Put your answer in a variable named `answer1.4`. We encourage you to try executing these!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "name": "question-1-3", "nbgrader": { "cell_type": "code", "checksum": "ba1a9117a4ea5c1d55d8dc591493c1b3", "grade": false, "grade_id": "cell-f245f071ce744359", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# answer1.4 <- \"FILL_THIS_IN\"\n", "# your code here\n", "fail() # No Answer - remove if you provide an answer" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "name": "test-1-3", "nbgrader": { "cell_type": "code", "checksum": "12961731198c338e61539c2e698494eb", "grade": true, "grade_id": "cell-a51b6f36744882c1", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "test_that(\"Question 1.4\", {\n", " expect_equal(digest(as.character(toupper(answer1.4))), \"3a5505c06543876fe45598b5e5e5195d\")\n", "})\n", "cat(\"success!\")" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "daabd1d41c849d3423a444127efe5d14", "grade": false, "grade_id": "cell-71e9d586a9716e39", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "## QUESTION 1.5\n", "\n", "Output the `penguins` tibble so that `year` comes first. Hint: use the tidyselect `everything()` function. Store the result in a variable named `answer1.5`. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "name": "question-1-4", "nbgrader": { "cell_type": "code", "checksum": "6aa4e71b68cdc93b58e9d803b46abea1", "grade": false, "grade_id": "cell-ab4e3c79b2c8ab2b", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# answer1.5 <- select(penguins, FILL_THIS_IN, FILL_THIS_IN)\n", "# your code here\n", "fail() # No Answer - remove if you provide an answer\n", "head(answer1.5)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "name": "test-1-4", "nbgrader": { "cell_type": "code", "checksum": "af3caf6ac1db07a54ea81f98377e5971", "grade": true, "grade_id": "cell-0191ea7047f44d3c", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "test_that(\"Question 1.5\", {\n", " expect_equal(digest(dim(answer1.5)), \"d095e682a86f7f16404b7f8dd5f3d676\")\n", " expect_equal(digest(answer1.5), \"a07a1cdcb64726866df3d525811a9bf6\")\n", "})\n", "cat(\"success!\")" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "b394247048979d125dfb1989caa5471a", "grade": false, "grade_id": "cell-5ca019b951c06c27", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "## QUESTION 1.6\n", "\n", "Rename `flipper_length_mm` to `length_flipper_mm`. Store the result in a variable named `answer1.6`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "name": "question-1-5", "nbgrader": { "cell_type": "code", "checksum": "698181e0cff0dfb7acd88c762540a85a", "grade": false, "grade_id": "cell-92e4880ebb7d27e5", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# answer1.6 <- rename(FILL_THIS_IN, FILL_THIS_IN)\n", "# your code here\n", "fail() # No Answer - remove if you provide an answer\n", "head(answer1.6)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "name": "test-1-5", "nbgrader": { "cell_type": "code", "checksum": "622b665afb998f7e2d4b62e5d0a90fcc", "grade": true, "grade_id": "cell-7e940a5bc6938d87", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "test_that(\"Question 1.6\", {\n", " expect_equal(digest(dim(answer1.6)), 'd095e682a86f7f16404b7f8dd5f3d676')\n", " expect_equal(digest(names(answer1.6)), 'ef6a2aaa40de41c0b11ad2f6888d5ce6')\n", "})\n", "cat(\"success!\")" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "af5b362292825d0edff8e0b70f517376", "grade": false, "grade_id": "cell-63fcfb647325d7f5", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "## `filter()` \n", "\n", "So far, we've practiced picking variables by their name with `select()`. But how about picking observations (rows)? This is where `filter()` comes in.\n", "\n", "## QUESTION 1.7\n", "\n", "Pick penguins with body mass greater than 3600 g. Store the resulting tibble in a variable named `answer1.7`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "name": "question-1-6", "nbgrader": { "cell_type": "code", "checksum": "8021df7e6769cd9ef4e9e0f249e2fa14", "grade": false, "grade_id": "cell-426cf55e35870df1", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# answer1.7 <- filter(FILL_THIS_IN, FILL_THIS_IN)\n", "# your code here\n", "fail() # No Answer - remove if you provide an answer\n", "head(answer1.7)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "name": "test-1-6", "nbgrader": { "cell_type": "code", "checksum": "4a2ac7786bb825a673a1579b46754e01", "grade": true, "grade_id": "cell-31eb0dad4d4e75a3", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "test_that(\"Question 1.7\", {\n", " expect_equal(digest(dim(answer1.7)), '0f80c9cad929bf5de5ae34e0d50cb60d')\n", " expect_equal(sum(pull(answer1.7, body_mass_g) <= 3600), 0)\n", "})\n", "cat(\"success!\")" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "b12d0b9bbe80d099176b37ccbe7ad905", "grade": false, "grade_id": "cell-fd6a0770b1db542e", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "## Storing the subsetted penguins data\n", "\n", "In question 1.7 above, you've created a subset of the `penguins` dataset by filtering for those penguins that have a body mass greater than 3600 g. Let's do a quick check to see how many penguins meet that threshold by comparing the dimensions of the `penguins` dataset and your subset, `answer1.7`. There are two different ways to do this. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "362d75da6d8e88080ac279592ca0e98d", "grade": false, "grade_id": "cell-88d29ce015d55c3a", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "dim(penguins)\n", "dim(answer1.7)" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "9e3025fc5d35fce3c6ff78950638731f", "grade": false, "grade_id": "cell-9ca14d97a1c89aad", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "As you can see, in filtering down to penguins with a body mass greater than 3600g, we have lost about 100 rows (observations). However, `answer1.7` doesn't seem like an informative name for this new dataset that you've created from `penguins`. Let's rename it to something else." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "e5fbd79f96c2b887e67dd0936b02435f", "grade": false, "grade_id": "cell-ad22993e5cc2f509", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "penguins3600 <- answer1.7" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "aec3374b9b2fd8c941228d4b8e8a5a4b", "grade": false, "grade_id": "cell-051fdf27d9771f42", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "## QUESTION 1.8\n", "\n", "From your \"new\" dataset `penguins3600`, take only data from penguins located in the Biscoe island. Store the result in a variable named `answer1.8`. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "name": "question-1-7", "nbgrader": { "cell_type": "code", "checksum": "134928a758b84aa095df3daad1369cca", "grade": false, "grade_id": "cell-d7d71f8ef7623ddc", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# answer1.8 <- filter(FILL_THIS_IN, FILL_THIS_IN)\n", "# your code here\n", "fail() # No Answer - remove if you provide an answer\n", "head(answer1.8)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "name": "test-1-7", "nbgrader": { "cell_type": "code", "checksum": "737d50c8e29d098e331f95e65f9c0623", "grade": true, "grade_id": "cell-5e4156e2a3ad4d40", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "test_that(\"Question 1.8\", {\n", " expect_equal(digest(dim(answer1.8)), \"92ac01cd2e8809faceb1f7a283cd935f\")\n", " a <- as.character(unique(pull(answer1.8, island)))\n", " expect_length(a, 1L)\n", " expect_equal(a, \"Biscoe\")\n", "})\n", "cat(\"success!\")" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "e39496d3ad6ea0548300f9a4cde19621", "grade": false, "grade_id": "cell-541157a971196e6d", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "## QUESTION 1.9\n", "\n", "Repeat the task from Question 1.8, but take data from islands Torgersen and Dream. Now that you've practiced with dplyr verbs quite a bit, you don't need as many prompts to answer! Hint: When you want to select more than one island, you use `%in%` instead of `==`.\n", "\n", "Store your answer in a variable named `answer1.9`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "name": "question-1-8", "nbgrader": { "cell_type": "code", "checksum": "20d695f2efb664823ee669b7692233b2", "grade": false, "grade_id": "cell-30a32f98d928d5ef", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# answer1.9 <- FILL_THIS_IN(FILL_THIS_IN, island FILL_THIS_IN c(\"FILL_THIS_IN\", \"FILL_THIS_IN\"))\n", "# your code here\n", "fail() # No Answer - remove if you provide an answer\n", "head(answer1.9)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "name": "test-1-8", "nbgrader": { "cell_type": "code", "checksum": "c51da36986358f14234a1ac8967baeae", "grade": true, "grade_id": "cell-89c20b9e5605b929", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "test_that(\"Question 1.9\", {\n", " expect_equal(digest(dim(answer1.9)), \"b207bbce54bb47be51e7ba7b56d24bc2\")\n", " expect_equal(sum(pull(answer1.9, island) == \"Torgersen\"), 28)\n", " expect_equal(sum(pull(answer1.9, island) == \"Dream\"), 69)\n", " expect_equal(sum(pull(answer1.9, island) == \"Biscoe\"), 0)\n", "})\n", "cat(\"success!\")" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "7f1c5af822668e44902a51e61aef02a0", "grade": false, "grade_id": "cell-4a8582b78ed98d91", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "## `arrange()` \n", "\n", "`arrange()` allows you to rearrange rows. Let's give it a try!\n", "\n", "## QUESTION 1.10\n", "\n", "Order `penguins` by year, in ascending order. Store the resulting tibble in a variable named `answer1.10`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "name": "question-1-9", "nbgrader": { "cell_type": "code", "checksum": "3205429a08cf954ab77533e04ba5e055", "grade": false, "grade_id": "cell-328575d183628b50", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# answer1.10 <- arrange(FILL_THIS_IN, FILL_THIS_IN)\n", "# your code here\n", "fail() # No Answer - remove if you provide an answer\n", "head(answer1.10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "name": "test-1-9", "nbgrader": { "cell_type": "code", "checksum": "6003adcf5179782d5a079adfec852e5e", "grade": true, "grade_id": "cell-f3eceec3a41ec267", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "test_that(\"Question 1.10\", {\n", " expect_sorted(pull(answer1.10, year))\n", "})\n", "cat(\"success!\")" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "2e17dc0e08e077638e6cd6d8ce593824", "grade": false, "grade_id": "cell-6a5c389fdb642b97", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "## QUESTION 1.11\n", "\n", "Great work! Order `penguins` by year, in descending order. Hint: there is a function that allows you to order a variable in descending order called `desc()`.\n", "\n", "Store your tibble in a variable named `answer1.11`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "name": "question-1-10", "nbgrader": { "cell_type": "code", "checksum": "65c587b6498994f0d37c8ba2a717996d", "grade": false, "grade_id": "cell-5d16610ec2eb5dec", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# answer1.11 <- arrange(FILL_THIS_IN, FILL_THIS_IN)\n", "# your code here\n", "fail() # No Answer - remove if you provide an answer\n", "head(answer1.11)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "name": "test-1-10", "nbgrader": { "cell_type": "code", "checksum": "83c2ae4f6d1df94e7953226f5aa4c2a0", "grade": true, "grade_id": "cell-c17dc14cc8e6251d", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "test_that(\"Question 1.11\", {\n", " expect_sorted(pull(answer1.11, year) %>% \n", " rev())\n", "})\n", "cat(\"success!\")" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "70e7d538da20c359dbd9d4e4bad37d79", "grade": false, "grade_id": "cell-83f549a48f378b7e", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "## QUESTION 1.12\n", "\n", "Order `penguins` by year, then by `body_mass_g`. Use ascending order in both cases.\n", "\n", "Store your answer in a variable named `answer1.12`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "name": "question-1-11", "nbgrader": { "cell_type": "code", "checksum": "9ef8c16841f131e207231000a8713024", "grade": false, "grade_id": "cell-b0a6c4950411e1a0", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# answer1.12 <- arrange(FILL_THIS_IN, FILL_THIS_IN, FILL_THIS_IN)\n", "# your code here\n", "fail() # No Answer - remove if you provide an answer\n", "head(answer1.12)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "name": "test-1-11", "nbgrader": { "cell_type": "code", "checksum": "5d5dc33b13633940ff9c03c4ed8f99f6", "grade": true, "grade_id": "cell-942ca4a7dff525d4", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "test_that(\"Question 1.12\", {\n", " expect_sorted(pull(answer1.12, year))\n", " answer1.12_list <- answer1.12 %>% \n", " group_by(year) %>% \n", " group_split()\n", " \n", " expect_length(answer1.12_list, 3)\n", " expect_sorted(answer1.12_list[[1]] %>% pull(body_mass_g) %>% na.omit())\n", " expect_sorted(answer1.12_list[[2]] %>% pull(body_mass_g) %>% na.omit())\n", " expect_sorted(answer1.12_list[[3]] %>% pull(body_mass_g) %>% na.omit())\n", "})\n", "cat(\"success!\")" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "4fb8fd640a694f457683e4f5747940ca", "grade": false, "grade_id": "cell-8670bdd2a80dcc71", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "## Piping, `%>%` \n", "\n", "So far, we've been using dplyr verbs by inputting the dataset that we want to work on as the first argument of the function (e.g. `select(**penguins**, year))`. This is fine when you're using a single verb, i.e. you only want to filter observations, or select variables. However, more often than not you will want to do several tasks at once; such as filtering penguins with a certain body mass, and simultaneously ordering those penguins by year. Here is where piping (`%>%`) comes in.\n", "\n", "Think of `%>%` as the word \"then\"!\n", "\n", "Let's see an example. Here I want to combine `select()` with `arrange()`.\n", "\n", "This is how I could do it by *nesting* the two function calls. I am selecting variables year, species, island, and body_mass_g, while simultaneously arranging by year." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "eval": false, "name": "nesting functions example", "nbgrader": { "cell_type": "code", "checksum": "b83bf34bafc68a12f039093c768339b4", "grade": false, "grade_id": "cell-8b04bc9446a8c4ac", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "print(arrange(select(penguins, year, species, island, body_mass_g), year), n = 5)" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "437bd4cbe9e6e1c13853db7c2df9a4b4", "grade": false, "grade_id": "cell-1f52ee5af576a960", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "However, that seems a little hard to read. Now using pipes:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "349fd7cae5ac4c1fafbc960196af6010", "grade": false, "grade_id": "cell-c822cab993153419", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "penguins %>%\n", " select(year, species, island, body_mass_g) %>%\n", " arrange(year) %>% \n", " print(n = 5)" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "e3f13c87c36e7127fae8bbfd8c271d3f", "grade": false, "grade_id": "cell-98ac8fcee497ce0d", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "## Creating tibbles\n", "\n", "Throughout Part A, we have been working with a tibble, `penguins`. Remember that when we ran `class()` on `penguins`, we could see that it was a dataframe that had been coerced to a tibble, which is a unifying feature of the tidyverse.\n", "\n", "Suppose that you have a dataframe that you want to coerce to a tibble. To do this, you can use `as_tibble()`. R comes with a few built-in datasets, one of which is `mtcars`. Let's check the class of `mtcars`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "8f8285eeb4e6af3669bd69c927e8ff9c", "grade": false, "grade_id": "cell-c7af6b2c82165fa0", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "class(mtcars)" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "5381bd4f4bc01937cdee106c0dab3ade", "grade": false, "grade_id": "cell-7507b4dc963878ab", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "As you can see, mtcars is a dataframe. Now, coerce it to a tibble with `as_tibble()`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "a782b35e1d46a7b048af57eacc570061", "grade": false, "grade_id": "cell-7a8f6ffc86b48d2b", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "as_tibble(mtcars) %>% \n", " print(n = 5)" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "db2f45b2c2c885d514fc0c2bdedd6745", "grade": false, "grade_id": "cell-8c956daac8fd8b0c", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "You can read more about tibbles in the [R4DS Tibble Chapter](https://r4ds.had.co.nz/tibbles.html#creating-tibbles).\n", "\n", "\n", "## QUESTION 1.13\n", "\n", "At the start of this worksheet, we loaded a package called `gapminder`. This package comes with a dataset stored in the variable also named `gapminder`. Check the class of the `gapminder` dataset:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "4765e2bb110d358c0a5870fc29a55347", "grade": false, "grade_id": "cell-992bc1625f2613d0", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "class(gapminder)" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "4030b2884f6ec9c54cf1c46512d8529e", "grade": false, "grade_id": "cell-35bd3fa6efef2b5c", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "As you can see, it is already a tibble.\n", "\n", "Take all countries in Europe that have a GDP per capita greater than 10000, and select all variables except `gdpPercap`, using pipes. (Hint: use `-`).\n", "\n", "Store your answer in a variable named `answer1.13`. Here is a code snippet that you can copy and paste into the solution cell below. \n", "\n", "```\n", "answer1.13 <- FILL_THIS_IN %>%\n", " filter(FILL_THIS_IN > 10000, FILL_THIS_IN == \"Europe\") %>%\n", " FILL_THIS_IN(-FILL_THIS_IN)\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "name": "question-1-14", "nbgrader": { "cell_type": "code", "checksum": "c6680597132ab0da22b3b6f2f029f38e", "grade": false, "grade_id": "cell-229c2279871804bd", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# your code here\n", "fail() # No Answer - remove if you provide an answer\n", "head(answer1.13)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "name": "test-1-14", "nbgrader": { "cell_type": "code", "checksum": "6f9491525635f048f6be77c624e578d2", "grade": true, "grade_id": "cell-d306a8e4f1863c71", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "test_that(\"Question 1.13\", {\n", " expect_equal(digest(dim(answer1.13)), \"87d72f02bf15a0a29647db0c48c9a226\")\n", " expect_equal(digest(answer1.13), \"d0136991f3cfee4fcf896f677181c9c6\")\n", "})\n", "cat(\"success!\")" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "bc2df7c8c0afe909cf419b98e0911f32", "grade": false, "grade_id": "cell-3e7c74b2095d67ab", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "## QUESTION 1.14\n", "\n", "Coerce the `mtcars` data frame to a tibble, and take all columns that start with the letter \"d\". \n", "*Hint: take a look at the \"Select helpers\" documentation by running the following code: `?tidyselect::select_helpers`.*\n", "\n", "Store your tibble in a variable named `answer1.14`\n", "\n", "```\n", "answer1.14 <- FILL_THIS_IN(FILL_THIS_IN) %>%\n", " FILL_THIS_IN(FILL_THIS_IN(\"d\"))\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "name": "question-1-15", "nbgrader": { "cell_type": "code", "checksum": "0bd95606ffae4b0041e8ac313caa70ff", "grade": false, "grade_id": "cell-e326cd7a208090ad", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# your code here\n", "fail() # No Answer - remove if you provide an answer\n", "head(answer1.14)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "name": "test-1-15", "nbgrader": { "cell_type": "code", "checksum": "41a5782be8567396a9d6deb60c7f7788", "grade": true, "grade_id": "cell-cef50c68a7dc95c8", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "test_that(\"Question 1.14\", {\n", " expect_equal(digest(dim(answer1.14)), \"ea1df69d6a59227894d1d4330f9bfab8\")\n", " expect_equal(digest(colnames(answer1.14)), \"0956954d01fe74c59c1f16850b7e874f\")\n", "})\n", "cat(\"success!\")" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "c637cbb8127c15cb1904198aa2534cdf", "grade": false, "grade_id": "cell-5f4fafc8a7e30702", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "This exercise is from [r-exercises](https://www.r-exercises.com/2017/10/19/dplyr-basic-functions-exercises/).\n", "\n", "## `mutate()`\n", "\n", "The `mutate()` function allows you to create new columns, possibly using existing columns. Like `select()`, `filter()`, and `arrange()`, the `mutate()` function also takes a tibble as its first argument, and returns a tibble. \n", "\n", "The general syntax is: `mutate(tibble, NEW_COLUMN_NAME = CALCULATION)`.\n", "\n", "## QUESTION 1.15\n", "\n", "Make a new column with body mass in kg, named `body_mass_kg`, *and* rearrange the tibble so that `body_mass_kg` goes after `body_mass_g` and before `sex`. Store the resulting tibble in a variable named `answer1.15`.\n", "\n", "\n", "*Hint*: within `select()`, use R's `:` operator to select all variables from `species` to `body_mass_g`.\n", "\n", "```\n", "answer1.15 <- penguins %>%\n", " mutate(FILL_THIS_IN = FILL_THIS_IN) %>%\n", " select(FILL_THIS_IN, FILL_THIS_IN, FILL_THIS_IN, FILL_THIS_IN)\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "name": "question-1-12", "nbgrader": { "cell_type": "code", "checksum": "7702c84540ddffebe97e47fc6a792106", "grade": false, "grade_id": "cell-d9e8ace4de3d5db5", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# Your code here\n", "# your code here\n", "fail() # No Answer - remove if you provide an answer\n", "head(answer1.15)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "name": "test-1-12", "nbgrader": { "cell_type": "code", "checksum": "9c7c53cef74dae7ec6af64e4920b62c2", "grade": true, "grade_id": "cell-2aa7af4bee586e8b", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "test_that(\"Question 1.15\", {\n", " expect_equal(digest(dim(answer1.15)), \"9e9457527d068c2333ea8fd598e07f13\")\n", " expect_equal(digest(colnames(answer1.15)), \"d7121e41fe934232c1c45dc425365040\")\n", " expect_equal(na.omit(answer1.15$body_mass_kg / answer1.15$body_mass_g) %>% digest,\n", " \"cdfbfd4da65e3575a474558218939055\")\n", "})\n", "cat(\"success!\")" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "775e141af5ce8163ee51ed0fb5ed9748", "grade": false, "grade_id": "cell-ddd7c27cced93e3e", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "Notice the backwards compatibility! No need for loops! By the way, if you'd like to simultaneously create columns _and_ delete other columns, use the `transmute` function." ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "38123a24d635667c645e7ae81c3703e5", "grade": false, "grade_id": "cell-93fb11af8743ba66", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "## `group_by()`\n", "\n", "The `group_by()` function groups the _rows_ in your tibble according to one or more categorical variables. Just specify the columns containing the grouping variables. `mutate()` (and others) will now operate on each chunk independently. \n", "\n", "## QUESTION 1.16\n", "\n", "Calculate the growth in population since the first year on record _for each country_, and name the column `rel_growth`. Do this by **rearranging the following lines**, and **filling in the `FILL_THIS_IN`**. Assign your answer to a variable named `answer1.16`\n", "\n", "*Hint*: Here's another convenience function for you: `dplyr::first()`.\n", "\n", "```\n", "answer1.16 <-\n", " mutate(rel_growth = FILL_THIS_IN) %>% \n", " arrange(FILL_THIS_IN) %>% \n", " gapminder %>% \n", " group_by(country) %>% \n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "f65d591e15108c03f9d889d1515cbf11", "grade": false, "grade_id": "cell-2ab3613bc7f96623", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# Your code here\n", "# your code here\n", "fail() # No Answer - remove if you provide an answer\n", "head(answer1.16)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "651d6f7c27b262eb0a215ec7f60d91ee", "grade": true, "grade_id": "cell-bc032212871c46c6", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "test_that(\"Answer 1.16\", {\n", " expect_equal(nrow(answer1.16), 1704)\n", " c('country', 'continent', 'year', 'lifeExp', 'pop', 'gdpPercap', 'rel_growth') %>% \n", " map_lgl(~ .x %in% names(answer1.16)) %>% \n", " all() %>% \n", " expect_true()\n", " expect_equal(digest(as.integer(answer1.16$rel_growth)), '26735e4b17481f965f9eb1d3b5de89ad')\n", "})\n", "cat(\"success!\")" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "ca177f3fcd5e8fc6b218433e60ec3a12", "grade": false, "grade_id": "cell-7ccdc0e8d8d7cfa0", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "## `summarise()`\n", "\n", "The last core dplyr verb is `summarise()`. It collapses a data frame to a single row:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "f46e9a4ccc1c8c085838d667eb7ae817", "grade": false, "grade_id": "cell-6b1f5abffe346c3f", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "summarise(penguins, body_mass_mean = mean(body_mass_g, na.rm = TRUE))" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "f37035a11976fe21525e15757c10884b", "grade": false, "grade_id": "cell-cb9e443cf78bab58", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "*From R4DS Data Transformation:* \n", "\n", "> `summarise()` is not terribly useful unless we pair it with `group_by()`. This changes the unit of analysis from the complete dataset to individual groups. Then, when you use the dplyr verbs on a grouped data frame they'll be automatically applied \"by group\".\n", "\n", "For example, if we applied exactly the same code to a tibble grouped by island, we get the average body mass per island:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "240d075af880582466a6bb0f5bc098c6", "grade": false, "grade_id": "cell-833376d1ffa3e568", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "penguins %>%\n", " group_by(island) %>%\n", " summarise(body_mass_mean = mean(body_mass_g, na.rm = TRUE))" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "8c143b931e225a48634b312ff1927b50", "grade": false, "grade_id": "cell-199c47f59d82515d", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "## QUESTION 1.17\n", "\n", "From the `penguins` tibble, calculate the mean penguin body mass per island by year, in a column named `body_mass_mean`. Your tibble should have the columns `year`, `island`, and `body_mass_mean` only (and in that order). Store the resulting tibble in a variable named `answer1.17`.\n", "\n", "```\n", "answer1.17 <- penguins %>%\n", " group_by(FILL_THIS_IN) %>%\n", " FILL_THIS_IN(body_mass_mean = mean(FILL_THIS_IN, na.rm = TRUE))\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "name": "question-1-13", "nbgrader": { "cell_type": "code", "checksum": "607f7b5d647d09b649b7009e8aa1690b", "grade": false, "grade_id": "cell-c7cb5063ed993313", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# Your code here\n", "# your code here\n", "fail() # No Answer - remove if you provide an answer\n", "head(answer1.17)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "name": "test-1-13", "nbgrader": { "cell_type": "code", "checksum": "0cd643a193960008ad3e2c62cffe4917", "grade": true, "grade_id": "cell-787a56e4c37460c9", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "test_that(\"Question 1.17\", {\n", " expect_equal(digest(dim(answer1.17)), \"f4885de1726d18557bd43d769cc0ae26\")\n", " expect_equal(digest(colnames(answer1.17)), \"ba0c85220a5fa5222cac937acb2f94c2\")\n", "})\n", "cat(\"success!\")" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "1179c68d77b11fbac0213e631baf31e6", "grade": false, "grade_id": "cell-26d8a38d2ac2d31c", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "# Part 2: Scoped variants with `across()`\n", "\n", "Sometimes we want to perform the same operation on many columns. We can achieve this by embedding the `across()` function within the `mutate()` or `summarise()` functions.\n", "\n", "## QUESTION 2.0\n", "\n", "In a single expression, make a tibble with the following columns *for each island* in the penguins data set:\n", "\n", "+ What is the *mean* of each numeric variable in the `penguins` dataset in each island? Keep the column names the same.\n", "+ How many penguins are there in each island? Add this to a column named `n`.\n", "\n", "Assign your answer to a variable named `answer2.0`\n", "\n", "```\n", "answer2.0 <- penguins %>% \n", " group_by(FILL_THIS_IN) %>% \n", " summarise(across(where(FILL_THIS_IN), FILL_THIS_IN, na.rm = TRUE), \n", " n = n())\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "bdba6f6c61b1fb775dfd845f40134f16", "grade": false, "grade_id": "cell-3f56050dd1cb58c1", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# Your code here\n", "# your code here\n", "fail() # No Answer - remove if you provide an answer\n", "head(answer2.0) " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "dfaa6bf7e10e1f2d379d5256a895f9a8", "grade": true, "grade_id": "cell-e8681c978652b8f9", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "test_that(\"Answer 2.0\", {\n", " expect_equal(\n", " answer2.0 %>% \n", " mutate(across(where(is.numeric), round, digits = 0)) %>% \n", " unclass() %>% \n", " digest(),\n", " \"20e5f1c917fbe7b018a23182ba6702fa\"\n", " )\n", "})\n", "cat(\"success!\")" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "83d951fef9c8757743488112a7d16eb3", "grade": false, "grade_id": "cell-ecc146b97be77724", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "## QUESTION 2.1\n", "\n", "Using the `penguins` dataset, what is the mean bill length and depth of penguins on each island, by year? The resulting tibble should have columns named `island`, `year`, `bill_length_mm`, and `bill_depth_mm`, in that order. Store the result in a variable named `answer2.1`. Be sure to remove NA's when you are calculating the mean. \n", "\n", "*Hint*: Use `starts_with()` instead of `where()` in the `across()` function.\n", "\n", "```\n", "answer2.1 <- penguins %>%\n", " group_by(FILL_THIS_IN) %>%\n", " summarise(across(FILL_THIS_IN))\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "1371fe515850be553208d529a4ee867f", "grade": false, "grade_id": "cell-a53b4a22a4bdbd25", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# Your code here\n", "# your code here\n", "fail() # No Answer - remove if you provide an answer\n", "head(answer2.1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "549e7c600cfe7c78e717643ae4dc9381", "grade": true, "grade_id": "cell-69502fa952a45d71", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "test_that(\"Answer 2.1\", {\n", " expect_equal(names(answer2.1), c(\"island\", \"year\", \"bill_length_mm\", \"bill_depth_mm\"))\n", " sorted <- answer2.1 %>%\n", " arrange(island, year)\n", " expect_identical(digest(round(sorted$bill_length_mm, 0)), \"f9f46fe0b2604eac7903505876e4b240\")\n", " expect_identical(digest(round(sorted$bill_depth_mm, 0)), \"d54992e0dbb34479e18f4f73ff1f16f4\")\n", "})\n", "cat(\"success!\")" ] } ], "metadata": { "jupytext": { "cell_metadata_filter": "name,eval,message,fig.height,warning,fig.width,-all", "notebook_metadata_filter": "-all", "text_representation": { "extension": ".Rmd", "format_name": "rmarkdown" } }, "kernelspec": { "display_name": "R", "language": "R", "name": "ir" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "4.1.2" } }, "nbformat": 4, "nbformat_minor": 4 }