{ "cells": [ { "cell_type": "markdown", "metadata": { "nbgrader": { "grade": false, "grade_id": "cell-8b26929bab50eea3", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "# Worksheet 01A: Intro to R, R Markdown, and Reproducibility\n", "**Version 2.0**\n", "\n", "\n", "## Welcome to STAT545!\n", "\n", "We hope that you are excited to become an R pro during the next few weeks! These in-class worksheets have been designed to help you navigate this R journey. We'll start easy, with some examples of R commands, and evolve to the more complex - and arguably cooler - syntax and structures of the R language.\n", "\n", "## An important note\n", "\n", "**Submission of this worksheet is optional**. Future worksheets **must** be submitted for participation marks. \n", "\n", "Lectures will mostly involve going through these worksheets and giving you the answers (yes, before the deadline). We suggest going through the worksheets before coming to class so that you can find out where you get stuck.\n", "\n", "## Attributions\n", "\n", "This document was primarily put together by Icíar Fernández Boyano. \n", "\n", "The following resources were used as inspiration in the creation of this worksheet:\n", "\n", "+ [Swirl R Programming Tutorial](https://swirlstats.com/scn/rprog.html)\n", "+ [A (very) short introduction to R](https://github.com/ClaudiaBrauer/A-very-short-introduction-to-R/blob/master/documents/A%20(very)%20short%20introduction%20to%20R.pdf)\n", "+ [Happy Git and GitHub for the useR](https://happygitwithr.com/)\n", "+ [2019 STAT545 Guidebook](https://stat545guidebook.netlify.app/index.html)\n", "+ [Jenny Bryan's STAT545 Guidebook](https://stat545.com/)\n", "\n", "## Getting Started\n", "\n", "Load the required add-on packages for this assignment by running the following code chunk (or _cell_). In Jupyter, you can load the chunk by clicking on the chunk, and clicking the \"Run\" button (keyboard shortcut: Command + Enter on a Mac, or Control + Enter on Windows). _If this fails, read on..._" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "nbgrader": { "grade": false, "grade_id": "cell-e7cc0f456605c076", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "library(testthat)\n", "library(digest)" ] }, { "cell_type": "markdown", "metadata": { "nbgrader": { "grade": false, "grade_id": "cell-e4d5c23517e65a5a", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "Did that fail for you? It could be that you don't have those packages installed. The following code chunk has been unlocked for you (did you notice that you weren't able to edit the above cells?), so you can use it to install these packages, or to generally just give you the flexibility to start this worksheet with some of your own code. To install the \"testthat\" package, execute the command `install.packages(\"testthat\")`; what would you need to type to install the \"digest\" package?\n", "\n", "**Please be sure to remove any `install.packages` commands after you've run them**: once you've successfully _installed_ a package with `install.packages`, the package is permanently installed on your computer. \n", "\n", "To _load_ the packages for use in this R session, try executing the above code chunk again." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# An unlocked code chunk." ] }, { "cell_type": "markdown", "metadata": { "nbgrader": { "grade": false, "grade_id": "cell-0da508f15814e7c2", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "In Episode 01A of the [STAT 545 video series](https://www.youtube.com/channel/UCrB-uourf2vxGeBnGjQrA0w), RStudio was mentioned as being an IDE for R. You're probably viewing this worksheet in another IDE called **jupyter**. We're using jupyter for the STAT 545 worksheets because it works well with an autograder called nbgrader." ] }, { "cell_type": "markdown", "metadata": { "nbgrader": { "grade": false, "grade_id": "cell-d11d3973ab3460a1", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "## Getting Familiar with R\n", "\n", "### 1.1 Calculator\n", "\n", "In its simplest form, R can be used as a interactive calculator. " ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "name": "example-1-0", "nbgrader": { "grade": false, "grade_id": "cell-02e613b3a71a2045", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "outputs": [ { "data": { "text/html": [ "14" ], "text/latex": [ "14" ], "text/markdown": [ "14" ], "text/plain": [ "[1] 14" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "6" ], "text/latex": [ "6" ], "text/markdown": [ "6" ], "text/plain": [ "[1] 6" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "2" ], "text/latex": [ "2" ], "text/markdown": [ "2" ], "text/plain": [ "[1] 2" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "10" ], "text/latex": [ "10" ], "text/markdown": [ "10" ], "text/plain": [ "[1] 10" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "81" ], "text/latex": [ "81" ], "text/markdown": [ "81" ], "text/plain": [ "[1] 81" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "10 + 4 # you can add \n", "10 - 4 # subtract\n", "4 / 2 # divide\n", "2 * 5 # multiply\n", "3 ^ 4 # and exponentiate" ] }, { "cell_type": "markdown", "metadata": { "nbgrader": { "grade": false, "grade_id": "cell-ab5c3cec51c4ac26", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "Now, what if you need to compute a longer expression? Let's say that I want to find out the percentage of students in the STAT department that are taking STAT545A (note: these numbers are fictional!). I could compute this in several steps, or use a more complex expression. \n", "\n", "**Using multiple steps...**\n", "\n", "+ To calculate the number of students in the STAT department, I add 375 new students that have enrolled this year, to the 2000 that were already enrolled." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "name": "example-1-1a", "nbgrader": { "grade": false, "grade_id": "cell-3a342455e968efc2", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "outputs": [ { "data": { "text/html": [ "2375" ], "text/latex": [ "2375" ], "text/markdown": [ "2375" ], "text/plain": [ "[1] 2375" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "2000 + 375" ] }, { "cell_type": "markdown", "metadata": { "nbgrader": { "grade": false, "grade_id": "cell-ccd48654544fd353", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "+ There are 82 students taking STAT545A this year. Last year, there was the same number of students, but 3 dropped the course after the first two weeks. Let's hypothesise that only 1 will drop the course this year - although I hope the real number is 0 :)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "name": "example-1-1b", "nbgrader": { "grade": false, "grade_id": "cell-23dd9f564b57ea4f", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "outputs": [ { "data": { "text/html": [ "81" ], "text/latex": [ "81" ], "text/markdown": [ "81" ], "text/plain": [ "[1] 81" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "82 - 1" ] }, { "cell_type": "markdown", "metadata": { "nbgrader": { "grade": false, "grade_id": "cell-345a371ec8621190", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "+ With the number of students taking STAT545 this year (hypothetically), and the number of students currently in the STAT department, I can now calculate what percentage of students in the STAT department are taking this class." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "name": "example-1-1c", "nbgrader": { "grade": false, "grade_id": "cell-f85728ee7af8d6bf", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "outputs": [ { "data": { "text/html": [ "0.0341052631578947" ], "text/latex": [ "0.0341052631578947" ], "text/markdown": [ "0.0341052631578947" ], "text/plain": [ "[1] 0.03410526" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "81 / 2375" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "name": "example 1-1d", "nbgrader": { "grade": false, "grade_id": "cell-a9bfe516af8ca200", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "outputs": [ { "data": { "text/html": [ "3.410526" ], "text/latex": [ "3.410526" ], "text/markdown": [ "3.410526" ], "text/plain": [ "[1] 3.410526" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "0.03410526 * 100" ] }, { "cell_type": "markdown", "metadata": { "nbgrader": { "grade": false, "grade_id": "cell-42ec2ad158bd9526", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "**What if we use a single expression?**\n", "\n", "It seems that around 3% of students in the STAT department are taking STAT545A... but that took *a lot* of steps to calculate. We could also write it like this to save some time:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "name": "example-1-2", "nbgrader": { "grade": false, "grade_id": "cell-1998144f9345aaf9", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "outputs": [ { "data": { "text/html": [ "3.41052631578947" ], "text/latex": [ "3.41052631578947" ], "text/markdown": [ "3.41052631578947" ], "text/plain": [ "[1] 3.410526" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "(82 - 1) / (2000 + 375) * 100 " ] }, { "cell_type": "markdown", "metadata": { "nbgrader": { "grade": false, "grade_id": "cell-8aa79a95beade98c", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "As you can see, *taking care of precedence rules* (i.e. using brackets appropriately), we can save some time by writing a single expression.\n", "\n", "Your turn! Can you calculate the percentage of your life that you have spent in university? \n", "\n", "Compute the difference between 2020 and the year that you started university, and divide this by the difference between 2020 and the year that you were born. Multiply this with 100 to get the percentage of your life that you have spent in university. Your *challenge* here is to use a single expression." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "name": "activity-1-2" }, "outputs": [], "source": [ "# your code here" ] }, { "cell_type": "markdown", "metadata": { "nbgrader": { "grade": false, "grade_id": "cell-1d3d407b175d3580", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "### 1.2 Variables\n", " \n", "Alright, R as a calculator works just fine... but you don't learn a programming language *only* to compute arithmetic expressions. What if you want to use your result from above in a second calculation? Instead of retyping your expression every time that you need it, or copying and pasting the result, you can simply create a new variable that stores it. \n", "\n", "Earlier, I figured out that I had spent 18% of my life at university. I want to assign this value to a variable called `life_university`, which will help me remember what my value means. The way you assign a value to a variable in R is by using the assignment operator, which is just a \"less than\" symbol, followed by a minus sign. It looks like this:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "name": "example-1-3a", "nbgrader": { "grade": false, "grade_id": "cell-0f70f29b98cfc79a", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "life_university <- 18" ] }, { "cell_type": "markdown", "metadata": { "nbgrader": { "grade": false, "grade_id": "cell-b8bde8ad059c03e2", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "Now, the variable `life_university`, stores the value 18, which is the percentage of time that I had spent at university. But prior to saving this into a variable, I had to calculate the value separately. What if I directly assigned the arithmetic expression that I used to compute my value to the variable?" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "name": "example-1-3b", "nbgrader": { "grade": false, "grade_id": "cell-e943d9cf804fdf52", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "life_university <- (2020 - 2016) / (2020 - 1998) * 100" ] }, { "cell_type": "markdown", "metadata": { "nbgrader": { "grade": false, "grade_id": "cell-9eeb2f90831c3118", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "Notice that R did not print the result of my expression this time. When you use the assignment operator, R assumes that you don't want to see the result immediately, but rather that you intend to use it for something else later on. \n", "\n", "To view the contents of the variable, you simply have to type the name of the variable - in this case, `life_university` and press Enter. Try it below!" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "name": "activity-1-3" }, "outputs": [], "source": [ "# your code here" ] }, { "cell_type": "markdown", "metadata": { "nbgrader": { "grade": false, "grade_id": "cell-dbf010f638a0b404", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "**QUESTION 1.0**\n", "\n", "Now, it's your turn to store the percentage of time that **you** have spent at university into a variable - try typing the arithmetic expression that you used to compute that value, rather than the value itself! Name this variable `my_life_university` in the first cell below, and check whether the answer is acceptable by running the second cell below. If the test cell gives you an error, try a different answer!\n", "\n", "```\n", "my_life_university <- FILL_THIS_IN / FILL_THIS_IN * 100\n", "```" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "name": "question-1-0", "nbgrader": { "grade": false, "grade_id": "cell-04e950ccf3313117", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "### BEGIN SOLUTION\n", "my_life_university <- 12 / 33 * 100 # Any number between 0 and 100.\n", "### END SOLUTION" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "name": "test-1-0", "nbgrader": { "grade": true, "grade_id": "cell-cebd6f6df5bdb3cc", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] \"success!\"\n" ] } ], "source": [ "test_that(\"Question 1.0\", {\n", " expect_gte(my_life_university, 0)\n", " expect_lte(my_life_university, 100)\n", "})\n", "print(\"success!\")" ] }, { "cell_type": "markdown", "metadata": { "nbgrader": { "grade": false, "grade_id": "cell-93af8d6b89c2ed33", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "### 1.3 Data structures\n", "\n", "Any object that contains data is called a data structure.\n", "\n", "### 1.3.1 Vectors\n", "\n", "#### Numeric vectors\n", "\n", "So far, you've learned how to use R as a calculator, and how to use variables to store numeric values. But in reality, a \"variable\" in R is just a way to name your data so that R can recall it later. Think of it as a label that you put on a box, so that you remember the contents that are inside it. \n", "\n", "The variable that you created above, `my_life_university`, stores the most basic data structure in R programming language: a vector. Even a single number is considered a vector of length one, which is the case with the vector that was assigned to `my_life_university`. Let's have a look again:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "name": "example-1-3", "nbgrader": { "grade": false, "grade_id": "cell-df8e5f769d89a75b", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "outputs": [ { "data": { "text/html": [ "36.3636363636364" ], "text/latex": [ "36.3636363636364" ], "text/markdown": [ "36.3636363636364" ], "text/plain": [ "[1] 36.36364" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "my_life_university" ] }, { "cell_type": "markdown", "metadata": { "nbgrader": { "grade": false, "grade_id": "cell-f825e25671f73f81", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "In this way, you can think of the vector as the data structure, and the variable as a label. But what if you want a vector that's greater than length one, or in other words, that stores more than a single numeric value? The easiest way to create a vector is using `c()`, which stands for \"concatenate\", or \"combine\". \n", "\n", "**QUESTION 1.1**\n", "\n", "Let's give it a try. To create a vector containing the numbers 3.14, 2.71, and 6.28, type `c(3.14, 2.71, 6.28)`. Store the result in a variable called `x`. \n", "\n", "```\n", "x <- FILL_THIS_IN(FILL_THIS_IN, FILL_THIS_IN, FILL_THIS_IN)\n", "```" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "name": "answer-1-1", "nbgrader": { "grade": false, "grade_id": "cell-0d87e7f0d19a4cda", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "### BEGIN SOLUTION\n", "x <- c(3.14, 2.71, 6.28)\n", "### END SOLUTION" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "lines_to_next_cell": 2, "name": "test-1-1", "nbgrader": { "grade": true, "grade_id": "cell-76c5ca05ef0dfe64", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] \"success!\"\n" ] } ], "source": [ "test_that(\"Question 1.1\", {\n", " expect_equal(digest(x), \"d696b13d28ab63409f1f528a2d37bb0e\")\n", "})\n", "print(\"success!\")" ] }, { "cell_type": "markdown", "metadata": { "nbgrader": { "grade": false, "grade_id": "cell-f9bab0fdef44f45e", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "Now, type `x` and press Enter to view its contents. Notice that there are no commas separating the values in the output!" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "name": "activity-1-4" }, "outputs": [], "source": [ "# your code here" ] }, { "cell_type": "markdown", "metadata": { "nbgrader": { "grade": false, "grade_id": "cell-5200389128901c83", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "You can combine several vectors to make a new vector. And here is where things get fun! For the sake of seeing the result immediately, we won't store this combined vector in a new variable for now." ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "name": "example-1-6", "nbgrader": { "grade": false, "grade_id": "cell-e71c214be4ee55a2", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "outputs": [ { "data": { "text/html": [ "\n", "
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
<dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
Mazda RX4 | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
Mazda RX4 Wag | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
Datsun 710 | 22.8 | 4 | 108.0 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
Hornet 4 Drive | 21.4 | 6 | 258.0 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
Hornet Sportabout | 18.7 | 8 | 360.0 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
Valiant | 18.1 | 6 | 225.0 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |
Duster 360 | 14.3 | 8 | 360.0 | 245 | 3.21 | 3.570 | 15.84 | 0 | 0 | 3 | 4 |
Merc 240D | 24.4 | 4 | 146.7 | 62 | 3.69 | 3.190 | 20.00 | 1 | 0 | 4 | 2 |
Merc 230 | 22.8 | 4 | 140.8 | 95 | 3.92 | 3.150 | 22.90 | 1 | 0 | 4 | 2 |
Merc 280 | 19.2 | 6 | 167.6 | 123 | 3.92 | 3.440 | 18.30 | 1 | 0 | 4 | 4 |
Merc 280C | 17.8 | 6 | 167.6 | 123 | 3.92 | 3.440 | 18.90 | 1 | 0 | 4 | 4 |
Merc 450SE | 16.4 | 8 | 275.8 | 180 | 3.07 | 4.070 | 17.40 | 0 | 0 | 3 | 3 |
Merc 450SL | 17.3 | 8 | 275.8 | 180 | 3.07 | 3.730 | 17.60 | 0 | 0 | 3 | 3 |
Merc 450SLC | 15.2 | 8 | 275.8 | 180 | 3.07 | 3.780 | 18.00 | 0 | 0 | 3 | 3 |
Cadillac Fleetwood | 10.4 | 8 | 472.0 | 205 | 2.93 | 5.250 | 17.98 | 0 | 0 | 3 | 4 |
Lincoln Continental | 10.4 | 8 | 460.0 | 215 | 3.00 | 5.424 | 17.82 | 0 | 0 | 3 | 4 |
Chrysler Imperial | 14.7 | 8 | 440.0 | 230 | 3.23 | 5.345 | 17.42 | 0 | 0 | 3 | 4 |
Fiat 128 | 32.4 | 4 | 78.7 | 66 | 4.08 | 2.200 | 19.47 | 1 | 1 | 4 | 1 |
Honda Civic | 30.4 | 4 | 75.7 | 52 | 4.93 | 1.615 | 18.52 | 1 | 1 | 4 | 2 |
Toyota Corolla | 33.9 | 4 | 71.1 | 65 | 4.22 | 1.835 | 19.90 | 1 | 1 | 4 | 1 |
Toyota Corona | 21.5 | 4 | 120.1 | 97 | 3.70 | 2.465 | 20.01 | 1 | 0 | 3 | 1 |
Dodge Challenger | 15.5 | 8 | 318.0 | 150 | 2.76 | 3.520 | 16.87 | 0 | 0 | 3 | 2 |
AMC Javelin | 15.2 | 8 | 304.0 | 150 | 3.15 | 3.435 | 17.30 | 0 | 0 | 3 | 2 |
Camaro Z28 | 13.3 | 8 | 350.0 | 245 | 3.73 | 3.840 | 15.41 | 0 | 0 | 3 | 4 |
Pontiac Firebird | 19.2 | 8 | 400.0 | 175 | 3.08 | 3.845 | 17.05 | 0 | 0 | 3 | 2 |
Fiat X1-9 | 27.3 | 4 | 79.0 | 66 | 4.08 | 1.935 | 18.90 | 1 | 1 | 4 | 1 |
Porsche 914-2 | 26.0 | 4 | 120.3 | 91 | 4.43 | 2.140 | 16.70 | 0 | 1 | 5 | 2 |
Lotus Europa | 30.4 | 4 | 95.1 | 113 | 3.77 | 1.513 | 16.90 | 1 | 1 | 5 | 2 |
Ford Pantera L | 15.8 | 8 | 351.0 | 264 | 4.22 | 3.170 | 14.50 | 0 | 1 | 5 | 4 |
Ferrari Dino | 19.7 | 6 | 145.0 | 175 | 3.62 | 2.770 | 15.50 | 0 | 1 | 5 | 6 |
Maserati Bora | 15.0 | 8 | 301.0 | 335 | 3.54 | 3.570 | 14.60 | 0 | 1 | 5 | 8 |
Volvo 142E | 21.4 | 4 | 121.0 | 109 | 4.11 | 2.780 | 18.60 | 1 | 1 | 4 | 2 |