{ "cells": [ { "cell_type": "markdown", "id": "4758928a", "metadata": {}, "source": [ "In this tutorial we are going to learn the library **dplyr**, which is really useful for transforming the data (e.g. create new variables, reorder observations, etc)." ] }, { "cell_type": "markdown", "id": "2bacfb97", "metadata": {}, "source": [ "# Loading dplyr" ] }, { "cell_type": "markdown", "id": "f3203cc6", "metadata": {}, "source": [ "Like **ggplot2**, dplyr is also part of **tidyverse**. " ] }, { "cell_type": "code", "execution_count": 1, "id": "2feeed5b", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "── \u001b[1mAttaching packages\u001b[22m ─────────────────────────────────────── tidyverse 1.3.2 ──\n", "\u001b[32m✔\u001b[39m \u001b[34mggplot2\u001b[39m 3.4.0 \u001b[32m✔\u001b[39m \u001b[34mpurrr \u001b[39m 1.0.1 \n", "\u001b[32m✔\u001b[39m \u001b[34mtibble \u001b[39m 3.1.8 \u001b[32m✔\u001b[39m \u001b[34mdplyr \u001b[39m 1.0.10\n", "\u001b[32m✔\u001b[39m \u001b[34mtidyr \u001b[39m 1.2.1 \u001b[32m✔\u001b[39m \u001b[34mstringr\u001b[39m 1.5.0 \n", "\u001b[32m✔\u001b[39m \u001b[34mreadr \u001b[39m 2.1.3 \u001b[32m✔\u001b[39m \u001b[34mforcats\u001b[39m 0.5.2 \n", "── \u001b[1mConflicts\u001b[22m ────────────────────────────────────────── tidyverse_conflicts() ──\n", "\u001b[31m✖\u001b[39m \u001b[34mdplyr\u001b[39m::\u001b[32mfilter()\u001b[39m masks \u001b[34mstats\u001b[39m::filter()\n", "\u001b[31m✖\u001b[39m \u001b[34mdplyr\u001b[39m::\u001b[32mlag()\u001b[39m masks \u001b[34mstats\u001b[39m::lag()\n" ] } ], "source": [ "library(tidyverse)" ] }, { "cell_type": "code", "execution_count": 2, "id": "3c97df2a", "metadata": {}, "outputs": [], "source": [ "#sat.dat<-read.csv(\"https://vincentarelbundock.github.io/Rdatasets/csv/psych/sat.act.csv\")\n", "sat.dat<-read.csv(\"../../datasets/sat.act.csv\")\n", "sat.dat$gender<- as.factor( sat.dat$gender )\n", "sat.dat$education<- as.factor( sat.dat$education )" ] }, { "cell_type": "code", "execution_count": 3, "id": "85722a96", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
  1. 700
  2. 7
\n" ], "text/latex": [ "\\begin{enumerate*}\n", "\\item 700\n", "\\item 7\n", "\\end{enumerate*}\n" ], "text/markdown": [ "1. 700\n", "2. 7\n", "\n", "\n" ], "text/plain": [ "[1] 700 7" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Like we learned, this gives the number of rows and columns of the data frame \n", "dim(sat.dat)" ] }, { "cell_type": "code", "execution_count": 4, "id": "e4b61558", "metadata": {}, "outputs": [ { "data": { "text/plain": [ " X gender education age ACT \n", " Min. :29442 1:247 0: 57 Min. :13.00 Min. : 3.00 \n", " 1st Qu.:32117 2:453 1: 45 1st Qu.:19.00 1st Qu.:25.00 \n", " Median :34896 2: 44 Median :22.00 Median :29.00 \n", " Mean :34731 3:275 Mean :25.59 Mean :28.55 \n", " 3rd Qu.:37250 4:138 3rd Qu.:29.00 3rd Qu.:32.00 \n", " Max. :39985 5:141 Max. :65.00 Max. :36.00 \n", " \n", " SATV SATQ \n", " Min. :200.0 Min. :200.0 \n", " 1st Qu.:550.0 1st Qu.:530.0 \n", " Median :620.0 Median :620.0 \n", " Mean :612.2 Mean :610.2 \n", " 3rd Qu.:700.0 3rd Qu.:700.0 \n", " Max. :800.0 Max. :800.0 \n", " NA's :13 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# And this, a summary of our data frame\n", "summary(sat.dat)" ] }, { "cell_type": "markdown", "id": "76f452da", "metadata": {}, "source": [ "# `filter`: Pick observations by their values\n", "\n", "This function allows you to subset observations based on specific conditions. The first argument is the name of the data frame. The second and subsequent arguments are the expressions that filter the data frame." ] }, { "cell_type": "code", "execution_count": 5, "id": "d708d9db", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
  1. 247
  2. 7
\n" ], "text/latex": [ "\\begin{enumerate*}\n", "\\item 247\n", "\\item 7\n", "\\end{enumerate*}\n" ], "text/markdown": [ "1. 247\n", "2. 7\n", "\n", "\n" ], "text/plain": [ "[1] 247 7" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Filter data for gender = 1\n", "sat.dat.filter<-filter(sat.dat, gender == 1)\n", "\n", "dim(sat.dat.filter)" ] }, { "cell_type": "code", "execution_count": 6, "id": "5d72b712", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
  1. 146
  2. 7
\n" ], "text/latex": [ "\\begin{enumerate*}\n", "\\item 146\n", "\\item 7\n", "\\end{enumerate*}\n" ], "text/markdown": [ "1. 146\n", "2. 7\n", "\n", "\n" ], "text/plain": [ "[1] 146 7" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Filter data for gender = 1 and age less than 25\n", "sat.dat.filter<-filter(sat.dat, gender == 1, age<25)\n", "dim(sat.dat.filter)" ] }, { "cell_type": "code", "execution_count": 7, "id": "9dd91a01", "metadata": {}, "outputs": [ { "data": { "text/plain": [ " X gender education age ACT \n", " Min. :29503 1: 53 0:16 Min. :15.00 Min. :26.00 \n", " 1st Qu.:31413 2:134 1:12 1st Qu.:19.00 1st Qu.:27.00 \n", " Median :34356 2:12 Median :23.00 Median :28.00 \n", " Mean :34323 3:67 Mean :25.11 Mean :27.55 \n", " 3rd Qu.:36680 4:40 3rd Qu.:28.00 3rd Qu.:28.00 \n", " Max. :39937 5:40 Max. :58.00 Max. :29.00 \n", " \n", " SATV SATQ \n", " Min. :400.0 Min. :430.0 \n", " 1st Qu.:550.0 1st Qu.:532.5 \n", " Median :600.0 Median :600.0 \n", " Mean :608.1 Mean :599.8 \n", " 3rd Qu.:670.0 3rd Qu.:660.0 \n", " Max. :800.0 Max. :800.0 \n", " NA's :5 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Filter observations within a range of ACT. Check that filtering worked by getting a summary of the filtered data frame. \n", "sat.dat.filter<-filter(sat.dat, (ACT > 25 & ACT < 30))\n", "summary(sat.dat.filter)" ] }, { "cell_type": "markdown", "id": "ce88ce1e", "metadata": {}, "source": [ "# `arrange`: Reorder the rows\n", "\n", "This function takes a data frame, and a set of column names (or more complicated expressions) to order by (in ascending order by default). If more than one column name is provided, each additional column is used to break ties in the values of preceding columns." ] }, { "cell_type": "code", "execution_count": 8, "id": "5c5ec278", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
A data.frame: 6 × 7
XgendereducationageACTSATVSATQ
<int><fct><fct><int><int><int><int>
1360931222 3200400
230591222715375375
336906201815470500
437398111715600596
536763112315600600
634578231816430350
\n" ], "text/latex": [ "A data.frame: 6 × 7\n", "\\begin{tabular}{r|lllllll}\n", " & X & gender & education & age & ACT & SATV & SATQ\\\\\n", " & & & & & & & \\\\\n", "\\hline\n", "\t1 & 36093 & 1 & 2 & 22 & 3 & 200 & 400\\\\\n", "\t2 & 30591 & 2 & 2 & 27 & 15 & 375 & 375\\\\\n", "\t3 & 36906 & 2 & 0 & 18 & 15 & 470 & 500\\\\\n", "\t4 & 37398 & 1 & 1 & 17 & 15 & 600 & 596\\\\\n", "\t5 & 36763 & 1 & 1 & 23 & 15 & 600 & 600\\\\\n", "\t6 & 34578 & 2 & 3 & 18 & 16 & 430 & 350\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A data.frame: 6 × 7\n", "\n", "| | X <int> | gender <fct> | education <fct> | age <int> | ACT <int> | SATV <int> | SATQ <int> |\n", "|---|---|---|---|---|---|---|---|\n", "| 1 | 36093 | 1 | 2 | 22 | 3 | 200 | 400 |\n", "| 2 | 30591 | 2 | 2 | 27 | 15 | 375 | 375 |\n", "| 3 | 36906 | 2 | 0 | 18 | 15 | 470 | 500 |\n", "| 4 | 37398 | 1 | 1 | 17 | 15 | 600 | 596 |\n", "| 5 | 36763 | 1 | 1 | 23 | 15 | 600 | 600 |\n", "| 6 | 34578 | 2 | 3 | 18 | 16 | 430 | 350 |\n", "\n" ], "text/plain": [ " X gender education age ACT SATV SATQ\n", "1 36093 1 2 22 3 200 400 \n", "2 30591 2 2 27 15 375 375 \n", "3 36906 2 0 18 15 470 500 \n", "4 37398 1 1 17 15 600 596 \n", "5 36763 1 1 23 15 600 600 \n", "6 34578 2 3 18 16 430 350 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# This order first based in ACT, then SATQ and finally SATV\n", "sat.dat.arrange<-arrange(sat.dat, ACT, SATQ, SATV)\n", "head(sat.dat.arrange)" ] }, { "cell_type": "markdown", "id": "31fce8b6", "metadata": {}, "source": [ "If we want to reorder in descending order, we can use the function `desc`:" ] }, { "cell_type": "code", "execution_count": 9, "id": "78bb4fdb", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
A data.frame: 6 × 7
XgendereducationageACTSATVSATQ
<int><fct><fct><int><int><int><int>
131080243336500500
229527253036610500
333539233136700500
432697222636550555
530129253036660610
639651242836590620
\n" ], "text/latex": [ "A data.frame: 6 × 7\n", "\\begin{tabular}{r|lllllll}\n", " & X & gender & education & age & ACT & SATV & SATQ\\\\\n", " & & & & & & & \\\\\n", "\\hline\n", "\t1 & 31080 & 2 & 4 & 33 & 36 & 500 & 500\\\\\n", "\t2 & 29527 & 2 & 5 & 30 & 36 & 610 & 500\\\\\n", "\t3 & 33539 & 2 & 3 & 31 & 36 & 700 & 500\\\\\n", "\t4 & 32697 & 2 & 2 & 26 & 36 & 550 & 555\\\\\n", "\t5 & 30129 & 2 & 5 & 30 & 36 & 660 & 610\\\\\n", "\t6 & 39651 & 2 & 4 & 28 & 36 & 590 & 620\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A data.frame: 6 × 7\n", "\n", "| | X <int> | gender <fct> | education <fct> | age <int> | ACT <int> | SATV <int> | SATQ <int> |\n", "|---|---|---|---|---|---|---|---|\n", "| 1 | 31080 | 2 | 4 | 33 | 36 | 500 | 500 |\n", "| 2 | 29527 | 2 | 5 | 30 | 36 | 610 | 500 |\n", "| 3 | 33539 | 2 | 3 | 31 | 36 | 700 | 500 |\n", "| 4 | 32697 | 2 | 2 | 26 | 36 | 550 | 555 |\n", "| 5 | 30129 | 2 | 5 | 30 | 36 | 660 | 610 |\n", "| 6 | 39651 | 2 | 4 | 28 | 36 | 590 | 620 |\n", "\n" ], "text/plain": [ " X gender education age ACT SATV SATQ\n", "1 31080 2 4 33 36 500 500 \n", "2 29527 2 5 30 36 610 500 \n", "3 33539 2 3 31 36 700 500 \n", "4 32697 2 2 26 36 550 555 \n", "5 30129 2 5 30 36 660 610 \n", "6 39651 2 4 28 36 590 620 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "sat.dat.arrange<-arrange(sat.dat, desc(ACT), SATQ, SATV)\n", "head(sat.dat.arrange)" ] }, { "cell_type": "markdown", "id": "2fd0549c", "metadata": {}, "source": [ "# `select`: Pick variables by their names\n", "\n", "It allows you to pick a subset of variables." ] }, { "cell_type": "code", "execution_count": 10, "id": "111402d5", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
A data.frame: 6 × 3
ACTSATQSATV
<int><int><int>
124500500
235500600
321470480
426520550
531550600
628640640
\n" ], "text/latex": [ "A data.frame: 6 × 3\n", "\\begin{tabular}{r|lll}\n", " & ACT & SATQ & SATV\\\\\n", " & & & \\\\\n", "\\hline\n", "\t1 & 24 & 500 & 500\\\\\n", "\t2 & 35 & 500 & 600\\\\\n", "\t3 & 21 & 470 & 480\\\\\n", "\t4 & 26 & 520 & 550\\\\\n", "\t5 & 31 & 550 & 600\\\\\n", "\t6 & 28 & 640 & 640\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A data.frame: 6 × 3\n", "\n", "| | ACT <int> | SATQ <int> | SATV <int> |\n", "|---|---|---|---|\n", "| 1 | 24 | 500 | 500 |\n", "| 2 | 35 | 500 | 600 |\n", "| 3 | 21 | 470 | 480 |\n", "| 4 | 26 | 520 | 550 |\n", "| 5 | 31 | 550 | 600 |\n", "| 6 | 28 | 640 | 640 |\n", "\n" ], "text/plain": [ " ACT SATQ SATV\n", "1 24 500 500 \n", "2 35 500 600 \n", "3 21 470 480 \n", "4 26 520 550 \n", "5 31 550 600 \n", "6 28 640 640 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# This selects columns by the given names of the columns. Here we just select ACT, SATQ and SATV\n", "sat.dat.select<-select(sat.dat, ACT, SATQ, SATV)\n", "\n", "head(sat.dat.select)" ] }, { "cell_type": "markdown", "id": "e2e460f0", "metadata": {}, "source": [ "If you want to select consecutive columns, we can always use the operator `:`, like with vectors." ] }, { "cell_type": "code", "execution_count": 11, "id": "047832aa", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
A data.frame: 6 × 4
gendereducationageACT
<fct><fct><int><int>
1231924
2232335
3232021
4142726
5123331
6152628
\n" ], "text/latex": [ "A data.frame: 6 × 4\n", "\\begin{tabular}{r|llll}\n", " & gender & education & age & ACT\\\\\n", " & & & & \\\\\n", "\\hline\n", "\t1 & 2 & 3 & 19 & 24\\\\\n", "\t2 & 2 & 3 & 23 & 35\\\\\n", "\t3 & 2 & 3 & 20 & 21\\\\\n", "\t4 & 1 & 4 & 27 & 26\\\\\n", "\t5 & 1 & 2 & 33 & 31\\\\\n", "\t6 & 1 & 5 & 26 & 28\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A data.frame: 6 × 4\n", "\n", "| | gender <fct> | education <fct> | age <int> | ACT <int> |\n", "|---|---|---|---|---|\n", "| 1 | 2 | 3 | 19 | 24 |\n", "| 2 | 2 | 3 | 23 | 35 |\n", "| 3 | 2 | 3 | 20 | 21 |\n", "| 4 | 1 | 4 | 27 | 26 |\n", "| 5 | 1 | 2 | 33 | 31 |\n", "| 6 | 1 | 5 | 26 | 28 |\n", "\n" ], "text/plain": [ " gender education age ACT\n", "1 2 3 19 24 \n", "2 2 3 23 35 \n", "3 2 3 20 21 \n", "4 1 4 27 26 \n", "5 1 2 33 31 \n", "6 1 5 26 28 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# This selects all columns between gender and ACT (inclusive)\n", "sat.dat.select<-select(sat.dat, gender:ACT)\n", "\n", "head(sat.dat.select)" ] }, { "cell_type": "markdown", "id": "285a30ea", "metadata": {}, "source": [ "Instead of selecting columns, you can always use a minus (-) sign before the name of the columns to filter them out." ] }, { "cell_type": "code", "execution_count": 12, "id": "4f454cc7", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
A data.frame: 6 × 3
XSATVSATQ
<int><int><int>
129442500500
229457600500
329498480470
429503550520
529504600550
629518640640
\n" ], "text/latex": [ "A data.frame: 6 × 3\n", "\\begin{tabular}{r|lll}\n", " & X & SATV & SATQ\\\\\n", " & & & \\\\\n", "\\hline\n", "\t1 & 29442 & 500 & 500\\\\\n", "\t2 & 29457 & 600 & 500\\\\\n", "\t3 & 29498 & 480 & 470\\\\\n", "\t4 & 29503 & 550 & 520\\\\\n", "\t5 & 29504 & 600 & 550\\\\\n", "\t6 & 29518 & 640 & 640\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A data.frame: 6 × 3\n", "\n", "| | X <int> | SATV <int> | SATQ <int> |\n", "|---|---|---|---|\n", "| 1 | 29442 | 500 | 500 |\n", "| 2 | 29457 | 600 | 500 |\n", "| 3 | 29498 | 480 | 470 |\n", "| 4 | 29503 | 550 | 520 |\n", "| 5 | 29504 | 600 | 550 |\n", "| 6 | 29518 | 640 | 640 |\n", "\n" ], "text/plain": [ " X SATV SATQ\n", "1 29442 500 500 \n", "2 29457 600 500 \n", "3 29498 480 470 \n", "4 29503 550 520 \n", "5 29504 600 550 \n", "6 29518 640 640 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# This selects all columns byt those betweena gender and ACT (inclusive)\n", "sat.dat.select<-select(sat.dat, -(gender:ACT))\n", "\n", "head(sat.dat.select)" ] }, { "cell_type": "markdown", "id": "9f5d8319", "metadata": {}, "source": [ "# `mutate`: Create new variables\n", "\n", "This function allows you to add new columns to data frames. These new columns can be set to any fixed value, a passed vector or, more interestingly and useful, from the existing columns in the data frame. New columns are always added at the end of the data frame." ] }, { "cell_type": "code", "execution_count": 13, "id": "23f841fa", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
A data.frame: 6 × 8
XgendereducationageACTSATVSATQqvdiff
<int><fct><fct><int><int><int><int><int>
129442231924500500 0
229457232335600500-100
329498232021480470 -10
429503142726550520 -30
529504123331600550 -50
629518152628640640 0
\n" ], "text/latex": [ "A data.frame: 6 × 8\n", "\\begin{tabular}{r|llllllll}\n", " & X & gender & education & age & ACT & SATV & SATQ & qvdiff\\\\\n", " & & & & & & & & \\\\\n", "\\hline\n", "\t1 & 29442 & 2 & 3 & 19 & 24 & 500 & 500 & 0\\\\\n", "\t2 & 29457 & 2 & 3 & 23 & 35 & 600 & 500 & -100\\\\\n", "\t3 & 29498 & 2 & 3 & 20 & 21 & 480 & 470 & -10\\\\\n", "\t4 & 29503 & 1 & 4 & 27 & 26 & 550 & 520 & -30\\\\\n", "\t5 & 29504 & 1 & 2 & 33 & 31 & 600 & 550 & -50\\\\\n", "\t6 & 29518 & 1 & 5 & 26 & 28 & 640 & 640 & 0\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A data.frame: 6 × 8\n", "\n", "| | X <int> | gender <fct> | education <fct> | age <int> | ACT <int> | SATV <int> | SATQ <int> | qvdiff <int> |\n", "|---|---|---|---|---|---|---|---|---|\n", "| 1 | 29442 | 2 | 3 | 19 | 24 | 500 | 500 | 0 |\n", "| 2 | 29457 | 2 | 3 | 23 | 35 | 600 | 500 | -100 |\n", "| 3 | 29498 | 2 | 3 | 20 | 21 | 480 | 470 | -10 |\n", "| 4 | 29503 | 1 | 4 | 27 | 26 | 550 | 520 | -30 |\n", "| 5 | 29504 | 1 | 2 | 33 | 31 | 600 | 550 | -50 |\n", "| 6 | 29518 | 1 | 5 | 26 | 28 | 640 | 640 | 0 |\n", "\n" ], "text/plain": [ " X gender education age ACT SATV SATQ qvdiff\n", "1 29442 2 3 19 24 500 500 0 \n", "2 29457 2 3 23 35 600 500 -100 \n", "3 29498 2 3 20 21 480 470 -10 \n", "4 29503 1 4 27 26 550 520 -30 \n", "5 29504 1 2 33 31 600 550 -50 \n", "6 29518 1 5 26 28 640 640 0 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# This creates a new column, which we called qvdiff, with the difference between SATQ and SATV\n", "sat.dat.mutate<-mutate(sat.dat, qvdiff=SATQ-SATV)\n", "head(sat.dat.mutate)" ] }, { "cell_type": "markdown", "id": "2a108362", "metadata": {}, "source": [ "We can always use a preceding created column to define a new column." ] }, { "cell_type": "code", "execution_count": 14, "id": "02c20b0b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
A data.frame: 6 × 9
XgendereducationageACTSATVSATQqvdiffabsqvdiff
<int><fct><fct><int><int><int><int><int><int>
129442231924500500 0 0
229457232335600500-100100
329498232021480470 -10 10
429503142726550520 -30 30
529504123331600550 -50 50
629518152628640640 0 0
\n" ], "text/latex": [ "A data.frame: 6 × 9\n", "\\begin{tabular}{r|lllllllll}\n", " & X & gender & education & age & ACT & SATV & SATQ & qvdiff & absqvdiff\\\\\n", " & & & & & & & & & \\\\\n", "\\hline\n", "\t1 & 29442 & 2 & 3 & 19 & 24 & 500 & 500 & 0 & 0\\\\\n", "\t2 & 29457 & 2 & 3 & 23 & 35 & 600 & 500 & -100 & 100\\\\\n", "\t3 & 29498 & 2 & 3 & 20 & 21 & 480 & 470 & -10 & 10\\\\\n", "\t4 & 29503 & 1 & 4 & 27 & 26 & 550 & 520 & -30 & 30\\\\\n", "\t5 & 29504 & 1 & 2 & 33 & 31 & 600 & 550 & -50 & 50\\\\\n", "\t6 & 29518 & 1 & 5 & 26 & 28 & 640 & 640 & 0 & 0\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A data.frame: 6 × 9\n", "\n", "| | X <int> | gender <fct> | education <fct> | age <int> | ACT <int> | SATV <int> | SATQ <int> | qvdiff <int> | absqvdiff <int> |\n", "|---|---|---|---|---|---|---|---|---|---|\n", "| 1 | 29442 | 2 | 3 | 19 | 24 | 500 | 500 | 0 | 0 |\n", "| 2 | 29457 | 2 | 3 | 23 | 35 | 600 | 500 | -100 | 100 |\n", "| 3 | 29498 | 2 | 3 | 20 | 21 | 480 | 470 | -10 | 10 |\n", "| 4 | 29503 | 1 | 4 | 27 | 26 | 550 | 520 | -30 | 30 |\n", "| 5 | 29504 | 1 | 2 | 33 | 31 | 600 | 550 | -50 | 50 |\n", "| 6 | 29518 | 1 | 5 | 26 | 28 | 640 | 640 | 0 | 0 |\n", "\n" ], "text/plain": [ " X gender education age ACT SATV SATQ qvdiff absqvdiff\n", "1 29442 2 3 19 24 500 500 0 0 \n", "2 29457 2 3 23 35 600 500 -100 100 \n", "3 29498 2 3 20 21 480 470 -10 10 \n", "4 29503 1 4 27 26 550 520 -30 30 \n", "5 29504 1 2 33 31 600 550 -50 50 \n", "6 29518 1 5 26 28 640 640 0 0 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# This defines a new column, which is just the absolute of a new column just created qvdiff\n", "sat.dat.mutate<-mutate(sat.dat, qvdiff=SATQ-SATV, absqvdiff=abs(qvdiff))\n", "head(sat.dat.mutate)" ] }, { "cell_type": "code", "execution_count": 15, "id": "9a293c3d", "metadata": {}, "outputs": [ { "ename": "ERROR", "evalue": "\u001b[1m\u001b[33mError\u001b[39m in `mutate()`:\u001b[22m\n\u001b[1m\u001b[22m\u001b[33m!\u001b[39m Problem while computing `absqvdiff = abs(qvdiff)`.\n\u001b[1mCaused by error in `mask$eval_all_mutate()`:\u001b[22m\n\u001b[33m!\u001b[39m object 'qvdiff' not found\n", "output_type": "error", "traceback": [ "\u001b[1m\u001b[33mError\u001b[39m in `mutate()`:\u001b[22m\n\u001b[1m\u001b[22m\u001b[33m!\u001b[39m Problem while computing `absqvdiff = abs(qvdiff)`.\n\u001b[1mCaused by error in `mask$eval_all_mutate()`:\u001b[22m\n\u001b[33m!\u001b[39m object 'qvdiff' not found\nTraceback:\n", "1. mutate(sat.dat, absqvdiff = abs(qvdiff), qvdiff = SATQ - SATV)", "2. mutate.data.frame(sat.dat, absqvdiff = abs(qvdiff), qvdiff = SATQ - \n . SATV)", "3. mutate_cols(.data, dplyr_quosures(...), caller_env = caller_env())", "4. withCallingHandlers({\n . for (i in seq_along(dots)) {\n . context_poke(\"column\", old_current_column)\n . quosures <- expand_across(dots[[i]])\n . quosures_results <- vector(mode = \"list\", length = length(quosures))\n . for (k in seq_along(quosures)) {\n . quo <- quosures[[k]]\n . quo_data <- attr(quo, \"dplyr:::data\")\n . if (!is.null(quo_data$column)) {\n . context_poke(\"column\", quo_data$column)\n . }\n . chunks <- NULL\n . result <- NULL\n . if (quo_is_symbol(quo)) {\n . name <- as_string(quo_get_expr(quo))\n . if (name %in% names(new_columns)) {\n . result <- new_columns[[name]]\n . chunks <- mask$resolve(name)\n . }\n . else if (name %in% names(.data)) {\n . result <- .data[[name]]\n . chunks <- mask$resolve(name)\n . }\n . if (inherits(.data, \"rowwise_df\") && vec_is_list(result)) {\n . sizes <- list_sizes(result)\n . wrong <- which(sizes != 1)\n . if (length(wrong)) {\n . group <- wrong[1L]\n . mask$set_current_group(group)\n . abort(class = c(\"dplyr:::mutate_incompatible_size\", \n . \"dplyr:::internal_error\"), dplyr_error_data = list(result_size = sizes[group], \n . expected_size = 1))\n . }\n . }\n . }\n . else if (!quo_is_symbolic(quo) && !is.null(quo_get_expr(quo))) {\n . result <- quo_get_expr(quo)\n . result <- withCallingHandlers(vec_recycle(result, \n . vec_size(.data)), error = function(cnd) {\n . abort(class = c(\"dplyr:::mutate_constant_recycle_error\", \n . \"dplyr:::internal_error\"), constant_size = vec_size(result), \n . data_size = vec_size(.data))\n . })\n . chunks <- vec_chop(result, rows)\n . }\n . if (is.null(chunks)) {\n . if (is.null(quo_data$column)) {\n . chunks <- mask$eval_all_mutate(quo)\n . }\n . else {\n . chunks <- withCallingHandlers(mask$eval_all_mutate(quo), \n . error = function(cnd) {\n . msg <- glue(\"Problem while computing column `{quo_data$name_auto}`.\")\n . abort(msg, call = call(\"across\"), parent = cnd)\n . })\n . }\n . }\n . if (is.null(chunks)) {\n . next\n . }\n . if (is.null(result)) {\n . if (length(rows) == 1) {\n . result <- chunks[[1]]\n . }\n . else {\n . chunks <- dplyr_vec_cast_common(chunks, quo_data$name_auto)\n . result <- vec_unchop(chunks, rows)\n . }\n . }\n . quosures_results[[k]] <- list(result = result, chunks = chunks)\n . }\n . for (k in seq_along(quosures)) {\n . quo <- quosures[[k]]\n . quo_data <- attr(quo, \"dplyr:::data\")\n . quo_result <- quosures_results[[k]]\n . if (is.null(quo_result)) {\n . if (quo_data$is_named) {\n . name <- quo_data$name_given\n . new_columns[[name]] <- zap()\n . mask$remove(name)\n . }\n . next\n . }\n . result <- quo_result$result\n . chunks <- quo_result$chunks\n . if (!quo_data$is_named && is.data.frame(result)) {\n . types <- vec_ptype(result)\n . types_names <- names(types)\n . chunks_extracted <- .Call(dplyr_extract_chunks, \n . chunks, types)\n . for (j in seq_along(types)) {\n . mask$add_one(types_names[j], chunks_extracted[[j]], \n . result = result[[j]])\n . }\n . new_columns[types_names] <- result\n . }\n . else {\n . name <- quo_data$name_auto\n . mask$add_one(name = name, chunks = chunks, result = result)\n . new_columns[[name]] <- result\n . }\n . }\n . }\n . }, error = function(e) {\n . local_error_context(dots = dots, .index = i, mask = mask)\n . bullets <- c(cnd_bullet_header(\"computing\"), mutate_bullets(e))\n . abort(bullets, class = \"dplyr:::mutate_error\", parent = skip_internal_condition(e), \n . bullets = bullets, call = error_call)\n . }, warning = function(w) {\n . if (check_muffled_warning(w)) {\n . maybe_restart(\"muffleWarning\")\n . }\n . local_error_context(dots = dots, .index = i, mask = mask)\n . warn(c(cnd_bullet_header(\"computing\"), i = cnd_header(w), \n . i = cnd_bullet_cur_group_label(what = \"warning\")))\n . maybe_restart(\"muffleWarning\")\n . })", "5. mask$eval_all_mutate(quo)", "6. .handleSimpleError(function (e) \n . {\n . local_error_context(dots = dots, .index = i, mask = mask)\n . bullets <- c(cnd_bullet_header(\"computing\"), mutate_bullets(e))\n . abort(bullets, class = \"dplyr:::mutate_error\", parent = skip_internal_condition(e), \n . bullets = bullets, call = error_call)\n . }, \"object 'qvdiff' not found\", base::quote(mask$eval_all_mutate(quo)))", "7. h(simpleError(msg, call))", "8. abort(bullets, class = \"dplyr:::mutate_error\", parent = skip_internal_condition(e), \n . bullets = bullets, call = error_call)", "9. signal_abort(cnd, .file)" ] } ], "source": [ "# Obviously, this is going to giv an error because we are trying to use qvdiff before it was created\n", "sat.dat.mutate<-mutate(sat.dat,absqvdiff=abs(qvdiff), qvdiff=SATQ-SATV)\n", "head(sat.dat.mutate)" ] }, { "cell_type": "markdown", "id": "0509424a", "metadata": {}, "source": [ "If we want to define a data frame with just the new columns, we can use `transmute` function:" ] }, { "cell_type": "code", "execution_count": 16, "id": "ad128811", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
A data.frame: 6 × 2
qvdiffabsqvdiff
<int><int>
1 0 0
2-100100
3 -10 10
4 -30 30
5 -50 50
6 0 0
\n" ], "text/latex": [ "A data.frame: 6 × 2\n", "\\begin{tabular}{r|ll}\n", " & qvdiff & absqvdiff\\\\\n", " & & \\\\\n", "\\hline\n", "\t1 & 0 & 0\\\\\n", "\t2 & -100 & 100\\\\\n", "\t3 & -10 & 10\\\\\n", "\t4 & -30 & 30\\\\\n", "\t5 & -50 & 50\\\\\n", "\t6 & 0 & 0\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A data.frame: 6 × 2\n", "\n", "| | qvdiff <int> | absqvdiff <int> |\n", "|---|---|---|\n", "| 1 | 0 | 0 |\n", "| 2 | -100 | 100 |\n", "| 3 | -10 | 10 |\n", "| 4 | -30 | 30 |\n", "| 5 | -50 | 50 |\n", "| 6 | 0 | 0 |\n", "\n" ], "text/plain": [ " qvdiff absqvdiff\n", "1 0 0 \n", "2 -100 100 \n", "3 -10 10 \n", "4 -30 30 \n", "5 -50 50 \n", "6 0 0 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# This just keeps the new created columns\n", "sat.dat.transmute<-transmute(sat.dat, qvdiff=SATQ-SATV, absqvdiff=abs(qvdiff))\n", "head(sat.dat.transmute)" ] }, { "cell_type": "markdown", "id": "5f9099de", "metadata": {}, "source": [ "# `summarize`: Collapse many values down to a single summary\n", "\n", "This functions collapses a data frame to a single row" ] }, { "cell_type": "code", "execution_count": 17, "id": "b43a2b4f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\n", "
A data.frame: 1 × 1
mean.act
<dbl>
28.54714
\n" ], "text/latex": [ "A data.frame: 1 × 1\n", "\\begin{tabular}{l}\n", " mean.act\\\\\n", " \\\\\n", "\\hline\n", "\t 28.54714\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A data.frame: 1 × 1\n", "\n", "| mean.act <dbl> |\n", "|---|\n", "| 28.54714 |\n", "\n" ], "text/plain": [ " mean.act\n", "1 28.54714" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "summarize(sat.dat, mean.act = mean(ACT))" ] }, { "cell_type": "markdown", "id": "e04ce731", "metadata": {}, "source": [ "Ok, so this is not very useful, right? It gets so though if we combine it with `group_by` from **tidyverse** as well." ] }, { "cell_type": "code", "execution_count": 18, "id": "99bbf3d5", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\n", "
A tibble: 2 × 2
gendermean.act
<fct><dbl>
128.78543
228.41722
\n" ], "text/latex": [ "A tibble: 2 × 2\n", "\\begin{tabular}{ll}\n", " gender & mean.act\\\\\n", " & \\\\\n", "\\hline\n", "\t 1 & 28.78543\\\\\n", "\t 2 & 28.41722\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A tibble: 2 × 2\n", "\n", "| gender <fct> | mean.act <dbl> |\n", "|---|---|\n", "| 1 | 28.78543 |\n", "| 2 | 28.41722 |\n", "\n" ], "text/plain": [ " gender mean.act\n", "1 1 28.78543\n", "2 2 28.41722" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# This first groups by gender and then calculates the mean ACT within each category\n", "by_gender <- group_by(sat.dat, gender)\n", "summarize(by_gender, mean.act = mean(ACT))" ] }, { "cell_type": "code", "execution_count": 19, "id": "ff8ef275", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\u001b[1m\u001b[22m`summarise()` has grouped output by 'gender'. You can override using the\n", "`.groups` argument.\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
A grouped_df: 12 × 3
gendereducationmean.act
<fct><fct><dbl>
1029.03704
1126.70000
1226.65217
1328.56250
1428.94118
1530.82609
2026.06667
2128.12000
2227.33333
2328.18462
2429.44828
2529.01053
\n" ], "text/latex": [ "A grouped\\_df: 12 × 3\n", "\\begin{tabular}{lll}\n", " gender & education & mean.act\\\\\n", " & & \\\\\n", "\\hline\n", "\t 1 & 0 & 29.03704\\\\\n", "\t 1 & 1 & 26.70000\\\\\n", "\t 1 & 2 & 26.65217\\\\\n", "\t 1 & 3 & 28.56250\\\\\n", "\t 1 & 4 & 28.94118\\\\\n", "\t 1 & 5 & 30.82609\\\\\n", "\t 2 & 0 & 26.06667\\\\\n", "\t 2 & 1 & 28.12000\\\\\n", "\t 2 & 2 & 27.33333\\\\\n", "\t 2 & 3 & 28.18462\\\\\n", "\t 2 & 4 & 29.44828\\\\\n", "\t 2 & 5 & 29.01053\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A grouped_df: 12 × 3\n", "\n", "| gender <fct> | education <fct> | mean.act <dbl> |\n", "|---|---|---|\n", "| 1 | 0 | 29.03704 |\n", "| 1 | 1 | 26.70000 |\n", "| 1 | 2 | 26.65217 |\n", "| 1 | 3 | 28.56250 |\n", "| 1 | 4 | 28.94118 |\n", "| 1 | 5 | 30.82609 |\n", "| 2 | 0 | 26.06667 |\n", "| 2 | 1 | 28.12000 |\n", "| 2 | 2 | 27.33333 |\n", "| 2 | 3 | 28.18462 |\n", "| 2 | 4 | 29.44828 |\n", "| 2 | 5 | 29.01053 |\n", "\n" ], "text/plain": [ " gender education mean.act\n", "1 1 0 29.03704\n", "2 1 1 26.70000\n", "3 1 2 26.65217\n", "4 1 3 28.56250\n", "5 1 4 28.94118\n", "6 1 5 30.82609\n", "7 2 0 26.06667\n", "8 2 1 28.12000\n", "9 2 2 27.33333\n", "10 2 3 28.18462\n", "11 2 4 29.44828\n", "12 2 5 29.01053" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# This first groups by gender and education and then calculates the mean ACT within each combined category\n", "by_gender <- group_by(sat.dat, gender, education)\n", "summarize(by_gender, mean.act = mean(ACT))" ] }, { "cell_type": "markdown", "id": "adf63282", "metadata": {}, "source": [ "# The pipe (%>%) operator \n", "\n", "Ok, so what is the fuss about all this if, in the end, most of what we've just seen can be done without having to use any of the above functions. For example:" ] }, { "cell_type": "code", "execution_count": 20, "id": "dbc7bc9b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
A data.frame: 6 × 7
XgendereducationageACTSATVSATQ
<int><fct><fct><int><int><int><int>
429503142726550520
529504123331600550
629518152628640640
829529131922520560
1129578132332760710
1329617143221600600
\n" ], "text/latex": [ "A data.frame: 6 × 7\n", "\\begin{tabular}{r|lllllll}\n", " & X & gender & education & age & ACT & SATV & SATQ\\\\\n", " & & & & & & & \\\\\n", "\\hline\n", "\t4 & 29503 & 1 & 4 & 27 & 26 & 550 & 520\\\\\n", "\t5 & 29504 & 1 & 2 & 33 & 31 & 600 & 550\\\\\n", "\t6 & 29518 & 1 & 5 & 26 & 28 & 640 & 640\\\\\n", "\t8 & 29529 & 1 & 3 & 19 & 22 & 520 & 560\\\\\n", "\t11 & 29578 & 1 & 3 & 23 & 32 & 760 & 710\\\\\n", "\t13 & 29617 & 1 & 4 & 32 & 21 & 600 & 600\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A data.frame: 6 × 7\n", "\n", "| | X <int> | gender <fct> | education <fct> | age <int> | ACT <int> | SATV <int> | SATQ <int> |\n", "|---|---|---|---|---|---|---|---|\n", "| 4 | 29503 | 1 | 4 | 27 | 26 | 550 | 520 |\n", "| 5 | 29504 | 1 | 2 | 33 | 31 | 600 | 550 |\n", "| 6 | 29518 | 1 | 5 | 26 | 28 | 640 | 640 |\n", "| 8 | 29529 | 1 | 3 | 19 | 22 | 520 | 560 |\n", "| 11 | 29578 | 1 | 3 | 23 | 32 | 760 | 710 |\n", "| 13 | 29617 | 1 | 4 | 32 | 21 | 600 | 600 |\n", "\n" ], "text/plain": [ " X gender education age ACT SATV SATQ\n", "4 29503 1 4 27 26 550 520 \n", "5 29504 1 2 33 31 600 550 \n", "6 29518 1 5 26 28 640 640 \n", "8 29529 1 3 19 22 520 560 \n", "11 29578 1 3 23 32 760 710 \n", "13 29617 1 4 32 21 600 600 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Here filtering observations of one gender\n", "head(sat.dat[sat.dat$gender==1,])" ] }, { "cell_type": "code", "execution_count": 21, "id": "28d7c46d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
A data.frame: 6 × 8
XgendereducationageACTSATVSATQqvdiff
<int><fct><fct><int><int><int><int><int>
129442231924500500 0
229457232335600500-100
329498232021480470 -10
429503142726550520 -30
529504123331600550 -50
629518152628640640 0
\n" ], "text/latex": [ "A data.frame: 6 × 8\n", "\\begin{tabular}{r|llllllll}\n", " & X & gender & education & age & ACT & SATV & SATQ & qvdiff\\\\\n", " & & & & & & & & \\\\\n", "\\hline\n", "\t1 & 29442 & 2 & 3 & 19 & 24 & 500 & 500 & 0\\\\\n", "\t2 & 29457 & 2 & 3 & 23 & 35 & 600 & 500 & -100\\\\\n", "\t3 & 29498 & 2 & 3 & 20 & 21 & 480 & 470 & -10\\\\\n", "\t4 & 29503 & 1 & 4 & 27 & 26 & 550 & 520 & -30\\\\\n", "\t5 & 29504 & 1 & 2 & 33 & 31 & 600 & 550 & -50\\\\\n", "\t6 & 29518 & 1 & 5 & 26 & 28 & 640 & 640 & 0\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A data.frame: 6 × 8\n", "\n", "| | X <int> | gender <fct> | education <fct> | age <int> | ACT <int> | SATV <int> | SATQ <int> | qvdiff <int> |\n", "|---|---|---|---|---|---|---|---|---|\n", "| 1 | 29442 | 2 | 3 | 19 | 24 | 500 | 500 | 0 |\n", "| 2 | 29457 | 2 | 3 | 23 | 35 | 600 | 500 | -100 |\n", "| 3 | 29498 | 2 | 3 | 20 | 21 | 480 | 470 | -10 |\n", "| 4 | 29503 | 1 | 4 | 27 | 26 | 550 | 520 | -30 |\n", "| 5 | 29504 | 1 | 2 | 33 | 31 | 600 | 550 | -50 |\n", "| 6 | 29518 | 1 | 5 | 26 | 28 | 640 | 640 | 0 |\n", "\n" ], "text/plain": [ " X gender education age ACT SATV SATQ qvdiff\n", "1 29442 2 3 19 24 500 500 0 \n", "2 29457 2 3 23 35 600 500 -100 \n", "3 29498 2 3 20 21 480 470 -10 \n", "4 29503 1 4 27 26 550 520 -30 \n", "5 29504 1 2 33 31 600 550 -50 \n", "6 29518 1 5 26 28 640 640 0 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Here creating a new score as the difference between SATQ and SATV scores\n", "new.sat.dat<-sat.dat\n", "new.sat.dat$qvdiff<-new.sat.dat$SATQ - new.sat.dat$SATV\n", "\n", "head(new.sat.dat)" ] }, { "cell_type": "markdown", "id": "a5f1179c", "metadata": {}, "source": [ "Apart from the user-friendly sintax of the **dplyr** functions, imagine that you would want to apply many transformation steps to your data frame. One should save intermediate objects or nest many functions within the same line. For example:" ] }, { "cell_type": "markdown", "id": "ce513a95", "metadata": {}, "source": [ "Nesting:" ] }, { "cell_type": "code", "execution_count": 22, "id": "3809c1db", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\n", "
A data.frame: 1 × 1
mean.act
<dbl>
20.51429
\n" ], "text/latex": [ "A data.frame: 1 × 1\n", "\\begin{tabular}{l}\n", " mean.act\\\\\n", " \\\\\n", "\\hline\n", "\t 20.51429\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A data.frame: 1 × 1\n", "\n", "| mean.act <dbl> |\n", "|---|\n", "| 20.51429 |\n", "\n" ], "text/plain": [ " mean.act\n", "1 20.51429" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# This selects observations for gender 1, then creates a new variable and finally computes its mean \n", "summarize(mutate(filter(sat.dat, sat.dat$gender==1), qvdiff=SATQ-SATV), mean.act = mean(qvdiff, na.rm = TRUE))" ] }, { "cell_type": "markdown", "id": "f9ed7fd5", "metadata": {}, "source": [ "Saving intermediate results:" ] }, { "cell_type": "code", "execution_count": 23, "id": "23e9057f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
A data.frame: 6 × 8
XgendereducationageACTSATVSATQqvdiff
<int><fct><fct><int><int><int><int><int>
429503142726550520-30
529504123331600550-50
629518152628640640 0
829529131922520560 40
1129578132332760710-50
1329617143221600600 0
\n" ], "text/latex": [ "A data.frame: 6 × 8\n", "\\begin{tabular}{r|llllllll}\n", " & X & gender & education & age & ACT & SATV & SATQ & qvdiff\\\\\n", " & & & & & & & & \\\\\n", "\\hline\n", "\t4 & 29503 & 1 & 4 & 27 & 26 & 550 & 520 & -30\\\\\n", "\t5 & 29504 & 1 & 2 & 33 & 31 & 600 & 550 & -50\\\\\n", "\t6 & 29518 & 1 & 5 & 26 & 28 & 640 & 640 & 0\\\\\n", "\t8 & 29529 & 1 & 3 & 19 & 22 & 520 & 560 & 40\\\\\n", "\t11 & 29578 & 1 & 3 & 23 & 32 & 760 & 710 & -50\\\\\n", "\t13 & 29617 & 1 & 4 & 32 & 21 & 600 & 600 & 0\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A data.frame: 6 × 8\n", "\n", "| | X <int> | gender <fct> | education <fct> | age <int> | ACT <int> | SATV <int> | SATQ <int> | qvdiff <int> |\n", "|---|---|---|---|---|---|---|---|---|\n", "| 4 | 29503 | 1 | 4 | 27 | 26 | 550 | 520 | -30 |\n", "| 5 | 29504 | 1 | 2 | 33 | 31 | 600 | 550 | -50 |\n", "| 6 | 29518 | 1 | 5 | 26 | 28 | 640 | 640 | 0 |\n", "| 8 | 29529 | 1 | 3 | 19 | 22 | 520 | 560 | 40 |\n", "| 11 | 29578 | 1 | 3 | 23 | 32 | 760 | 710 | -50 |\n", "| 13 | 29617 | 1 | 4 | 32 | 21 | 600 | 600 | 0 |\n", "\n" ], "text/plain": [ " X gender education age ACT SATV SATQ qvdiff\n", "4 29503 1 4 27 26 550 520 -30 \n", "5 29504 1 2 33 31 600 550 -50 \n", "6 29518 1 5 26 28 640 640 0 \n", "8 29529 1 3 19 22 520 560 40 \n", "11 29578 1 3 23 32 760 710 -50 \n", "13 29617 1 4 32 21 600 600 0 " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "20.5142857142857" ], "text/latex": [ "20.5142857142857" ], "text/markdown": [ "20.5142857142857" ], "text/plain": [ "[1] 20.51429" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# This selects observations for gender 1, then create a new variable and finally computes its mean\n", "sat.dat.subset<-sat.dat[sat.dat$gender==1,]\n", "sat.dat.subset$qvdiff<-sat.dat.subset$SATQ - sat.dat.subset$SATV\n", "\n", "head(sat.dat.subset)\n", "mean(sat.dat.subset$qvdiff, na.rm = TRUE)" ] }, { "cell_type": "markdown", "id": "6b9a2bce", "metadata": {}, "source": [ "Now is when the %>% operator from **tidyverse** brings it magic, since it allows you to apply multiple operations from left-to-right, top-to-bottom (i.e. it is read as “then”)." ] }, { "cell_type": "code", "execution_count": 24, "id": "d0d94d1d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\n", "
A data.frame: 1 × 1
mean.act
<dbl>
20.51429
\n" ], "text/latex": [ "A data.frame: 1 × 1\n", "\\begin{tabular}{l}\n", " mean.act\\\\\n", " \\\\\n", "\\hline\n", "\t 20.51429\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A data.frame: 1 × 1\n", "\n", "| mean.act <dbl> |\n", "|---|\n", "| 20.51429 |\n", "\n" ], "text/plain": [ " mean.act\n", "1 20.51429" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "sat.dat %>% \n", "filter(gender==1) %>%\n", "mutate(qvdiff=SATQ-SATV) %>%\n", "summarize(mean.act = mean(qvdiff, na.rm = TRUE))" ] }, { "cell_type": "markdown", "id": "e90eef5d", "metadata": {}, "source": [ "This considerably improves the readability of code. It's pretty cool, right?" ] }, { "cell_type": "code", "execution_count": 25, "id": "7c14abd0", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\n", "
A data.frame: 1 × 8
mupop_medsigmapop_iqrpop_minpop_maxpop_q1pop_q3
<dbl><dbl><dbl><dbl><int><int><dbl><dbl>
28.54714294.8235673362532
\n" ], "text/latex": [ "A data.frame: 1 × 8\n", "\\begin{tabular}{llllllll}\n", " mu & pop\\_med & sigma & pop\\_iqr & pop\\_min & pop\\_max & pop\\_q1 & pop\\_q3\\\\\n", " & & & & & & & \\\\\n", "\\hline\n", "\t 28.54714 & 29 & 4.82356 & 7 & 3 & 36 & 25 & 32\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A data.frame: 1 × 8\n", "\n", "| mu <dbl> | pop_med <dbl> | sigma <dbl> | pop_iqr <dbl> | pop_min <int> | pop_max <int> | pop_q1 <dbl> | pop_q3 <dbl> |\n", "|---|---|---|---|---|---|---|---|\n", "| 28.54714 | 29 | 4.82356 | 7 | 3 | 36 | 25 | 32 |\n", "\n" ], "text/plain": [ " mu pop_med sigma pop_iqr pop_min pop_max pop_q1 pop_q3\n", "1 28.54714 29 4.82356 7 3 36 25 32 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# This was the example of the previous tutorial\n", "sat.dat %>% \n", " summarize(mu = mean(ACT), \n", " pop_med = median(ACT), \n", " sigma = sd(ACT), \n", " pop_iqr = IQR(ACT),\n", " pop_min = min(ACT), \n", " pop_max = max(ACT),\n", " pop_q1 = quantile(ACT, 0.25), # first quartile, 25th percentile\n", " pop_q3 = quantile(ACT, 0.75)) # third quartile, 75th percentile)" ] }, { "cell_type": "markdown", "id": "80a5671c", "metadata": {}, "source": [ "
Practice: Using the operator %>%, perform the following operations:\n", "
    \n", "
  1. Filter observations such that we keep only those with an age within 18 and 45 years.
  2. \n", "
  3. Create a new variable, \"SATV.avg\", as the average of SATQ and SATV.
  4. \n", "
  5. Select the columns \"gender\", \"education\", and \"SATV.avg\".
  6. \n", "
  7. Group the data frame by gender and education
  8. \n", "
  9. Summarize the data frame by computing the mean and standard deviation of \"SATV.avg\".
  10. \n", "
  11. With the resulting data frame, create a barplot using geom_bar, such that the x position corresponds to each level of education and the height of each bar (i.e. the y position) is the mean of \"SATV.avg\". In order to have a different bar per gender, pass this variable to the fill argument in aesthetics (set position=\"dodge\" to have these bars next to each other instead of stacked). Finally, use geom_errorbar to add the standard deviation as an error bar (set position=\"dodge\" again here). Visit https://r-graph-gallery.com/4-barplot-with-error-bar.html to see an example of how to generate this kind of plots
" ] }, { "cell_type": "code", "execution_count": 26, "id": "d959323e", "metadata": {}, "outputs": [], "source": [ "# Your response here" ] } ], "metadata": { "kernelspec": { "display_name": "R", "language": "R", "name": "ir" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "4.2.2" } }, "nbformat": 4, "nbformat_minor": 5 }