--- title: "Introduction to Data Manipulation" author: "ENS-215" date: "17-Jan-2025" output: html_document: df_print: paged theme: journal toc: TRUE toc_float: TRUE ---
Let's now jump into some data manipulation using the functionality in the `dplyr` package. This package is going to dramatically increase your data handling skills. You'll see how with `dplyr` data manipulation operations that may have required long, complex, and difficult to decipher code can often be reduced to a single line that is easy to code and easy to read! As you work through the code, pay close attention and make sure that you understand what is happening in each line of code. Check help files, ask your neighbor, and of course ask me if you have any questions or don't fully understand something. Note: The **dplyr** cheatsheet will be very helpful when working through today's lesson
## Load in the `dplyr` package We'll actually load in `tidyverse` which contains `dplyr` along with many other packages. ```{r message = FALSE, warning=FALSE} library(tidyverse) ```
## Load in the `gapminder` package We'll also load in the the `gapminder` package. This package has a great dataset to use when learning and showcasing the functionality of `dplyr`. The `gapminder` dataset contains data on life expectancy, GDP per capita, and populations by country (and through time). This is a great dataset to look at for thinking about global development. As global development is inherently tied to environmental aspects (e.g. water resources, mineral resources, ecology) this dataset should be of interest to anyone studying the environment, and of particular interest to those focused on policy. If you haven't installed the `gapminder` package yet, you should go to your package window and do this. Once you have the package let's load it in. ```{r} library(gapminder) ```
## Look at the Gapminder dataset Let's print out some of the data and take a look. ```{r} gapminder ```
Let's also get a `summary()` to learn a bit more ```{r} summary(gapminder) ```
You should take a minute to learn more about the dataset by typing `?gapminder` in your CONSOLE. Also take a minute to look at the data by using the `View()` function. Do this in your CONSOLE. You can see that there is a lot of data here, a number of variables, and many different ways we might want to filter, group, select,... the data. `dplyr` is going to come to the rescue here. Before we proceed, let's assign the gapminder data to a new object (just so you aren't worrying about messing up the original data) ```{r} gap_data <- gapminder ```
## Using `filter()` to subset rows of data No longer do we need to use complex logical statements as our indices when trying to subset data along rows (_recall the complex statements that you would put into the `[]` when subsetting_). `filter()` makes this operation painless. The `filter()` function takes your logical comparison and selects all of the rows for which the test is `TRUE`. ```{r} filter(gap_data, gdpPercap > 5000) ``` We just filtered the gap_data so that only the rows where the gdpPercap was greater than 5000 were selected. That was super easy! We can add even more conditions to filter on. ```{r} filter(gap_data, gdpPercap > 5000, year == 2007) ``` We now selected all of the rows where the per capita GDP > 5000 AND the year is 2007.
You see that the syntax is `filter(dataset, criteria 1, criteria 2,..., criteria n)` Note that `filter()` returns the output you want but it does not modify the original dataset. This is good, since we likely want to keep using our original dataset later on in our code. Furthermore, you'll see that since we didn't assign the output from `filter()` to a data object, we haven't created any new data objects yet. We can easily assign the output to a new object, the same way we do all other object assignments ```{r} data_2007_highGDP <- filter(gap_data, gdpPercap > 5000, year == 2007) ``` However, before you go ahead and make tons of new data frames to store filtered versions of your data, you should realize that generally we can pass the filtered data directly to another function (e.g. a plotting function, mean, sum,...) which may eliminate the need to creating tons of new data frames. This will help to keep our environment clean (something you should care about as environmental scientists).
Here's how you would filter based on a variable that has text values ```{r} filter(gap_data, gdpPercap > 5000, continent == "Asia") ```
Now I want you to try a few more filter operations on the `gap_data`. Take your time and get comfortable with the basics here. ```{r} # Your code here ```
### `filter()` and `in` If you want to filter based on several values for a given variable, you could use the OR `|` operator in your criteria. For example, ```{r} filter(gap_data, country == "Canada" | country == "China") ``` This will select the rows for the both Canada and China. What I did above works, but it is cumbersome and somewhat difficult to read the code. We can achieve the same result and greatly simplify our code ```{r} filter(gap_data, country %in% c("Canada", "China")) ```
You can imagine if you have a very long vector of items, the `%in%` approach is much easier than writing a bunch of OR `|` statements. Try filtering the data to get only rows with Canada, China, France, India, and Argentina. You should save this set of countries to its own vector and then use this vector in your `filter()` statement. This will make your code much, much easier to read then defining the vector in your `filter()` directly after the `%in%` (as I did above). ```{r} # Your code here ```
### `filter()` is pretty nice right? Remember those days before `dplyr` when we wanted to select rows from a data frame according to criteria. **In base R** ```{r} gap_data[gap_data$year == 2007 & gap_data$gdpPercap > 5000, ] # the old way in base R ```
**With `dplyr`** ```{r} filter(gap_data, year == 2007, gdpPercap > 5000) # with dplyr ``` Both of the above code blocks do the exact same thing: select the rows in `gap_data` where the year is 2007 AND the per capita GDP > 5000. The `dplyr` method is so much easier to code and read.
## Sort data with `arrange()` Let's use the `arrange()` function to sort our dataset. We'll save to sorted data to another object, so that you can take a look at it easier. ```{r} gap_sorted <- arrange(gap_data, gdpPercap) ``` `View()` the sorted data (remember to type do this in the console) and confirm that everything worked. `arrange()` sorts data in ascending order (smallest to largest). To sort data in descending order use the `desc()` function ```{r} gap_sorted <- arrange(gap_data, desc(gdpPercap)) ``` Take a look and make sure everything looks good. Were you surprised to see the country that had the highest GDP (and the year when this occurred)? As a interesting aside, discuss with your neighbor why you think this might be the case.
Test out a few more sorts on the `gap_data` using the `arrange()` function ```{r} # Your code here ```
### `arrange()` by multiple variables You can sort data according to multiple variables using the `arrange()` function. Below we'll sort the `gap_data` by continent first and then by life expectancy. This will give us a sorted dataset that will have the life expectancies sorted within each continent ```{r} gap_sorted <- arrange(gap_data, continent, lifeExp) ```
That worked well, but you'll see that we didn't consider the year in our sorting. Let's now sort by continent, then year, then life expectancy ```{r} # your code here ``` Take a look at the data. Does it look different than when we sorted without the year. Does this particular organization seem more useful? ```{r eval = FALSE, echo = FALSE} gap_sorted <- arrange(gap_data, continent, year, lifeExp) ```
Repeat your code block directly above, but sort the years in descending order so that we can see the most recent data at the top of the data frame. ```{r} # Your code here ``` ```{r eval = FALSE, echo = FALSE} gap_sorted <- arrange(gap_data, continent, desc(year), lifeExp) ```
## Piping Before we go further, I want to introduce the **pipe operator**. The pipe operator will make your code much easier to write and read. It allows you to carry out a sequence of operations without having to nest the operations within one another. + The pipe operator is `%>%` (you can also use `|>` as the pipe operator) + Prounced "_then_" when reading code in "English" + Shortcut to insert `%>%` is + PC: CTRL + SHIFT + M + MAC: COMMAND + SHIFT + M The pipe `%>%` operator allows you to 1. Take **output** of one function and **pipe** it as **input** to the next function 2. String together many **pipes** to create a single chain of operations
Let's take a look at an example ```{r} filter(gap_data, year == 2007, continent == "Asia") %>% arrange(gdpPercap) ``` The above code first filtered the data to select only the observations for year 2007 and in Asia. Then it sent this filtered dataset `arrange()` where it was sorted by per capita GDP. Notice, how I did not specify the data frame to use in `arrange()`. This is because the **pipe** operator passes the output from the previous operation so this output is necessarily the input to the `arrange()` function.
We could have piped the `gap_data` to the `filter()` function to make it even easier to read. ```{r} gap_data %>% filter(year == 2007, continent == "Asia") %>% arrange(gdpPercap) ```
The code below does the exact same thing but without the **pipe** operator. See how much easier it is to read the code that used **pipe**. With the **piped** code we can read the sequence of operation from left to right, whereas in the nested code below we had to read from inside out (which is much more difficult from a human perspective). ```{r} arrange(filter(gap_data, year == 2007, continent == "Asia"), gdpPercap) # without using pipe operator ```
FYI, the pipe operator is actually in the `magrittr` package, but this package is part of `tidyverse` so it loads in everytime we load `tidyverse`.
## `select()` columns We can select a subset of columns from a data frame using the `select()` function. This is extremely useful when you load in a data frame that might have tens or hundreds of columns and you are only interested. Let's give it a test with our `gap_data`. We'll select the country, year, lifeExp, and gdpPercap columns ```{r} select(gap_data, country, year, lifeExp, gdpPercap) ```
You can also rename the columns when you select them. Let's give that a try and rename the `gdpPercap` variable to `GDP_percap` ```{r} select(gap_data, country, year, lifeExp, GDP_percap = gdpPercap) ``` This is helpful when the original dataset has column names that we don't like for some reason (e.g. not meaningful, too long,...)
Try selecting and renaming some variables ```{r} # Your code here ```
If you want to select most of the columns, then it is easier to specify which ones you don't want to keep as opposed to which ones you want to keep. Imagine we want all of the columns in `gap_data` except for `pop` (population) and `continent`. It is less typing to specify the ones we don't want. We can do this as follows ```{r} gap_data %>% select(-pop,-continent) ``` ## Exercises Using the `gap_data` do the following: 1. Create a new data frame that meets the following conditions + Only has entries with years after 1980 + Only has entries in Asia and Africa + Data is sorted by continent and then by life expectancy Think about what the above data can tell you
2. Create a new data frame that meets the following conditions + Has the continent column removed + Only has entries for year 2007 + Only has entries for life expectancies less than 60 + Only has entries with with a gdp per capita that is greater than 10000 Think about what the above data can tell you
3. Create a new data frame that meets the following conditions + Has entries with life expectancies greater than the mean life expectancy of all the data + Has entries with populations greater than 20 million Think about what the above data can tell you
4. Which country had the lowest life expectancy in 1977? 5. Which country in Europe had the lowest life expectancy in 2007? 6. What were the top countries by total GDP (note that this is different from per capita GDP) in 1952?
Once you finish these exercises, keep practicing on your own. Come up with some questions/ideas you would like to investigate and use the pipe and `dplyr` tools to help answer them ```{r} # Your code blocks here ```