--- output: hugodown::md_document title: "Session 1: Backyard Birds" subtitle: "Starting with RStudio Projects and reading in our data" summary: "In the first session of Code Club, we'll make sure that everyone is properly set up, create an RStudio Project, and start working with some data from the Great Backyard Bird Count." authors: [admin] tags: [codeclub, backyard-birds] categories: [] date: 2020-11-04 lastmod: 2020-11-04 featured: false draft: false toc: true # Featured image # To use, add an image named `featured.jpg/png` to your page's folder. # Focal points: Smart, Center, TopLeft, Top, TopRight, Left, Right, BottomLeft, Bottom, BottomRight. image: caption: "Red-breasted Nuthatch, *Sitta canadensis*" focal_point: "" preview_only: false # Projects (optional). # Associate this post with one or more of your projects. # Simply enter your project's folder or file name without extension. # E.g. `projects = ["internal-project"]` references `content/project/deep-learning/index.md`. # Otherwise, set `projects = []`. projects: [] --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, eval=TRUE) ```


----- ## Prep homework {#prep} #### Basic computer setup If you didn't already do this, please follow the [Code Club Computer Setup](/codeclub-setup/) instructions. #### Test if it works Please open RStudio locally or [start an OSC RStudio Server session](/codeclub-setup/#osc-run-rstudio). **Nov 19 addition: If you're working locally, test if you can load the *tidyverse* package with `library("tidyverse")` inside R.** (If you haven't installed the *tidyverse* yet, please go to the [Code Club Computer Setup](/codeclub-setup/#install-tidy) instructions.) If you have not used RStudio before, take a moment to explore what's in the panels and tabs. (It may help to check out [Mike Sovic's 1-minute intro to the RStudio interface](https://www.youtube.com/watch?v=ByxF3xjN2JQ&list=PLxhIMi78eQegFm3XqsylVa-Lm7nfiUshe&t=2m15s) or [RStudio's 3-minute intro](https://fast.wistia.net/embed/iframe/520zbd3tij?videoFoam=true).) If you're able to do so, please open RStudio again a bit before Code Club starts -- and in case you run into issues, please join the Zoom call early and we'll troubleshoot. #### New to R? If you're completely new to R, it will be useful to have a look at some of the resources listed on our [New to R?](/codeclub-novice/) page prior to Code Club.
---- ## Slides On Friday, we started with a couple of [introductory slides](/slides/CC01/).
---- ---- ## 1 - Create an RStudio Project Projects are an RStudio-specific concept that create a special file (`.Rproj`), primarily to designate a directory as the working directory for everything within it. We recommend *creating exactly one separate Project for each research project* with an R component -- and for things like Code Club.
**Why use Projects?** In brief, Projects help you to organize your work and to make it more portable. - They record which scripts (and R Markdown files) are open in RStudio, and will reopen all of those when you reopen the project. This becomes quite handy, say, when you work on three different projects, each of which uses a number of scripts. - When using Projects, you generally don't have to manually set your working directory, and can use *relative file paths* to refer to files within the project. This way, even if you move the project directory, or copy it to a different computer, the same paths will still work. (This would not be the case if you used `setwd()` which will generally require you to use an absolute path, e.g. `setwd("C:/Users/Jelmer/Documents/")`.) - Projects encourage you to organize research projects inside self-contained directories, rather than with files spread around your computer. This can save you a lot of headaches and increases reproducibility. And because R will restart whenever you switch Projects, there is no risk of unwanted cross-talk between your projects.

**Let's create an RStudio Project for Code Club:** - Open RStudio locally or [start an OSC RStudio Server session](/codeclub-setup/#osc-run-rstudio). (*If you're at OSC*, you should see a file `0_CODECLUB.md` that's open in your top-left panel. You can ignore/close this file.) - *If you're working locally*, create a directory wherever you like on your computer for all things Code Club. You can do this in R using `dir.create("path/to/your/dir")`, or outside of R. (*If you're at OSC*, skip this step because you're automatically inside a Code Club-specific, personal directory.) - Click `File` (top menu bar) > `New Project`, and then select `Existing Directory`. - *If you're working locally*, select the Code Club directory that you created in the previous step. - *If you're working at OSC*, keep the default choice "`~`" (i.e., *home*), which is the directory you started in when entering the RStudio Server session. - After RStudio automatically reloads, you should see the file ending in `.Rproj` in the RStudio `Files` tab in the lower right pane, and you will have the Project open. All done for now! (For future Code Club sessions: RStudio will by default reopen the most recently used Project, and therefore, OSC users will have the Project automatically opened. If you're working locally and are also using other Projects, you can open this Project with `File` > `Open Project` inside RStudio, or by clicking the `.Rproj` file in your file browser, which will open RStudio *and* the Project.)
---- ## 2 - Orienting ourselves #### Where are we? We don't need to set our working directory, because our newly created Project is open, and therefore, our working directory is the directory that contains the `.Rproj` file. To see where you are, type or copy into the console (bottom left): ```{r, eval = FALSE} # Print the working directory: getwd() # List the files in your current directory: dir() # This should print at least the `.RProj` file. ``` #### Create directories Create two new directories -- one for this session, and one for a dataset that we will download shortly (and will be reusing across sessions): ```{r, eval = FALSE} # Dir for Code Club Session 1: dir.create("S01") # Dir for our bird data: # ("recursive" to create two levels at once.) dir.create("data/birds/", recursive = TRUE) ``` #### Create a script To keep a record of what we are doing, and to easily modify and rerun earlier commands, we'll want to save our commands in a script and execute them from there, rather than typing our commands directly in the console. - Click `File` (top menu bar) > `New File` > `R script`. - Save the script (`File` > `Save`) as `S01.R` inside your `S01` directory. #### First line of the script We will now load the core set of 8 *tidyverse* packages all at once. To do so, type/copy the command below on the first line of the script, and then **execute it** by clicking `Run` (top right of script pane) or by pressing `Ctrl Enter` (Windows/Linux, this should also work in your browser) or `⌘ Enter` (Mac). ```{r, eval = TRUE} # If you're working locally, and did not install it yet: # install.packages("tidyverse") # Load the tidyverse (meta)package: library(tidyverse) ``` If this worked, you should get the same output as shown in the code block above: it attached 8 packages, and it warns that some of its functions are now "masking" base R functions.
The *tidyverse* is a very popular and useful ecosystem of R packages for data analysis, which we will be using a lot in Code Club. When we refer to "*base R*" as opposed to the *tidyverse*, we mean functions that are loaded in R by default (without loading a package), and that can perform similar operations in a different way.

---- ## 3 - Getting our dataset We downloaded a Great Backyard Bird Count (GBBC) [dataset](https://www.gbif.org/dataset/82cb293c-f762-11e1-a439-00145eb45e9a) from the [Global Biodiversity Information Facility (GBIF)](https://www.gbif.org/). Because the file was 3.1 GB large, we selected only the records from Ohio and removed some uninformative columns. We also added columns with English names and the breeding range for each species. We'll download the resulting much smaller file (41.5 MB) from our Github repo.
### The Great Backyard Bird Count

The [GBBC](https://gbbc.birdcount.org/) is an annual citizen science event where everyone is encouraged to to identify and count birds in their backyard -- or anywhere else -- for at least 15 minutes, and report their sightings online. Since 2013, it is a global event, but it has been organized in the US and Canada since 1998.
#### Download the data Let's download the dataset using the `download.file()` function: ```{r} # The URL to our file: birds_file_url <- "https://raw.githubusercontent.com/biodash/biodash.github.io/master/assets/data/birds/backyard-birds_Ohio.tsv" # The path to the file we want to download to: birds_file <- "data/birds/backyard-birds_Ohio.tsv" # Download: download.file(url = birds_file_url, destfile = birds_file) ``` #### Read the data Now, let's read the file into R. The `.tsv` extension ("tab-separated values") tells us this is a plain text file in which columns are separated by tabs, so we will use a convenience function from the *readr* package (which is loaded as part of the core set *tidyverse* packages) for exactly this type of file: ```{r} # Read the data: birds <- read_tsv(file = birds_file) ``` Done! We have now read our data into a *tibble*, which is a type of data frame (formally a *data.frame*): R's object class to deal with tabular data wherein each column can contain a different type of data (numeric, characters/strings, etc).
---- ## 4 - Exploring backyard birds :::puzzle ### Exercise 1 **What's in the dataset?** - Explore the dataset using some functions and methods you may know to get a quick overview of data(frames), and try to understand what you see. What does a single row represent, and what is in each column? (Be sure to check out the hints below at some point, especially if you're stuck.) - Pay attention to the data types (e.g., "character" or `chr`) of the different columns, which several of these functions print. The output of our `read_tsv()` command also printed this information -- this function parsed our columns as the types we see now. Were all the columns parsed correctly? - How many rows and how many columns does the dataset have? - What are some questions you would like to explore with this dataset? We'll collect some of these and try to answer them in later sessions. If your group has sufficient R skills already, you are also welcome to go ahead and try to answer one or more of these questions.
Hints (click here)
```{r, eval = FALSE} # Type an object's name to print it to screen: birds # Same as above, but explicitly calling print(): print(birds) # For column-wise information (short for "structure"): str(birds) # tidyverse version of str(): glimpse(birds) # In RStudio, open object in a separate tab: View(birds) ``` - Note that in R, `dbl` (for "double") and `num` (for "numeric") are both used, and almost interchangeably so, for floating point numbers. (Integers are a separate type that are simply called "integers" and abbreviated as `int`, but we have no integer columns in this dataset.) - `read_tsv()` parsed our date as a "date-time" (`dttm` or `POSIXct` for short), which contains both a date and a time. In our case, it looks like the time is always "00:00:00" and thus doesn't provide any information.

Solutions (click here)
```{r} # Just printing the glimpse() output, # which will show the number of rows and columns: glimpse(birds) ``` ```{r} # You can also check the number of rows and columns directly using: dim(birds) # Will return the number of rows and columns nrow(birds) # Will return the number of rows ncol(birds) # Will return the number of columns ```
:::

---- ---- ## Bonus material If your breakout group is done with Exercise 1, you can have a look at the bonus material below which includes another exercise. You can also have a look at this as homework. Or not at all!
### `readr` options for challenging files Earlier, we successfully read in our file without specifying any arguments other than the file name to the `read_tsv()` function, i.e. with all the default options. It is not always this easy! Some options for more complex cases: - The more general counterpart of this function is `read_delim()`, which allows you to specify the delimiter using the `sep` argument, e.g. `delim="\t"` for tabs. - There are also arguments to these functions for when you need to skip lines, when you don't have column headers, when you need to specify the column types of some or all the columns, and so forth -- see this example: ```{r, eval = FALSE} my_df <- read_delim( file = "file.txt", delim = "\t", # Specify tab as delimiter col_names = FALSE, # First line is not a header skip = 3, # Skip the first three lines comment = "#", # Skip any line beginning with a "#" col_types = cols( # Specify column types col1 = col_character(), # ..We only need to specify columns for col2 = col_double() # ..which we need non-automatic typing ) ) ```

:::puzzle ### Exercise 2 (Optional) **Read this file!** Try to read the following file into R, which is a modified and much smaller version of the bird dataset. Make the function parse the "order" column as a factor, and the "year", "month", and "day" columns as whatever you think is sensible. ```{r} # Download and read the file: birds2_file_url <- "https://raw.githubusercontent.com/biodash/biodash.github.io/master/assets/data/birds/backyard-birds_read-challenge.txt" birds2_file <- "data/birds/backyard-birds_read-challenge.txt" download.file(url = birds2_file_url, destfile = birds2_file) ``` ```{r, eval = FALSE} # Your turn! birds2 <- read_ # Complete the command ```
Hints (click here)
- The file is saved as `.txt`, so the delimiter is not obvious -- first have a look at it (open it in RStudio, a text editor, or the terminal) to determine the delimiter. Then, use `read_delim()` with manual specification of the delimiter using the `delim` argument, or use a specialized convenience function. - Besides a leading line with no data, there is another problematic line further down. You will need both the `skip` and `comment` arguments to circumvent these. - Note that *readr* erroneously parses `month` as a character column if you don't manually specify its type. - Note that you can also use a succinct column type specification like `col_types = "fc"`, which would parse, for a two-column file, the first column as a factor and the second as a character -- type e.g. `?read_tsv` for details.

Bare solution (click here)
```{r} # With succint column type specification: birds2 <- read_csv( file = birds2_file, skip = 1, comment = "$", col_types = "fcdiii" ) # With long column type specification: birds2 <- read_csv( file = birds2_file, skip = 1, comment = "$", col_types = cols( order = col_factor(), year = col_integer(), month = col_integer(), day = col_integer() ) ) ```

Solution with explanations (click here)
```{r} # With succinct column type specification: birds2 <- read_csv( # `read_csv()`: file is comma-delimited file = birds2_file, skip = 1, # First line is not part of the dataframe comment = "$", # Line 228 is a comment that starts with `$` col_types = "fcdiii" # "f" for factor, "c" for character, ) # .."d" for double (=numeric), # .."i" for integer. # With long column type specification: birds2 <- read_csv( file = birds2_file, skip = 1, comment = "$", col_types = cols( # We can omit columns for which we order = col_factor(), # ..accept the automatic parsing, year = col_integer(), # ..when using the long specification. month = col_integer(), day = col_integer() ) ) ```

:::
### Other options for reading tabular data There are also functions in *base R* that read tabular data, such as `read.table()` and `read.delim()`. These are generally slower than the *readr* functions, and have less sensible default options to their arguments. Particularly relevant is how columns with characters (strings) are parsed -- until R 4.0, which was released earlier this year, base R's default behavior was to parse them as **factors**, and this is generally not desirable^[You can check which version of R you are running by typing `sessionInfo()`. You can also check directly how strings are read by default with `default.stringsAsFactors()`. To avoid conversion to factors, specify `stringsAsFactors = FALSE` in your `read.table()` / `read.delim()` function call.]. *readr* functions will never convert columns with strings to factors. If speed is important, such as when reading in very large files (~ 100s of MBs or larger), you should consider using the `fread()` function from the *data.table* package. Finally, some examples of reading other types of files: - Read excel files directly using the *readxl* package. - Read Google Sheets directly from the web using the *googlesheets4* package. - Read non-tabular data using the base R `readLines()` function.