---
title: "Reading data files"
editor: 
  markdown: 
    wrap: 72
---

## Introduction

-   First thing we need to do is to read in data, so that we can use our
    software to analyze.
-   Consider these:
    -   Spreadsheet data saved as `.csv` file.
    -   "Delimited" data such as values separated by spaces.
    -   Actual Excel spreadsheets.

## Packages for this section

```{r readfile-R-1}
library(tidyverse)
```

## A spreadsheet

![](spreadsheet.png)

## Save as `.csv`

-   `.csv` or "comma-separated values" is a way of turning spreadsheet
    values into plain text.
-   Easy to read into R
-   but does not preserve formulas. (This is a reason for doing all your
    calculations in your statistical software, and only having data in
    your spreadsheet.)
-   File, Save As Text CSV (or similar).
-   used name `test1.csv`.

## The `.csv` file

```         
id,x,y,group
p1,10,21,upper
p2,11,20,lower
p3,13,25,upper
p4,15,27,lower
p5,16,30,upper
p6,17,31,lower
```

To read this in:

-   Fire up R Studio at `r.datatools.utoronto.ca`
-   Upload this .csv file. (Bottom right, next to New Folder, Upload.)
    Click Choose File, find the file, click Open. Click OK. See the file
    appear bottom right.

## Make a new Quarto document

-   File, New File, Quarto Document
-   ...and get rid of the template document (leaving the first four
    lines).
-   Make a code chunk and in it put this. Run it.

```{r readfile-R-2, eval=F}
library(tidyverse)
```

## Reading in the file

-   Use `read_csv` with the name of the file, in quotes. Save the
    read-in file in something, here called `mydata`. Make a new code
    chunk for this:

```{r readfile-R-3}
mydata <- read_csv("test1.csv")
mydata
```

## More on the above

-   read_csv guesses what kind of thing is in each column. Here it
    correctly guesses that:
    -   id and group are text (categorical variables). id is actually
        "identifier variable": identifies individuals.
    -   x and y are "double": numbers that might have a decimal point in
        them.

## R Studio on your own computer

-   Put the .csv file in the same folder as your project. Then read it
    in as above like `read_csv("test1.csv")`.
-   Or, use

```{r, eval=FALSE}
# f <- file.choose()
f
```

which brings up a file selector (as if you were going to find a file to
load or save it). Find your `.csv` file, the address of which will be
saved in `f`, and then:

```{r, eval=FALSE}
mydata <- read_csv(f)
```

-   When you have selected the file, comment out the `file.choose` line
    by putting a \# on the front of it. That will save you having to
    find the file again by mistake. (Keyboard shortcut: go to the line,
    type control-shift-C or Mac equivalent with Cmd.)

## Looking at what we read in

-   Again, type the name of the thing to display it:

\scriptsize

```{r readfile-R-4}
mydata
```

\normalsize

-   This is a "tibble" or data frame, the standard way of storing a data
    set in R.
-   Tibbles print as much as will display on the screen. If there are
    more rows or columns, it will say so.
-   You will see navigation keys to display more rows or columns (if
    there are more).

## `View`-ing your data frame

-   Another way to examine your data frame is to View it, like this:

```{r readfile-R-5, eval=F}
View(mydata)
```

...or find your data frame in the Global Environment top right and click
it. - This pops up a "data frame viewer" top left:

![](viewview.png){height="500px"}

## This View

-   Read-only: cannot edit data
-   Can display data satisfying conditions: click on Filter, then:
    -   for a categorical variable, type name of category you want
    -   for a quantitative variable, use slider to describe values you
        want.
-   Can sort a column into ascending or descending order (click little
    arrows next to column name).
-   Clicking the symbol with arrow on it left of Filter "pops out" View
    into separate (bigger) window.

## Summarizing what we read in

-   It is always a good idea to look at your data after you have read it
    in, to make sure you have believable numbers (and the right number
    of individuals and variables).
-   Quick check for errors: these often show up as values too high or
    too low, so the min and/or max will be unreasonable.
-   Five-number summary:

```{r readfile-R-6}
summary(mydata)
```

-   Quantitative, five-number summary plus mean.
-   Categorical, how many rows.

## Reading from a URL

-   Any data file on the Web can be read directly.
-   [Example data link:](http://ritsokiguess.site/datafiles/global.csv)
-   Use URL instead of filename.
-   I like to save the URL in a variable first (because URLs tend to be
    long), and then put that variable in the `read_` function:

```{r readfile-R-7}
my_url <- "http://ritsokiguess.site/datafiles/global.csv"
my_url
global <- read_csv(my_url)
```

## The data

```{r readfile-R-8}
global
```

## Space-delimited files

-   Another common format for data is a text file with the values
    separated by spaces. Top of some other data:

```         
cup tempdiff            
Starbucks 13            
Starbucks 7        
Starbucks 7        
Starbucks 17.5     
Starbucks 10       
Starbucks 15.5     
Starbucks 6        
Starbucks 6        
SIGG 12            
SIGG 16            
SIGG 9             
SIGG 23            
SIGG 11            
SIGG 20.5          
SIGG 12.5      
SIGG 20.5      
SIGG 24.5      
CUPPS 6    
CUPPS 6    
CUPPS 18.5 
CUPPS 10   
```

## Reading the coffee data

-   This file was on my computer so I uploaded it to
    `r.datatools.utoronto.ca` first.
-   This time, `read_delim`, and we also have to say what the thing is
    separating the values:

```{r readfile-R-9}
coffee <- read_delim("coffee.txt", " ")
coffee
```

-   Name of the cup, text, and tempdiff, a decimal number.

## Looking at the values (some)

```{r readfile-R-10}
coffee
```

These were four brands of travel mug (in cup), and for each, how much
the temperature of the coffee in the mug decreased over 30 minutes.

## Reading from the Web; the soap data

-   Use the URL in place of the filename.
-   Save the URL in a variable first:

```{r readfile-R-11}
url <- "http://ritsokiguess.site/datafiles/soap.txt"
soap <- read_delim(url, " ")
```

## The soap data (some)

```{r readfile-R-12}
soap
```

## Data aligned in columns

-   Sometimes you see data aligned in columns, thus:

![](Screenshot_2019-04-24_20-16-24.png){height="400px"}

-   `read_delim` will not work: values separated by more than one space.
-   The number of spaces between values is not constant, because there
    is one fewer space before the 10.
-   `read_table` works for this.

## Reading in column-aligned data

```{r readfile-R-13}
drugs <- read_table("migraine.txt")
drugs
```

## Reading an Excel sheet directly

-   Here is [my spreadsheet](test2.xlsx) from before, but tarted up a
    bit:

![](excel.png){height="450px"}

-   It is now a workbook with a second sheet called "notes" (that we
    don't want).
-   Install package `readxl` first.

```{r}

```

## Reading it in

-   Read into R, saying that we only want the sheet "data". Upload
    spreadsheet first.
-   Excel spreadsheets must be "local": cannot read one in from a URL.

```{r readfile-R-15, warning=F}
# install.packages("readxl")
library(readxl)
mydata2 <- read_excel("test2.xlsx", sheet = "data")
mydata2
```