--- title: Managing and Manipulating Data Using R # potentially push to header subtitle: Lecture 1.1 author: Karina Salazar date: classoption: dvipsnames # for colors fontsize: 8pt urlcolor: blue output: beamer_presentation: keep_tex: true toc: true slide_level: 3 theme: default # AnnArbor # push to header? #colortheme: "dolphin" # push to header? #fonttheme: "structurebold" highlight: default # Supported styles include "default", "tango", "pygments", "kate", "monochrome", "espresso", "zenburn", and "haddock" (specify null to prevent syntax highlighting); push to header df_print: tibble # push to header? latex_engine: pdflatex # Available engines are pdflatex [default], xelatex, and lualatex; The main reasons you may want to use xelatex or lualatex are: (1) They support Unicode better; (2) It is easier to make use of system fonts. includes: in_header: ../beamer_header.tex #after_body: table-of-contents.txt --- ```{r, echo=FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>", highlight = TRUE) ``` # Student introductions ### Student introductions 1. Preferred name 1. Preferred pronouns 1. Academic program (and how far along) 1. GA, RA, TA, and/or job? 1. Why are you interested in this course? # About your instructor ### Karina Salazar, instructor My start in data management/statistical analysis - SPSS - evaluated retention programs within institutional research and assessment offices - student-level data on math remediation courses - College Academy for Parents, Think Tank, Assessment Institute - Stata - used loops and user-defined functions to work with national datasets (IPEDS, Survey of Earned Doctorates) Got sick of the limitations of survey data and/or available data - No survey asked questions on what I was interested in - universities pledge commitment to access, but enrollments don't tell the whole story - who do they actually recruit? - We realized "data science" could create data from publicly available data sources - Twitter - travel schedules on admissions websites ### Recruiting research program and "data science" - Python - web-scraping - connecting to Application Program Interfaces (API) (e.g., census data, Twitter, LinkedIn) - Natural Language Processing - R - R can do all "data science" tasks Python can - R can do all statistical analyses that Stata can (and more!) - R has amazing mapping capabilities Examples: - [The off-campus recruiting project](https://emraresearch.org/) - [Dissertation Defense](https://ksalazar3.github.io/defense/#/title) # What is R ### What is R According to the Inter-university consortium for political and social research [(ICPSR)](https://www.icpsr.umich.edu/icpsrweb/content/shared/ICPSR/faqs/what-is-r.html): > R is "an alternative to traditional statistical packages such as SPSS, SAS, and Stata such that it is an extensible, open-source language and computing environment for Windows, Macintosh, UNIX, and Linux platforms. Such software allows for the user to freely distribute, study, change, and improve the software under the [Free Software Foundation's GNU General Public License](https://www.gnu.org/home.en.html)." - For more info visit [R-project.org](https://www.r-project.org/about.html) ### Base R vs. R packages There are "default" packages that come with [R](https://stat.ethz.ch/R-manual/R-devel/library/base/html/00Index.html). Some of these include: - `as.character` - `print` - `setwd` And there are [R packages](http://r-pkgs.had.co.nz/intro.html) developed and shared by others. Some R packages include: - `tidyverse` - `stargazer` - `foreign` more about these in later weeks... ### Installing and Loading R packages You only need to install a package once. To install an R package use `install.package()` function. ```{r warning=FALSE, message=FALSE} #install.packages("tidyverse") ``` However, you need to load a package everytime you plan to use it. To load a package use the `library()` function. ```{r} library(tidyverse) ``` ### RStudio "[RStudio](https://www.rstudio.com/products/rstudio/features/) is an integrated development environment (IDE) for R. It includes a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging and workspace management." ![](pane_layout.png) ### R Markdown Documents [R Markdown](https://rmarkdown.rstudio.com/) produces dynamic output formats in html, pdf, MS Word, dashboards, Beamer presentations, etc. - We will be using R Markdown for lectures and homeworks. - These files names end with `.Rmd` ### R Scripts R Scripts are simply a text file containing all the same commands that you would enter on the command line of R. The "text" you can include in these files are in the form of comments. - We will be using R Markdown for some homework assignments. - These files names end with `.R` ### Why R? Capabilities of R - [Graphs](https://ggplot2.tidyverse.org/) - [Presentation](https://bookdown.org/yihui/rmarkdown/presentations.html) - [Websites](https://bookdown.org/yihui/rmarkdown/websites.html) - [Journals](https://bookdown.org/yihui/rmarkdown/journals.html) - [Interactive tutorials](https://rstudio.github.io/learnr/) - [Web apps](http://shiny.rstudio.com/) - [Dashbaords](https://rmarkdown.rstudio.com/flexdashboard/) - [Books](https://bookdown.org/) - [Web scraping](https://www.analyticsvidhya.com/blog/2017/03/beginners-guide-on-web-scraping-in-r-using-rvest-with-hands-on-knowledge/) - [Maps](http://pierreroudier.github.io/teaching/20170626-Pedometrics/20170626-soil-data.html) For more info [visit](https://bookdown.org/yihui/rmarkdown/) ### Graphs - Create graphs with [ggplot2](https://ggplot2.tidyverse.org/) package ```{r echo=FALSE, warning=FALSE} # Source: http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html # install.packages("ggplot2") # load package and data options(scipen=999) # turn-off scientific notation like 1e+48 library(ggplot2) theme_set(theme_bw()) # pre-set the bw theme. data("midwest", package = "ggplot2") # midwest <- read.csv("http://goo.gl/G1K41K") # bkup data source # Scatterplot gg <- ggplot(midwest, aes(x=area, y=poptotal)) + geom_point(aes(col=state, size=popdensity)) + geom_smooth(method="loess", se=F, formula = 'y ~ x') + xlim(c(0, 0.1)) + ylim(c(0, 500000)) + labs(subtitle="Area Vs Population", y="Population", x="Area", title="Scatterplot", caption = "Source: midwest") plot(gg) ``` ### Journal articles - Journal articles with [rticles](https://github.com/rstudio/rticles) package ![](rticles.png) ### Interactive web apps - Interactive web apps with [shiny](http://shiny.rstudio.com/) package ![](shiny.png) ### Mapping - Mapping with [sf](http://strimas.com/r/tidy-sf/) package & ggplot ![](sf.png) # What is this course about? ### What is data management? - All the stuff you have to do to create analysis datasets that are ready to analyze: - collect data - read/import data into statistical programming language - clean data - integrate data from multiple sources (e.g, join/merge, append) - change organizational structure of data so it is suitable for analysis - create "analysis variables" from "input variables" - Make sure that you have created analysis variables correctly ### Why I don't call this class "R for data science" Learn to walk before you can run! - "data science" implies doing "fancy" things like mapping, network analysis, web-scraping, etc. - But if you don't know how to clean data, these "fancy" analyses and visualizations will be useless - "80% of data science is data cleaning" - The skills you learn in this data management class are foundational to data science tasks! (and a prerequisite to taking data science seminar) ### Who is this class for? This class is for anyone who wants to work with data, that is people who want to be: - researchers working with survey data and doing traditional statistical analyses - researchers who want to do "data science" oriented research involving mapping, NLP, connecting to APIs - analysts working at think tanks or non-profits - "Data scientists" # Course logistics ### Course logistics - follow the syllabus # Create "R project" and directory structure ### What is an R project? Why are you doing this? What is an "R project"? - helps you keep all files for a project in one place - When you open an R project, the file-path of your current working directory is automatically set to the file-path of your R-project Why am I asking you to create R project and download a specific directory structure? - I want you to be able to run the .Rmd files for each lecture on your own computer - Sometimes these .Rmd files point to certain sub-folders - If you create R project and create directory structure I recommend, you will be able to run .Rmd files from your own computer without making any changes to file-paths! ### Follow these steps to create "R project" and directory structure 1. Download this zip folder: [LINK HERE](https://github.com/ozanj/rclass/raw/master/rclass.zip) - Unzip the folder: this is a shell of the file directory you should use for this class - Change the name to "rclass" - Move it to your preferred location (e.g, documents, desktop, dropbox, etc) 2. In RStudio, click on "File" >> "New Project" >> "Existing Directory" >> Select the rclass folder >> Create Directory 3. Save the following files in "rclass/lectures/lecture1" - lecture1.1_ua.Rmd - lecture1.1_ua.pdf - lecture1.2_ua.Rmd - lecture1.2_ua.Pdf - lecture1.2_ua.R ### After you follow these steps - you can add any additional sub-folders you want to the "rclass" folder - e.g., "syllabus", "resources" - You can add any additional files you want to the sub-directory folders you unzipped - e.g., in "rclass/lectures/lecture1" you might add an additional document of notes you took # Directories and filepaths ### Working directory __(Current) Working directory__ - the folder/directory in which you are currently working - this is where R looks for files - Files located in your current working directory can be accessed without specifying a filepath because R automatically looks in this folder Function `getwd()` shows current working directory ```{r} getwd() ``` Command `list.files()` lists all files located in working directory ```{r} getwd() list.files() ``` ### Working directory, "Code chunks" vs. "console" and "R scripts" When you run __code chunks__ in RMarkdown files (.Rmd), the working directory is set to the filepath where the .Rmd file is stored ```{r} getwd() list.files() ``` When you run code from the __R Console__ or an __R Script__, the working directory is.... Command `getwd()` shows current working directory ```{r} getwd() ``` ### Absolute vs. relative filepath **Absolute file path**: The absolute file path is the complete list of directories needed to locate a file or folder. `setwd("Users/Karina/rclass/lectures/lecture2")` **Relative file path**: The relative file path is the path relative to your current location/directory. Assuming your current working directory is in the "lecture2" folder and you want to change your directory to the data folder, your relative file path would look something like this: `setwd("../../data")` File path shortcuts | **Key** | **Description** | | ------ | -------- | | ~ | tilde is a shortcut for user's home directory (mine is my name) | | ../ | moves up a level | | ../../ | moves up two levels |