--- title: "Week 2: Markdown and introduction" format: html: default code-annotations: select --- # Introduction Welcome! Each week, in-class we will be answering questions on the reading and performing analyses based on a code notebook. Throughout these sessions we will be replicating the analysis from [Beheshti et. al. 2021](https://pubmed.ncbi.nlm.nih.gov/33892837/)[^1] [^1]: Beheshti, Mahdieh et al. "Association of Diabetes and Dental Caries Among U.S. Adolescents in the NHANES Dataset." Pediatric dentistry vol. 43,2 (2021): 123-128. This is an analysis of the association between diabetes and dental caries in U.S. adolescents. By the end of the semester we will be able to replicate and extend the analysis in this paper. # Code Notebooks {#sec-notebook} A **code notebook** is a document which typically consists of different **chunks**. Each chunk is either code or text. There are a variety of different notebook platforms for different languages, such as Jupyter notebooks in Python. In R, notebooks have historically been written using R Markdown. However, recently Quarto has been created by Posit (the organization behind RStudio) as an updated version of R Markdown. R Markdown/Quarto notebooks can be *rendered* do different formats such as html (a webpage viewable in your web browser), pdf, Word, powerpoint, and others. Their power lies in their ability to make code an output document. We can write our report in the same document we actually perform the analysis, integrating the two together. Quarto and R Markdown syntax are almost identical. We will mainly be using Quarto in this course. ## Code Chunks You can start a and end code chunk using three back ticks "\`\`\`". To have a chunk run as R code, you need to assign the chunk using `{r}`. You can then specify options for the chunk on subsequent lines using the "hash-pipe" `|#`. Code chinks have a lot of options, but some of the most important are `label`, `eval`, `echo`, and `output`. ```{r} #| label: first-code-chunk x <- 5 x ``` ::: {.callout-tip appearance="minimal"} ## Exercise Try changing these options in the first of the two chunks below and re-rendering the document. What do each of these arguments do? Pay attention to both chunk's output. ```{r} #| label: chunk-options #| echo: true #| eval: true #| output: true y = 8 x y x <- x + y ``` ```{r} #| label: options-followup x # Show the value of x ``` ::: ## Markdown Markdown is a language used to quickly create formatted text. It's great to know as it is used in R Markdown, Quarto, Jupyter, Github documents, and many other places. A pure markdown file has a `.md` file extension. You can find a quick [guide to markdown here](https://quarto.org/docs/authoring/markdown-basics.html), throughout the course we will see various things markdown can do in the readings and in-class materials. ::: {.callout-note collapse="true"} ## Quarto Vs. R Markdown For those familiar with R Markdown, you can find a rundown of changes [here](https://quarto.org/docs/faq/rmarkdown.html). Due to Quarto being written as an evolution of R Markdown, it also supports most R Markdown syntax. While we could technically mix and match different types of syntax in a single document, this is bad practice. **Readable code is consistent**. Even if there are multiple ways to do something, it's best to choose one way and stick with it throughout a code or document. For an example of how passionate programmers can get about consistencies in their code, check out the [wikipedia article on indentation style](https://en.wikipedia.org/wiki/Indentation_style). ::: # Data Let's dive into the dataset we will be using today. We will begin with a mostly processed dataset from NHANES as described in *Beheshti 2021*. Specifically, this dataset contains the 3346 adolescents recorded in NHANES from 2005 to 2010 with non-missing dental decay data. You can find the dataset [here](https://raw.githubusercontent.com/ccb-hms/hsdm-r-course/main/session-materials/session2/session2.qmd) in the session 2 materials folder. Download it and place it into an appropriate location in your project folder. We will use the `read_csv` function for reading in this `.csv` file. ```{r} #| label: load-data #| output: false library(tidyverse) # <1> nhanes_ado <- read_csv("week2Data.csv") # <2> ``` 1. This line is loading the `tidyverse` library, which is actually a [family of different packages](https://www.tidyverse.org/packages/). Note that this code will fail if you do not have `tidyverse` installed, which you can do with `install.packages("tidyverse")` 2. You could change the filename to `file.choose()` to manually select a file location. Let's take a look at the first 200 rows of the dataframe. Note that you will need to have the `DT` package installed `install.packages('DT')`. ```{r} #| label: preview-data #| echo: false #| message: false DT::datatable(head(nhanes_ado,200), rownames = FALSE, options = list(pageLength = 10, scrollX=T), class = 'white-space: nowrap' ) ``` # Summarizing data There are a wide variety of ways to examine our data. Let's start by using the `summary` function to get an overview of `nhanes_ado`. ```{r} summary(nhanes_ado) # <1> ``` 1. While `summary` gives us a lot of information, we can't easily extract or use each piece of information easily. Thus, we need to also be able to individually calculate the values shown. Recall from the reading that we can use `filter` to get a subset of rows and `select` to get a subset of columns when we're using the tidyverse. We can also use the `$` or `[[]]` notations to get columns by name or the `[]` notation to get rows and columns by index. ::: {.callout-tip appearance="minimal"} ## Exercise Below is an incomplete set of code blocks summarizing the `nhanes_ado` dataset. Try to fill in the missing parts of the blocks. You need to add or modify code wherever you see `TODO`. We've already looked at the first few rows of the dataset. Now let's check the **last 15 rows** using the tail function. You can type `?tail` into the R console to check how tail's arugments work. ```{r} tail(nhanes_ado) ``` We can display a table showing the min, max, and mean for age and BMI. Replace the 0's below with the r expressions to correctly fill in the table. ```{r} min_age <- min(select(nhanes_ado, age.years)) # Tidyverse way max_age <- max(nhanes_ado$age.years) # base R way mean_age <- mean(nhanes_ado$age.years) min_bmi <- min(nhanes_ado$bmi, na.rm = TRUE) # BMI has some NA's so we need na.rm max_bmi <- max(nhanes_ado$bmi, na.rm = TRUE) mean_bmi <- mean(nhanes_ado$bmi, na.rm = TRUE) ``` This is a basic markdown table. We will see more advanced ways to make tables later (and we already saw one above using the `DT` package). Here we are using *inline* code to fill put variables into the markdown text. | Variable | Minimum | Maximum | Mean | |----------|-------------|-------------|--------------| | Age | `r min_age` | `r max_age` | `r mean_age` | | BMI | `r min_bmi` | `r max_bmi` | `r mean_bmi` | ::: # Practice with Conditionals While the summarizing above is nice, we need more tools to ask more interesting questions about the data. We can use conditional statements to dive into the data a bit deeper. In tidyverse we can use dplyr's `filter` with conditional statements to see how many rows meet various criteria. We can also sum a conditional statement directly. ```{r} filter(nhanes_ado, gender=="Female") %>% nrow() # Tidyverse way sum(nhanes_ado$gender == "Female") # Base r way ``` Let's try it now. ::: {.callout-tip appearance="minimal"} ## Exercise Fill in the TODO's below with the correct expressions. ```{r} under_15 <- nrow(filter(nhanes_ado, age.years < 15)) underweight <- nrow(filter(nhanes_ado, bmi < 18.5)) overweight <- nrow(filter(nhanes_ado, bmi >= 25 & bmi <= 29.9)) decay_or_restore <- nrow(filter(nhanes_ado, dental.decay.present | dental.restoration.present)) med_glu <- median(nhanes_ado$plasma.glucose, na.rm = TRUE) white_and_medpg <- nrow(filter(nhanes_ado, ethnicity == "Non-Hispanic White" & plasma.glucose > med_glu)) ``` This is an example of a markdown list. - There are `r under_15` samples with an age under 15. - There are `r underweight` samples who are underweight (BMI below 18.5). - There are `r overweight` samples who are of a overweight (BMI between 25 and 29.9). - There are `r decay_or_restore` samples who either have dental caries experience. - There are `r white_and_medpg` samples who are Non-Hispanic White and whose plasma glucose is greater than the median.\ :::