---
title: "Lab: data visualization"
format: html
---
## Reading in data
Let's start again by reading in the data from yesterday using the `read_csv()` function after loading the `tidyverse`:
```{r}
library(tidyverse)
heart_disease <- read_csv("https://raw.githubusercontent.com/36-SURE/2024/main/data/heart_disease.csv")
```
## Previewing the data
Write code that displays the column names of `heart_disease`. Also, look at the first six rows of your dataset to get an idea of what these variables look like. Which variables are quantitative, and which are categorical?
```{r}
# INSERT CODE HERE
```
As it turns out, even though `Drugs` and `Complications` appear to be quantitative - they are actually categorical variables. Specifically, `Drugs` represents the categorized number of drugs prescribed: 0 if none, 1 if one, 2 if more than one; `Complications` indicates whether or not the subscriber had complications: 1 if yes, 0 if no. To address this issue for our plots, we can manually recode the variables as **factors**. For instance, we can modify the `Complications` variable using a simple if-else statement:
```{r}
heart_disease <- heart_disease |>
mutate(Complications = ifelse(Complications == 0, "No", "Yes"))
```
This is a quick fix to the binary indicator variable since, by default, `R` orders factor variables in alphabetical order. In this case, "No" is before "Yes" because "N" is before "Y". We may not want variables in alphabetical order however - we will see how to change this in lecture.
Next, to update the `Drugs` variable we will use the [`fct_recode()`](https://forcats.tidyverse.org/reference/fct_recode.html) function which allows us to manually change the labels of a factor variable:
```{r}
heart_disease <- heart_disease |>
mutate(Drugs = fct_recode(as.factor(Drugs),
"None" = "0", "One" = "1", "More than one" = "2"))
```
*Why did we have to specify `as.factor(Drugs)` first then place the numbers in quotation marks?*
## Always make a bar chart...
Now we'll use the `ggplot()` function to create a **bar chart** of the `Drugs` variable. To make things easier, we provide the code for you to do this below; just uncomment the code and run it to create the bar chart. In what follows, you must answer some questions about the code and plot.
```{r}
# Create the bar chart of Drugs:
# heart_disease |>
# ggplot(aes(x = Drugs)) +
# geom_bar(fill = "darkblue") +
# labs(title = "Number of patients by number of drugs",
# x = "Number of drugs",
# y = "Number of patients")
```
Answer the following questions about the code and plot:
- In general, `ggplot()` code takes the following format: `ggplot(blank1, aes(x = blank2))`. Looking at the above code, what kind of `R` object should `blank1` be, and what should `blank2` be?
- What do you think the line `geom_bar(fill = "darkblue")` does?
- What do you think the remaining lines of code do (contained in `labs()`)?
## More area plots (but bar charts are better!)
Now we'll make a few other **area plots**:
- spine chart
- pie chart
- rose diagram
Your goal for this part is to create each of these plots. These plots can be created by copy-and-pasting the bar chart code from above and modifying it slightly. Follow these directions to create each of these plots:
- **spine chart**: First, copy-and-paste the bar chart code from above. Then, delete the `fill = "darkblue"` within `geom_bar()`. Finally, within `ggplot()`, replace `aes(x = Drugs)` with `aes(x = "", fill = Drugs)`. Also, change the labels in `labs()` if necessary.
```{r}
# PUT YOUR SPINE CHART CODE HERE
```
- **pie chart**: First, copy-and-paste the **spine chart code** you just made. Then, after `geom_bar()`, "add" `coord_polar("y")`. Be sure to put plus signs before and after `coord_polar("y")`. Also, change the labels in `labs()` if necessary.
```{r}
# PUT YOUR PIE CHART CODE HERE
```
- **rose diagram**: First, copy-and-paste your original bar chart code. Then, after `geom_bar(fill = "darkblue")`, "add" `coord_polar() + scale_y_sqrt()`. Be sure to put plus signs before and after `coord_polar() + scale_y_sqrt()`. Also, change the labels in `labs()` if necessary. After you make the rose diagram: In 1-2 sentences, what do you think `scale_y_sqrt()` does, and what is a benefit to including `scale_y_sqrt()` when making the rose diagram?
```{r}
# PUT YOUR ROSE DIAGRAM CODE HERE
```
## Notes on colors in plots
Three types of color scales to work with:
1. **Qualitative**: distinguishing discrete items that don't have an order (nominal categorical). Colors should be distinct and equal with none standing out unless otherwise desired for emphasis.
- Do **NOT** use a discrete scale on a continuous variable
2. **Sequential**: when data values are mapped to one shade, e.g., for an ordered categorical variable or low to high continuous variable
- Do **NOT** use a sequential scale on an unordered variable
3. **Divergent**: think of it as two sequential scales with a natural midpoint midpoint could represent 0 (assuming +/- values) or 50% if your data spans the full scale
- Do **NOT** use a divergent scale on data without natural midpoint
### Options for `ggplot2` colors
The default color scheme is pretty bad to put it bluntly, but `ggplot2` has ColorBrewer built in which makes it easy to customize your color scales. For instance, we can make a scatterplot with `Cost` on the y-axis and `Duration` on the x-axis and using the `geom_point()` layer with each point colored by `Drugs`:
```{r}
heart_disease |>
ggplot(aes(x = Duration, y = Cost, color = Drugs)) +
geom_point(alpha = 0.5) +
labs(x = "Duration",
y = "Cost",
color = "Number of drugs") +
theme_light()
```
*What does `alpha` change?* We can change the color plot for this plot using `scale_color_brewer()` function:
```{r}
heart_disease |>
ggplot(aes(x = Duration, y = Cost, color = Drugs)) +
geom_point(alpha = 0.5) +
scale_color_brewer(palette = "Set2") +
labs(x = "Duration",
y = "Cost",
color = "Number of drugs") +
theme_light()
```
Which do you prefer, the default palette or this new one? You can [check out more color palettes here.](https://r-graph-gallery.com/38-rcolorbrewers-palettes.html)
Something you should keep in mind is to pick a [color-blind friendly palette](http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/). One simple way to do this is by using the `ggthemes` package (you need to install it first before running this code!) which has color-blind friendly palettes included:
```{r}
heart_disease |>
ggplot(aes(x = Duration, y = Cost, color = Drugs)) +
geom_point(alpha = 0.5) +
# call the function directly from the package using `::` instead of library(ggthemes)
ggthemes::scale_color_colorblind() +
labs(x = "Duration",
y = "Cost",
color = "Number of drugs") +
theme_light()
```
In terms of displaying color from low to high, the [viridis scales](https://ggplot2.tidyverse.org/reference/scale_viridis.html) are excellent choices (and are also color-blind friendly!). For instance, we can map another quantitative variable (`Interventions`) to the color:
```{r}
heart_disease |>
ggplot(aes(x = Duration, y = Cost, color = Interventions)) +
geom_point(alpha = 0.5) +
scale_color_viridis_c() +
labs(x = "Duration",
y = "Cost",
color = "Interventions") +
theme_light()
```
What does this reveal about the plot? What happens if you delete `scale_color_viridis_c() +` from above? Which do you prefer?
## Notes on themes
You might have noticed above have various changes to the `theme` of plots for customization. **You will constantly be changing the theme of your plots to optimize the display.** Fortunately, there are a number of built-in themes you can use to start with rather than the default `theme_gray()`:
```{r}
heart_disease |>
ggplot(aes(x = Duration, y = Cost, color = Interventions)) +
geom_point(alpha = 0.5) +
scale_color_viridis_c() +
labs(x = "Duration",
y = "Cost",
color = "Interventions") +
theme_gray()
```
For instance, Quang's go-to theme is `theme_light()`
```{r}
heart_disease |>
ggplot(aes(x = Duration, y = Cost, color = Interventions)) +
geom_point(alpha = 0.5) +
scale_color_viridis_c() +
labs(x = "Duration",
y = "Cost",
color = "Interventions") +
theme_light()
```
There are options such as `theme_minimal()`:
```{r}
heart_disease |>
ggplot(aes(x = Duration, y = Cost, color = Interventions)) +
geom_point(alpha = 0.5) +
scale_color_viridis_c() +
labs(x = "Duration",
y = "Cost",
color = "Interventions") +
theme_minimal()
```
or `theme_classic()`:
```{r}
heart_disease |>
ggplot(aes(x = Duration, y = Cost, color = Interventions)) +
geom_point(alpha = 0.5) +
scale_color_viridis_c() +
labs(x = "Duration",
y = "Cost",
color = "Interventions") +
theme_classic()
```
or `theme_bw()`:
```{r}
heart_disease |>
ggplot(aes(x = Duration, y = Cost, color = Interventions)) +
geom_point(alpha = 0.5) +
scale_color_viridis_c() +
labs(x = "Duration",
y = "Cost",
color = "Interventions") +
theme_bw()
```
There are also packages with popular, such as the `ggthemes` package which includes, for example, `theme_economist()`:
```{r}
library(ggthemes)
heart_disease |>
ggplot(aes(x = Duration, y = Cost, color = Interventions)) +
geom_point(alpha = 0.5) +
scale_color_viridis_c() +
labs(x = "Duration",
y = "Cost",
color = "Interventions") +
theme_economist()
```
and `theme_fivethirtyeight()`, to name a couple:
```{r}
heart_disease |>
ggplot(aes(x = Duration, y = Cost, color = Interventions)) +
geom_point(alpha = 0.5) +
scale_color_viridis_c() +
labs(x = "Duration",
y = "Cost",
color = "Interventions") +
theme_fivethirtyeight()
```
With any theme you have picked, you can then modify specific components directly using the `theme()` layer. There are [many aspects of the plot's theme to modify](https://ggplot2.tidyverse.org/reference/theme.html), such as my decision to move the legend to the bottom of the figure, drop the legend title, and increase the font size for the y-axis:
```{r}
heart_disease |>
ggplot(aes(x = Duration, y = Cost, color = Interventions)) +
geom_point(alpha = 0.5) +
scale_color_viridis_c() +
labs(x = "Duration",
y = "Cost",
title = "Joint distribution of patients' duration and cost",
color = "Interventions") +
theme_light() +
theme(legend.position = "bottom",
legend.title = element_blank(),
axis.text.y = element_text(size = 14),
axis.text.x = element_text(size = 6))
```
If you're tired of explicitly customizing every plot in the same way all the time, then you should make a custom theme. It's quite easy to make a custom theme for `ggplot2` and of course [there are an incredible number of ways to customize your theme](https://themockup.blog/posts/2020-12-26-creating-and-using-custom-ggplot2-themes/). Below, we modify `theme_bw()` using the `%+replace%` argument to a new customized theme named `theme_cus()` - which is stored as a function:
```{r}
theme_cus <- function() {
# start with the base font size
theme_bw(base_size = 10) %+replace%
theme(
panel.background = element_blank(),
plot.background = element_rect(fill = "transparent", color = NA),
legend.position = "bottom",
legend.background = element_rect(fill = "transparent", color = NA),
legend.key = element_rect(fill = "transparent", color = NA),
axis.ticks = element_blank(),
panel.grid.major = element_line(color = "grey90", linewidth = 0.3),
panel.grid.minor = element_blank(),
plot.title = element_text(size = 15, hjust = 0, vjust = 0.5, face = "bold",
margin = margin(b = 0.2, unit = "cm")),
plot.subtitle = element_text(size = 12, hjust = 0, vjust = 0.5,
margin = margin(b = 0.2, unit = "cm")),
plot.caption = element_text(size = 7, hjust = 1, face = "italic",
margin = margin(t = 0.1, unit = "cm")),
axis.text.x = element_text(size = 13),
axis.text.y = element_text(size = 13)
)
}
```
Create the plot from before with this theme:
```{r}
heart_disease |>
ggplot(aes(x = Duration, y = Cost,
color = Interventions)) +
geom_point(alpha = 0.5) +
scale_color_viridis_c() +
labs(x = "Duration",
y = "Cost",
title = "Joint distribution of patients' duration and cost",
color = "Interventions") +
theme_cus()
```