---
title: "Lab 11: Numeric Summaries"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(readr)
library(ggplot2)
library(dplyr)
library(smodels)
library(ggrepel)
theme_set(theme_minimal())
```
## Cereal Data
Today, we start by looking at a collection of breakfast cereals:
```{r}
cereal <- read_csv("https://statsmaths.github.io/stat_data/cereal.csv")
```
With variables:
- name: name of the specific cereal
- brand: name of the cereal's manufacturer
- sugar: amount of sugar per serving (g)
- score: healthiness score; 0-100; 100 is the best
- shelf: what shelf the cereal is typically stocked on in the store
Produce a histogram of the sugar variable.
```{r}
```
Now, compute the standard deviation of the variable `sugar`:
```{r}
```
What are the units of this measurement?
**Answer**:
Now, compute the deciles of the variable `score`:
```{r}
```
What is the value of the 30th percentile. Describe what this means in words:
**Answer**:
Produce a boxplot of score and brand.
```{r}
```
Which brand seems to have the healthiest cereals?
**Answer**:
Produce a boxplot of score and shelf.
```{r}
```
Produce a boxplot of sugar and shelf.
```{r}
```
If I want a healthy but reasonably sweet cereal which shelf would be the
best to look on?
**Answer**:
## Tea Reviews
Next, we will take another look at a dataset of tea reviews that I used in
a previous lecture:
```{r}
tea <- read_csv("https://statsmaths.github.io/stat_data/tea.csv")
```
With variables:
- name: the full name of the tea
- type: the type of tea. One of:
- black
- chai
- decaf
- flavors
- green
- herbal
- masters
- matcha
- oolong
- pu_erh
- rooibos
- white
- score: user rated score; from 0 to 100
- price: estimated price of one cup of tea
- num_reviews: total number of online reviews
Draw a scatterplot with num_reviews (x-axis) against score (y-axis) and add a
regression line (recall: `geom_bestfit()`).
```{r}
```
Does the score tend to increase, decrease, or remain the same as the number
of reviews increases?
**Answer**:
Calculate the ventiles of the variable price.
```{r}
```
What is the 80th percentile? Describe it in words, include the units of the
problem in your answer.
**Answer**:
Plot the number of reviews (x-axis) against the score variable. Color
the points according to price binned into 5 buckets.
```{r}
```
What tends to be true about the number of reviews for the most expensive
20% of teas?
**Answer**:
Create a dataset named `white` that consists of only white teas.
```{r}
```
Calculate the standard deviation of the price for white teas and the
standard deviation of the price for all of the teas.
```{r}
```
Is the variation of the white tea prices smaller, larger, or about the same
as the entire dataset?
**Answer**:
### Group summarize
Now, make use of the group summarize command. Summarize the dataset by the
type of tea and save the results as a variable named `tea_type`.
```{r}
```
Plot the average price (x-axis) against the average score (y-axis) of
each type of tea. Make the size of the points proportional to the number
of teas in each category and label the points with `geom_text_repel` (it is
used like `geom_text` but will produce a better looking output here) using the
tea type as the label.
```{r}
```
Describe an interesting pattern or set of outliers that you found in the
previous plot. This does not need to take more than 1-2 sentences.
**Answer**: