---
title: "Regression with categorical variables"
editor: 
  markdown: 
    wrap: 72
---

## Packages for this section

```{r with-categ-R-1}
library(tidyverse)
library(broom)
```

## The pigs revisited

```{r with-categ-R-2, echo=FALSE}
options(dplyr.summarise.inform = FALSE)
```

-   Recall pig feed data, after we tidied it:

```{r with-categ-R-3, message=F}
my_url <- "http://ritsokiguess.site/datafiles/pigs2.txt"
pigs <- read_delim(my_url, " ")
pigs 
```

## Summaries

```{r with-categ-R-4}
pigs %>%
  group_by(feed) %>%
  summarize(n = n(), mean_wt = mean(weight), 
            sd_wt = sd(weight))
```

## Running through `aov` and `lm`

-   What happens if we run this through `lm` rather than `aov`?
-   Recall `aov` first:

```{r with-categ-R-5}
pigs.1 <- aov(weight ~ feed, data = pigs)
summary(pigs.1)
```

## and now `lm`

\footnotesize

```{r with-categ-R-6}
pigs.2 <- lm(weight ~ feed, data = pigs)
summary(pigs.2)
tidy(pigs.2)
glance(pigs.2)
```

\normalsize

## Understanding those slopes {.scrollable}

-   Get one slope for each category of categorical variable feed, except
    for first.
-   feed1 treated as "baseline", others measured relative to that.
-   Thus prediction for feed 1 is intercept, 60.62 (mean weight for feed
    1).
-   Prediction for feed 2 is 60.62 + 8.68 = 69.30 (mean weight for feed
    2).
-   Or, mean weight for feed 2 is 8.68 bigger than for feed 1.
-   Mean weight for feed 3 is 33.48 bigger than for feed 1.
-   Slopes can be negative, if mean for a feed had been smaller than for
    feed 1.

## Reproducing the ANOVA

-   Pass the fitted model object into `anova`:

\footnotesize

```{r with-categ-R-7}
anova(pigs.2)
```

\normalsize

-   Same as before.
-   But no Tukey this way:

\footnotesize

```{r with-categ-R-8, error=TRUE}
TukeyHSD(pigs.2)
```

\normalsize

## The crickets

-   Male crickets rub their wings together to produce a chirping sound.
-   Rate of chirping, called "pulse rate", depends on species and
    possibly on temperature.
-   Sample of crickets of two species' pulse rates measured; temperature
    also recorded.
-   Does pulse rate differ for species, especially when temperature
    accounted for?

## The crickets data

Read the data:

```{r with-categ-R-9, message=F}
my_url <- "http://ritsokiguess.site/datafiles/crickets2.csv"
crickets <- read_csv(my_url)
crickets %>% slice_sample(n = 10)
```

## Fit model with `lm`

```{r with-categ-R-10}
crickets.1 <- lm(pulse_rate ~ temperature + species, 
                 data = crickets)
```

Can I remove anything? No:

```{r with-categ-R-11}
drop1(crickets.1, test = "F") 
```

`drop1` is right thing to use in a regression with categorical
(explanatory) variables in it: "can I remove this categorical variable
*as a whole*?"

## The summary

```{r with-categ-R-12}
summary(crickets.1)
```

## Conclusions

-   Slope for temperature says that increasing temperature by 1 degree
    increases pulse rate by 3.6 (same for both species)
-   Slope for `speciesniveus` says that pulse rate for `niveus` about 10
    lower than that for `exclamationis` at same temperature (latter
    species is baseline).
-   R-squared of almost 0.99 is very high, so that the prediction of
    pulse rate from species and temperature is very good.

## To end with a graph

-   Two quantitative variables and one categorical: scatterplot with
    categories distinguished by colour.
-   This graph seems to need a title, which I define first.

```{r with-categ-R-13}
t1 <- "Pulse rate against temperature for two species of crickets"
t2 <- "Temperature in degrees Celsius"
ggplot(crickets, aes(x = temperature, y = pulse_rate,
  colour = species)) +
  geom_point() + geom_smooth(method = "lm", se = FALSE) +
  ggtitle(t1, t2) -> g
```

## The graph

```{r}
g
```