---
title: "Lecture 10: GLM/logistic models and a review"
author: "Dr Nicolaas Puts and Dr Lucas França"
date: "12 December 2022"
output:
  html_document:
    df_print: paged
  pdf_document: default
editor_options: 
  chunk_output_type: inline
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```


# Background:

The data set consists of a sample of n=1000 sixteen years old children. Several continuous and categorical variables have been recorded:
- weight – weight in kg at age 16
- sex – gender of child (0 = male, 1 = female)
- height – height in cm at age 16
- class – registrar general's classification of social class (7 categories)
- malcat – Malaise score (0=no, 1=yes)
- reading – Reading score
- scd – CD rating
- sha – Hyperactivity (HA) rating


## Loading Packages

```{r}
#install.packages("gmodels")
library(foreign)
library(tidyverse)
library(car) # Remember to install the package 'car' if you do not have it yet!
library(gmodels) # Also install the package 'gmodels' if you do not have it!
```


## Loading data

```{r}
url <- 'https://github.com/fans-stats/intro_to_R_stats/blob/main/datasets/child_dev_a.sav?raw=true'
dataset <- foreign::read.spss(file = url, to.data.frame = TRUE) %>% 
  tibble::as_tibble()
dataset
```

# Task 1:

We want to compare the effects of a child’s gender on the risk of depression at the age of 22 measured as the dichotomised malaise score where 0 = No malaise and 1 = Yes malaise. Calculate your answer to 3 decimal places

```{r}
# First we look at the crosstables using the CrossTable() function from package 'gmodels'.
gmodels::CrossTable(dataset$malcat, dataset$sex, 
                    format = c('SPSS'), 
                    prop.r = FALSE, 
                    prop.t = FALSE, 
                    prop.chisq = FALSE)
```

Risk of having malaise in males = 26 ÷ 491 = 0·053
Risk of having malaise in females = 72 ÷ 509 = 0·141
We can compare the risk for each of the groups using the risk ratio.
(Risk when male) ÷ (Risk when female) = 0·053 ÷ 0·14 = 0.376
Those who were male had 0.37 times the risk of having malaise at 22 compared to those who were female.

We can also present risk as a percentage using the following formula % decrease = (1 - RR) x 100 Males had a 63% reduction in risk of malaise at 22 compared to females.

You could have also calculated this as
(Risk when female) ÷ (Risk when male) = 0.14 ÷ 0.053 = 2.642
Those who were female had 2.64 times the risk of having malaise at 22 compared to those who were male.

We can also present risk as a percentage using the following formula % increase = (RR - 1) x 100 Those who were female had a 164% increase in risk of having malaise at 22 compared to males.

# Task 2:

We want to compare the effects of a child’s gender on the odds of depression measured as the dichotomous malaise score where 0 = No malaise and 1 = Yes malaise. Calculate your answer to 3 decimal places (3dp)

Odds of having malaise in males = 26 ÷ 465 = 0·056
Odds of having malaise in females = 72 ÷ 437 = 0·165
We can compare the odds for each of the groups using the odds ratio.
(Odds when male) ÷ (Odds when female) = 0·056 ÷ 0·165 = 0.339
So the odds of having malaise at 22 when the child is male is about a 1/3 of the odds of females having malaise at 22.
You could have also calculated this as
(Odds when female) ÷ (Odds when male) = 0.165 ÷ 0.056 = 2.946
So the odds of having malaise at 22 when the child is female is about 3 times of the odds of a male child.

# Task 3:

Is there an association between a child’s gender and them suffering malaise at age 22?

```{r}
#install.packages('fmsb')
library(fmsb)

dataset %>% 
  glm(formula = malcat ~ sex, family = binomial(link = 'logit')) %>% 
  summary()

dataset %>% 
  glm(formula = malcat ~ sex, family = binomial(link = 'logit')) %>% 
  fmsb::NagelkerkeR2()
```

H0: there is no association between gender and malaise at age 22
H1: there is an association between gender and malaise at age 22

There was a significant improvement to the constant only model. Nagelkerke $R^2$ = 4.8% of the variation in malaise can be explained by the model including gender.


# Task 4

```{r}
dataset %>% 
  glm(formula = malcat ~ sex + reading + scd + sha + class, family = binomial(link = 'logit')) %>% 
  summary()

dataset %>% 
  glm(formula = malcat ~ sex + reading + scd + sha + class, family = binomial(link = 'logit')) %>% 
  fmsb::NagelkerkeR2()
```
There was a significant improvement to the constant only model. Nagelkerke R^2 = 10.5% of the variation in malaise can be explained by the model including gender, reading score, scd, sha and class.


# A review (with some data from the Office of National Statistics and NHS)


```{r}
url <- 'https://raw.githubusercontent.com/fans-stats/intro_to_R_stats_2022/main/lecture_10/covid_deaths_la.csv'
covid <- readr::read_csv(url) %>% 
  dplyr::rename(local_authority = `Area of usual residence name`,
                local_authority_code = `Area of usual residence code`)

covid
```

```{r}
url <- 'https://raw.githubusercontent.com/fans-stats/intro_to_R_stats_2022/main/lecture_10/deprivation_la.csv'

deprivation <- readr::read_csv(url) %>% 
  dplyr::rename(local_authority = `Local Authority District name (2019)`,
                local_authority_code = `Local Authority District code (2019)`, 
                income_score = `Income deprivation- Average score`, 
                income_rank = `Income deprivation - Rank of average score`) 

deprivation
  
```

```{r}
library(readxl)

url <- 'https://github.com/fans-stats/intro_to_R_stats_2022/blob/main/lecture_10/pop_la.xls?raw=true'

destfile <- './pop_la.xls'
download.file(url, destfile)

population <- readxl::read_xls(path = destfile) %>% 
  tibble::as_tibble() %>% 
  dplyr::rename(local_authority = `Area name`,
                local_authority_code = `Area code [note 2]`, 
                total_population = `All persons`)

population
```


```{r}
data <- inner_join(covid, deprivation) %>% 
  inner_join(population)
```

```{r}
data %>% 
  dplyr::filter(`Cause of death` == 'COVID-19', 
                Sex == c('Males', 'Females')) %>% 
  dplyr::mutate(deaths = March + April + May + June + July, 
                death_rate = deaths/total_population) %>% 
  
  ggplot2::ggplot(aes(x = income_score, y = death_rate)) + 
  geom_point(aes(colour = Sex)) + 
  theme_bw()
```

```{r}
data %>% 
  dplyr::filter(`Cause of death` == 'COVID-19', 
                Sex == c('Males', 'Females')) %>% 
  dplyr::mutate(deaths = March + April + May + June + July, 
                death_rate = deaths/total_population) %>% 
  
  lm(formula = death_rate ~ income_score + Sex) %>% 
  summary()
```

```{r}
data %>% 
  dplyr::filter(`Cause of death` == 'COVID-19', 
                Sex == "Persons") %>% 
  dplyr::mutate(deaths = March + April + May + June + July, 
                death_rate = deaths/total_population*100000) %>% 
  dplyr::group_by(death_rate) %>% 
  head(20) %>% 
  
  ggplot2::ggplot(aes(x = reorder(local_authority, death_rate), y = death_rate)) + 
    geom_col(aes(fill = income_score)) + 
    coord_flip() + 
    theme_bw() 
```

```{r}
library(sf)
url <- 'https://raw.githubusercontent.com/fans-stats/intro_to_R_stats_2022/main/lecture_10/Local_Authority_Districts_(April_2019)_UK_BUC.geojson'

lad <- st_read(url)

covid_data <- data %>% 
  dplyr::filter(`Cause of death` == 'COVID-19', 
                Sex == "Persons") %>% 
  dplyr::mutate(deaths = March + April + May + June + July, 
                death_rate = deaths/total_population*100000) %>% 
  dplyr::rename(LAD19NM = local_authority, 
                LAD19CD = local_authority_code)

lad_sf <- lad %>% 
  dplyr::left_join(covid_data) %>% 
  sf::st_as_sf(coords = c(LONG, LAT), crs = 4326)
```

```{r}
lad_sf %>% 
  ggplot2::ggplot() + 
  geom_sf(aes(fill = death_rate)) + 
  scale_fill_viridis_c(trans = 'log10') + 
  theme_minimal()
```

```{r}
url <- 'https://raw.githubusercontent.com/fans-stats/intro_to_R_stats_2022/main/lecture_10/trust_data.csv'

trusts <- readr::read_csv(file = url)
trusts
```

```{r}
lad_sf %>% 
  ggplot2::ggplot() + 
  geom_sf(aes(fill = death_rate)) + 
  geom_point(data = trusts, aes(x = Longitude, y = Latitude)) + 
  scale_fill_viridis_c(trans = 'log10') + 
  theme_minimal()
```