---
title: "8- Hypothesis testing with qualitative variables"
author: "Miriam Mota and Santi Pérez"
date: "Statistics and Bioinformatics Unit. Vall d'Hebron Institut de Recerca"
output:
  beamer_presentation:
    theme: "Copenhagen"
    colortheme: "dolphin"
    fonttheme: "structurebold"
    slide_level: 2
footer: "Statistical Analysis with R"
editor_options: 
  chunk_output_type: console
---

```{r setLicense, child = 'license.Rmd'}

```

```{r setup, include=FALSE}
knitr::opts_chunk$set(message = FALSE, warning=FALSE)
```

```{r echo=FALSE}
knitr::knit_hooks$set(mysize = function(before, options, envir) {
  if (before) 
    return(options$size)
})
```

# Introduction. Categorical Variables

- Categorical variables represent facts that can be better described with _labels_ than with numbers. 

    - Example: `Sex`, better choose from {`Male` , `Female`} than from: {1,2}.

- Sometimes ordering of labels makes sense, although _it is not reasonable to assign numbers to categories_:

    - Example: `Tumor stage`: {1,2,3,4}, but $1+2\neq 3$!!!

  - `Sex` is an example of a categorical variable in `nominal` scale
  
  - `Stage` is an example of a categorical variable in `ordinal` scale

## Representing categorical variables in R

- Categorical variables are well represented with _factors_

```{r}
sex <- factor(c("Female", "Male"))
blood_group <- factor(c("A", "B", "AB", "O"))

tabaco <- factor(c(" Si", "No"))
levels(tabaco)
```

<!-- - Besides, factors can be forced to be "ordereded" -->

```{r, echo=FALSE}
# tumorstage <- factor(1:4, ordered=TRUE)
```

- Be careful with the names of factors, by default, _levels_ assigned in alphabetical order.

```{r,size="tiny"}
levels(blood_group)
```

- To verify class of a variable

```{r}
class(blood_group)
```


## Creating factors

- Factors can be created ... 

  - automatically, when reading a file  or 
  
    - Not all functions for reading data from file will create a factor!!!
    - Usually levels will be defined from alphabetic order
  
  - using the `factor` or the `as.factor` commands.
    
    - more flexible
    
## Create factors automatically

- This is achieved by

  - Using the `read.table` or `read.delim` functions for reading
  
    - Setting the "character variables as.factors" to TRUE

- Example

  - Load the `diabetes` dataset using the `Import Dataset` feature of Rstudio
    - From text (base)      (use the file `osteoporosis.csv`)
    - From text (readr)     (use the file `osteoporosis.csv`)
    
    <!-- - From Excel            (use the file `diabetes.xls`) -->
    
  - What is the class of the variable `menop`
  

## Create factors automatically
  
```{r, size=5, results='hide'}

osteo1 <- read.csv("datasets/osteoporosis.csv",sep = "\t", stringsAsFactors = TRUE)
class(osteo1$menop)
summary(osteo1$menop)
str(osteo1)

library(pacman)
p_load(readr)
osteo2 <- read_delim("datasets/osteoporosis.csv", "\t", escape_double = FALSE)
class(osteo2$menop)
osteo2$menop <- as.factor(osteo2$menop)
class(osteo2$menop)
summary(osteo2$menop)
str(osteo2)
```

## Exercise 1

- Select `diabetes.xls` datasets 
    
 - Read the dataset into R and check that the categorical variables you are interested (mort, tabac, ecg) in are converted into factors.
 
 - Confirm the conversion by summarizing the variables

 
<!-- # Exercise 2 -->

<!-- - Use the diabetes.sav file and import it into R with the "Import from SPSS" feature. --> 

<!--   - What is the class of the "MORT" variable. -->

<!--   - Turn it into one factor so that it has the same levels as when you read it using `read.csv` -->
  

## The analysis of categorical variables

- The analysis of categorical data proceeds as usual:

 - Start exploring the data with the   tables and graphics
 - Proceed to estimation and/or testing if _appropriate_

  - Estimation
    - Proportions: Point estimates, confidence intervals
    
  - Testing
    - One variable (tests with proportions)
    - With two variables (chi-square and related)
  
  
## Types of test with categorical variables

- One variable (tests with proportions)
  - Does the proportion (% affected) match a given value?
  - Is the proportion (% affected) the same in two populations?

- With two variables (chi-square and related)

  - Is there an association between two categorical variables?
  - Is there a relationship between the values of a categorical variable before and after treatment?
  
# Types of test with categorical variables

![](images/arbrequali.PNG)


## Example

Consider the following study relating smoking and cancer.

```{r, echo=FALSE, out.width="80%", fig.cap=""}
knitr::include_graphics("images/cancerAndSmoking1png.png")
```

Our goal here would be to determine if there is an association between smoking and cancer.

## Crosstabulating a dataset

\scriptsize

- Data may come from a table (aggregated) or disagregated in a data file.

- In this case we need to build the table applying "cross-tabulation"

```{r}
dadescancer <- read.csv("datasets/dadescancer.csv", stringsAsFactors = TRUE, sep = ",")
head(dadescancer)
```

```{r}
#attach(dadescancer)
mytable <- table(dadescancer$cancer, dadescancer$fumar)
mytable
```

## Crosstabulation (2): Marginal tables

Marginal values are important to understand the structure of the data:


```{r}
mytable_margin <- addmargins(mytable)
mytable_margin
```


## Crosstabulation (3): In percentages
\scriptsize

Showing tables as percentages is useful for comparisons

```{r}
mytable_prop <- prop.table(mytable)
mytable_prop

```

## Crosstabulation (3): In percentages
\scriptsize

Showing tables as percentages is useful for comparisons

```{r}
gmodels::CrossTable(dadescancer$cancer, dadescancer$fumar, prop.chisq = F)
```

## Plot 
\scriptsize

Showing plot as percentages is useful for comparisons

```{r}
p_load(dplyr, ggplot2)

# Calcular frecuencias y porcentajes
datos_porcentaje_total <- dadescancer  %>% 
  group_by(fumar, cancer) %>%
  summarise(n = n()) %>%
  group_by(fumar) %>%
  mutate(porcentaje = n / sum(n) * 100)


ggplot(datos_porcentaje_total, aes(x = fumar, y = porcentaje, fill = cancer)) +
  geom_col(position = "fill") +
  scale_y_continuous(labels = scales::percent) +
  labs(
    title = "Proporción de fumadores por condición de cáncer",
    x = "Fuma",
    y = "Porcentaje",
    fill = "Cancer"
  )
```

## Plot 
\scriptsize

Showing plot as percentages is useful for comparisons

```{r, echo = FALSE, fig.width=6, fig.height=3 }
p_load(dplyr, ggplot2)

# Calcular frecuencias y porcentajes
datos_porcentaje_total <- dadescancer  %>% 
  group_by(fumar, cancer) %>%
  summarise(n = n()) %>%
  group_by(fumar) %>%
  mutate(porcentaje = n / sum(n) * 100)


ggplot(datos_porcentaje_total, aes(x = fumar, y = porcentaje, fill = cancer)) +
  geom_col(position = "fill") +
  scale_y_continuous(labels = scales::percent) +
  labs(
    title = "Proporción de fumadores por condición de cáncer",
    x = "Fuma",
    y = "Porcentaje",
    fill = "Cancer"
  )
```

## Exercise 2

- With the diabetes dataset repeat the crosstabulation done above using 
  - Two categorical variables
  - Variable "mort" and a newly created variable "bmi30" created by properly categorizing variable bmi.
  

# One variable: Proportion tests

- According to medical literature, in the period 1950-1980, the proportion of obese individuals (defined as: BMI $\geq 30$) was 15% in the population of men over  55 years old.

- A random sample obtained from the same population between 2000 and 2003 showed that, over a total of 723 men older than 55, 142 were obese.

- With a significance level of 5%, can we say that the population of men older than 55 in 2000-2003 had the same proportion of obese cases than that population had in 50'-80'?

## Proportion tests with R
\scriptsize


Alternative "NOT EQUAL". This is set by default.

```{r}
prop.test(x = 142, n = 723, p = 0.15)
prop.test(x = 142, n = 723)
```

---

Alternative "GREATER"

```{r}
prop.test(x = 142, n = 723, p = 0.15, alternative = "g")
```

---

Alternative "LESS THAN"

```{r}
prop.test(x=142, n=723, p=0.15, alternative="l")
```


Notice that _choosing the wrong alternative may yield unreasonable conclusions_.


## Estimation comes with proportion test

- `prop.test` does __three__ distinct calculations

  - A test for the hypothesis $H_0: p=p_0$ is performed
  - A confidence interval for $p$ is built based on the sample
  - A point estimate for $p$ is also provided.

```{r, echo=FALSE, out.width="100%", fig.cap=""}
knitr::include_graphics("images/propTestsAndCI.png")
```

## Exercise 3

- In the diabetes dataset.
  
  - Test the hypothesis that the proportion of patients with `bmi30` is higher than 40%
      - In the global population of the study
      - Only in patients with 'mort' equal "Muerto"
      
      
  <!-- - Select a sample of size 100 and repeat the test. How do the results change? -->

  <!-- - What sample size should we have taken so that th precision of the confidence intervals would have been at most 3% with a probability of 95%? -->
    
    
## Contingency tables

- A contingency table (a.k.a cross tabulation or cross tab) is a matrix-like table that displays the (multivariate) frequency distribution of the variables.

- It is bidimensional, and classifies all observations according to two categorical variables 
(A and B, rows and columns).

```{r, echo=FALSE, out.width="80%", fig.cap=""}
knitr::include_graphics("images/contingencytables1.png")
```

# Two variables. Chi-squared test

- A _familiy_ of tests receiving its name because they all rely on the _Chi-Squared distribution_ to compute the test probabilities.

**Chi squared independence test**

- When the sample comes from a single population with 2 categorical variables, the aim is to determine if there is relationship between them.

**Chi squared homogeneity test** 

- When each row is a sample from distinct populations (groups, subgroups...), the aim is to determine if both groups have significative differences in that variable

## Chi-squared tests

- When we have:

  - quantitative data, 
  - two or more categories, 
  - independent observations, 
  - adequate sample size (>10)

- and our questions are like...

   - _Do the number of individuals or objects that fall in each pair of categories differ significantly from the number you would expect if there was no association?_

   - _Is this difference between the expected and observed due to chance ("sampling variation"), or is it a real difference?_

## Chi squared.test: Observed vs expected

```{r, echo=FALSE, out.width="80%", fig.cap=""}
knitr::include_graphics("images/observedvsexpected.png")
```

## Chi squared tests: Observed vs expected with R

```{r, results='hide'}
require(gmodels)
mytable <- table(dadescancer$cancer, dadescancer$fumar)
CrossTable(mytable,expected = T,prop.chisq = F,prop.c = F,prop.r = F)
```
![](images/chi.PNG)

## Chi squared tests with R

```{r}
chisq.test (mytable)
```


## Fisher test. an assumptions-free alternative

Chi-squared test require that sample sizes are "big" and expected frequencies are, at least greater than 5.

Fisher test can be an alternative if these assumptions are not met, especially for two times two tables.

```{r}
fisher.test(mytable)
```


## Exercise 4

- Use the diabetes dataset to study if it can be detected an association between the variables `mort` and `tabac` in the diabetes dataset.

- Do not start with a test but with an appropriate summarization and visualization!

## Mcnemar test

Mcnemar test is used to compare the frequencies of paired samples of dichotomous data

- Ho: There is no significant change in individuals after the treatment

- H1: There is a significant change in individuals after the treatment


## Mcnemar test. Example 

![](images/mcnemar.PNG)


```{r}
.Table <- matrix(c(69,28,5,63), 2, 2, byrow=TRUE)
mcnemar.test(.Table)
```


## Cochran Test

Un investigador quiere evaluar si hay diferencia en la proporción de éxitos en 3 tratamientos diferentes aplicados a los mismos 10 pacientes 

- 1 = éxito, 

- 0 = falla
```{r}

p_load(DescTools)

datos <- data.frame(
  Tratamiento_A = c(1, 1, 0, 1, 0, 1, 0, 1, 1, 0),
  Tratamiento_B = c(1, 0, 0, 1, 0, 1, 1, 1, 0, 0),
  Tratamiento_C = c(0, 0, 0, 1, 1, 1, 1, 1, 1, 0)
)


```
## Cochran Test

```{r}


# Test de Cochran Q
CochranQTest(as.matrix(datos))
```


# Some about diagnosis

[diagnosis](https://github.com/uebvhir/Course_Basic_Statistics-VHIO-2019/blob/master/Sessio_9/EBB_2019_Session_9_Diagnostic.pdf)

![](images/diagnosis.png)

## Diagnostic test 

Most importan result of medical practice

- Classification of individuals in healthy or sick

- Need of reference method or “TRUE”

- Positive results of test in patients and negative in healthy


## Diagnostic test 

![](images/tab_diag.png)

## Sensitivity

![](images/sensitivity.png)

## Specificty

![](images/specificity.png)

## Diagnostic Measures

![](images/diag_measures.png)

## Positive Predictive Value

![](images/ppv.png)
## Negative Predictive Value

![](images/npv.png)

## Diagnostic measures in R 

\scriptsize

```{r}
pacman::p_load(epiR)
table1 <- as.table(matrix(c(634,269,487,1251),nrow=2, byrow=TRUE))
epi.tests(table1)
```