--- title: "8- Hypothesis testing with qualitative variables" author: "Alex Sanchez, Miriam Mota, Ricardo Gonzalo and \n Santiago Perez-Hoyos" date: "Statistics and Bioinformatics Unit. \n Vall d'Hebron Institut de Recerca" output: beamer_presentation: theme: "Copenhagen" colortheme: "dolphin" fonttheme: "structurebold" slide_level: 2 footer: "Statistical Analysis with R" editor_options: chunk_output_type: console --- ```{r setLicense, child = 'license.Rmd'} ``` ```{r setup, include=FALSE} knitr::opts_chunk$set(message = FALSE, warning=FALSE) ``` ```{r echo=FALSE} knitr::knit_hooks$set(mysize = function(before, options, envir) { if (before) return(options$size) }) ``` # Categorical Variables - Categorical variables represent facts that can be better described with _labels_ than with numbers. - Example: `Sex`, better choose from {`Male` , `Female`} than from: {1,2}. - Sometimes ordering of labels makes sense, although _it is not reasonable to assign numbers to categories_: - Example: `Tumor stage`: {1,2,3,4}, but $1+2\neq 3$!!! - `Sex` is an example of a categorical variable in `nominal` scale - `Stage` is an example of a categorical variable in `ordinal` scale # Representing categorical variables in R - Categorical variables are well represented with _factors_ ```{r} sex <- factor(c("Female", "Male")) blood_group <- factor(c("A", "B", "AB", "O")) ``` ```{r, echo=FALSE} # tumorstage <- factor(1:4, ordered=TRUE) ``` - Be careful with the names of factors, by default, _levels_ assigned in alphabetical order. ```{r,size="tiny"} levels(blood_group) ``` - To verify class of a variable ```{r} class(sex) ``` # Creating factors - Factors can be created ... - automatically, when reading a file or - Not all functions for reading data from file will create a factor!!! - Usually levels will be defined from alphabetic order - using the `factor` or the `as.factor` commands. - more flexible # Create factors automatically - This is achieved by - Using the `read.table` or `read.delim` functions for reading - Setting the "character variables as.factors" to TRUE - Example - Load the `diabetes` dataset using the `Import Dataset` feature of Rstudio - From text (base) (use the file `osteoporosis.csv`) - From text (readr) (use the file `osteoporosis.csv`) - What is the class of the variable `menop` # Create factors automatically ```{r, size=5, results='hide'} osteo1 <- read.csv("datasets/osteoporosis.csv",sep = "\t", stringsAsFactors=TRUE) class(osteo1$menop) summary(osteo1$menop) str(osteo1) library(readr) osteo2 <- read_delim("datasets/osteoporosis.csv", "\t", escape_double = FALSE) class(osteo2$menop) osteo2$menop <- as.factor(osteo2$menop) class(osteo2$menop) summary(osteo2$menop) str(osteo2) ``` # Exercise 1 - Select `diabetes.xls` datasets - Read the dataset into R and check that the categorical variables you are interested (mort, tabac, ecg) in are converted into factors. - Confirm the conversion by summarizing the variables # The analysis of categorical variables - The analysis of categorical data proceeds as usual: - Start exploring the data with the tables and graphics - Proceed to estimation and/or testing if _appropriate_ - Estimation - Proportions: Point estimates, confidence intervals - Testing - One variable (tests with proportions) - With two variables (chi-square and related) # Types of test with categorical variables - One variable (tests with proportions) - Does the proportion (% affected) match a given value? - Is the proportion (% affected) the same in two populations? - With two variables (chi-square and related) - Is there an association between two categorical variables? - Is there a relationship between the values of a categorical variable before and after treatment? # Types of test with categorical variables ![](images/arbrequali.PNG) # Example Consider the following study relating smoking and cancer. ```{r, echo=FALSE, out.width="80%", fig.cap=""} knitr::include_graphics("images/cancerAndSmoking1png.png") ``` Our goal here would be to determine if there is an association between smoking and cancer. # Crosstabulating a dataset - Data may come from a table (aggregated) or disagregated in a data file. - In this case we need to build the table applying "cross-tabulation" ```{r} dadescancer <- read.csv("datasets/dadescancer.csv", stringsAsFactors = TRUE) ``` ```{r} #attach(dadescancer) mytable <-table(dadescancer$cancer, dadescancer$fumar) mytable ``` # Crosstabulation (2): Marginal tables Marginal values are important to understand the structure of the data: ```{r} mytable<- addmargins(mytable) mytable ``` # Crosstabulation (3): In percentages Showing tables as percentages is useful for comparisons ```{r} prop.table(mytable) # cell percentages prop.table(mytable, 1) # row percentages # prop.table(mytable, 2) # column percentages ``` # Exercise 2 - With the diabetes dataset repeat the crosstabulation done above using - Two categorical variables - Variable "mort" and a newly created variable "bmi30" created by properly categorizing variable bmi. # One variable: Proportion tests - According to medical literature, in the period 1950-1980, the proportion of obese individuals (defined as: BMI $\geq 30$) was 15% in the population of men over 55 years old. - A random sample obtained from the same population between 2000 and 2003 showed that, over a total of 723 men older than 55, 142 were obese. - With a significance level of 5%, can we say that the population of men older than 55 in 2000-2003 had the same proportion of obese cases than that population had in 50'-80'? # Proportion tests with R Alternative "NOT EQUAL". This is set by default. ```{r} prop.test(x=142, n=723, p=0.15) ``` --- Alternative "GREATER" ```{r} prop.test(x=142, n=723, p=0.15, alternative="g") ``` --- Alternative "LESS THAN" ```{r} prop.test(x=142, n=723, p=0.15, alternative="l") ``` Notice that _choosing the wrong alternative may yield unreasonable conclusions_. # Estimation comes with proportion test - `prop.test` does __three__ distinct calculations - A test for the hypothesis $H_0: p=p_0$ is performed - A confidence interval for $p$ is built based on the sample - A point estimate for $p$ is also provided. ```{r, echo=FALSE, out.width="100%", fig.cap=""} knitr::include_graphics("images/propTestsAndCI.png") ``` # Exercise 3 - In the diabetes dataset. - Test the hypothesis that the proportion of patients with `bmi30` is higher than 40% - In the global population of the study - Only in patients with 'mort' equal "Muerto" # Contingency tables - A contingency table (a.k.a cross tabulation or cross tab) is a matrix-like table that displays the (multivariate) frequency distribution of the variables. - It is bidimensional, and classifies all observations according to two categorical variables (A and B, rows and columns). ```{r, echo=FALSE, out.width="80%", fig.cap=""} knitr::include_graphics("images/contingencytables1.png") ``` # Chi-squared test - A _familiy_ of tests receiving its name because they all rely on the _Chi-Squared distribution_ to compute the test probabilities. **Chi squared independence test** - When the sample comes from a single population with 2 categorical variables, the aim is to determine if there is relationship between them. **Chi squared homogeneity test** - When each row is a sample from distinct populations (groups, subgroups...), the aim is to determine if both groups have significative differences in that variable # Chi-squared tests - When we have: - quantitative data, - two or more categories, - independent observations, - adequate sample size (>10) - and our questions are like... - _Do the number of individuals or objects that fall in each pair of categories differ significantly from the number you would expect if there was no association?_ - _Is this difference between the expected and observed due to chance ("sampling variation"), or is it a real difference?_ # Chi squared.test: Observed vs expected ```{r, echo=FALSE, out.width="80%", fig.cap=""} knitr::include_graphics("images/observedvsexpected.png") ``` # Chi squared tests: Observed vs expected with R ```{r, results='hide'} require(gmodels) mytable <- table(dadescancer$cancer, dadescancer$fumar) CrossTable(mytable,expected = T,prop.chisq = F,prop.c = F,prop.r = F) ``` ![](images/chi.PNG) # Chi squared tests with R ```{r} chisq.test (mytable) ``` # Fisher test. an assumptions-free alternative Chi-squared test require that sample sizes are "big" and expected frequencies are, at least greater than 5. Fisher test can be an alternative if these assumptions are not met, especially for two times two tables. ```{r} fisher.test(mytable) ``` # Exercise 4 - Use the diabetes dataset to study if it can be detected an association between the variables `mort` and `tabac` in the diabetes dataset. - Do not start with a test but with an appropriate summarization and visualization! # Mcnemar test Mcnemar test is used to compare the frequencies of paired samples of dichotomous data - Ho: There is no significant change in individuals after the treatment - H1: There is a significant change in individuals after the treatment # Mcnemar test. Example ![](images/mcnemar.PNG) ```{r} .Table <- matrix(c(69,28,5,63), 2, 2, byrow=TRUE) mcnemar.test(.Table) ```