--- title: "Practical Statistics 3: Powering up! Descriptive and Inferential Statistics" format: html editor: visual --- ### COVID-19 Vaccinations and Death in Malaysia ```{r, include=FALSE} #load packages required_packages <- c("tidyverse", "lubridate", "gtsummary", "rstatix", "janitor", "corrr") not_installed <- required_packages[!(required_packages %in% installed.packages()[ , "Package"])] if(length(not_installed)) install.packages(not_installed) suppressWarnings(lapply(required_packages, require, character.only = TRUE)) #load data c19_df <- read.csv("https://raw.githubusercontent.com/MoH-Malaysia/covid19-public/main/epidemic/linelist/linelist_deaths.csv") ``` ### **Task 1: Descriptive statistics using `tidyverse`** Question: Compute the summary statistics (count, mean, standard deviation, minimum, and maximum) of age using tidyverse functions. Steps: 1. Install and load the tidyverse package. 2. Filter the dataset to remove missing values in the "age" column (Note there are no missing values in the dataset- the task is simply meant to simulate the code that would be required if there were). 3. Use the summary functions from `dplyr` to compute the required summary statistics. In this case- count, mean, standard deviation, minimum, and maximum Solution: ```{r} # Step 1 #install.packages("tidyverse") library(tidyverse) # Step 2 & 3 summary_age <- c19_df %>% filter(!is.na(age)) %>% summarise( count = n(), mean = mean(age), sd = sd(age), min = min(age), max = max(age) ) summary_age ``` ### **Task 2: Descriptive statistics using `gtsummary`** Question: Create a descriptive statistics table for age, male, bid, and malaysian variables using gtsummary. Steps: 1. Install and load the `gtsummary` package. 2. Create a subset of the data with the selected variables (Note: Select any five variables). 3. Use the `tbl_summary()` function to compute and display the descriptive statistics. 4. Stratify by any other selected variable. Solution: ```{r} # Step 1 #install.packages("gtsummary") library(gtsummary) # Step 2, 3 & 4 df_subset <- c19_df %>% select(age, male, bid, malaysian) %>% tbl_summary(by = malaysian) ``` ### **Task 3: Inferential statistics using `rstatix`** Question: Test if there is a significant difference in age between males and females using the t-test. Steps: 1. Install and load the `rstatix` package. 2. Filter the dataset to remove missing values in the "age" and "male" columns. 3. Recode the "male" variable to factor. 4. Conduct a t-test to compare the means. Solution: ```{r} # Step 1 #install.packages("rstatix") library(rstatix) # Step 2 c19_df <- c19_df %>% filter(!is.na(age), !is.na(male)) # Step 3 c19_df$male <- factor(c19_df$male, levels = c(0, 1), labels = c("Female", "Male")) # Step 4 c19_df %>% t_test(age ~ male) ``` ### **Task 4: Inferential statistics using `gtsummary`** Question: Test if there is a significant difference in age between Malaysians and non-Malaysians using the t-test, and present the results in a table using gtsummary. Steps: 1. Recode the "malaysian" variable to factor (Tip: Use the factor function). 2. Use the `tbl_summary()` function to present the results. Solution: ```{r} # Step 1 c19_df$malaysian <- factor(c19_df$malaysian, levels = c(0, 1), labels = c("Non-Malaysian", "Malaysian")) # Step 2 t_test_result <- c19_df %>% select(age, malaysian) %>% # keep variables of interest tbl_summary( # produce summary table statistic = age ~ "{mean} ({sd})", # specify what statistics to show by = malaysian) %>% # specify the grouping variable add_p(age ~ "t.test") t_test_result ``` ### **Task 5: Correlations using `corrr`** Question: Compute the correlation between age, male, bid, and malaysian variables, and represent it in a correlation plot (Note: The selection of categorical variables is by design- just to practice the selection and presentation) Steps: 1. Install and load the `corrr` package. 2. Create a subset of the data with the selected variables. 3. Compute the correlation matrix (Note: Try ?network_plot and see how this can be used) ```{r} # Step 1 #install.packages("corrr") library(corrr) # Step 2 df_subset <- c19_df %>% select(age, male, bid, malaysian) # Step 3 correlation_matrix <- df_subset %>% correlate() # Step 4 correlation_matrix %>% network_plot() ``` ![](/images/network_plot-1.png){fig-align="center"} This would be the outcome if anything was highly correlated in our data.