---
title: "Running pairwise chi-square tests with the `vivainsights` R library"
output: 
  html_document:
    toc: true
    toc_float: true
    toc_depth: 3
    theme: "lumen"
---

This script shows an example of how to perform pairwise chi-square tests for categorical variables in a dataset.

A pairwise chi-square test helps detect associations between distinct pairs of categorical variables. By running multiple chi-square tests on each pair, it allows you to pinpoint which specific pairs exhibit significant relationships.

## Step 1: load libraries and sample data

In this example, we will use the sample `pq_data` dataset from the **vivainsights** package. We will also use **dplyr** and **purrr** for data manipulation and iteration respectively, and optionally you can just load the **tidyverse** package instead.

```{r setup, message=FALSE, warning=FALSE}
library(vivainsights)
library(dplyr)
library(purrr)

sample_data <- pq_data
```

## Step 2: simulating a dataset with categorical variables

The next step is to simulate additional categorical variables for the sample data. In this example, we will create three fake categorical variables: `Teams`, `Regions`, and `Functions`. We will then merge these variables with the sample data.

```{r}
# Set random seed for reproducibility
set.seed(123)

# Number of unique PersonId in data
n_personid <- length(unique(sample_data$PersonId))

# Create fake categorical variables for each PersonId
cat_data <- data.frame(
  PersonId = unique(sample_data$PersonId),
  Teams = sample(c('Team 1', 'Team 2', 'Team 3'), size = n_personid, replace = TRUE),
  Regions = sample(c('East', 'South', 'West', 'North'), size = n_personid, replace = TRUE),
  Functions = sample(c('HR', 'Finance', 'Operations', 'Sales'), size = n_personid, replace = TRUE)
)

# Merge the datasets
sample_data_merged <- merge(sample_data, cat_data, by = "PersonId")

# Assign categorical variables names to `cat_vars`, alongside existing variables
cat_vars <- c("Teams", "Regions", "Functions", "Organization", "LevelDesignation")
```

## Step 3: Perform pairwise chi-square tests for all categorical variables

In the following, we first use `combn()` to generate all combinations of variable pairs. `combn()` comes from the **utils** package, and generates all combinations of a vector of elements of a given size. Here, we set `m = 2` to yield pairs.

Next, we use `map_dfr()` from the **purrr** package to loop over each combination and perform a chi-square test. The operation is similar to a for loop, but the results are row-bound (similar to `rbind()` or `bind_rows()`) and returned as a data frame. In R, it is generally more efficient to use `map()` functions from the **purrr** package than to use for loops.

Towards the end of the code, we add a significance level to the results based on the p-value. The significance level is denoted by asterisks, where `***` indicates p < 0.001, `**` indicates p < 0.01, and `*` indicates p < 0.05.

```{r}
# Generate all combinations of variable pairs
cat_var_combinations <- combn(x = cat_vars, m = 2, simplify = FALSE)

# Use `map_dfr()` to loop over each combination
results_df <-
  map_dfr(cat_var_combinations, ~{
  var1 <- .x[1]
  var2 <- .x[2]
  
  # Create a contingency table
  contingency_table <- table(sample_data_merged[[var1]], sample_data_merged[[var2]])
  
  # Perform chi-square test
  chi_test <- chisq.test(contingency_table)
  
  # Return data frame with results
  tibble(
    var1 = var1,
    var2 = var2,
    chi2 = chi_test$statistic,
    p = chi_test$p.value,
    n = sum(contingency_table)
  ) %>%
    # Add significance level
    mutate(
      significance = case_when(
        p < 0.001 ~ "***",
        p < 0.01 ~ "**",
        p < 0.05 ~ "*",
        TRUE ~ ""
    ))
})

print(results_df)
```

Finally, you can export the results to csv or clipboard using the following code:

```{r eval=FALSE}
# Copy to clipboard
results_df %>% export(method = "clipboard")

# Export to csv
results_df %>% export(method = "csv", path = "chi-square-results", timestamp = FALSE)
```