---
title: "HHEAR: Code for Separating Out Phenotypes from Single Phenotype Column"
version: "1.0"
author: "MS, primary; JAS, secondary"
---

##Background
Within the HHEAR Knowledge Graph, many outcomes are classified as a phenotype. Examples of phenotypes include having asthma, not having asthma, having food allergies, not having cancer, etc. Because an individual within a study can have multiple phenotypes (e.g. having asthma, not having food allergies, not having cancer), often harmonized datasets generated by the HHEAR Repository will contain a single column that contains multiple phenotype codes. This column has "Phenotype" in the header, codes are separated by a space and the meaning of the codes is included in the generated codebook.

This piece of code will transform the single column with multiple codes into a series of indicator values that represent the distinct phenotypes. The code will keep the original phenotype column and then append separate columns for each phenotype listed in the phenotype column at the end of the dataset. The column header will indicate the specific phenotype (derived from the label in the codebook) and the value in the column will be 1 if that observation had the specific phenotype and blank if that observation does not have that specific phenotype. Note that if you want a single, binary variable for a set of corresponding phenotypes (e.g. Has asthma/does not have asthma), you will need to create that variable using the generated indicators for the two phenotypes.

## User Instructions
The 2nd code chunk below requires user input. You need to enter the file locations of your codebook and dataset, downloaded from the HHEAR Harmonized Data Repository. Once you enter those file locations, run this code in its entirety. It will produce a new .csv file called "UpdatedData.csv" with separate columns for each phenotype. If you want to change the name of the dataset, you can do so in the last line of the 3rd code chunk.

```{r package_setup, include=FALSE}
library(tidyverse)
library(knitr)
```

##Loading codebook & data - REQUIRES USER INPUT

Please replace the listed file location for codebook_path with the location of the codebook generated for your dataset.

Please replace the file location for data_path with the location that corresponds to your dataset.

(Optional) Please replace the path location for updated_data_location with the path where you want the new file to be saved.


```{r user_input, include=FALSE}
##Required - Specify path of the codebook in the "":
codebook_path <- ""

##Required - Specify path of the dataset in the "":
data_path <- ""

##Optional - Specify location of where you want to save the updated data file. If you leave "" empty, the new file will be saved in your current session directory by default. 
updated_data_location <- "" 
```

##Data Processing 

```{r daprocess, include=FALSE}
codebook <- read_csv(codebook_path, na = c("", "NA"), 
                     show_col_types = FALSE) %>% 
  select(-class)

phenotype_data <- read_csv(data_path, na = c("", "NA"), ##specify location in the ""
                           show_col_types = FALSE, guess_max = 16000) 

phenotype_data_subtype <- phenotype_data %>% #subsetting on phenotype variables 
  select(contains("Phenotype")) %>% 
  mutate(across(everything(), as.character)) 

phenotype_data_subtype[] <- sapply(phenotype_data_subtype, function(x) gsub("^(.*)$", " \\1 ", x))

phenotype_variables <- colnames(phenotype_data_subtype)

unique_pheno <- numeric() #initiating list for the phenotypes 

for (i in 1:nrow(phenotype_data_subtype)) {
  for (j in 1:ncol(phenotype_data_subtype)) {
    row_values <- phenotype_data_subtype[i, j]
        if (!is.na(row_values)) {
          split_pheno <- strsplit(as.character(row_values), " ")[[1]]
          
          for (value in split_pheno) {
            numeric_value <- as.numeric(value)
            if (!is.na(numeric_value)) {
              unique_pheno <- c(unique_pheno, numeric_value)
        }
      }
    }
  }
}

unique_pheno <- unique(unique_pheno) #keeps all unique phenotype numbers

unique_pheno_df <- as.data.frame(unique_pheno) %>% 
  rename(code = unique_pheno) %>% 
  left_join(codebook, by = "code") %>% 
  mutate(value = gsub(" ", "-", value)) #adding labels to phenotype numbers

label_pheno <- c()

# Iterate over each unique value and suffix to generate labels
for (value in unique_pheno_df$value) {
  for (phenotype_variable in phenotype_variables) {
    new_variable <- paste0(phenotype_variable, "--", value) 
    label_pheno <- c(label_pheno, new_variable)
  }
}

#Adding columns of phenotype labels to dataframe
for (column_name in label_pheno) { 
  phenotype_data_subtype[[column_name]] <- NA
}

for (data_row in 1:nrow(phenotype_data_subtype)) { 
  for (col_name in phenotype_variables) {
    row_values <- phenotype_data_subtype[data_row, col_name]
    for (i in 1:nrow(unique_pheno_df)) {
      code <- unique_pheno_df[i, "code"] %>% as.character()
      code <- paste0(" ", code, " ")
      value <- unique_pheno_df[i, "value"]
  
      if (grepl(code, row_values[1])) {
        value = paste0(col_name, "--", value)
        phenotype_data_subtype[data_row, value] <- 1
      }
    }
  }
}

phenotype_data_subtype2 <- phenotype_data_subtype %>% 
  select_if(~!(all(is.na(.)) | all(. == "")))

phenotype_data_1 <- phenotype_data %>% 
  select(-contains("Phenotype"))

updated_phenotype_data <- cbind(phenotype_data_1, phenotype_data_subtype2)

write.csv(updated_phenotype_data, 
                    paste0(updated_data_location, "UpdatedData.csv"), row.names = FALSE, na="") 
```