--- title: "R Notebook" output: html_notebook editor_options: chunk_output_type: console --- ```{r, message=FALSE} require(pacman) p_load(tidyverse) ``` ## Selecting rows with filter The dplyr function filter() provides a cleaner syntax for subsetting datasets. Conditions separated by , are joined by & (logical AND). ```{r, size="footnotesize", message=FALSE} require(readxl) diab <- read_excel("datasets/diabetes.xls") diab_filt <- filter(diab, tabac == "No fumador", edat >= 50) head(diab_filt, n = 4) ``` ## Selecting columns with select Use dplyr function select() to keep only the variables you need. ```{r} diab_small <- select(diab, mort, edat, tabac, sbp) head(diab_small, n = 4) ``` ## Sorting rows with arrange Sort the order of rows by variable values using **arrange()** from dplyr. Be default, ascending order will be used. Surround a sorting variable with **desc()** to sort by descending order instead. ```{r} # sort, with males before 'vivo', then by age, youngest first diab_sort <- arrange(diab, desc(mort), edat) head(diab_sort, n = 4) ``` ## R Logical operators and functions Here are some operators and functions to help with selection: * **==**: equality * **$>, >=$**: greater than, greater than or equal to * **!**: not * **&**: AND * **|**: OR * **%in%**: matches any of (2 %in% c(1,2,3) = TRUE) * **is.na()**: equality to NA * **near()**: checking for equality for floating point (decimal) numbers, has a built-in tolerance ## Transforming variables into new variables The function **mutate()** allows us to transform many variables in one step without having to respecify the data frame name over and over. Useful R functions for transforming: * **log()**: logarithm * **min_rank()**: rank values * **cut()**: cut a continuous variable into intervals with new integer value signifying into which interval original value falls * **scale()**: standardizes variable (substracts mean and divides by standard deviation) * **cumsum()**: cumulative sum * **rowMeans(), rowSums()**: means and sums of several columns ## Example: mutate() create age category variable, and highbmi binary variable ```{r, size="tiny"} diab_mut <- mutate(diab, edatcat = cut(edat, breaks = c(0,40,50,60,70,120)), highbmi = bmi > mean(bmi)) tail(diab_mut, n = 4) ``` ```{r, eval = FALSE} table(diab_mut$edatcat, diab_mut$highbmi) ``` ## EXERCISE 1. Find all individual that: 1.1 Had a sbp higher than 160 (**filter()**) 1.2 Had a sbp higher than 160 or tabac was 'Fumador' 2. What happens if you include the name of a variable multiple times in a **select()** call? 3. Sort individual to find the most 'tempsviu'. (**arrange() **) # The pipe operator %>% ## Examples of using the pipe operator As a first example, perhaps we want to create a dataset of just Vivo under 40, with only the age and pain variables selected. We could do this in 2 steps, like so: ```{r} diab40 <- filter(diab, mort == "Vivo" & edat < 40) diab40_small <- select(diab40, edat, dbp) head(diab40_small,n = 4) ``` ## Examples of using the pipe operator While that works fine, the intermediate dataset f40 is not of interest and is cluttering up memory and the workspace unnecessarily. We could use %>% instead: ```{r} diab40_small <- diab %>% filter(mort == "Vivo" & edat < 40) %>% select(edat, dbp) head(diab40_small,n = 4) ``` ## EXERCISE Replicate the last exercice using 'pipes' ```{r, eval = F} df <- filter(diab,sbp > 160 | tabac == "Fumador") dfs <- select(df, tempsviu ,bmi,sbp,sbp) dfsa <- arrange(dfs, desc(tempsviu)) ``` # Example This dataset contains patient information from a hospital survey or registry. **It was provided by a researcher and needs a lot of cleaning.** GOAL - Transform this chaotic dataset into a clean, structured, and analyzable format using R: - clean_names() (from janitor) - filter(), select(), mutate(), arrange() (from dplyr) ## Horrible data base ```{r, echo =F} horrible_base <- import("../datasets/horrible_base.csv") ``` ## Horrors in the dataset: - Long, unclear column names with strange characters. - Extra rows like repeated headers or summary lines. - Incorrectly typed variables (numbers stored as text, inconsistent date formats). - Categories with spelling mistakes, inconsistent capitalization, or extra spaces. - Missing and duplicate data. - Useless columns. ## Load Packages ```{r} require(pacman) p_load(dplyr,janitor) ``` ## Clean names Use clean_names() to fix the column names ```{r} names(horrible_base)[1:3] horrible_base_clean <- horrible_base %>% janitor::clean_names() names(horrible_base_clean)[1:3] ``` ## Filter rows Use filter() to remove unwanted or extra rows. You can remove: - Rows with "TOTAL" in any column. - Full duplicate rows. ```{r} dim(horrible_base_clean) horrible_base_clean <- horrible_base_clean %>% filter(!grepl("TOTAL", id_paciente)) %>% filter(!duplicated(.)) dim(horrible_base_clean) ``` ## Select columns Use select() to keep only useful columns ```{r} dim(horrible_base_clean) horrible_base_clean <- horrible_base_clean %>% select(id_paciente, edad_en_anos, fecha_de_ingreso_hospital, sexo_m_f, grupo_de_intervencion_o_no, notas_adicionales) dim(horrible_base_clean) ``` ## Mutate variables Use mutate() to fix data types and clean up categories - Convert age to numeric. - Parse various date formats. - Standardize text values in categorical variables. ## Mutate variables. Numerical Convert age to numeric. ```{r} head(horrible_base_clean$edad_en_anos) horrible_base_clean <- horrible_base_clean %>% mutate( edad_en_anos = as.numeric(edad_en_anos) ) head(horrible_base_clean$edad_en_anos) ``` ## Mutate variables. Categorical Standardize text values in categorical variables. ```{r} head(horrible_base_clean$sexo_m_f) horrible_base_clean <- horrible_base_clean %>% mutate( sexo_m_f = case_when( grepl("mascul", tolower(sexo_m_f)) ~ "Male", grepl("fem", tolower(sexo_m_f)) ~ "Female", TRUE ~ NA_character_ ), grupo_de_intervencion_o_no = case_when( tolower(grupo_de_intervencion_o_no) %in% c("intervencion", "intervención") ~ "Intervention", tolower(grupo_de_intervencion_o_no) %in% c("control", "ctrl") ~ "Control", TRUE ~ NA_character_ ) ) head(horrible_base_clean$sexo_m_f) ``` ## Mutate variables. Date Parse various date formats. ```{r} head(horrible_base_clean$fecha_de_ingreso_hospital) horrible_base_clean <- horrible_base_clean %>% mutate( fecha_de_ingreso_hospital = as.Date(fecha_de_ingreso_hospital, c("%Y/%m/%d")), ) head(horrible_base_clean$fecha_de_ingreso_hospital) ``` ## Arrange data base Use arrange() to sort the data For example, by admission date and age: ```{r} horrible_base_clean <- horrible_base_clean %>% arrange(fecha_de_ingreso_hospital, edad_en_anos) head(horrible_base_clean %>% select(id_paciente,fecha_de_ingreso_hospital, edad_en_anos)) ``` ## Final result: A cleaner dataset Now you have a much cleaner dataset with: - Clear and consistent column names. - Correct data types. - Unwanted rows removed. - Clean and standardized categories. - Would you like me to wrap this all into one clean code block for easy reuse? # Merging datasets ## Merging datasets Appending adds more rows of observations, whereas merging adds more columns of variables. Datasets to be merged should be matched on some id variable(s). \begin{figure} \includegraphics[width=.83\linewidth]{images/merge.png} \end{figure} ## Data example ```{r} band_members band_instruments ``` ## Append row bind_rows() ```{r} bind_rows(band_members, band_instruments) ``` ## Append columns bind_cols() !!!!!!!!!! ```{r} bind_cols(band_members, band_instruments) ``` !!!!!!!!!! ## Merging datasets with dplyr joins The **dplyr** "join" functions perform such merges and will use any same-named variables between the datasets as the id variables by default. Use the by= argument to specify specific matching id variables. These joins all return a table with all columns from x and y, but differ in how they deal with mismatched rows: - **inner_join(x, y)**: returns all rows from x where there is a matching value in y (returns only matching rows). - **left_join(x, y)**: returns all rows from x, unmatched rows in x will have NA in the columns from y. Unmatched rows in y not returned. - **full_join(x, y)**: returns all rows from x and from y; unmatched rows in either will have NA in new columns ## Mutating joins **inner_join(x, y)**: returns all rows from x where there is a matching value in y (returns only matching rows). ```{r} band_members %>% inner_join(band_instruments, by = "name") ``` ## Mutating joins **Other joins **: `left_join`, `right_join`, `full_join` ```{r} band_members %>% left_join(band_instruments) ``` ## EXERCISE What happens if you run these lines? ```{r, eval = F} band_members %>% right_join(band_instruments) ``` ```{r, eval = F} band_members %>% full_join(band_instruments) ```