Utilities for data manipulation
Tiago Olivoto
2023-03-06
Source:vignettes/vignettes_utilities.Rmd
vignettes_utilities.Rmd
See the section Rendering engine to know how HTML tables were generated.
Utilities for rows and columns
Add columns and rows
The functions add_cols()
and add_rows()
can
be used to add columns and rows, respectively to a data frame.
It is also possible to add a column based on existing data. Note that
the arguments .after
and .before
are used to
select the position of the new column(s). This is particularly useful to
put variables of the same category together.
Select or remove columns and rows
The functions select_cols()
and
select_rows()
can be used to select columns and rows,
respectively from a data frame.
select_cols(data_ge2, ENV, GEN) %>% print_table()
Numeric columns can be selected quickly by using the function
select_numeric_cols()
. Non-numeric columns are selected
with select_non_numeric_cols()
select_numeric_cols(data_ge2) %>% print_table()
We can select the first or last columns quickly with
select_first_col()
and select_last_col()
,
respectively.
select_first_col(data_ge2) %>% print_table()
To remove columns or rows, use remove_cols()
and
remove_rows()
.
remove_cols(data_ge2, ENV, GEN) %>% print_table()
Concatenating columns
The function concatetate()
can be used to concatenate
multiple columns of a data frame. It return a data frame with all the
original columns in .data
plus the concatenated variable,
after the last column (by default), or at any position when using the
arguments .before
or .after
.
concatenate(data_ge, ENV, GEN, REP, .after = REP) %>% print_table()
To drop the existing variables and keep only the concatenated column,
use the argument drop = TRUE
. To use
concatenate()
within a given function like
add_cols()
use the argument pull = TRUE
to
pull out the results to a vector.
concatenate(data_ge, ENV, GEN, REP, drop = TRUE) %>% head()
# # A tibble: 6 × 1
# new_var
# <chr>
# 1 E1_G1_1
# 2 E1_G1_2
# 3 E1_G1_3
# 4 E1_G2_1
# 5 E1_G2_2
# 6 E1_G2_3
concatenate(data_ge, ENV, GEN, REP, pull = TRUE) %>% head()
# [1] "E1_G1_1" "E1_G1_2" "E1_G1_3" "E1_G2_1" "E1_G2_2" "E1_G2_3"
To check if a column exists in a data frame, use
column_exists()
column_exists(data_ge, "ENV")
# [1] TRUE
Getting levels
To get the levels and the size of the levels of a factor, the
functions get_levels()
and get_level_size()
can be used.
get_levels(data_ge, ENV)
# [1] "E1" "E10" "E11" "E12" "E13" "E14" "E2" "E3" "E4" "E5" "E6" "E7"
# [13] "E8" "E9"
get_level_size(data_ge, ENV)
# # A tibble: 14 × 5
# ENV GEN REP GY HM
# <fct> <int> <int> <int> <int>
# 1 E1 30 30 30 30
# 2 E10 30 30 30 30
# 3 E11 30 30 30 30
# 4 E12 30 30 30 30
# 5 E13 30 30 30 30
# 6 E14 30 30 30 30
# 7 E2 30 30 30 30
# 8 E3 30 30 30 30
# 9 E4 30 30 30 30
# 10 E5 30 30 30 30
# 11 E6 30 30 30 30
# 12 E7 30 30 30 30
# 13 E8 30 30 30 30
# 14 E9 30 30 30 30
Utilities for numbers and strings
Rounding whole data frames
The function round_cols()
round a selected column or a
whole data frame to the specified number of decimal places (default 0).
If no variables are informed, then all numeric variables are
rounded.
round_cols(data_ge2) %>% print_table()
Alternatively, select variables to round.
round_cols(data_ge2, PH, EP, digits = 1) %>% print_table()
Extracting and replacing numbers
The functions extract_number()
, and
replace_number()
can be used to extract or replace numbers.
As an example, we will extract the number of each genotype in
data_g
.
extract_number(data_ge, GEN) %>% print_table()
To replace numbers of a given column with a specified replacement,
use replace_number()
. By default, numbers are replaced with
““.
replace_number(data_ge, GEN) %>% print_table()
Extracting, replacing, and removing strings
The functions extract_string()
, and
replace_string()
are used in the same context of
extract_number()
, and replace_number()
, but
for handling with strings.
extract_string(data_ge, GEN) %>% print_table()
To replace strings, we can use the function
replace_strings()
.
replace_string(data_ge,
GEN,
replacement = "GENOTYPE_") %>%
print_table()
To remove all strings of a data frame, use
remove_strings()
.
remove_strings(data_ge) %>% print_table()
Tidy strings
The function tidy_strings()
tidy up characters strings,
non-numeric columns, or any selected columns in a data frame by putting
all word in upper case, replacing any space, tabulation, punctuation
characters by '_'
, and putting '_'
between
lower and upper cases. Consider the following character strings:
messy_env
by definition should represent a unique level of
the factor environment (environment 1). messy_gen
shows six
genotypes, and messy_int
represents the interaction of such
genotypes with environment 1.
messy_env <- c("ENV 1", "Env 1", "Env1", "env1", "Env.1", "Env_1")
messy_gen <- c("GEN1", "gen 2", "Gen.3", "gen-4", "Gen_5", "GEN_6")
messy_int <- c("Env1Gen1", "Env1_Gen2", "env1 gen3", "Env1 Gen4", "ENV_1GEN5", "ENV1GEN6")
These character vectors are visually messy. Let’s tidy them.
tidy_strings(messy_env)
# [1] "ENV_1" "ENV_1" "ENV_1" "ENV_1" "ENV_1" "ENV_1"
tidy_strings(messy_gen)
# [1] "GEN_1" "GEN_2" "GEN_3" "GEN_4" "GEN_5" "GEN_6"
tidy_strings(messy_int)
# [1] "ENV_1_GEN_1" "ENV_1_GEN_2" "ENV_1_GEN_3" "ENV_1_GEN_4" "ENV_1_GEN_5"
# [6] "ENV_1_GEN_6"
tidy_strings()
works also to tidy a whole data frame or
specific columns. Let’s create a ‘messy’ data frame in the context of
plant breeding trials.
library(tibble)
#
# Attaching package: 'tibble'
# The following objects are masked from 'package:metan':
#
# column_to_rownames, remove_rownames, rownames_to_column
df <- tibble(Env = messy_env,
gen = messy_gen,
Env_GEN = interaction(Env, gen),
y = rnorm(6, 300, 10))
df %>% print_table()
Rendering engine
This vignette was built with pkgdown. All tables were produced
with the package DT
using the
following function.
library(metan)
library(DT) # Used to make the tables
# Function to make HTML tables
print_table <- function(table, rownames = FALSE, digits = 3, ...){
df <- datatable(table, rownames = rownames, extensions = 'Buttons',
options = list(scrollX = TRUE,
dom = '<<t>Bp>',
buttons = c('copy', 'excel', 'pdf', 'print')), ...)
num_cols <- c(as.numeric(which(sapply(table, class) == "numeric")))
if(length(num_cols) > 0){
formatSignif(df, columns = num_cols, digits = digits)
} else{
df
}
}