For this post, we need to load the following library:
The gtsummary
uses the tbl_summary()
to generate the summary table and
works well with the %>%
symbol.
It automatically detects data type and use it to decides what type of statistics to compute. By default, it’s: - median, 1st and 3rd quartile for numeric columns - number of observations and proportion for categorical columns
library(gtsummary)
# create dataset
data("Titanic")
df = as.data.frame(Titanic)
# create the table
df %>%
tbl_summary()
Characteristic | N = 321 |
---|---|
Class | |
1st | 8 (25%) |
2nd | 8 (25%) |
3rd | 8 (25%) |
Crew | 8 (25%) |
Sex | |
Male | 16 (50%) |
Female | 16 (50%) |
Age | |
Child | 16 (50%) |
Adult | 16 (50%) |
Survived | 16 (50%) |
Freq | 14 (1, 77) |
1 n (%); Median (IQR) |
If you want to add p-values to the table, you have
to add by=variable_name
in the
tbl_summary()
function. This happens because p-values are
used to compare things between them.
The variable in the by
argument will be used to
split the dataset into multiple sub-samples (2 if it’s
dichotomous, 3 if there are 3 distinct labels in the variable, etc).
Those samples will be compared for each column in the
dataset, and the test done depends on the type of data.
In this case, we add: - add_p()
to create a new column
for p-values - add_overall()
to add a new column for
descriptive statistics for the whole sample
library(gtsummary)
# create dataset
data("Titanic")
df = as.data.frame(Titanic)
# create the table
df %>%
tbl_summary(by=Survived) %>%
add_overall() %>%
add_p() #%>%
Characteristic | Overall, N = 321 | No, N = 161 | Yes, N = 161 | p-value2 |
---|---|---|---|---|
Class | >0.9 | |||
1st | 8 (25%) | 4 (25%) | 4 (25%) | |
2nd | 8 (25%) | 4 (25%) | 4 (25%) | |
3rd | 8 (25%) | 4 (25%) | 4 (25%) | |
Crew | 8 (25%) | 4 (25%) | 4 (25%) | |
Sex | >0.9 | |||
Male | 16 (50%) | 8 (50%) | 8 (50%) | |
Female | 16 (50%) | 8 (50%) | 8 (50%) | |
Age | >0.9 | |||
Child | 16 (50%) | 8 (50%) | 8 (50%) | |
Adult | 16 (50%) | 8 (50%) | 8 (50%) | |
Freq | 14 (1, 77) | 9 (0, 96) | 14 (10, 75) | 0.6 |
1 n (%); Median (IQR) | ||||
2 Fisher’s exact test; Pearson’s Chi-squared test; Wilcoxon rank sum test |
Thanks to the add_stat()
function, we can create new
column based on our own functions.
Below, we define an anova function that returns the
p-values of an ANOVA and pass it to the
add_stat()
function.
library(gtsummary)
# create dataset
data("iris")
df = as.data.frame(iris)
my_anova = function(data, variable, by, ...) {
result = aov(as.formula(paste(variable, "~", by)), data = data)
summary(result)[[1]]$'Pr(>F)'[1] # Extracting the p-value for the group effect
}
# create the table
df %>%
tbl_summary(by=Species) %>%
add_overall() %>%
add_p() %>%
add_stat(fns = everything() ~ my_anova) %>%
modify_header(
list(
add_stat_1 ~ "**p-value**",
all_stat_cols() ~ "**{level}**"
)
) %>%
modify_footnote(
add_stat_1 ~ "ANOVA")
Characteristic | Overall1 | setosa1 | versicolor1 | virginica1 | p-value2 | p-value3 |
---|---|---|---|---|---|---|
Sepal.Length | 5.80 (5.10, 6.40) | 5.00 (4.80, 5.20) | 5.90 (5.60, 6.30) | 6.50 (6.23, 6.90) | <0.001 | 0.000 |
Sepal.Width | 3.00 (2.80, 3.30) | 3.40 (3.20, 3.68) | 2.80 (2.53, 3.00) | 3.00 (2.80, 3.18) | <0.001 | 0.000 |
Petal.Length | 4.35 (1.60, 5.10) | 1.50 (1.40, 1.58) | 4.35 (4.00, 4.60) | 5.55 (5.10, 5.88) | <0.001 | 0.000 |
Petal.Width | 1.30 (0.30, 1.80) | 0.20 (0.20, 0.30) | 1.30 (1.20, 1.50) | 2.00 (1.80, 2.30) | <0.001 | 0.000 |
1 Median (IQR) | ||||||
2 Kruskal-Wallis rank sum test | ||||||
3 ANOVA |
This post explained how to create summary table using the gtsummary library. For more of this package, see the dedicated section or the table section.
👋 After crafting hundreds of R charts over 12 years, I've distilled my top 10 tips and tricks. Receive them via email! One insight per day for the next 10 days! 🔥