Descriptive statistics — desc

desc_stat() Computes the most used measures of central tendency, position, and dispersion.
desc_wider() is useful to put the variables in columns and grouping variables in rows. The table is filled with a statistic chosen with the argument stat.

Usage

desc_stat(
  .data = NULL,
  ...,
  by = NULL,
  stats = "main",
  hist = FALSE,
  level = 0.95,
  digits = 4,
  na.rm = FALSE,
  verbose = TRUE,
  plot_theme = theme_metan()
)

desc_wider(.data, which)

Arguments

.data

The data to be analyzed. It can be a data frame (possible with grouped data passed from dplyr::group_by() or a numeric vector. For desc_wider() .data is an object of class desc_stat.

...

A single variable name or a comma-separated list of unquoted variables names. If no variable is informed, all the numeric variables from .data will be used. Select helpers are allowed.

by

One variable (factor) to compute the function by. It is a shortcut to dplyr::group_by(). To compute the statistics by more than one grouping variable use that function.

stats

The descriptive statistics to show. This is used to filter the output after computation. Defaults to "main" (cv, max, mean median, min, sd.amo, se, ci ). Other allowed values are "all" to show all the statistics, "robust" to show robust statistics, "quantile" to show quantile statistics, or chose one (or more) of the following:

"av.dev": average deviation.
"ci.t": t-interval (95% confidence interval) of the mean.
"ci.z": z-interval (95% confidence interval) of the mean.
"cv": coefficient of variation.
"iqr": interquartile range.
"gmean": geometric mean.
"hmean": harmonic mean.
"Kurt": kurtosis.
"mad": median absolute deviation.
"max": maximum value.
"mean": arithmetic mean.
"median": median.
"min": minimum value.
"n": the length of the data.
"n.valid": The valid (Not NA) number of elements
"n.missing": The number of missing values
"n.unique": The length of unique elements.
"ps": the pseudo-sigma (iqr / 1.35).
"q2.5", "q25", "q75", "q97.5": the percentile 2.5\ quartile, third quartile, and percentile 97.5\
range: The range of data).
"sd.amo", "sd.pop": the sample and population standard deviation.
"se": the standard error of the mean.
"skew": skewness.
"sum". the sum of the values.
"sum.dev": the sum of the absolute deviations.
"ave.sq.dev": the average of the squared deviations.
"sum.sq.dev": the sum of the squared deviations.
"n.valid": The size of sample with valid number (not NA).
"var.amo", "var.pop": the sample and population variance.

Use a names to select the statistics. For example, stats = c("median, mean, cv, n"). Note that the statistic names are not case-sensitive. Both comma or space can be used as separator.

hist

Logical argument defaults to FALSE. If hist = TRUE then a histogram is created for each selected variable.

level

The confidence level to compute the confidence interval of mean. Defaults to 0.95.

digits

The number of significant digits.

na.rm

Logical. Should missing values be removed? Defaults to FALSE.

verbose

Logical argument. If verbose = FALSE the code is run silently.

plot_theme

The graphical theme of the plot. Default is plot_theme = theme_metan(). For more details, see ggplot2::theme().

which

A statistic to fill the table.

Value

desc_stats() returns a tibble with the statistics in the columns and variables (with possible grouping factors) in rows.
desc_wider() returns a tibble with variables in columns and grouping factors in rows.

Author

Tiago Olivoto tiagoolivoto@gmail.com

Examples

# \donttest{
library(metan)
#===============================================================#
# Example 1: main statistics (coefficient of variation, maximum,#
# mean, median, minimum, sample standard deviation, standard    #
# error and confidence interval of the mean) for all numeric    #
# variables in data                                             #
#===============================================================#

desc_stat(data_ge2)
#> # A tibble: 15 × 10
#>    variable    cv     max    mean  median     min  sd.amo     se    ci.t n.valid
#>    <chr>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>  <dbl>   <dbl>   <dbl>
#>  1 CD        7.34  18.6    16.0    16      12.9    1.17   0.0939  0.186      156
#>  2 CDED      5.71   0.694   0.586   0.588   0.495  0.0334 0.0027  0.0053     156
#>  3 CL        7.95  34.7    29.0    28.7    23.5    2.31   0.185   0.365      156
#>  4 CW       25.2   38.5    24.8    24.5    11.1    6.26   0.501   0.99       156
#>  5 ED        5.58  54.9    49.5    49.9    43.5    2.76   0.221   0.437      156
#>  6 EH       21.2    1.88    1.34    1.41    0.752  0.284  0.0228  0.045      156
#>  7 EL        8.28  17.9    15.2    15.1    11.5    1.26   0.101   0.199      156
#>  8 EP       10.5    0.660   0.537   0.544   0.386  0.0564 0.0045  0.0089     156
#>  9 KW       18.9  251.    173.    175.    106.    32.8    2.62    5.18       156
#> 10 NKE      14.2  697.    512.    509.    332.    72.6    5.82   11.5        156
#> 11 NKR      10.7   42      32.2    32      23.2    3.47   0.277   0.548      156
#> 12 NR       10.2   21.2    16.1    16      12.4    1.64   0.131   0.259      156
#> 13 PERK      2.17  91.8    87.4    87.5    81.2    1.90   0.152   0.300      156
#> 14 PH       13.4    3.04    2.48    2.52    1.71   0.334  0.0267  0.0528     156
#> 15 TKW      13.9  452.    339.    342.    218.    47.1    3.77    7.44       156

#===============================================================#
#Example 2: robust statistics using a numeric vector as input   #
# data
#===============================================================#
vect <- data_ge2$TKW
desc_stat(vect, stats = "robust")
#> # A tibble: 1 × 5
#>   variable     n median   iqr    ps
#>   <chr>    <dbl>  <dbl> <dbl> <dbl>
#> 1 val        156   342.  57.8  42.8

#===============================================================#
# Example 3: Select specific statistics. In this example, NAs   #
# are removed before analysis with a warning message            #
#===============================================================#
desc_stat(c(12, 13, 19, 21, 8, NA, 23, NA),
          stats = c('mean, se, cv, n, n.valid'),
          na.rm = TRUE)
#> # A tibble: 1 × 6
#>   variable  mean    se    cv     n n.valid
#>   <chr>    <dbl> <dbl> <dbl> <dbl>   <dbl>
#> 1 val         16  2.39  36.7     8       6

#===============================================================#
# Example 4: Select specific variables and compute statistics by#
# levels of a factor variable (GEN)                             #
#===============================================================#
stats <-
  desc_stat(data_ge2,
            EP, EL, EH, ED, PH, CD,
            by = GEN)
stats
#> # A tibble: 78 × 11
#>    GEN   variable    cv    max   mean median    min sd.amo     se   ci.t n.valid
#>    <fct> <chr>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>
#>  1 H1    CD        6.44 17.9   15.7   15.7   14.5   1.01   0.292  0.643       12
#>  2 H1    ED        2.66 53.3   51.2   50.8   49.2   1.36   0.393  0.864       12
#>  3 H1    EH       19.5   1.88   1.50   1.56   1.05  0.294  0.0848 0.187       12
#>  4 H1    EL        6.27 16.9   15.1   15.1   13.7   0.947  0.273  0.602       12
#>  5 H1    EP        9.91  0.658  0.570  0.574  0.492 0.0565 0.0163 0.0359      12
#>  6 H1    PH       11.7   3.00   2.62   2.70   2.11  0.307  0.0885 0.195       12
#>  7 H10   CD        6.32 17.5   15.9   15.7   14.4   1.00   0.290  0.638       12
#>  8 H10   ED        7.70 54.1   48.4   47.7   43.7   3.73   1.08   2.37        12
#>  9 H10   EH       23.2   1.71   1.26   1.25   0.888 0.293  0.0845 0.186       12
#> 10 H10   EL        6.83 16.7   15.1   14.9   13.6   1.03   0.298  0.656       12
#> # … with 68 more rows

# To get a 'wide' format with the maximum values for all variables
desc_wider(stats, max)
#> # A tibble: 13 × 7
#>    GEN      CD    ED    EH    EL    EP    PH
#>    <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 H1     17.9  53.3  1.88  16.9 0.658  3.00
#>  2 H10    17.5  54.1  1.71  16.7 0.660  2.83
#>  3 H11    18.0  52.3  1.67  17.4 0.600  2.77
#>  4 H12    16.2  52.7  1.58  15.7 0.616  2.79
#>  5 H13    17.8  54.0  1.77  16.3 0.615  2.93
#>  6 H2     17.0  53.6  1.87  16.1 0.615  3.03
#>  7 H3     18.0  52.2  1.80  17.6 0.640  3.04
#>  8 H4     17.7  52.8  1.82  16.8 0.617  3.02
#>  9 H5     17.4  52.7  1.76  16.6 0.632  2.90
#> 10 H6     18.3  54.9  1.69  17.9 0.631  2.94
#> 11 H7     18.6  52.1  1.67  17.5 0.617  2.87
#> 12 H8     18.4  53.3  1.57  17.7 0.585  2.76
#> 13 H9     18.1  53.6  1.71  17.5 0.630  3.00

#===============================================================#
# Example 5: Compute all statistics for all numeric variables   #
# by two or more factors. Note that group_by() was used to pass #
# grouped data to the function desc_stat()                      #
#===============================================================#

data_ge2 %>%
  group_by(ENV, GEN) %>%
  desc_stat()
#> # A tibble: 780 × 12
#>    ENV   GEN   variable    cv     max    mean  median     min  sd.amo      se
#>    <fct> <fct> <chr>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
#>  1 A1    H1    CD        6.91  16.4    15.7    16.3    14.5    1.09    0.627 
#>  2 A1    H1    CDED      2.04   0.561   0.550   0.551   0.538  0.0112  0.0065
#>  3 A1    H1    CL        1.48  28.4    28.1    28.1    27.6    0.415   0.239 
#>  4 A1    H1    CW        7.93  25.1    23.5    24.0    21.4    1.86    1.08  
#>  5 A1    H1    ED        1.98  52.2    51.1    50.7    50.3    1.01    0.583 
#>  6 A1    H1    EH        5.36   1.76    1.68    1.71    1.58   0.0902  0.0521
#>  7 A1    H1    EL        7.15  16.1    15.4    16.0    14.2    1.10    0.637 
#>  8 A1    H1    EP        5.34   0.658   0.626   0.628   0.591  0.0334  0.0193
#>  9 A1    H1    KW        8.31 217.    203.    208.    184.    16.8     9.72  
#> 10 A1    H1    NKE       6.80 565.    527.    521.    494.    35.8    20.7   
#> # … with 770 more rows, and 2 more variables: ci.t <dbl>, n.valid <dbl>

# }