Useful functions for computing descriptive statistics

The following functions compute descriptive statistics by levels of a factor or combination of factors quickly.
- cv_by() For computing coefficient of variation.
- max_by() For computing maximum values.
- mean_by() For computing arithmetic means.
- min_by() For compuing minimum values.
- n_by() For getting the length.
- sd_by() For computing sample standard deviation.
- var_by() For computing sample variance.
- sem_by() For computing standard error of the mean.
Useful functions for descriptive statistics. All of them work naturally with \%>\%, handle grouped data and multiple variables (all numeric variables from .data by default).
- av_dev() computes the average absolute deviation.
- ci_mean_t() computes the t-interval for the mean.
- ci_mean_z() computes the z-interval for the mean.
- cv() computes the coefficient of variation.
- freq_table() Computes a frequency table for either numeric and categorical/discrete data. For numeric data, it is possible to define the number of classes to be generated.
- hmean(), gmean() computes the harmonic and geometric means, respectively. The harmonic mean is the reciprocal of the arithmetic mean of the reciprocals. The geometric mean is the nth root of n products.
- kurt() computes the kurtosis like used in SAS and SPSS.
- range_data() Computes the range of the values.
- n_valid() The valid (not NA) length of a data.
- n_unique() Number of unique values.
- n_missing() Number of missing values.
- row_col_mean(), row_col_sum() Adds a row with the mean/sum of each variable and a column with the the mean/sum for each row of the data.
- sd_amo(), sd_pop() Computes sample and populational standard deviation, respectively.
- sem() computes the standard error of the mean.
- skew() computes the skewness like used in SAS and SPSS.
- ave_dev() computes the average of the absolute deviations.
- sum_dev() computes the sum of the absolute deviations.
- sum_sq() computes the sum of the squared values.
- sum_sq_dev() computes the sum of the squared deviations.
- var_amo(), var_pop() computes sample and populational variance.

desc_stat() is wrapper function around the above ones and can be used to compute quickly all these statistics at once.

Usage

av_dev(.data, ..., na.rm = FALSE)

ci_mean_t(.data, ..., na.rm = FALSE, level = 0.95)

ci_mean_z(.data, ..., na.rm = FALSE, level = 0.95)

cv(.data, ..., na.rm = FALSE)

freq_table(.data, var, k = NULL, digits = 3)

freq_hist(
  table,
  xlab = NULL,
  ylab = NULL,
  fill = "gray",
  color = "black",
  ygrid = TRUE
)

hmean(.data, ..., na.rm = FALSE)

gmean(.data, ..., na.rm = FALSE)

kurt(.data, ..., na.rm = FALSE)

n_missing(.data, ..., na.rm = FALSE)

n_unique(.data, ..., na.rm = FALSE)

n_valid(.data, ..., na.rm = FALSE)

pseudo_sigma(.data, ..., na.rm = FALSE)

range_data(.data, ..., na.rm = FALSE)

row_col_mean(.data, na.rm = FALSE)

row_col_sum(.data, na.rm = FALSE)

sd_amo(.data, ..., na.rm = FALSE)

sd_pop(.data, ..., na.rm = FALSE)

sem(.data, ..., na.rm = FALSE)

skew(.data, ..., na.rm = FALSE)

sum_dev(.data, ..., na.rm = FALSE)

ave_dev(.data, ..., na.rm = FALSE)

sum_sq_dev(.data, ..., na.rm = FALSE)

sum_sq(.data, ..., na.rm = FALSE)

var_pop(.data, ..., na.rm = FALSE)

var_amo(.data, ..., na.rm = FALSE)

cv_by(.data, ..., .vars = NULL, na.rm = FALSE)

max_by(.data, ..., .vars = NULL, na.rm = FALSE)

min_by(.data, ..., .vars = NULL, na.rm = FALSE)

means_by(.data, ..., .vars = NULL, na.rm = FALSE)

mean_by(.data, ..., .vars = NULL, na.rm = FALSE)

n_by(.data, ..., .vars = NULL, na.rm = FALSE)

sd_by(.data, ..., .vars = NULL, na.rm = FALSE)

var_by(.data, ..., .vars = NULL, na.rm = FALSE)

sem_by(.data, ..., .vars = NULL, na.rm = FALSE)

sum_by(.data, ..., .vars = NULL, na.rm = FALSE)

Arguments

.data

A data frame or a numeric vector.

...

The argument depends on the function used.

For *_by functions, ... is one or more categorical variables for grouping the data. Then the statistic required will be computed for all numeric variables in the data. If no variables are informed in ..., the statistic will be computed ignoring all non-numeric variables in .data.
For the other statistics, ... is a comma-separated of unquoted variable names to compute the statistics. If no variables are informed in n ..., the statistic will be computed for all numeric variables in .data.

na.rm

If FALSE, the default, missing values are removed with a warning. If TRUE, missing values are silently removed.

level

The confidence level for the confidence interval of the mean. Defaults to 0.95.

var

The variable to compute the frequency table. See Details for more details.

k

The number of classes to be created. See Details for more details.

digits

The number of significant figures to show. Defaults to 2.

table

A frequency table computed with freq_table().

xlab, ylab

The x and y labels.

fill, color

The color to fill the bars and color the border of the bar, respectively.

ygrid

Shows a grid line on the y axis? Defaults to TRUE. freq_hist <- function(table,

.vars

Used to select variables in the *_by() functions. One or more unquoted expressions separated by commas. Variable names can be used as if they were positions in the data frame, so expressions like x:y can be used to select a range of variables. Defaults to NULL (all numeric variables are analyzed)..

Value

Functions *_by() returns a tbl_df with the computed statistics by each level of the factor(s) declared in ....
All other functions return a named integer if the input is a data frame or a numeric value if the input is a numeric vector.
freq_table() Returns a list with the frequency table and the breaks used for class definition. These breaks can be used to construct an histogram of the variable.

Details

The function freq_table() computes a frequency table for either numerical or categorical variables. If a variable is categorical or discrete (integer values), the number of classes will be the number of levels that the variable contains.

If a variable (say, data) is continuous, the number of classes (k) is given by the square root of the number of samples (n) if n =< 100 or 5 * log10(n) if n > 100.

The amplitude (\(A\)) of the data is used to define the size of the class (\(c\)), given by

\[c = \frac{A}{n - 1}\]

The lower limit of the first class (LL1) is given by min(data) - c / 2. The upper limit is given by LL1 + c. The limits of the other classes are given in the same way. After the creation of the classes, the absolute and relative frequencies within each class are computed.

References

Ferreira, Daniel Furtado. 2009. Estatistica Basica. 2 ed. Vicosa, MG: UFLA.

Author

Tiago Olivoto tiagoolivoto@gmail.com

Examples

# \donttest{
library(metan)
# means of all numeric variables by ENV
mean_by(data_ge2, GEN, ENV)
#> # A tibble: 52 × 17
#>    GEN   ENV      PH    EH    EP    EL    ED    CL    CD    CW    KW    NR   NKR
#>    <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 H1    A1     2.72 1.68  0.626  15.4  51.1  28.1  15.7  23.5  203.  16.3  33.3
#>  2 H1    A2     2.93 1.80  0.612  15.0  51.9  31.5  15.5  30.2  188.  17.1  31.4
#>  3 H1    A3     2.20 1.10  0.497  14.8  50.6  30.9  15.8  26.8  157.  15.9  28.4
#>  4 H1    A4     2.64 1.44  0.547  15.2  51.2  29.8  15.8  26.4  187.  17.2  35.7
#>  5 H10   A1     2.78 1.62  0.584  16.1  53.2  31.4  16.8  24.6  192.  16.7  31.2
#>  6 H10   A2     2.05 0.987 0.494  15.5  46.7  26.8  16.3  26.3  160.  14    33.5
#>  7 H10   A3     2.04 1.01  0.503  14.0  43.9  24.8  15.2  12.3  121.  15.3  33.3
#>  8 H10   A4     2.39 1.43  0.600  14.9  50.0  30.7  15.3  28.0  183.  16.4  31.6
#>  9 H11   A1     2.75 1.58  0.574  16.6  48.9  29.0  17.2  23.6  188.  15.2  34.6
#> 10 H11   A2     2.15 1.02  0.475  15.1  47.3  27.2  15.7  24.3  164.  13.7  35  
#> # … with 42 more rows, and 4 more variables: CDED <dbl>, PERK <dbl>, TKW <dbl>,
#> #   NKE <dbl>

# Coefficient of variation for all numeric variables
# by GEN and ENV
cv_by(data_ge2, GEN, ENV)
#> # A tibble: 52 × 17
#>    GEN   ENV       PH    EH    EP    EL    ED    CL    CD    CW    KW    NR
#>    <fct> <fct>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 H1    A1     4.93   5.36  5.34  7.15 1.98   1.48  6.91  7.93  8.31  5.12
#>  2 H1    A2     3.46   5.98  2.71  2.04 1.92   1.92  1.98  3.75  5.44  3.58
#>  3 H1    A3     4.14   4.98  1.01  7.00 3.58   2.25  6.43 14.6  15.2   2.91
#>  4 H1    A4     8.66   9.84  1.89  9.88 3.72   5.51 11.4  19.1  17.9   4.65
#>  5 H10   A1     1.97   6.11  5.33  6.46 1.43   2.72  6.78 13.6   7.99  6.04
#>  6 H10   A2     5.31   6.83 12.8   5.29 0.800  1.38  4.89  8.81  9.34  7.56
#>  7 H10   A3    11.9   11.3   2.24  3.23 0.436  5.56  5.40  5.80  5.23  3.98
#>  8 H10   A4     4.38   3.53  8.82  3.11 3.42   3.89  3.09  5.02  6.07  2.44
#>  9 H11   A1     0.988  5.01  4.44  4.94 4.95   5.27  4.72 12.3  13.9  10.5 
#> 10 H11   A2     1.43   3.00  4.00  4.84 0.762  2.48  5.02 15.8   5.37  3.36
#> # … with 42 more rows, and 5 more variables: NKR <dbl>, CDED <dbl>, PERK <dbl>,
#> #   TKW <dbl>, NKE <dbl>

# Skewness of a numeric vector
set.seed(1)
nvec <- rnorm(200, 10, 1)
skew(nvec)
#> [1] 0.1977769

# Confidence interval 0.95 for the mean
# All numeric variables
# Grouped by levels of ENV
data_ge2 %>%
  group_by(ENV) %>%
  ci_mean_t()
#> # A tibble: 4 × 16
#>   ENV       PH     EH     EP    EL    ED    CL    CD    CW    KW    NR   NKR
#>   <fct>  <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 A1    0.0401 0.0415 0.0142 0.382 0.598 0.725 0.352  1.83  6.31 0.638 0.948
#> 2 A2    0.129  0.113  0.0206 0.442 0.882 0.819 0.436  2.24 11.9  0.467 1.06 
#> 3 A3    0.0724 0.0625 0.0153 0.326 0.897 0.781 0.307  1.66  7.03 0.498 1.01 
#> 4 A4    0.0539 0.0483 0.0128 0.423 0.700 0.579 0.394  1.48  8.70 0.433 1.20 
#> # … with 4 more variables: CDED <dbl>, PERK <dbl>, TKW <dbl>, NKE <dbl>

# standard error of the mean
# Variable PH and EH
sem(data_ge2, PH, EH)
#> # A tibble: 1 × 2
#>       PH     EH
#>    <dbl>  <dbl>
#> 1 0.0267 0.0228

# Frequency table for variable NR
data_ge2 %>%
  freq_table(NR)
#> $freqs
#>                class abs_freq abs_freq_ac rel_freq rel_freq_ac
#> 1  11.96 |---  12.84        1           1    0.006       0.006
#> 2  12.84 |---  13.72       10          11    0.064       0.071
#> 3   13.72 |---  14.6       18          29    0.115       0.186
#> 4   14.6 |---  15.48       23          52    0.147       0.333
#> 5  15.48 |---  16.36       31          83    0.199       0.532
#> 6  16.36 |---  17.24       40         123    0.256       0.788
#> 7  17.24 |---  18.12       21         144    0.135       0.923
#> 8     18.12 |---  19        5         149    0.032       0.955
#> 9     19 |---  19.88        1         150    0.006       0.962
#> 10 19.88 |---  20.76        5         155    0.032       0.994
#> 11 20.76 |---| 21.64        1         156    0.006       1.000
#> 12             Total      156         156    1.000       1.000
#> 
#> $LL
#>  [1] 11.96 12.84 13.72 14.60 15.48 16.36 17.24 18.12 19.00 19.88 20.76
#> 
#> $UL
#>  [1] 12.84 13.72 14.60 15.48 16.36 17.24 18.12 19.00 19.88 20.76 21.64
#> 
#> $vartype
#> [1] "continuous"
#> 
#> attr(,"class")
#> [1] "freq_table"
# }