Sometimes discrete features have sparse categories. This function will group the sparse categories for a discrete feature based on a given threshold.
Usage
group_category(
data,
feature,
threshold,
measure,
update = FALSE,
category_name = "OTHER",
exclude = NULL
)
Arguments
- data
input data
- feature
name of the discrete feature to be collapsed.
- threshold
the bottom x% categories to be grouped, e.g., if set to 20%, categories with cumulative frequency of the bottom 20% will be grouped
- measure
name of feature to be used as an alternative measure.
- update
logical, indicating if the data should be modified. The default is
FALSE
. Setting toTRUE
will modify the input data.table object directly. Otherwise, input class will be returned.- category_name
name of the new category if update is set to
TRUE
. The default is "OTHER".- exclude
categories to be excluded from grouping when update is set to
TRUE
.
Value
If update
is set to FALSE
, returns categories with cumulative frequency less than the input threshold. The output class will match the class of input data.
If update
is set to TRUE
, updated data will be returned, and the output class will match the class of input data.
Details
If a continuous feature is passed to the argument feature
, it will be force set to character-class.
Examples
# Load packages
library(data.table)
# Generate data
data <- data.table("a" = as.factor(round(rnorm(500, 10, 5))), "b" = rexp(500, 500))
# View cumulative frequency without collpasing categories
group_category(data, "a", 0.2)
#> a cnt pct cum_pct
#> 1: 13 41 0.082 0.082
#> 2: 7 40 0.080 0.162
#> 3: 11 39 0.078 0.240
#> 4: 12 39 0.078 0.318
#> 5: 6 36 0.072 0.390
#> 6: 9 36 0.072 0.462
#> 7: 10 35 0.070 0.532
#> 8: 8 32 0.064 0.596
#> 9: 5 28 0.056 0.652
#> 10: 14 27 0.054 0.706
#> 11: 17 21 0.042 0.748
#> 12: 3 19 0.038 0.786
# View cumulative frequency based on another measure
group_category(data, "a", 0.2, measure = "b")
#> a cnt pct cum_pct
#> 1: 7 0.08783473 0.08852054 0.08852054
#> 2: 13 0.08424364 0.08490141 0.17342195
#> 3: 8 0.08401002 0.08466596 0.25808791
#> 4: 9 0.08298358 0.08363151 0.34171942
#> 5: 11 0.07865060 0.07926470 0.42098413
#> 6: 12 0.07650944 0.07710682 0.49809095
#> 7: 6 0.06974521 0.07028978 0.56838072
#> 8: 5 0.06906425 0.06960350 0.63798422
#> 9: 10 0.06108373 0.06156067 0.69954489
#> 10: 17 0.04584915 0.04620714 0.74575203
#> 11: 14 0.04347191 0.04381133 0.78956337
# Group bottom 20% categories based on cumulative frequency
group_category(data, "a", 0.2, update = TRUE)
plot_bar(data)
# Exclude categories from being grouped
dt <- data.table("a" = c(rep("c1", 25), rep("c2", 10), "c3", "c4"))
group_category(dt, "a", 0.8, update = TRUE, exclude = c("c3", "c4"))
plot_bar(dt)
# Return from non-data.table input
df <- data.frame("a" = as.factor(round(rnorm(50, 10, 5))), "b" = rexp(50, 10))
group_category(df, "a", 0.2)
#> a cnt pct cum_pct
#> 1 10 6 0.12 0.12
#> 2 11 4 0.08 0.20
#> 3 14 4 0.08 0.28
#> 4 6 4 0.08 0.36
#> 5 15 3 0.06 0.42
#> 6 16 3 0.06 0.48
#> 7 13 3 0.06 0.54
#> 8 12 3 0.06 0.60
#> 9 19 2 0.04 0.64
#> 10 9 2 0.04 0.68
#> 11 20 2 0.04 0.72
#> 12 7 2 0.04 0.76
group_category(df, "a", 0.2, measure = "b", update = TRUE)
#> a b
#> 1 OTHER 0.020373305
#> 2 OTHER 0.052507619
#> 3 OTHER 0.081805921
#> 4 19 0.009925348
#> 5 11 0.123021504
#> 6 14 0.144656546
#> 7 16 0.083169172
#> 8 9 0.213948655
#> 9 OTHER 0.080024696
#> 10 10 0.124592741
#> 11 20 0.256929572
#> 12 14 0.003419408
#> 13 10 0.098786739
#> 14 OTHER 0.069425177
#> 15 10 0.001624242
#> 16 19 0.863716054
#> 17 OTHER 0.044421191
#> 18 OTHER 0.020331713
#> 19 9 0.103372486
#> 20 6 0.133517823
#> 21 OTHER 0.047434915
#> 22 OTHER 0.031668132
#> 23 OTHER 0.081896676
#> 24 OTHER 0.006090181
#> 25 14 0.071219310
#> 26 10 0.064981094
#> 27 6 0.285785574
#> 28 16 0.073774583
#> 29 OTHER 0.054916968
#> 30 6 0.005104824
#> 31 OTHER 0.014855876
#> 32 OTHER 0.092099438
#> 33 OTHER 0.070745972
#> 34 16 0.031218843
#> 35 10 0.067868470
#> 36 OTHER 0.066864912
#> 37 10 0.088797072
#> 38 OTHER 0.039481432
#> 39 OTHER 0.021024986
#> 40 18 0.202653392
#> 41 20 0.104911416
#> 42 11 0.077060431
#> 43 6 0.039111814
#> 44 OTHER 0.096057286
#> 45 OTHER 0.028048669
#> 46 14 0.059014897
#> 47 OTHER 0.041757003
#> 48 11 0.289007986
#> 49 OTHER 0.024127883
#> 50 11 0.112046068
group_category(df, "a", 0.2, update = TRUE)
#> a b
#> 1 OTHER 0.020373305
#> 2 15 0.052507619
#> 3 OTHER 0.081805921
#> 4 19 0.009925348
#> 5 11 0.123021504
#> 6 14 0.144656546
#> 7 16 0.083169172
#> 8 9 0.213948655
#> 9 13 0.080024696
#> 10 10 0.124592741
#> 11 20 0.256929572
#> 12 14 0.003419408
#> 13 10 0.098786739
#> 14 15 0.069425177
#> 15 10 0.001624242
#> 16 19 0.863716054
#> 17 13 0.044421191
#> 18 7 0.020331713
#> 19 9 0.103372486
#> 20 6 0.133517823
#> 21 OTHER 0.047434915
#> 22 13 0.031668132
#> 23 7 0.081896676
#> 24 12 0.006090181
#> 25 14 0.071219310
#> 26 10 0.064981094
#> 27 6 0.285785574
#> 28 16 0.073774583
#> 29 12 0.054916968
#> 30 6 0.005104824
#> 31 OTHER 0.014855876
#> 32 OTHER 0.092099438
#> 33 OTHER 0.070745972
#> 34 16 0.031218843
#> 35 10 0.067868470
#> 36 OTHER 0.066864912
#> 37 10 0.088797072
#> 38 OTHER 0.039481432
#> 39 12 0.021024986
#> 40 OTHER 0.202653392
#> 41 20 0.104911416
#> 42 11 0.077060431
#> 43 6 0.039111814
#> 44 OTHER 0.096057286
#> 45 15 0.028048669
#> 46 14 0.059014897
#> 47 OTHER 0.041757003
#> 48 11 0.289007986
#> 49 OTHER 0.024127883
#> 50 11 0.112046068