Data dummification is also known as one hot encoding or feature binarization. It turns each category to a distinct column with binary (numeric) values.
Arguments
- data
input data
- maxcat
maximum categories allowed for each discrete feature. Default is 50.
- select
names of selected features to be dummified. Default is
NULL
.
Value
dummified dataset (discrete features only) preserving original features. However, column order might be different.
Details
Continuous features will be ignored if added in select
.
select
features will be ignored if categories exceed maxcat
.
Note
This is different from model.matrix, where the latter aims to create a full rank matrix for regression-like use cases. If your intention is to create a design matrix, use model.matrix instead.
Examples
## Dummify iris dataset
str(dummify(iris))
#> 'data.frame': 150 obs. of 7 variables:
#> $ Sepal.Length : num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#> $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#> $ Petal.Length : num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#> $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#> $ Species_setosa : int 1 1 1 1 1 1 1 1 1 1 ...
#> $ Species_versicolor: int 0 0 0 0 0 0 0 0 0 0 ...
#> $ Species_virginica : int 0 0 0 0 0 0 0 0 0 0 ...
#> - attr(*, ".internal.selfref")=<externalptr>
## Dummify diamonds dataset ignoring features with more than 5 categories
data("diamonds", package = "ggplot2")
str(dummify(diamonds, maxcat = 5))
#> 2 features with more than 5 categories ignored!
#> color: 7 categories
#> clarity: 8 categories
#> tibble [53,940 × 14] (S3: tbl_df/tbl/data.frame)
#> $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
#> $ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
#> $ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
#> $ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
#> $ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
#> $ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
#> $ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
#> $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
#> $ clarity : Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
#> $ cut_Fair : int [1:53940] 0 0 0 0 0 0 0 0 1 0 ...
#> $ cut_Good : int [1:53940] 0 0 1 0 1 0 0 0 0 0 ...
#> $ cut_Ideal : int [1:53940] 1 0 0 0 0 0 0 0 0 0 ...
#> $ cut_Premium : int [1:53940] 0 1 0 1 0 0 0 0 0 0 ...
#> $ cut_Very.Good: int [1:53940] 0 0 0 0 0 1 1 1 0 1 ...
#> - attr(*, ".internal.selfref")=<externalptr>
str(dummify(diamonds, select = c("cut", "color")))
#> tibble [53,940 × 20] (S3: tbl_df/tbl/data.frame)
#> $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
#> $ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
#> $ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
#> $ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
#> $ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
#> $ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
#> $ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
#> $ clarity : Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
#> $ cut_Fair : int [1:53940] 0 0 0 0 0 0 0 0 1 0 ...
#> $ cut_Good : int [1:53940] 0 0 1 0 1 0 0 0 0 0 ...
#> $ cut_Ideal : int [1:53940] 1 0 0 0 0 0 0 0 0 0 ...
#> $ cut_Premium : int [1:53940] 0 1 0 1 0 0 0 0 0 0 ...
#> $ cut_Very.Good: int [1:53940] 0 0 0 0 0 1 1 1 0 1 ...
#> $ color_D : int [1:53940] 0 0 0 0 0 0 0 0 0 0 ...
#> $ color_E : int [1:53940] 1 1 1 0 0 0 0 0 1 0 ...
#> $ color_F : int [1:53940] 0 0 0 0 0 0 0 0 0 0 ...
#> $ color_G : int [1:53940] 0 0 0 0 0 0 0 0 0 0 ...
#> $ color_H : int [1:53940] 0 0 0 0 0 0 0 1 0 1 ...
#> $ color_I : int [1:53940] 0 0 0 1 0 0 1 0 0 0 ...
#> $ color_J : int [1:53940] 0 0 0 0 1 1 0 0 0 0 ...
#> - attr(*, ".internal.selfref")=<externalptr>