Process the data by converting categorical predictors to dummy variables, standardizing continuous predictors, and apply subsampling techniques.
Usage
splendid_process(
data,
class,
algorithms,
convert = FALSE,
standardize = FALSE,
sampling = c("none", "up", "down", "smote"),
seed_samp = NULL
)
Arguments
- data
data frame with rows as samples, columns as features
- class
true/reference class vector used for supervised learning
- algorithms
character vector of algorithms to use for supervised learning. See Algorithms section for possible options. By default, this argument is
NULL
, in which case all algorithms are used.- convert
logical; if
TRUE
, converts all categorical variables indata
to dummy variables. Certain algorithms only work with such limitations (e.g. LDA).- standardize
logical; if
TRUE
, the training sets are standardized on features to have mean zero and unit variance. The test sets are standardized using the vectors of centers and standard deviations used in corresponding training sets.- sampling
the default is "none", in which no subsampling is performed. Other options include "up" (Up-sampling the minority class), "down" (Down-sampling the majority class), and "smote" (synthetic points for the minority class and down-sampling the majority class). Subsampling is only applicable to the training set.
- seed_samp
random seed used for reproducibility in subsampling training sets for model generation
Details
If all the variables in the original data are already continuous, nothing is
done. Otherwise, conversion is performed if convert = TRUE
using
dummify()
. An error message is thrown if there are categorical variables
and convert = FALSE
, indicating exactly which algorithms specified require
data conversion. Classification algorithms LDA and the MLR family have such a
limitation.
Continuous predictors can be scaled to have zero mean and unit variance with
standardize = TRUE
. Dummy variables coded to 0 or 1 are never standardized.
Subsampling techniques can be applied with sampling
methods passed to
subsample()
.
Examples
data(hgsc)
cl <- attr(hgsc, "class.true")
# Nothing happens if data is all continuous
data_same <- splendid_process(hgsc, class = cl, algorithms = "lda", convert =
TRUE)
identical(hgsc, data_same)
#> [1] FALSE
# Dummy variables created if there are categorical variables
data_dummy <- splendid_process(iris, class = iris$Species, algorithms =
"lda", convert = TRUE)
head(data_dummy)
#> $data
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Speciesversicolor
#> 1 5.1 3.5 1.4 0.2 0
#> 2 4.9 3.0 1.4 0.2 0
#> 3 4.7 3.2 1.3 0.2 0
#> 4 4.6 3.1 1.5 0.2 0
#> 5 5.0 3.6 1.4 0.2 0
#> 6 5.4 3.9 1.7 0.4 0
#> 7 4.6 3.4 1.4 0.3 0
#> 8 5.0 3.4 1.5 0.2 0
#> 9 4.4 2.9 1.4 0.2 0
#> 10 4.9 3.1 1.5 0.1 0
#> 11 5.4 3.7 1.5 0.2 0
#> 12 4.8 3.4 1.6 0.2 0
#> 13 4.8 3.0 1.4 0.1 0
#> 14 4.3 3.0 1.1 0.1 0
#> 15 5.8 4.0 1.2 0.2 0
#> 16 5.7 4.4 1.5 0.4 0
#> 17 5.4 3.9 1.3 0.4 0
#> 18 5.1 3.5 1.4 0.3 0
#> 19 5.7 3.8 1.7 0.3 0
#> 20 5.1 3.8 1.5 0.3 0
#> 21 5.4 3.4 1.7 0.2 0
#> 22 5.1 3.7 1.5 0.4 0
#> 23 4.6 3.6 1.0 0.2 0
#> 24 5.1 3.3 1.7 0.5 0
#> 25 4.8 3.4 1.9 0.2 0
#> 26 5.0 3.0 1.6 0.2 0
#> 27 5.0 3.4 1.6 0.4 0
#> 28 5.2 3.5 1.5 0.2 0
#> 29 5.2 3.4 1.4 0.2 0
#> 30 4.7 3.2 1.6 0.2 0
#> 31 4.8 3.1 1.6 0.2 0
#> 32 5.4 3.4 1.5 0.4 0
#> 33 5.2 4.1 1.5 0.1 0
#> 34 5.5 4.2 1.4 0.2 0
#> 35 4.9 3.1 1.5 0.2 0
#> 36 5.0 3.2 1.2 0.2 0
#> 37 5.5 3.5 1.3 0.2 0
#> 38 4.9 3.6 1.4 0.1 0
#> 39 4.4 3.0 1.3 0.2 0
#> 40 5.1 3.4 1.5 0.2 0
#> 41 5.0 3.5 1.3 0.3 0
#> 42 4.5 2.3 1.3 0.3 0
#> 43 4.4 3.2 1.3 0.2 0
#> 44 5.0 3.5 1.6 0.6 0
#> 45 5.1 3.8 1.9 0.4 0
#> 46 4.8 3.0 1.4 0.3 0
#> 47 5.1 3.8 1.6 0.2 0
#> 48 4.6 3.2 1.4 0.2 0
#> 49 5.3 3.7 1.5 0.2 0
#> 50 5.0 3.3 1.4 0.2 0
#> 51 7.0 3.2 4.7 1.4 1
#> 52 6.4 3.2 4.5 1.5 1
#> 53 6.9 3.1 4.9 1.5 1
#> 54 5.5 2.3 4.0 1.3 1
#> 55 6.5 2.8 4.6 1.5 1
#> 56 5.7 2.8 4.5 1.3 1
#> 57 6.3 3.3 4.7 1.6 1
#> 58 4.9 2.4 3.3 1.0 1
#> 59 6.6 2.9 4.6 1.3 1
#> 60 5.2 2.7 3.9 1.4 1
#> 61 5.0 2.0 3.5 1.0 1
#> 62 5.9 3.0 4.2 1.5 1
#> 63 6.0 2.2 4.0 1.0 1
#> 64 6.1 2.9 4.7 1.4 1
#> 65 5.6 2.9 3.6 1.3 1
#> 66 6.7 3.1 4.4 1.4 1
#> 67 5.6 3.0 4.5 1.5 1
#> 68 5.8 2.7 4.1 1.0 1
#> 69 6.2 2.2 4.5 1.5 1
#> 70 5.6 2.5 3.9 1.1 1
#> 71 5.9 3.2 4.8 1.8 1
#> 72 6.1 2.8 4.0 1.3 1
#> 73 6.3 2.5 4.9 1.5 1
#> 74 6.1 2.8 4.7 1.2 1
#> 75 6.4 2.9 4.3 1.3 1
#> 76 6.6 3.0 4.4 1.4 1
#> 77 6.8 2.8 4.8 1.4 1
#> 78 6.7 3.0 5.0 1.7 1
#> 79 6.0 2.9 4.5 1.5 1
#> 80 5.7 2.6 3.5 1.0 1
#> 81 5.5 2.4 3.8 1.1 1
#> 82 5.5 2.4 3.7 1.0 1
#> 83 5.8 2.7 3.9 1.2 1
#> 84 6.0 2.7 5.1 1.6 1
#> 85 5.4 3.0 4.5 1.5 1
#> 86 6.0 3.4 4.5 1.6 1
#> 87 6.7 3.1 4.7 1.5 1
#> 88 6.3 2.3 4.4 1.3 1
#> 89 5.6 3.0 4.1 1.3 1
#> 90 5.5 2.5 4.0 1.3 1
#> 91 5.5 2.6 4.4 1.2 1
#> 92 6.1 3.0 4.6 1.4 1
#> 93 5.8 2.6 4.0 1.2 1
#> 94 5.0 2.3 3.3 1.0 1
#> 95 5.6 2.7 4.2 1.3 1
#> 96 5.7 3.0 4.2 1.2 1
#> 97 5.7 2.9 4.2 1.3 1
#> 98 6.2 2.9 4.3 1.3 1
#> 99 5.1 2.5 3.0 1.1 1
#> 100 5.7 2.8 4.1 1.3 1
#> 101 6.3 3.3 6.0 2.5 0
#> 102 5.8 2.7 5.1 1.9 0
#> 103 7.1 3.0 5.9 2.1 0
#> 104 6.3 2.9 5.6 1.8 0
#> 105 6.5 3.0 5.8 2.2 0
#> 106 7.6 3.0 6.6 2.1 0
#> 107 4.9 2.5 4.5 1.7 0
#> 108 7.3 2.9 6.3 1.8 0
#> 109 6.7 2.5 5.8 1.8 0
#> 110 7.2 3.6 6.1 2.5 0
#> 111 6.5 3.2 5.1 2.0 0
#> 112 6.4 2.7 5.3 1.9 0
#> 113 6.8 3.0 5.5 2.1 0
#> 114 5.7 2.5 5.0 2.0 0
#> 115 5.8 2.8 5.1 2.4 0
#> 116 6.4 3.2 5.3 2.3 0
#> 117 6.5 3.0 5.5 1.8 0
#> 118 7.7 3.8 6.7 2.2 0
#> 119 7.7 2.6 6.9 2.3 0
#> 120 6.0 2.2 5.0 1.5 0
#> 121 6.9 3.2 5.7 2.3 0
#> 122 5.6 2.8 4.9 2.0 0
#> 123 7.7 2.8 6.7 2.0 0
#> 124 6.3 2.7 4.9 1.8 0
#> 125 6.7 3.3 5.7 2.1 0
#> 126 7.2 3.2 6.0 1.8 0
#> 127 6.2 2.8 4.8 1.8 0
#> 128 6.1 3.0 4.9 1.8 0
#> 129 6.4 2.8 5.6 2.1 0
#> 130 7.2 3.0 5.8 1.6 0
#> 131 7.4 2.8 6.1 1.9 0
#> 132 7.9 3.8 6.4 2.0 0
#> 133 6.4 2.8 5.6 2.2 0
#> 134 6.3 2.8 5.1 1.5 0
#> 135 6.1 2.6 5.6 1.4 0
#> 136 7.7 3.0 6.1 2.3 0
#> 137 6.3 3.4 5.6 2.4 0
#> 138 6.4 3.1 5.5 1.8 0
#> 139 6.0 3.0 4.8 1.8 0
#> 140 6.9 3.1 5.4 2.1 0
#> 141 6.7 3.1 5.6 2.4 0
#> 142 6.9 3.1 5.1 2.3 0
#> 143 5.8 2.7 5.1 1.9 0
#> 144 6.8 3.2 5.9 2.3 0
#> 145 6.7 3.3 5.7 2.5 0
#> 146 6.7 3.0 5.2 2.3 0
#> 147 6.3 2.5 5.0 1.9 0
#> 148 6.5 3.0 5.2 2.0 0
#> 149 6.2 3.4 5.4 2.3 0
#> 150 5.9 3.0 5.1 1.8 0
#> Speciesvirginica
#> 1 0
#> 2 0
#> 3 0
#> 4 0
#> 5 0
#> 6 0
#> 7 0
#> 8 0
#> 9 0
#> 10 0
#> 11 0
#> 12 0
#> 13 0
#> 14 0
#> 15 0
#> 16 0
#> 17 0
#> 18 0
#> 19 0
#> 20 0
#> 21 0
#> 22 0
#> 23 0
#> 24 0
#> 25 0
#> 26 0
#> 27 0
#> 28 0
#> 29 0
#> 30 0
#> 31 0
#> 32 0
#> 33 0
#> 34 0
#> 35 0
#> 36 0
#> 37 0
#> 38 0
#> 39 0
#> 40 0
#> 41 0
#> 42 0
#> 43 0
#> 44 0
#> 45 0
#> 46 0
#> 47 0
#> 48 0
#> 49 0
#> 50 0
#> 51 0
#> 52 0
#> 53 0
#> 54 0
#> 55 0
#> 56 0
#> 57 0
#> 58 0
#> 59 0
#> 60 0
#> 61 0
#> 62 0
#> 63 0
#> 64 0
#> 65 0
#> 66 0
#> 67 0
#> 68 0
#> 69 0
#> 70 0
#> 71 0
#> 72 0
#> 73 0
#> 74 0
#> 75 0
#> 76 0
#> 77 0
#> 78 0
#> 79 0
#> 80 0
#> 81 0
#> 82 0
#> 83 0
#> 84 0
#> 85 0
#> 86 0
#> 87 0
#> 88 0
#> 89 0
#> 90 0
#> 91 0
#> 92 0
#> 93 0
#> 94 0
#> 95 0
#> 96 0
#> 97 0
#> 98 0
#> 99 0
#> 100 0
#> 101 1
#> 102 1
#> 103 1
#> 104 1
#> 105 1
#> 106 1
#> 107 1
#> 108 1
#> 109 1
#> 110 1
#> 111 1
#> 112 1
#> 113 1
#> 114 1
#> 115 1
#> 116 1
#> 117 1
#> 118 1
#> 119 1
#> 120 1
#> 121 1
#> 122 1
#> 123 1
#> 124 1
#> 125 1
#> 126 1
#> 127 1
#> 128 1
#> 129 1
#> 130 1
#> 131 1
#> 132 1
#> 133 1
#> 134 1
#> 135 1
#> 136 1
#> 137 1
#> 138 1
#> 139 1
#> 140 1
#> 141 1
#> 142 1
#> 143 1
#> 144 1
#> 145 1
#> 146 1
#> 147 1
#> 148 1
#> 149 1
#> 150 1
#>
#> $class
#> [1] setosa setosa setosa setosa setosa setosa
#> [7] setosa setosa setosa setosa setosa setosa
#> [13] setosa setosa setosa setosa setosa setosa
#> [19] setosa setosa setosa setosa setosa setosa
#> [25] setosa setosa setosa setosa setosa setosa
#> [31] setosa setosa setosa setosa setosa setosa
#> [37] setosa setosa setosa setosa setosa setosa
#> [43] setosa setosa setosa setosa setosa setosa
#> [49] setosa setosa versicolor versicolor versicolor versicolor
#> [55] versicolor versicolor versicolor versicolor versicolor versicolor
#> [61] versicolor versicolor versicolor versicolor versicolor versicolor
#> [67] versicolor versicolor versicolor versicolor versicolor versicolor
#> [73] versicolor versicolor versicolor versicolor versicolor versicolor
#> [79] versicolor versicolor versicolor versicolor versicolor versicolor
#> [85] versicolor versicolor versicolor versicolor versicolor versicolor
#> [91] versicolor versicolor versicolor versicolor versicolor versicolor
#> [97] versicolor versicolor versicolor versicolor virginica virginica
#> [103] virginica virginica virginica virginica virginica virginica
#> [109] virginica virginica virginica virginica virginica virginica
#> [115] virginica virginica virginica virginica virginica virginica
#> [121] virginica virginica virginica virginica virginica virginica
#> [127] virginica virginica virginica virginica virginica virginica
#> [133] virginica virginica virginica virginica virginica virginica
#> [139] virginica virginica virginica virginica virginica virginica
#> [145] virginica virginica virginica virginica virginica virginica
#> Levels: setosa versicolor virginica
#>
# Some algorithms are robust to the covariate data structure
data_robust <- splendid_process(iris, class = iris$Species, algorithms =
"rf", convert = FALSE)
identical(iris, data_robust)
#> [1] FALSE
# Standardize and down-sample
iris2 <- iris[1:130, ]
data_scale_down <- splendid_process(iris2, class = iris2$Species, algorithms
= "rf", standardize = TRUE, sampling = "down")
dim(data_scale_down)
#> NULL
# Other algorithms require conversion
if (FALSE) { # \dontrun{
splendid_process(iris, class = iris$Species, algorithms = "lda", convert =
FALSE)
} # }