Subsampling imbalanced data using up-sampling, down-sampling, or SMOTE.
subsample(
data,
class,
sampling = c("none", "up", "down", "smote"),
seed_samp = NULL
)
data frame with rows as samples, columns as features
true/reference class vector used for supervised learning
the default is "none", in which no subsampling is performed. Other options include "up" (Up-sampling the minority class), "down" (Down-sampling the majority class), and "smote" (synthetic points for the minority class and down-sampling the majority class). Subsampling is only applicable to the training set.
random seed used for reproducibility in subsampling training sets for model generation
A subsampled dataset where corresponding strata of class
are more
balanced. The resulting class
variable is not included in the data
output.
To deal with class imbalances, we can subsample the data so that the class proportions are more uniform.
# Create imbalanced version of iris dataset
iris_imbal <- iris[1:130, ]
# Up-sampling
iris_up <- subsample(iris_imbal, iris_imbal$Species, sampling = "up")
nrow(iris_up)
#> [1] 150
# Down-sampling
iris_down <- subsample(iris_imbal, iris_imbal$Species, sampling = "down")
nrow(iris_down)
#> [1] 90
# SMOTE
iris_smote <- subsample(iris_imbal, iris_imbal$Species, sampling = "smote")
nrow(iris_smote)
#> [1] 101