Subsampling imbalanced data using up-sampling, down-sampling, or SMOTE.

Usage,
subsample(
  data,
  class,
  sampling = c("none", "up", "down", "smote"),
  seed_samp = NULL
)

Arguments

data

data frame with rows as samples, columns as features

class

true/reference class vector used for supervised learning

sampling

the default is "none", in which no subsampling is performed. Other options include "up" (Up-sampling the minority class), "down" (Down-sampling the majority class), and "smote" (synthetic points for the minority class and down-sampling the majority class). Subsampling is only applicable to the training set.

seed_samp

random seed used for reproducibility in subsampling training sets for model generation

Value

A subsampled dataset where corresponding strata of class are more balanced. The resulting class variable is not included in the data output.

Details

To deal with class imbalances, we can subsample the data so that the class proportions are more uniform.

Author

Derek Chiu

Examples

# Create imbalanced version of iris dataset
iris_imbal <- iris[1:130, ]

# Up-sampling
iris_up <- subsample(iris_imbal, iris_imbal$Species, sampling = "up")
nrow(iris_up)
#> [1] 150

# Down-sampling
iris_down <- subsample(iris_imbal, iris_imbal$Species, sampling = "down")
nrow(iris_down)
#> [1] 90

# SMOTE
iris_smote <- subsample(iris_imbal, iris_imbal$Species, sampling = "smote")
nrow(iris_smote)
#> [1] 101