Run a multiclass classification algorithm on a given dataset and reference class.
Usage
classification(
data,
class,
algorithms,
rfe = FALSE,
ova = FALSE,
standardize = FALSE,
sampling = c("none", "up", "down", "smote"),
seed_samp = NULL,
sizes = NULL,
trees = 100,
tune = FALSE,
seed_alg = NULL,
convert = FALSE
)
Arguments
- data
data frame with rows as samples, columns as features
- class
true/reference class vector used for supervised learning
- algorithms
character string of algorithm to use for supervised learning. See Algorithms section for possible options.
- rfe
logical; if
TRUE
, run Recursive Feature Elimination as a feature selection method for "lda", "rf", and "svm" algorithms.- ova
logical; if
TRUE
, use the One-Vs-All approach for theknn
algorithm.- standardize
logical; if
TRUE
, the training sets are standardized on features to have mean zero and unit variance. The test sets are standardized using the vectors of centers and standard deviations used in corresponding training sets.- sampling
the default is "none", in which no subsampling is performed. Other options include "up" (Up-sampling the minority class), "down" (Down-sampling the majority class), and "smote" (synthetic points for the minority class and down-sampling the majority class). Subsampling is only applicable to the training set.
- seed_samp
random seed used for reproducibility in subsampling training sets for model generation
- sizes
the range of sizes of features to test RFE algorithm
- trees
number of trees to use in "rf"
- tune
logical; if
TRUE
, algorithms with hyperparameters are tuned- seed_alg
random seed used for reproducibility when running algorithms with an intrinsic random element (random forests)
- convert
logical; if
TRUE
, converts all categorical variables indata
to dummy variables. Certain algorithms only work with such limitations (e.g. LDA).
Details
Some of the classification algorithms implemented use pre-defined values that
specify settings and options while others need to tune hyperparameters.
"multinom"
and "nnet"
use a maximum number of weights of 2000, in case
data
is high dimensional and classification is time-consuming. "nnet"
also tunes the number of nodes (1-5) in the hidden layer. "pam"
considers
100 thresholds when training, and uses a uniform prior. "adaboost_m1"
calls
adabag::boosting()
which supports hyperparameter tuning.
When alg = "knn"
, the return value is NULL
because class::knn()
does
not output an intermediate model object. The modelling and prediction is
performed in one step. However, the class attribute "knn" is still assigned
to the result in order to call the respective prediction()
method. An
additional class "ova" is added if ova = TRUE
.
Algorithms
The classification algorithms currently supported are:
Prediction Analysis for Microarrays ("pam")
Support Vector Machines ("svm")
Random Forests ("rf")
Linear Discriminant Analysis ("lda")
Shrinkage Linear Discriminant Analysis ("slda")
Shrinkage Diagonal Discriminant Analysis ("sdda")
Multinomial Logistic Regression using
Generalized Linear Model with no penalization ("mlr_glm")
GLM with LASSO penalty ("mlr_lasso")
GLM with ridge penalty ("mlr_ridge")
GLM with elastic net penalty ("mlr_enet")
Neural Networks ("mlr_nnet")
Neural Networks ("nnet")
Naive Bayes ("nbayes")
AdaBoost.M1 ("adaboost_m1")
Extreme Gradient Boosting ("xgboost")
K-Nearest Neighbours ("knn")
Examples
data(hgsc)
class <- attr(hgsc, "class.true")
classification(hgsc, class, "xgboost")
#> ##### xgb.Booster
#> raw: 18.6 Kb
#> call:
#> xgboost::xgb.train(params = list(objective = "multi:softprob",
#> eval_metric = "mlogloss", num_class = nlevels(class)), data = xgboost::xgb.DMatrix(data = as.matrix(data),
#> label = as.integer(class) - 1), nrounds = 2)
#> params (as set within xgb.train):
#> objective = "multi:softprob", eval_metric = "mlogloss", num_class = "4", validate_parameters = "TRUE"
#> xgb.attributes:
#> niter
#> callbacks:
#> cb.print.evaluation(period = print_every_n)
#> # of features: 321
#> niter: 2
#> nfeatures : 321