Supervised learning classification algorithms are performed on bootstrap replicates and an ensemble classifier is built and evaluated across these variants.
Usage
splendid(
data,
class,
algorithms = NULL,
n = 1,
seed_boot = NULL,
seed_samp = NULL,
seed_alg = NULL,
convert = FALSE,
rfe = FALSE,
ova = FALSE,
standardize = FALSE,
sampling = c("none", "up", "down", "smote"),
stratify = FALSE,
plus = TRUE,
threshold = 0,
trees = 100,
tune = FALSE,
vi = FALSE,
top = 3,
seed_rank = 1,
sequential = FALSE
)
Arguments
- data
data frame with rows as samples, columns as features
- class
true/reference class vector used for supervised learning
- algorithms
character vector of algorithms to use for supervised learning. See Algorithms section for possible options. By default, this argument is
NULL
, in which case all algorithms are used.- n
number of bootstrap replicates to generate
- seed_boot
random seed used for reproducibility in bootstrapping training sets for model generation
- seed_samp
random seed used for reproducibility in subsampling training sets for model generation
- seed_alg
random seed used for reproducibility when running algorithms with an intrinsic random element (random forests)
- convert
logical; if
TRUE
, converts all categorical variables indata
to dummy variables. Certain algorithms only work with such limitations (e.g. LDA).- rfe
logical; if
TRUE
, run Recursive Feature Elimination as a feature selection method for "lda", "rf", and "svm" algorithms.- ova
logical; if
TRUE
, a One-Vs-All classification approach is performed for every algorithm inalgorithms
. The relevant results are prefixed with the stringova_
.- standardize
logical; if
TRUE
, the training sets are standardized on features to have mean zero and unit variance. The test sets are standardized using the vectors of centers and standard deviations used in corresponding training sets.- sampling
the default is "none", in which no subsampling is performed. Other options include "up" (Up-sampling the minority class), "down" (Down-sampling the majority class), and "smote" (synthetic points for the minority class and down-sampling the majority class). Subsampling is only applicable to the training set.
- stratify
logical; if
TRUE
, the bootstrap resampling is performed within each strata ofclass
to ensure the bootstrap sample contains the same proportions of each strata as the original data.- plus
logical; if
TRUE
(default), the .632+ estimator is calculated. Otherwise, the .632 estimator is calculated.- threshold
a number between 0 and 1 indicating the lowest maximum class probability below which a sample will be unclassified.
- trees
number of trees to use in "rf"
- tune
logical; if
TRUE
, algorithms with hyperparameters are tuned- vi
logical; if
TRUE
, model-based variable importance scores are returned for each algorithm if available. Otherwise, SHAP-based VI scores are calculated.- top
the number of highest-performing algorithms to retain for ensemble
- seed_rank
random seed used for reproducibility in rank aggregation of ensemble algorithms
- sequential
logical; if
TRUE
, a sequential model is fit on the algorithms that had the best performance with one-vs-all classification.
Value
A nested list with five elements
models
: A list with an element for each algorithm, each of which is a list with lengthn
. Shows the model object for each algorithm and bootstrap replicate on the training set.preds
: A list with an element for each algorithm, each of which is a list with lengthn
. Shows the predicted classes for each algorithm and bootstrap replicate on the test set.evals
: For each bootstrap sample, we can calculate various evaluation measures for the predicted classes from each algorithm. Evaluation measures include macro-averaged precision/recall/F1-score, micro-averaged precision, and (micro-averaged MCC) The return value ofeval
is a tibble that shows some summary statistics (e.g. mean, median) of the evaluation measures across bootstrap samples, for each classification algorithm.bests
: best-performing algorithm for each bootstrapped replicate of the data, chosen by rank aggregation.ensemble_algs
: tallies the frequencies inbests
, returning the top algorithms chosen.ensemble
: list of model fits for each of the algorithms inensemble_algs
, fit on the full data.
Details
Training sets are bootstrap replicates of the original data sampled with replacement. Test sets comprise of all remaining samples left out from each training set, also called Out-Of-Bag samples. This framework uses the 0.632 bootstrap rule for large n.
An ensemble classifier is constructed using Rank Aggregation across multiple evaluation measures such as precision, recall, F1-score, and Matthew's Correlation Coefficient (MCC).
Algorithms
The classification algorithms currently supported are:
Prediction Analysis for Microarrays ("pam")
Support Vector Machines ("svm")
Random Forests ("rf")
Linear Discriminant Analysis ("lda")
Shrinkage Linear Discriminant Analysis ("slda")
Shrinkage Diagonal Discriminant Analysis ("sdda")
Multinomial Logistic Regression using
Generalized Linear Model with no penalization ("mlr_glm")
GLM with LASSO penalty ("mlr_lasso")
GLM with ridge penalty ("mlr_ridge")
GLM with elastic net penalty ("mlr_enet")
Neural Networks ("mlr_nnet")
Neural Networks ("nnet")
Naive Bayes ("nbayes")
AdaBoost.M1 ("adaboost_m1")
Extreme Gradient Boosting ("xgboost")
K-Nearest Neighbours ("knn")