Introduction

Supervised learning is the branch of statistical learning where we construct a predictive model with the goal of regression or classification. The splendid package focuses on classification with an ensemble framework: many classification algorithms are used, and prediction is done across bootstrap replicates of the data. An ensemble classifier is built from the best performing algorithms according to evaluation measures. No single classifier always performs the best for every data set that exists, so there is increasing utility to come up with ensemble classifiers. The objective is to use this classifier to obtain highly accurate predictions in independent data sets for the purposes of diagnostic identification. In genomic studies for example, one might be interested in using the class labels of a cancer subtype in one cohort to predict the subtypes in an independent cohort.

splendid is currently only available on GitHub (the second line below will be uncommented once the repository becomes public).

# install.packages("devtools")
# devtools::install_github("AlineTalhouk/splendid")
library(splendid)
library(knitr)
data(hgsc)

Overview

The main function of splendid is splendid(), and the usage is intuitive:

  • data: your rectangular data in standard form (rows as samples/observations, columns as variables/features)
  • class: the true/reference class labels with length equal to the number of samples
  • algorithms: character vector of classification algorithms
  • n: number of bootstrap replicates to generate as training sets. Test sets are taken from the out-of-bag samples (OOB).
  • seed_boot: seed for reproducibility in bootstrapping training sets for model generation
  • seed_samp: seed for reproducibility in subsampling training sets for model generation
  • seed_alg: seed for reproducibility for algorithms with a random training aspect (e.g. random forests)
  • convert: whether to convert categorical predictors to dummy variables
  • rfe: whether to use Recursive Feature Elimination (RFE) for feature selection. Only implemented for certain algorithms.
  • ova: whether to run one-vs-all versions for each algorithm (e.g. class 1 vs. not class 1, class 2 vs. not class 2, etc.)
  • standardize: whether to standardize the training set before classification and the test set (using training set centerings & scalings) during prediction
  • sampling: the subsampling method is either up-sampling, down-sampling, or SMOTE (hybrid method)
  • stratify: whether to use stratified bootstrap sampling to ensure that the proportion of classes in the bootstrap sample is the same as the original data
  • plus: whether to calculate the .632+ or .632 estimator for the log loss error rate
  • threshold: predicted probabilities below this value are classified as “unclassified” instead of imposing one of the labels from class
  • trees: number of trees/iterations to use in random forest/boosting models, respectively
  • tune: whether to tune hyperparameters in pertinent algorithms
  • top: how many of the top performing algorithms are chosen for the final ensemble?
  • seed_rank: random seed used for reproducibility in rank aggregation of ensemble algorithms
  • sequential: whether to run the sequential ensemble
class <- attr(hgsc, "class.true")
table(class)
#> class
#> DIF.C4 IMM.C2 MES.C1 PRO.C5 
#>    135    107    109    138
sl_result <- splendid(data = hgsc, class = class, n = 2,
                      algorithms = c("slda", "knn", "svm"), rfe = FALSE)

The resulting object is a list with the following elements:

  • models gives the model fits for each of the algorithms used. The fits are themselves nested based on the number of bootstrap replicates.
  • preds has the same hierarchical structure as models, except the metadata consists of the predicted class labels.
  • evals is a list of data frames showing evaluation metrics for each algorithm. Rows are metrics and columns are bootstrap replicates. More details on this element will be provided later.
  • bests is a vector that shows the best-performing algorithm for each bootstrapped replicate of the data, chosen by rank aggregation.
  • ensemble_algs tallies the algorithm frequencies in bests, returning the top (3, by default) algorithms chosen.
  • ensemble is a list of model fits for each of the algorithms in ensemble_algs, fit on the full data.
str(sl_result, max.level = 2)
#> List of 8
#>  $ models       :List of 3
#>   ..$ slda:List of 2
#>   ..$ knn :List of 2
#>   ..$ svm :List of 2
#>  $ preds        :List of 3
#>   ..$ slda:List of 2
#>   ..$ knn :List of 2
#>   ..$ svm :List of 2
#>  $ evals        :List of 3
#>   ..$ slda:'data.frame': 48 obs. of  2 variables:
#>   .. ..- attr(*, "err_632plus")= num 1.84
#>   ..$ knn :'data.frame': 48 obs. of  2 variables:
#>   .. ..- attr(*, "err_632plus")= num 2.47
#>   ..$ svm :'data.frame': 48 obs. of  2 variables:
#>   .. ..- attr(*, "err_632plus")= num 0.244
#>  $ bests        : Named chr [1:2] "svm" "svm"
#>   ..- attr(*, "names")= chr [1:2] "1" "2"
#>  $ ensemble_algs: chr "svm"
#>  $ ensemble_mods:List of 1
#>   ..$ :List of 29
#>   .. ..- attr(*, "class")= chr "svm"
#>  $ seq_mods     : NULL
#>  $ seq_preds    : NULL

Classification

The first step in the splendid pipeline is classification. Given explanatory data and a reference response of classes, we wish to build a classifier that can accurately predict class representation in a separate validation data set. To avoid overfitting, we split the given data into a training set and test set. The training set is constructed by taking a random sample with replacement of all samples. The test set is comprised of all samples which are not included in the corresponding training set, also known as an OOB sample. This is a simple bootstrap resampling scheme, which we replicate a sufficient number of times to capture sampling variability.

Algorithms

There exist a vast number of classification algorithms. Those currently in splendid are:

  • Prediction Analysis for Microarrays ("pam")
  • Support Vector Machines using a Radial-Basis Kernel("svm")
  • Random Forests ("rf")
  • Linear Discriminant Analysis ("lda")
  • Shrinkage Linear Discriminant Analysis ("slda")1
  • Shrinkage Diagonal Discriminant Analysis ("sdda")1
  • Multinomial Logistic Regression using
    • Generalized Linear Model with no penalization ("mlr_glm")2
    • GLM with LASSO penalty ("mlr_lasso")2
    • GLM with ridge penalty ("mlr_ridge")2
    • Neural Networks ("mlr_nnet")
  • Neural Networks with One Hidden Layer ("nnet")
  • Naive Bayes ("nbayes")
  • Adaptive Boosting using Decision Trees ("adaboost")
  • AdaBoost.m1 ("adaboost_m1")
  • Extreme Gradient Boosting using Decision Trees ("xgboost")
  • K-Nearest Neighbours ("knn")

These algorithms are implemented in classification().

Hyperparameters and other Details

Certain functions have hyperparameters that need to be tuned in order to select the best model before prediction. We use a grid search on a pre-specified range of the hyperparameters and choose the optimal values using caret::train().

The ranges for the tuning parameters are:

  • svm
    • sigma: 1 / ncol(data) * 2 ^ (0:4)
    • C: 1, 2, 4, 8, 16
  • rf
    • mtry: 1, 4, 9, 16, 25
  • adaboost_m1
    • mfinal: 1, 2, 3, 4, 5
    • maxdepth: 1, 2, 3, 4, 5
    • coeflearn: “Breiman”, “Freund”, “Zhu”

We use e1071::tune() for neural network parameters. The ranges are:

  • nnet
    • size: 1, 2, 3, 4, 5
    • decay: 0, 0.125, 0.25, 0.375, 0.5

Some algorithms have certain properties that require data manipulations before classification.

  • slda and sdda: These shrinkage discriminant analysis algorithms estimate the regularization parameters analytically without having to solve using resampling methods.
  • multinom_nnet and nnet: We increase the maximum number of allowable weights to a sufficiently large number to avoid ill-fitted models.
  • knn: Since K-Nearest Neighbours is a non-parametric method, there is no model object upon which predictions are made, and predicted classes are directly found. To obtain prediction probabilities by class, we use a distance matrix between the training and test set samples.
  • pam: a uniform prior is used in the training step.
  • xgboost: the objective function is soft probability, and the evaluation metric is log loss.

Feature Selection

We use Recursive Feature Elimination (RFE)3 on lda, rf, svm, and adaboost_m1 to reduce the dimensionality before classification. In svm, we do this before tuning because of the computational complexity. However, we can embed this feature selection within the tuning step for the other three algorithms. Set rfe = TRUE in splendid to use feature selection.

One limitation of RFE is that an a priori set of feature subset sizes need to be specified, determining the search space for the algorithm. By default, we set the sizes parameter to be every 5 integers from 0 up to one-half of the smallest class size. Recall the class sizes:

table(class)
#> class
#> DIF.C4 IMM.C2 MES.C1 PRO.C5 
#>    135    107    109    138

We tell RFE to search for the best models with 5, 10, …, 50 features. Cross-validation with 2 folds is used in the algorithm.

Prediction

There is a different prediction method for each classifier, since they are all imported from an external package. The prediction() function calls each method based on the class of the model output from classification(). prediction() also performs some manipulations so that the results all have the same data structure: unnamed factors, with labels given in the same order as the true class labels.

The method for "pam" is an exception: the output from prediction() not only has the predicted class labels, but also the cross-validated threshold value calculated from the training set to use in prediction on the test set, named delta.

str(sl_result$preds, max.level = 2, list.len = 2)
#> List of 3
#>  $ slda:List of 2
#>   ..$ : Factor w/ 4 levels "DIF.C4","IMM.C2",..: 3 2 4 1 2 2 1 2 1 1 ...
#>   .. ..- attr(*, "prob")= num [1:184, 1:4] 4.20e-06 2.70e-01 4.17e-02 1.00 2.12e-02 ...
#>   .. .. ..- attr(*, "dimnames")=List of 2
#>   .. ..- attr(*, "class.true")= Factor w/ 4 levels "DIF.C4","IMM.C2",..: 3 1 3 1 1 2 1 2 1 1 ...
#>   .. ..- attr(*, "class.thres")= Factor w/ 4 levels "DIF.C4","IMM.C2",..: 3 2 4 1 2 2 1 2 1 1 ...
#>   .. ..- attr(*, "class.prop")= num 1
#>   ..$ : Factor w/ 4 levels "DIF.C4","IMM.C2",..: 4 3 1 3 4 1 4 4 1 2 ...
#>   .. ..- attr(*, "prob")= num [1:188, 1:4] 0 0.00251 0.99704 0.00367 0 ...
#>   .. .. ..- attr(*, "dimnames")=List of 2
#>   .. ..- attr(*, "class.true")= Factor w/ 4 levels "DIF.C4","IMM.C2",..: 4 3 1 3 4 1 4 1 1 2 ...
#>   .. ..- attr(*, "class.thres")= Factor w/ 4 levels "DIF.C4","IMM.C2",..: 4 3 1 3 4 1 4 4 1 2 ...
#>   .. ..- attr(*, "class.prop")= num 1
#>  $ knn :List of 2
#>   ..$ : Factor w/ 4 levels "DIF.C4","IMM.C2",..: 4 1 3 4 1 2 1 2 1 1 ...
#>   .. ..- attr(*, "prob")= num [1:184, 1:4] 0 1 0 0.4 0.6 ...
#>   .. .. ..- attr(*, "dimnames")=List of 2
#>   .. ..- attr(*, "class.true")= Factor w/ 4 levels "DIF.C4","IMM.C2",..: 3 1 3 1 1 2 1 2 1 1 ...
#>   .. ..- attr(*, "class.thres")= Factor w/ 4 levels "DIF.C4","IMM.C2",..: 4 1 3 1 1 2 1 2 1 1 ...
#>   .. ..- attr(*, "class.prop")= num 1
#>   ..$ : Factor w/ 4 levels "DIF.C4","IMM.C2",..: 4 1 1 3 4 1 4 1 1 2 ...
#>   .. ..- attr(*, "prob")= num [1:188, 1:4] 0.4 0.6 0.667 0 0 ...
#>   .. .. ..- attr(*, "dimnames")=List of 2
#>   .. ..- attr(*, "class.true")= Factor w/ 4 levels "DIF.C4","IMM.C2",..: 4 3 1 3 4 1 4 1 1 2 ...
#>   .. ..- attr(*, "class.thres")= Factor w/ 4 levels "DIF.C4","IMM.C2",..: 1 1 1 3 4 1 4 1 1 2 ...
#>   .. ..- attr(*, "class.prop")= num 1
#>   [list output truncated]

Attributes

Note that each predicted class has an attribute that shows the class probabilities in a matrix. We ensure that the class probabilities for every sample sum to one by making a small adjustment to one of the classes depending on whether the probability was over or under one. These matrices are useful to compute evaluation measures and generate discriminating graphs that we detail below, and is stored in attr(*, "prob").

To obtain better performance under evaluation metrics, we may want to exclude samples which have a maximum class probability below a threshold. For example, we may only compare the true test labels with the corresponding OOB samples where the winning class has a probability at least 50%. Samples below this threshold are labelled as "unclassified". If the threshold results in all samples being unclassified, then we use the unfiltered, original predicted labels for evaluation. This object is stored in attr(*, "class.thres").

The final attribute of a prediction object is stored attr(*, "class.prop"), showing the proportion of classified cases. This value is inversely proportional to the threshold setting.

Evaluation

Metrics

Evaluation measures are important because they tell us the prediction performance of a classifier. The table below shows all the measures for svm, for two bootstrap replicates. The logloss, auc and pdi measures make use of the prediction probabilities, whereas the rest are computed from the multiclass confusion matrix of reference and predicted class labels.

SVM Evaluation Metrics
1 2
logloss 0.3621245 0.2680567
auc 0.9782607 0.9882386
pdi 0.9416330 0.9692426
accuracy 0.8641304 0.9148936
macro_ppv 0.8774574 0.9180654
macro_npv 0.9548134 0.9717229
macro_sensitivity 0.8613753 0.9130733
macro_specificity 0.9534702 0.9711636
macro_f1 0.8652796 0.9148339
mcc 0.8187042 0.8862210
kappa 0.8169008 0.8857143
gmean 0.8582250 0.9127181
accuracy.DIF.C4 0.9076087 0.9574468
accuracy.IMM.C2 0.9456522 0.9627660
accuracy.MES.C1 0.9619565 0.9627660
accuracy.PRO.C5 0.9130435 0.9468085
ppv.DIF.C4 0.8518519 0.9387755
ppv.IMM.C2 0.9655172 0.9047619
ppv.MES.C1 0.8888889 0.9473684
ppv.PRO.C5 0.8035714 0.8813559
npv.DIF.C4 0.9307692 0.9640288
npv.IMM.C2 0.9419355 0.9794521
npv.MES.C1 0.9856115 0.9666667
npv.PRO.C5 0.9609375 0.9767442
sensitivity.DIF.C4 0.8363636 0.9019608
sensitivity.IMM.C2 0.7567568 0.9268293
sensitivity.MES.C1 0.9523810 0.8780488
sensitivity.PRO.C5 0.9000000 0.9454545
specificity.DIF.C4 0.9379845 0.9781022
specificity.IMM.C2 0.9931973 0.9727891
specificity.MES.C1 0.9647887 0.9863946
specificity.PRO.C5 0.9179104 0.9473684
f1.DIF.C4 0.8440367 0.9200000
f1.IMM.C2 0.8484848 0.9156627
f1.MES.C1 0.9195402 0.9113924
f1.PRO.C5 0.8490566 0.9122807
mcc.DIF.C4 0.7784736 0.8913611
mcc.IMM.C2 0.8249532 0.8918829
mcc.MES.C1 0.8955810 0.8888934
mcc.PRO.C5 0.7907590 0.8752894
kappa.DIF.C4 0.7784075 0.8910303
kappa.IMM.C2 0.8159632 0.8917763
kappa.MES.C1 0.8946680 0.8878664
kappa.PRO.C5 0.7882624 0.8741802
gmean.DIF.C4 0.8857179 0.9392603
gmean.IMM.C2 0.8669537 0.9495312
gmean.MES.C1 0.9585648 0.9306463
gmean.PRO.C5 0.9089111 0.9464110
  • logloss: Multiclass log loss should be minimized
  • auc: Area Under the Curve compares trade-off between sensitivity and specificity
  • pdi: Polytomous Discrimination Index
  • accuracy: number of correctly classified samples out of all samples
  • ppv: positive predictive value, also known as precision
  • npv: negative predictive value
  • sensitivity: also known as recall
  • specificity: also known as selectivity
  • f1: balanced metric between precision and recall
  • mcc: Matthew’s correlation coefficient

Note that in the multiclass case, we have variants for some of the measures listed above, depending on how we are looking at the data. A macro-averaged metric calculates said metric on each of the one-vs-all confusion matrices and then takes the mean. A micro-averaged metric is calculated on the element-wise sum of all one-vs-all confusion matrices.

Hence, there are macro_ppv, macro_npv, macro_sensitivity, macro_specificity, and macro_f1 macro-averaged metrics, and micro_mcc as the “only” micro-averaged metric. It turns out that accuracy is mathematically equivalent to micro-averaged PPV, sensitivity, and F1-score, so we don’t redundantly add those to the list. Furthermore, calculating mcc in a macro-averaged way is not recommended, so we use it on the full confusion matrix.

Class-specific measures fill in the rest of the matrix, labelled with the naming scheme metric.class.

The .632 estimator are implemented for the multiclass log loss function. This error estimate aims to make a compromise between an overbiased prediction like the leave-one-out bootstrap error with an underbiased prediction like the training (or apparent) error. If we set plus = TRUE in splendid() (default), one can calculate the .632+ estimator, an improvement over the .632 estimator that takes into account the amount of overfitting.5 The value is stored as an attribute in the evaluation object:

vapply(sl_result$evals, attr, "err_632plus", FUN.VALUE = double(1))
#>      slda       knn       svm 
#> 1.8351057 2.4690499 0.2443584

Plots

To assess the performance of a classifier, we can look at a discriminating plot and reliability plot.

data(hgsc)
class <- factor(attr(hgsc, "class.true"))
set.seed(1)
training.id <- sample(seq_along(class), replace = TRUE)
test.id <- which(!seq_along(class) %in% training.id)
mod <- classification(hgsc[training.id, ], class[training.id], "adaboost")
pred <- prediction(mod, hgsc, test.id, class = class)
evals <- evaluation(class[test.id], pred, plot = TRUE)

Ensemble Construction

For each bootstrap replicate training set, we want to find the top performing classifier. We use Rank Aggregation with the Genetic Algorithm to choose the top algorithm by comparing across evaluation metrics4. In the case of log loss, we need to first invert its value since it its objective function is minimization. Class-specific measures are not included in the rank aggregation because they are interdependent. For example, a sample with a true class of “A” should not be treated differently depending on whether it was misclassified into “B” or “C”.

After obtaining the list of top classifiers for each bootstrap replicate, we sort them by decreasing frequency and keep the top 3 to use in the ensemble. The ensemble_mods output of splendid is a list of models fit on the full data for the top classifiers.

References

  1. Ahdesmäki, Miika, and Korbinian Strimmer. “Feature selection in omics prediction problems using cat scores and false nondiscovery rate control.” The Annals of Applied Statistics 4.1 (2010): 503-519.
  2. https://web.stanford.edu/~hastie/glmnet/glmnet_beta.html
  3. https://topepo.github.io/caret/recursive-feature-elimination.html
  4. Pihur, Vasyl, Susmita Datta, and Somnath Datta. “RankAggreg, an R package for weighted rank aggregation.” BMC bioinformatics 10.1 (2009): 62.
  5. Efron, Bradley and Tibshirani, Robert (1997), “Improvements on Cross-Validation: The .632+ Bootstrap Method,” Journal of American Statistical Association, 92, 438, 548-560.