splendid
supports two modelling extensions to how ensemble classification is performed. Both paradigms are constructed from a set of binary classifiers. The first extension is called the one-vs-all approach, where each class is compared to its complement. The second extension utilizes a sequential algorithm to build the ensemble classifier, assigning class membership sequentially, in the order of highest performance. Our goal is to provide the user additional options and flexibility with modelling, since there is no guarantee that prediction accuracy is improved.
To build a one-vs-all model, or ova
model for short, we first need to create a binarized class matrix, where each class is compared to its complement. Take the hgsc
data set for example. The first 10 classes as predicted by TCGA are:
class <- attr(hgsc, "class.true")
head(class, 10)
#> [1] "PRO.C5" "MES.C1" "DIF.C4" "MES.C1" "MES.C1" "PRO.C5" "DIF.C4" "MES.C1"
#> [9] "IMM.C2" "PRO.C5"
The corresponding binarized class matrix is shown in Table 1. We denote a "0"
meaning “not equal” to the class as indicated by the column name.
DIF.C4 | IMM.C2 | MES.C1 | PRO.C5 |
---|---|---|---|
class_0 | class_0 | class_0 | PRO.C5 |
class_0 | class_0 | MES.C1 | class_0 |
DIF.C4 | class_0 | class_0 | class_0 |
class_0 | class_0 | MES.C1 | class_0 |
class_0 | class_0 | MES.C1 | class_0 |
class_0 | class_0 | class_0 | PRO.C5 |
DIF.C4 | class_0 | class_0 | class_0 |
class_0 | class_0 | MES.C1 | class_0 |
class_0 | IMM.C2 | class_0 | class_0 |
class_0 | class_0 | class_0 | PRO.C5 |
Classification proceeds in this fashion for every bootstrap replicate and algorithm, as specified in splendid()
. However, whereas before we have one model fit, now there are four model fits at each iteration. Note that k-Nearest Neighbour classification gives the same results using the multiclass or ova
approach because of its nonparametric nature.
There is a bit of data manipulation required for obtaining the predicted class and probability matrix. Let K be the number of classes and N be the number of samples to classify. For each of the K fits, there is an N by 2 probability matrix from classifying k vs. not k, k = 1, …, K. We extract the column of probabilities pertaining to k for each fit and construct an N by K probability matrix. Of course, the sum of probabilities for each sample would likely not equate to one. To normalize the probability matrix, we divide each row by its sum. Finally, the ova
predicted class is taken as the largest probability from each row.
The ova
approach would hence produce a probability matrix that has a similar pattern to the conventional, multiclass approach but with less extreme values. This results in potentially more unclassified cases when we choose to threshold the predictions.
The sequential algorithm follows a more complicated workflow. This is an ensemble method, meaning that we work with fitted models, predictions, and evaluation metrics, whereas the ova
approach is integrated in the initial classification.
By sequential, the idea is that we want to sequentially assign classes in a binary fashion, until all classes have been assigned. The diagram below is a simple example for a 4-class case.
Notice that in a 4-class case, there are only 3 fits (since the last binary classification class “2” is the same as class “Not 1”). But how do we determine the sequence in which classes are assigned? Why did we classify 3, then 4, then 1 vs. 2 in the example above? We use the per-class F1-score evaluation metric. The F1-score has been shown to be more robust than traditional measures such as accuracy, and represents a balance between precision and recall.
Suppose we fit a set of classifiers with 2 bootstrap replicates on the xgboost
and rf
algorithms.
sm <- splendid_model(hgsc, class, n = 2, algorithms = c("xgboost", "rf"))
Table 2 shows the per-class F1-scores for model fitted and bootstrap replicate. An logical column indicates whether ova
was used for the particular algorithm.
class | model | boot | value |
---|---|---|---|
DIF.C4 | xgboost | 1 | 0.7047619 |
DIF.C4 | xgboost | 2 | 0.7339450 |
DIF.C4 | rf | 1 | 0.8453608 |
DIF.C4 | rf | 2 | 0.8571429 |
IMM.C2 | xgboost | 1 | 0.7252747 |
IMM.C2 | xgboost | 2 | 0.8064516 |
IMM.C2 | rf | 1 | 0.8941176 |
IMM.C2 | rf | 2 | 0.8235294 |
MES.C1 | xgboost | 1 | 0.7792208 |
MES.C1 | xgboost | 2 | 0.8607595 |
MES.C1 | rf | 1 | 0.9000000 |
MES.C1 | rf | 2 | 0.9135802 |
PRO.C5 | xgboost | 1 | 0.6153846 |
PRO.C5 | xgboost | 2 | 0.8043478 |
PRO.C5 | rf | 1 | 0.8627451 |
PRO.C5 | rf | 2 | 0.9090909 |
We then calculate the average value
across model
and boot
to obtain the mean F1-score across algorithms and replicates. The model associated with the maximum mean F1-score for a certain class will be used for its sequential algorithm. Finally, the sequence of fits is determined by descending magnitude of this maximum mean F1-score.
Table 3 shows that the class “MES.C1” had the highest mean F1-score, and the model that produced it was rf
, and so on. This means we first fit “MES.C1” vs. “Not MES.C1” using rf
, then “IMM.C2” vs. “Not IMM.C2” using xgboost
, and finally “PRO.C5” vs. “DIF.C4” using rf
. The information in the last row of Table 3 is simply for reference.
rank | class | model |
---|---|---|
1 | MES.C1 | rf |
2 | PRO.C5 | rf |
3 | IMM.C2 | rf |
4 | DIF.C4 | rf |
A slightly different binarized class matrix is needed as a result:
IMM.C2 | MES.C1 | PRO.C5 |
---|---|---|
class_0 | class_0 | PRO.C5 |
class_0 | MES.C1 | class_0 |
DIF.C4 | class_0 | class_0 |
class_0 | MES.C1 | class_0 |
class_0 | MES.C1 | class_0 |
class_0 | class_0 | PRO.C5 |
DIF.C4 | class_0 | class_0 |
class_0 | MES.C1 | class_0 |
IMM.C2 | class_0 | class_0 |
class_0 | class_0 | PRO.C5 |
Table 4 is similar in nature to Table 1, except that the two “worst-classified” classes are grouped together (i.e. “PRO.C5” and “DIF.C4”). In each sequential fit, the nonzero rows are removed from the entire matrix and the classified class column is removed. This means in the final sequential fit, the matrix reduces to a single column of “PRO.C5” and “DIF.C4” values.
The prediction output is generated from applying the sequential algorithm on the full data. We can extract the binarized probability matrices and also the 2-by-2 confusion matrices.
The ova
approach can be used in splendid()
by setting ova = TRUE
. For every algorithm specified in algorithms
, an analogous ova
version is implemented. All pertinent results are labelled with the string ova_algorithm_name
.
The sequential
algorithm can be used to construct an ensemble classifier by setting sequential = TRUE
in splendid()
.
Please read the function documentation for further details.