Introduction
splendid
supports two modelling extensions to how
ensemble classification is performed. Both paradigms are constructed
from a set of binary classifiers. The first extension is called the
one-vs-all approach, where each class is compared to
its complement. The second extension utilizes a sequential
algorithm to build the ensemble classifier, assigning class
membership sequentially, in the order of highest performance. Our goal
is to provide the user additional options and flexibility with
modelling, since there is no guarantee that prediction accuracy is
improved.
One-Vs-All
To build a one-vs-all model, or ova
model for short, we
first need to create a binarized class matrix, where each class is
compared to its complement. Take the hgsc
data set for
example. The first 10 classes as predicted by TCGA are:
class <- attr(hgsc, "class.true")
head(class, 10)
#> [1] "PRO.C5" "MES.C1" "DIF.C4" "MES.C1" "MES.C1" "PRO.C5" "DIF.C4" "MES.C1"
#> [9] "IMM.C2" "PRO.C5"
The corresponding binarized class matrix is shown in Table 1. We
denote a "0"
meaning “not equal” to the class as indicated
by the column name.
DIF.C4 | IMM.C2 | MES.C1 | PRO.C5 |
---|---|---|---|
class_0 | class_0 | class_0 | PRO.C5 |
class_0 | class_0 | MES.C1 | class_0 |
DIF.C4 | class_0 | class_0 | class_0 |
class_0 | class_0 | MES.C1 | class_0 |
class_0 | class_0 | MES.C1 | class_0 |
class_0 | class_0 | class_0 | PRO.C5 |
DIF.C4 | class_0 | class_0 | class_0 |
class_0 | class_0 | MES.C1 | class_0 |
class_0 | IMM.C2 | class_0 | class_0 |
class_0 | class_0 | class_0 | PRO.C5 |
Classification proceeds in this fashion for every bootstrap replicate
and algorithm, as specified in splendid()
. However, whereas
before we have one model fit, now there are four model fits at each
iteration. Note that k-Nearest Neighbour classification gives the same
results using the multiclass or ova
approach because of its
nonparametric nature.
There is a bit of data manipulation required for obtaining the
predicted class and probability matrix. Let K be the number of classes
and N be the number of samples to classify. For each of the K fits,
there is an N by 2 probability matrix from classifying k
vs. not k, k = 1, …, K. We extract the column of
probabilities pertaining to k for each fit and construct an N
by K probability matrix. Of course, the sum of probabilities for each
sample would likely not equate to one. To normalize the probability
matrix, we divide each row by its sum. Finally, the ova
predicted class is taken as the largest probability from each row.
The ova
approach would hence produce a probability
matrix that has a similar pattern to the conventional, multiclass
approach but with less extreme values. This results in
potentially more unclassified cases when we choose to threshold the
predictions.
Sequential Algorithm
The sequential algorithm follows a more complicated workflow. This is
an ensemble method, meaning that we work with fitted models,
predictions, and evaluation metrics, whereas the ova
approach is integrated in the initial classification.
By sequential, the idea is that we want to sequentially assign classes in a binary fashion, until all classes have been assigned. The diagram below is a simple example for a 4-class case.
Notice that in a 4-class case, there are only 3 fits (since the last binary classification class “2” is the same as class “Not 1”). But how do we determine the sequence in which classes are assigned? Why did we classify 3, then 4, then 1 vs. 2 in the example above? We use the per-class F1-score evaluation metric. The F1-score has been shown to be more robust than traditional measures such as accuracy, and represents a balance between precision and recall.
Suppose we fit a set of classifiers with 2 bootstrap replicates on
the xgboost
and rf
algorithms.
sm <- splendid_model(hgsc, class, n = 2, algorithms = c("xgboost", "rf"))
Table 2 shows the per-class F1-scores for model fitted and bootstrap
replicate. An logical column indicates whether ova
was used
for the particular algorithm.
class | model | boot | value |
---|---|---|---|
DIF.C4 | xgboost | 1 | 0.7272727 |
DIF.C4 | xgboost | 2 | 0.6666667 |
DIF.C4 | rf | 1 | 0.8155340 |
DIF.C4 | rf | 2 | 0.8297872 |
IMM.C2 | xgboost | 1 | 0.7750000 |
IMM.C2 | xgboost | 2 | 0.7352941 |
IMM.C2 | rf | 1 | 0.9066667 |
IMM.C2 | rf | 2 | 0.8181818 |
MES.C1 | xgboost | 1 | 0.8604651 |
MES.C1 | xgboost | 2 | 0.8285714 |
MES.C1 | rf | 1 | 0.9111111 |
MES.C1 | rf | 2 | 0.8823529 |
PRO.C5 | xgboost | 1 | 0.7472527 |
PRO.C5 | xgboost | 2 | 0.8085106 |
PRO.C5 | rf | 1 | 0.8636364 |
PRO.C5 | rf | 2 | 0.9148936 |
We then calculate the average value
across
model
and boot
to obtain the mean F1-score
across algorithms and replicates. The model associated with the
maximum mean F1-score for a certain class will be used for its
sequential algorithm. Finally, the sequence of fits is determined by
descending magnitude of this maximum mean F1-score.
Table 3 shows that the class “MES.C1” had the highest mean F1-score,
and the model that produced it was rf
, and so on. This
means we first fit “MES.C1” vs. “Not MES.C1” using rf
, then
“IMM.C2” vs. “Not IMM.C2” using xgboost
, and finally
“PRO.C5” vs. “DIF.C4” using rf
. The information in the last
row of Table 3 is simply for reference.
rank | class | model |
---|---|---|
1 | MES.C1 | rf |
2 | PRO.C5 | rf |
3 | IMM.C2 | rf |
4 | DIF.C4 | rf |
A slightly different binarized class matrix is needed as a result:
IMM.C2 | MES.C1 | PRO.C5 |
---|---|---|
class_0 | class_0 | PRO.C5 |
class_0 | MES.C1 | class_0 |
DIF.C4 | class_0 | class_0 |
class_0 | MES.C1 | class_0 |
class_0 | MES.C1 | class_0 |
class_0 | class_0 | PRO.C5 |
DIF.C4 | class_0 | class_0 |
class_0 | MES.C1 | class_0 |
IMM.C2 | class_0 | class_0 |
class_0 | class_0 | PRO.C5 |
Table 4 is similar in nature to Table 1, except that the two “worst-classified” classes are grouped together (i.e. “PRO.C5” and “DIF.C4”). In each sequential fit, the nonzero rows are removed from the entire matrix and the classified class column is removed. This means in the final sequential fit, the matrix reduces to a single column of “PRO.C5” and “DIF.C4” values.
The prediction output is generated from applying the sequential algorithm on the full data. We can extract the binarized probability matrices and also the 2-by-2 confusion matrices.
Usage
The ova
approach can be used in splendid()
by setting ova = TRUE
. For every algorithm specified in
algorithms
, an analogous ova
version is
implemented. All pertinent results are labelled with the string
ova_algorithm_name
.
The sequential
algorithm can be used to construct an
ensemble classifier by setting sequential = TRUE
in
splendid()
.
Please read the function documentation for further details.