Runs consensus clustering across subsamples, algorithms, and number of clusters (k).
Usage
dice(
data,
nk,
p.item = 0.8,
reps = 10,
algorithms = NULL,
k.method = NULL,
nmf.method = c("brunet", "lee"),
hc.method = "average",
distance = "euclidean",
cons.funs = c("kmodes", "majority", "CSPA", "LCE", "LCA"),
sim.mat = c("cts", "srs", "asrs"),
prep.data = c("none", "full", "sampled"),
min.var = 1,
seed = 1,
seed.data = 1,
trim = FALSE,
reweigh = FALSE,
n = 5,
evaluate = TRUE,
plot = FALSE,
ref.cl = NULL,
progress = TRUE
)
Arguments
- data
data matrix with rows as samples and columns as variables
- nk
number of clusters (k) requested; can specify a single integer or a range of integers to compute multiple k
- p.item
proportion of items to be used in subsampling within an algorithm
- reps
number of subsamples
- algorithms
vector of clustering algorithms for performing consensus clustering. Must be any number of the following: "nmf", "hc", "diana", "km", "pam", "ap", "sc", "gmm", "block", "som", "cmeans", "hdbscan". A custom clustering algorithm can be used.
- k.method
determines the method to choose k when no reference class is given. When
ref.cl
is notNULL
, k is the number of distinct classes ofref.cl
. Otherwise the input fromk.method
chooses k. The default is to use the PAC to choose the best k(s). Specifying an integer as a user-desired k will override the best k chosen by PAC. Finally, specifying "all" will produce consensus results for all k. The "all" method is implicitly performed when there is only one k used.- nmf.method
specify NMF-based algorithms to run. By default the "brunet" and "lee" algorithms are called. See
NMF::nmf()
for details.- hc.method
agglomeration method for hierarchical clustering. The the "average" method is used by default. See
stats::hclust()
for details.- distance
a vector of distance functions. Defaults to "euclidean". Other options are given in
stats::dist()
. A custom distance function can be used.- cons.funs
consensus functions to use. Current options are "kmodes" (k-modes), "majority" (majority voting), "CSPA" (Cluster-based Similarity Partitioning Algorithm), "LCE" (linkage clustering ensemble), "LCA" (latent class analysis)
- sim.mat
similarity matrix; choices are "cts", "srs", "asrs".
- prep.data
Prepare the data on the "full" dataset, the "sampled" dataset, or "none" (default).
- min.var
minimum variability measure threshold used to filter the feature space for only highly variable features. Only features with a minimum variability measure across all samples greater than
min.var
will be used. Iftype = "conventional"
, the standard deviation is the measure used, and iftype = "robust"
, the MAD is the measure used.- seed
random seed for knn imputation reproducibility
- seed.data
seed to use to ensure each algorithm operates on the same set of subsamples
- trim
logical; if
TRUE
, algorithms that score low on internal indices will be trimmed out- reweigh
logical; if
TRUE
, after trimming out poor performing algorithms, each algorithm is reweighed depending on its internal indices.- n
an integer specifying the top
n
algorithms to keep after trimming off the poor performing ones using Rank Aggregation. If the total number of algorithms is less thann
no trimming is done.- evaluate
logical; if
TRUE
(default), validity indices are returned. Internal validity indices are always computed. Ifref.cl
is notNULL
, then external validity indices will also be computed.- plot
logical; if
TRUE
,graph_all
is called and a summary evaluation heatmap of ranked algorithms vs. internal validity indices is plotted as well.- ref.cl
reference class
- progress
logical; should a progress bar be displayed?
Value
A list with the following elements
- E
raw clustering ensemble object
- Eknn
clustering ensemble object with knn imputation used on
E
- Ecomp
flattened ensemble object with remaining missing entries imputed by majority voting
- clusters
final clustering assignment from the diverse clustering ensemble method
- indices
if
evaluate = TRUE
, shows cluster evaluation indices; otherwiseNULL
Details
There are three ways to handle the input data before clustering via argument
prep.data
. The default is to use the raw data as-is ("none"). Or, we can
enact prepare_data()
on the full dataset ("full"), or the bootstrap sampled
datasets ("sampled").
Examples
library(dplyr)
#>
#> Attaching package: ‘dplyr’
#> The following object is masked from ‘package:MASS’:
#>
#> select
#> The following object is masked from ‘package:testthat’:
#>
#> matches
#> The following object is masked from ‘package:Biobase’:
#>
#> combine
#> The following objects are masked from ‘package:BiocGenerics’:
#>
#> combine, intersect, setdiff, union
#> The following objects are masked from ‘package:stats’:
#>
#> filter, lag
#> The following objects are masked from ‘package:base’:
#>
#> intersect, setdiff, setequal, union
data(hgsc)
dat <- hgsc[1:100, 1:50]
ref.cl <- strsplit(rownames(dat), "_") %>%
purrr::map_chr(2) %>%
factor() %>%
as.integer()
dice.obj <- dice(dat, nk = 4, reps = 5, algorithms = "hc", cons.funs =
"kmodes", ref.cl = ref.cl, progress = FALSE)
str(dice.obj, max.level = 2)
#> List of 5
#> $ E : int [1:100, 1:5, 1, 1] 1 1 NA NA NA 1 1 NA 1 NA ...
#> ..- attr(*, "dimnames")=List of 4
#> $ Eknn : int [1:100, 1:5, 1, 1] 1 1 1 1 1 1 1 1 1 3 ...
#> ..- attr(*, "dimnames")=List of 4
#> $ Ecomp : num [1:100, 1:5, 1] 1 1 1 1 1 1 1 1 1 3 ...
#> ..- attr(*, "dimnames")=List of 3
#> $ clusters: int [1:100, 1:2] 4 3 1 3 3 4 1 3 2 4 ...
#> ..- attr(*, "dimnames")=List of 2
#> $ indices :List of 5
#> ..$ k : int 4
#> ..$ pac :'data.frame': 1 obs. of 2 variables:
#> ..$ ii :List of 1
#> ..$ ei :List of 1
#> ..$ trim:List of 5