Title: | Infrastructure for Ordering Objects Using Seriation |
---|---|
Description: | Infrastructure for ordering objects with an implementation of several seriation/sequencing/ordination techniques to reorder matrices, dissimilarity matrices, and dendrograms. Also provides (optimally) reordered heatmaps, color images and clustering visualizations like dissimilarity plots, and visual assessment of cluster tendency plots (VAT and iVAT). Hahsler et al (2008) <doi:10.18637/jss.v025.i03>. |
Authors: | Michael Hahsler [aut, cre, cph] , Christian Buchta [aut, cph], Kurt Hornik [aut, cph] , David Barnett [ctb], Michael Brusco [ctb, cph], Michael Friendly [ctb], Hans-Friedrich Koehn [ctb, cph], Fionn Murtagh [ctb, cph], Stephanie Stahl [ctb, cph] |
Maintainer: | Michael Hahsler <[email protected]> |
License: | GPL-3 |
Version: | 1.5.6.1 |
Built: | 2024-10-24 21:21:48 UTC |
Source: | https://github.com/mhahsler/seriation |
Plot a data matrix of cases and variables. Each value is represented by a symbol. Large values are highlighted. Note that Bertin arranges the cases horizontally and the variables as rows. The matrix can be rearranged using seriation techniques to make structure in the data visible (see Falguerolles et al 1997).
bertinplot( x, order = NULL, panel.function = panel.bars, highlight = TRUE, row_labels = TRUE, col_labels = TRUE, flip_axes = TRUE, ... ) panel.bars(value, spacing, hl) panel.circles(value, spacing, hl) panel.rectangles(value, spacing, hl) panel.squares(value, spacing, hl) panel.tiles(value, spacing, hl) panel.blocks(value, spacing, hl) panel.lines(value, spacing, hl) bertin_cut_line(x = NULL, y = NULL, col = "red") ggbertinplot( x, order = NULL, geom = "bar", highlight = TRUE, row_labels = TRUE, col_labels = TRUE, flip_axes = TRUE, prop = FALSE, ... )
bertinplot( x, order = NULL, panel.function = panel.bars, highlight = TRUE, row_labels = TRUE, col_labels = TRUE, flip_axes = TRUE, ... ) panel.bars(value, spacing, hl) panel.circles(value, spacing, hl) panel.rectangles(value, spacing, hl) panel.squares(value, spacing, hl) panel.tiles(value, spacing, hl) panel.blocks(value, spacing, hl) panel.lines(value, spacing, hl) bertin_cut_line(x = NULL, y = NULL, col = "red") ggbertinplot( x, order = NULL, geom = "bar", highlight = TRUE, row_labels = TRUE, col_labels = TRUE, flip_axes = TRUE, prop = FALSE, ... )
x |
a data matrix. Note that following Bertin, columns are variables
and rows are cases. This behavior can be reversed using |
order |
an object of class |
panel.function |
a function to produce the symbols. Currently available
functions are |
highlight |
a logical scalar indicating whether to use highlighting.
If |
row_labels , col_labels
|
a logical indicating if row and column labels
in |
flip_axes |
logical indicating whether to swap cases and variables in
the plot. The default ( |
... |
|
value , spacing , hl
|
are used internally for the panel functions. |
col , y
|
and x in |
geom |
visualization type. Available ggplot2 geometries are: |
prop |
logical; change the aspect ratio so cells in the image have a equal width and height. |
The plot is organized as a matrix of symbols. The symbols are drawn by a
panel function, where all symbols of a row are drawn by one call of the
function (using vectorization). The interface for the panel function is
panel.myfunction(value, spacing, hl)
. value
is the vector of
values for a row scaled between 0 and 1, spacing
contains the
relative space between symbols and hl
is a logical vector indicating
which symbol should be highlighted.
Cut lines can be added to an existing Bertin plot using
bertin_cut_line(x = NULL, y = NULL)
. x
/y
is can be a
number indicating where to draw the cut line between two columns/rows. If
both x
and y
is specified then one can select a row/column and
the other can select a range to draw a line which does only span a part of
the row/column. It is important to call bertinplot()
with the option
pop = FALSE
.
ggbertinplot()
calls ggpimage()
and all additional parameters are
passed on.
Nothing.
Michael Hahsler
de Falguerolles, A., Friedrich, F., Sawitzki, G. (1997): A Tribute to J. Bertin's Graphical Data Analysis. In: Proceedings of the SoftStat '97 (Advances in Statistical Software 6), 11–20.
Other plots:
VAT()
,
dissplot()
,
hmap()
,
palette()
,
pimage()
data("Irish") scale_by_rank <- function(x) apply(x, 2, rank) x <- scale_by_rank(Irish[,-6]) # Use the the sum of absolute rank differences order <- c( seriate(dist(x, "minkowski", p = 1)), seriate(dist(t(x), "minkowski", p = 1)) ) # Plot bertinplot(x, order) # Some alternative displays bertinplot(x, order, panel = panel.tiles, shading_col = bluered(100), highlight = FALSE) bertinplot(x, order, panel = panel.circles, spacing = -.2) bertinplot(x, order, panel = panel.rectangles) bertinplot(x, order, panel = panel.lines) # Plot with cut lines (we manually set the order here) order <- ser_permutation(c(6L, 9L, 29L, 10L, 32L, 22L, 2L, 35L, 24L, 30L, 33L, 25L, 37L, 36L, 8L, 27L, 4L, 39L, 3L, 40L, 38L, 1L, 31L, 34L, 28L, 23L, 5L, 11L, 7L, 41L, 13L, 26L, 17L, 15L, 12L, 20L, 14L, 18L, 19L, 16L, 21L), c(4L, 2L, 1L, 6L, 7L, 8L, 5L, 3L)) bertinplot(x, order, pop=FALSE) bertin_cut_line(, 4) ## horizontal line between rows 4 and 5 bertin_cut_line(, 7) ## separate "Right to Life" from the rest bertin_cut_line(18, c(0, 4)) ## separate a block of large values (vertically) # ggplot2-based plots if (require("ggplot2")) { library(ggplot2) # Default plot uses bars and highlighting values larger than the mean ggbertinplot(x, order) # highlight values in the 4th quartile ggbertinplot(x, order, highlight = quantile(x, probs = .75)) # Use different geoms. "none" lets the user specify their own geom. # Variables set are row, col and x (for the value). ggbertinplot(x, order, geom = "tile", prop = TRUE) ggbertinplot(x, order, geom = "rectangle") ggbertinplot(x, order, geom = "rectangle", prop = TRUE) ggbertinplot(x, order, geom = "circle") ggbertinplot(x, order, geom = "line") # Tiles with diverging color scale ggbertinplot(x, order, geom = "tile", prop = TRUE) + scale_fill_gradient2(midpoint = mean(x)) # Custom geom (geom = "none"). Defined variables are row, col, and x for the value ggbertinplot(x, order, geom = "none", prop = FALSE) + geom_point(aes(x = col, y = row, size = x, color = x > 30), pch = 15) + scale_size(range = c(1, 10)) # Use a ggplot2 theme with theme_set() old_theme <- theme_set(theme_minimal() + theme(panel.grid = element_blank()) ) ggbertinplot(x, order, geom = "bar") theme_set(old_theme) }
data("Irish") scale_by_rank <- function(x) apply(x, 2, rank) x <- scale_by_rank(Irish[,-6]) # Use the the sum of absolute rank differences order <- c( seriate(dist(x, "minkowski", p = 1)), seriate(dist(t(x), "minkowski", p = 1)) ) # Plot bertinplot(x, order) # Some alternative displays bertinplot(x, order, panel = panel.tiles, shading_col = bluered(100), highlight = FALSE) bertinplot(x, order, panel = panel.circles, spacing = -.2) bertinplot(x, order, panel = panel.rectangles) bertinplot(x, order, panel = panel.lines) # Plot with cut lines (we manually set the order here) order <- ser_permutation(c(6L, 9L, 29L, 10L, 32L, 22L, 2L, 35L, 24L, 30L, 33L, 25L, 37L, 36L, 8L, 27L, 4L, 39L, 3L, 40L, 38L, 1L, 31L, 34L, 28L, 23L, 5L, 11L, 7L, 41L, 13L, 26L, 17L, 15L, 12L, 20L, 14L, 18L, 19L, 16L, 21L), c(4L, 2L, 1L, 6L, 7L, 8L, 5L, 3L)) bertinplot(x, order, pop=FALSE) bertin_cut_line(, 4) ## horizontal line between rows 4 and 5 bertin_cut_line(, 7) ## separate "Right to Life" from the rest bertin_cut_line(18, c(0, 4)) ## separate a block of large values (vertically) # ggplot2-based plots if (require("ggplot2")) { library(ggplot2) # Default plot uses bars and highlighting values larger than the mean ggbertinplot(x, order) # highlight values in the 4th quartile ggbertinplot(x, order, highlight = quantile(x, probs = .75)) # Use different geoms. "none" lets the user specify their own geom. # Variables set are row, col and x (for the value). ggbertinplot(x, order, geom = "tile", prop = TRUE) ggbertinplot(x, order, geom = "rectangle") ggbertinplot(x, order, geom = "rectangle", prop = TRUE) ggbertinplot(x, order, geom = "circle") ggbertinplot(x, order, geom = "line") # Tiles with diverging color scale ggbertinplot(x, order, geom = "tile", prop = TRUE) + scale_fill_gradient2(midpoint = mean(x)) # Custom geom (geom = "none"). Defined variables are row, col, and x for the value ggbertinplot(x, order, geom = "none", prop = FALSE) + geom_point(aes(x = col, y = row, size = x, color = x > 30), pch = 15) + scale_size(range = c(1, 10)) # Use a ggplot2 theme with theme_set() old_theme <- theme_set(theme_minimal() + theme(panel.grid = element_blank()) ) ggbertinplot(x, order, geom = "bar") theme_set(old_theme) }
Several 2D data sets created to evaluate the CHAMELEON clustering algorithm in the paper by Karypis et al (1999).
chameleon_ds4
: The format is a 8,000 x 2 data.frame.
chameleon_ds5
: The format is a 8,000 x 2 data.frame.
chameleon_ds7
: The format is a 10,000 x 2 data.frame.
chameleon_ds8
: The format is a 8,000 x 2 data.frame.
Karypis, G., EH. Han, V. Kumar (1999): CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling, IEEE Computer, 32(8): 68–75. doi:10.1109/2.781637
Other data:
Irish
,
Munsingen
,
SupremeCourt
,
Townships
,
Wood
,
Zoo
,
create_lines_data()
,
is.robinson()
data(Chameleon) plot(chameleon_ds4, cex = .1) plot(chameleon_ds5, cex = .1) plot(chameleon_ds7, cex = .1) plot(chameleon_ds8, cex = .1)
data(Chameleon) plot(chameleon_ds4, cex = .1) plot(chameleon_ds5, cex = .1) plot(chameleon_ds7, cex = .1) plot(chameleon_ds8, cex = .1)
Several functions to create simulated data to evaluate different aspects of seriation algorithms and criterion functions.
create_lines_data(n = 250) create_ordered_data( n = 250, k = 2, size = NULL, spacing = 6, path = "linear", sd1 = 1, sd2 = 0 )
create_lines_data(n = 250) create_ordered_data( n = 250, k = 2, size = NULL, spacing = 6, path = "linear", sd1 = 1, sd2 = 0 )
n |
number of data points to create. |
k |
number of Gaussian components. |
size |
relative size (number of points) of components (length of k).
If |
spacing |
space between the centers of components. The default of 6
means that the components will barely touch at |
path |
Are the components arranged along a |
sd1 |
variation in the direction along the components. A value greater than one means the components are mixing. |
sd2 |
variation perpendicular to the direction along the components. A value greater than 0 will introduce anti-Robinson violation events. |
create_lines_data()
recreates the lines data set used in for iVAT()
in
Havens and Bezdeck (2012).
create_ordered_data()
(Hahsler et al, 2021) is a versatile
function which creates "orderable"
2D data using Gaussian components along a linear or circular path. The
components are equally spaced (spacing
) along the path. The default
spacing of 6 ensures that 2 adjacent components with a standard deviation of
one along the direction of the path will barely touch. The standard
deviation along the path is set by sd1
. The standard deviation
perpendicular to the path is set by sd2
. A value larger than zero
will result in the data not being perfectly orderable (i.e., the resulting
distance matrix will not be a perfect pre-anti-Robinson matrix and contain
anti-Robinson violation events after seriation). Note that a circular path
always creates anti-Robinson violation since the circle has to be broken at
some point to create a linear order.
a data.frame with the created data.
Michael Hahsler
Havens, T.C. and Bezdek, J.C. (2012): An Efficient Formulation of the Improved Visual Assessment of Cluster Tendency (iVAT) Algorithm, IEEE Transactions on Knowledge and Data Engineering, 24(5), 813–822.
Michael Hahsler, Christian Buchta and Kurt Hornik (2021). seriation: Infrastructure for Ordering Objects Using Seriation. R package version 1.3.2. https://github.com/mhahsler/seriation
seriate()
, criterion()
, iVAT()
.
Other data:
Chameleon
,
Irish
,
Munsingen
,
SupremeCourt
,
Townships
,
Wood
,
Zoo
,
is.robinson()
## lines data set from Havens and Bezdek (2011) x <- create_lines_data(100) plot(x, xlim = c(-5, 5), ylim = c(-3, 3), cex = .2, col = attr(x, "id")) d <- dist(x) pimage(d, seriate(d, "OLO_single"), col = bluered(100, bias = .5), key = TRUE) ## create_ordered_data can produce many types of "orderable" data ## perfect pre-Anti-Robinson matrix (with a single components) x <- create_ordered_data(100, k = 1) plot(x, cex = .2, col = attr(x, "id")) d <- dist(x) pimage(d, seriate(d, "MDS"), col = bluered(100, bias=.5), key = TRUE) ## separated components x <- create_ordered_data(100, k = 5) plot(x, cex =.2, col = attr(x, "id")) d <- dist(x) pimage(d, seriate(d, "MDS"), col = bluered(100, bias = .5), key = TRUE) ## overlapping components x <- create_ordered_data(100, k = 5, sd1 = 2) plot(x, cex = .2, col = attr(x, "id")) d <- dist(x) pimage(d, seriate(d, "MDS"), col = bluered(100, bias = .5), key = TRUE) ## introduce anti-Robinson violations (a non-zero y value) x <- create_ordered_data(100, k = 5, sd1 = 2, sd2 = 5) plot(x, cex = .2, col = attr(x, "id")) d <- dist(x) pimage(d, seriate(d, "MDS"), col = bluered(100, bias = .5), key = TRUE) ## circular path (has always violations) x <- create_ordered_data(100, k = 5, path = "circular", sd1 = 2) plot(x, cex = .2, col = attr(x, "id")) d <- dist(x) pimage(d, seriate(d, "OLO"), col = bluered(100, bias = .5), key = TRUE) ## circular path (with more violations violations) x <- create_ordered_data(100, k = 5, path = "circular", sd1 = 2, sd2 = 1) plot(x, cex=.2, col = attr(x, "id")) d <- dist(x) pimage(d, seriate(d, "OLO"), col = bluered(100, bias = .5), key = TRUE)
## lines data set from Havens and Bezdek (2011) x <- create_lines_data(100) plot(x, xlim = c(-5, 5), ylim = c(-3, 3), cex = .2, col = attr(x, "id")) d <- dist(x) pimage(d, seriate(d, "OLO_single"), col = bluered(100, bias = .5), key = TRUE) ## create_ordered_data can produce many types of "orderable" data ## perfect pre-Anti-Robinson matrix (with a single components) x <- create_ordered_data(100, k = 1) plot(x, cex = .2, col = attr(x, "id")) d <- dist(x) pimage(d, seriate(d, "MDS"), col = bluered(100, bias=.5), key = TRUE) ## separated components x <- create_ordered_data(100, k = 5) plot(x, cex =.2, col = attr(x, "id")) d <- dist(x) pimage(d, seriate(d, "MDS"), col = bluered(100, bias = .5), key = TRUE) ## overlapping components x <- create_ordered_data(100, k = 5, sd1 = 2) plot(x, cex = .2, col = attr(x, "id")) d <- dist(x) pimage(d, seriate(d, "MDS"), col = bluered(100, bias = .5), key = TRUE) ## introduce anti-Robinson violations (a non-zero y value) x <- create_ordered_data(100, k = 5, sd1 = 2, sd2 = 5) plot(x, cex = .2, col = attr(x, "id")) d <- dist(x) pimage(d, seriate(d, "MDS"), col = bluered(100, bias = .5), key = TRUE) ## circular path (has always violations) x <- create_ordered_data(100, k = 5, path = "circular", sd1 = 2) plot(x, cex = .2, col = attr(x, "id")) d <- dist(x) pimage(d, seriate(d, "OLO"), col = bluered(100, bias = .5), key = TRUE) ## circular path (with more violations violations) x <- create_ordered_data(100, k = 5, path = "circular", sd1 = 2, sd2 = 1) plot(x, cex=.2, col = attr(x, "id")) d <- dist(x) pimage(d, seriate(d, "OLO"), col = bluered(100, bias = .5), key = TRUE)
Compute the value for different loss functions and merit function
for data given a permutation.
criterion(x, order = NULL, method = NULL, force_loss = FALSE, ...) ## S3 method for class 'array' criterion(x, order = NULL, method = NULL, force_loss = FALSE, ...) ## S3 method for class 'dist' criterion(x, order = NULL, method = NULL, force_loss = FALSE, ...) ## S3 method for class 'matrix' criterion(x, order = NULL, method = NULL, force_loss = FALSE, ...) ## S3 method for class 'data.frame' criterion(x, order = NULL, method = NULL, force_loss = FALSE, ...) ## S3 method for class 'table' criterion(x, order = NULL, method = NULL, force_loss = FALSE, ...)
criterion(x, order = NULL, method = NULL, force_loss = FALSE, ...) ## S3 method for class 'array' criterion(x, order = NULL, method = NULL, force_loss = FALSE, ...) ## S3 method for class 'dist' criterion(x, order = NULL, method = NULL, force_loss = FALSE, ...) ## S3 method for class 'matrix' criterion(x, order = NULL, method = NULL, force_loss = FALSE, ...) ## S3 method for class 'data.frame' criterion(x, order = NULL, method = NULL, force_loss = FALSE, ...) ## S3 method for class 'table' criterion(x, order = NULL, method = NULL, force_loss = FALSE, ...)
x |
an object of class dist or a matrix (currently no functions are implemented for array). |
order |
an object of class ser_permutation suitable for
|
method |
a character vector with the names of the criteria to be
employed (see |
force_loss |
logical; should merit function be converted into loss functions by multiplying with -1? |
... |
additional parameters passed on to the criterion method. |
Criteria for distance matrices (dist)
For a symmetric dissimilarity matrix with elements
where
, the aim is generally to place low distance
values close to the diagonal. The following criteria to judge the quality of
a certain permutation of the objects in a dissimilarity matrix are currently
implemented (for a more detailed description and an experimental comparison
see Hahsler (2017)):
Gradient measures: "Gradient_raw"
, "Gradient_weighted"
(Hubert et al, 2001)
A symmetric dissimilarity matrix where the values in all rows and columns only increase when moving away from the main diagonal is called a perfect anti-Robinson matrix (Robinson 1951). A suitable merit measure which quantifies the divergence of a matrix from the anti-Robinson form is
where is a function which defines how a
violation or satisfaction of a gradient condition for an object triple
(
) is counted.
Hubert et al (2001) suggest two functions. The first function is given by:
It results in raw number of triples satisfying the gradient constraints minus triples which violate the constraints.
The second function is defined as:
It weights the each satisfaction or violation by the difference by its magnitude given by the absolute difference between the values.
Anti-Robinson events: "AR_events"
, "AR_deviations"
(Chen, 2002)
"AR_events"
counts the number of violations of the anti-Robinson form.
with
where is an indicator function returning 1 only for violations.
Chen (2002) presented a formulation for an equivalent loss function and
called the violations anti-Robinson events.
"AR_deviations"
: Chen (2002) also introduced a
weighted versions of the loss function by using
which weights each violation by the deviation.
Relative generalized Anti-Robinson events: "RGAR"
(Tien et al, 2008)
Counts Anti-Robinson events in a variable band (window specified
by w
defaults to the maximum of ) around the main diagonal
and normalizes by the maximum of possible events.
where , the maximal number of possible
anti-Robinson events in the window. The window size
represents the
number of neighboring objects (number of entries from the diagonal of the
distance matrix) are considered. The window size is
, where
smaller values result in focusing on the local structure while larger values
look at the global structure.
...
parameters are:
w
window size. Default is to use a pct
of 100% of .
pct
and alternative specification of w as a percentage of in
.
relative
logical; can be set to FALSE
to get the GAR, i.e., the absolute number of AR
events in the window.
Banded anti-Robinson form criterion: "BAR"
(Earle and Hurley, 2015)
Simplified measure for closeness to the anti-Robinson form in a band of size
with
around the diagonal.
For the measure reduces to the Hamiltonian path length. For
the measure is equivalent to ARc defined (Earle and Hurley,
2015). Note that ARc is equivalent to the Linear Seriation criterion (scaled
by 1/2).
...
parameter is: b
band size defaults to a band of 20% of .
Hamiltonian path length: "Path_length"
(Caraux and Pinloche, 2005)
The order of the objects in a dissimilarity matrix corresponds to a path through a graph where each node represents an object and is visited exactly once, i.e., a Hamilton path. The length of the path is defined as the sum of the edge weights, i.e., dissimilarities.
The length of the Hamiltonian path is equal to the value of the minimal span loss function (as used by Chen 2002). Both notions are related to the traveling salesperson problem (TSP).
If order
is not unique or there are non-finite distance values
NA
is returned.
Lazy path length: "Lazy_path_length"
(Earl and Hurley, 2015)
A weighted version of the Hamiltonian path criterion. This loss function postpones larger distances to later in the order (i.e., a lazy traveling sales person).
Earl and Hurley (2015) proposed this criterion for reordering in visualizations to concentrate on closer objects first.
Inertia criterion: "Inertia"
(Caraux and Pinloche, 2005)
Measures the moment of the inertia of dissimilarity values around the diagonal as
is used as a measure for the distance to the diagonal and
gives the weight. This criterion gives higher weight to values
farther away from the diagonal. It increases with quality.
Least squares criterion: "Least_squares"
(Caraux and Pinloche, 2005)
The sum of squared differences between distances and the rank differences:
where is an element of the
dissimilarity matrix
and
is the rank difference between
the objects.
Note that if Euclidean distance is used to calculate from a data
matrix
, the order of the elements in
by projecting them on
the first principal component of
minimizes this criterion. The
least squares criterion is related to unidimensional scaling.
Linear Seriation Criterion: "LS"
(Hubert and Schultz, 1976)
Weights the distances with the absolute rank differences.
2-Sum Criterion: "2SUM"
(Barnard, Pothen and Simon, 1993)
The 2-Sum loss criterion multiplies the similarity between objects with the squared rank differences.
where represents the similarity between objects
and
.
Absolute Spearman Correlation "Rho"
The absolute value of the Spearman rank correlation between the original distances and the rank differences in the order.
Matrix measures: "ME"
, "Moore_stress"
, "Neumann_stress"
These criteria are defined on general matrices (see
below for definitions). The dissimilarity matrix is first converted into a
similarity matrix using . If a different transformation is
required, then perform the transformation first and supply a matrix instead
of a dist object.
Criteria for matrices (matrix)
For a general matrix ,
and
, currently the following loss/merit functions are implemented:
Measure of Effectiveness: "ME"
(McCormick, 1972).
The measure of effectiveness (ME) for matrix , is defined as
with, by convention
ME is a merit measure, i.e. a higher ME indicates a better arrangement.
Maximizing ME is the objective of the bond energy algorithm (BEA). ME is not
defined for matrices with negative values. NA
is returned in this
case.
Weighted correlation coefficient: "Cor_R"
(Deutsch and Martin, 1971)
Developed as the Measure of Effectiveness for the Moment
Ordering Algorithm.
R is a merit measure normalized so that its value always lies in
. For the special case of a square matrix
corresponds
to only the main diagonal being filled,
to a random distribution
of value throughout the array, and
to the opposite diagonal only
being filled.
Matrix Stress: "Moore_stress"
, "Neumann_stress"
(Niermann, 2005)
Stress measures the conciseness of the presentation of a matrix/table and can be seen as a purity function which compares the values in a matrix/table with its neighbors. The stress measure used here is computed as the sum of squared distances of each matrix entry from its adjacent entries.
The following types of neighborhoods are available:
Moore: comprises the eight adjacent entries.
Neumann: comprises the four adjacent entries.
The major difference between the Moore and the Neumann neighborhood is that for the later the contribution of row and column permutations to stress are independent and thus can be optimized independently.
A named vector of real values.
Michael Hahsler
Barnard, S.T., A. Pothen, and H. D. Simon (1993): A Spectral Algorithm for Envelope Reduction of Sparse Matrices. In Proceedings of the 1993 ACM/IEEE Conference on Supercomputing, 493–502. Supercomputing '93. New York, NY, USA: ACM.
Caraux, G. and S. Pinloche (2005): Permutmatrix: A Graphical Environment to Arrange Gene Expression Profiles in Optimal Linear Order, Bioinformatics, 21(7), 1280–1281.
Chen, C.-H. (2002): Generalized association plots: Information visualization via iteratively generated correlation matrices, Statistica Sinica, 12(1), 7–29.
Deutsch, S.B. and J.J. Martin (1971): An ordering algorithm for analysis of data arrays. Operational Research, 19(6), 1350–1362. doi:10.1287/opre.19.6.1350
Earle, D. and C.B. Hurley (2015): Advances in Dendrogram Seriation for Application to Visualization. Journal of Computational and Graphical Statistics, 24(1), 1–25. doi:10.1080/10618600.2013.874295
Hahsler, M. (2017): An experimental comparison of seriation methods for one-mode two-way data. European Journal of Operational Research, 257, 133–143. doi:10.1016/j.ejor.2016.08.066
Hubert, L. and J. Schultz (1976): Quadratic Assignment as a General Data Analysis Strategy. British Journal of Mathematical and Statistical Psychology, 29(2). Blackwell Publishing Ltd. 190–241. doi:10.1111/j.2044-8317.1976.tb00714.x
Hubert, L., P. Arabie, and J. Meulman (2001): Combinatorial Data Analysis: Optimization by Dynamic Programming. Society for Industrial Mathematics. doi:10.1137/1.9780898718553
Niermann, S. (2005): Optimizing the Ordering of Tables With Evolutionary Computation, The American Statistician, 59(1), 41–46. doi:10.1198/000313005X22770
McCormick, W.T., P.J. Schweitzer and T.W. White (1972): Problem decomposition and data reorganization by a clustering technique, Operations Research, 20(5), 993-1009. doi:10.1287/opre.20.5.993
Robinson, W.S. (1951): A method for chronologically ordering archaeological deposits, American Antiquity, 16, 293–301. doi:10.2307/276978
Tien, Y-J., Yun-Shien Lee, Han-Ming Wu and Chun-Houh Chen (2008): Methods for simultaneously identifying coherent local clusters with smooth global patterns in gene expression profiles, BMC Bioinformatics, 9(155), 1–16. doi:10.1186/1471-2105-9-155
Other criterion:
registry_for_criterion_methods
## create random data and calculate distances m <- matrix(runif(20),ncol=2) d <- dist(m) ## get an order for rows (optimal for the least squares criterion) o <- seriate(d, method = "MDS") o ## compare the values for all available criteria rbind( unordered = criterion(d), ordered = criterion(d, o) ) ## compare RGAR by window size (from local to global) w <- 2:(nrow(m)-1) RGAR <- sapply(w, FUN = function (w) criterion(d, o, method="RGAR", w = w)) plot(w, RGAR, type = "b", ylim = c(0,1), xlab = "Windows size (w)", main = "RGAR by window size")
## create random data and calculate distances m <- matrix(runif(20),ncol=2) d <- dist(m) ## get an order for rows (optimal for the least squares criterion) o <- seriate(d, method = "MDS") o ## compare the values for all available criteria rbind( unordered = criterion(d), ordered = criterion(d, o) ) ## compare RGAR by window size (from local to global) w <- 2:(nrow(m)-1) RGAR <- sapply(w, FUN = function (w) criterion(d, o, method="RGAR", w = w)) plot(w, RGAR, type = "b", ylim = c(0,1), xlab = "Windows size (w)", main = "RGAR by window size")
Visualizes a dissimilarity matrix using seriation and matrix shading using the method developed by Hahsler and Hornik (2011). Entries with lower dissimilarities (higher similarity) are plotted darker. Dissimilarity plots can be used to uncover hidden structure in the data and judge cluster quality.
dissplot( x, labels = NULL, method = "spectral", control = NULL, lower_tri = TRUE, upper_tri = "average", diag = TRUE, cluster_labels = TRUE, cluster_lines = TRUE, reverse_columns = FALSE, options = NULL, ... ) ## S3 method for class 'reordered_cluster_dissimilarity_matrix' plot( x, lower_tri = TRUE, upper_tri = "average", diag = TRUE, options = NULL, ... ) ## S3 method for class 'reordered_cluster_dissimilarity_matrix' print(x, ...) ggdissplot( x, labels = NULL, method = "spectral", control = NULL, lower_tri = TRUE, upper_tri = "average", diag = TRUE, cluster_labels = TRUE, cluster_lines = TRUE, reverse_columns = FALSE, ... )
dissplot( x, labels = NULL, method = "spectral", control = NULL, lower_tri = TRUE, upper_tri = "average", diag = TRUE, cluster_labels = TRUE, cluster_lines = TRUE, reverse_columns = FALSE, options = NULL, ... ) ## S3 method for class 'reordered_cluster_dissimilarity_matrix' plot( x, lower_tri = TRUE, upper_tri = "average", diag = TRUE, options = NULL, ... ) ## S3 method for class 'reordered_cluster_dissimilarity_matrix' print(x, ...) ggdissplot( x, labels = NULL, method = "spectral", control = NULL, lower_tri = TRUE, upper_tri = "average", diag = TRUE, cluster_labels = TRUE, cluster_lines = TRUE, reverse_columns = FALSE, ... )
x |
an object of class dist. |
labels |
|
method |
A single character string indicating the seriation method used
to reorder the clusters (inter cluster seriation) as well as the objects
within each cluster (intra cluster seriation). If different algorithms for
inter and intra cluster seriation are required, Set method to A third list element (named |
control |
a list of control options passed on to the seriation
algorithm. In case of two different seriation algorithms, |
upper_tri , lower_tri , diag
|
a logical indicating whether to show the upper triangle, the lower triangle or the diagonal of the distance matrix. The string "average" can also be used to display within and between cluster averages in the two triangles. |
cluster_labels |
a logical indicating whether to display cluster labels in the plot. |
cluster_lines |
a logical indicating whether to draw lines to separate clusters. |
reverse_columns |
a logical indicating if the clusters are displayed on
the diagonal from north-west to south-east ( |
options |
a list with options for plotting the matrix (
|
... |
|
The plot can also be used to visualize cluster quality (see Ling 1973). Objects belonging to the same cluster are displayed in consecutive order. The placement of clusters and the within cluster order is obtained by a seriation algorithm which tries to place large similarities/small dissimilarities close to the diagonal. Compact clusters are visible as dark squares (low dissimilarity) on the diagonal of the plot. Additionally, a Silhouette plot (Rousseeuw 1987) is added. This visualization is similar to CLUSION (see Strehl and Ghosh 2002), however, allows for using arbitrary seriating algorithms.
Note: Since pimage()
uses grid, it should not be mixed
with base R primitive plotting functions.
dissplot()
returns an invisible object of class
cluster_proximity_matrix
with the following elements:
order |
|
cluster_order |
|
method |
vector of character strings indicating
the seriation methods used for plotting |
k |
|
description |
a |
This object can be used for plotting via plot(x, options = NULL, ...)
,
where x
is the object and options
contains a list with
plotting options (see above).
ggdissplot()
returns a ggplot2 object representing the plot.
The plot description as an object of class reordered_cluster_dissimilarity_matrix
.
Michael Hahsler
Hahsler, M. and Hornik, K. (2011): Dissimilarity plots: A visual exploration tool for partitional clustering. Journal of Computational and Graphical Statistics, 10(2):335–354. doi:10.1198/jcgs.2010.09139
Ling, R.F. (1973): A computer generated aid for cluster analysis. Communications of the ACM, 16(6), 355–361. doi:10.1145/362248.362263
Rousseeuw, P.J. (1987): Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20(1), 53–65. doi:10.1016/0377-0427(87)90125-7
Strehl, A. and Ghosh, J. (2003): Relationship-based clustering and visualization for high-dimensional data mining. INFORMS Journal on Computing, 15(2), 208–230. doi:10.1287/ijoc.15.2.208.14448
Other plots:
VAT()
,
bertinplot()
,
hmap()
,
palette()
,
pimage()
data("iris") # shuffle rows x_iris <- iris[sample(seq(nrow(iris))), -5] d <- dist(x_iris) # Plot original matrix dissplot(d, method = NA) # Plot reordered matrix using the nearest insertion algorithm (from tsp) dissplot(d, method = "TSP", main = "Seriation (TSP)") # Cluster iris with k-means and 3 clusters and reorder the dissimality matrix l <- kmeans(x_iris, centers = 3)$cluster dissplot(d, labels = l, main = "k-means") # show only distances as lower triangle dissplot(d, labels = l, main = "k-means", lower_tri = TRUE, upper_tri = FALSE) # Use a grid layout to place several plots on a page library("grid") grid.newpage() pushViewport(viewport(layout=grid.layout(nrow = 2, ncol = 2), gp = gpar(fontsize = 8))) pushViewport(viewport(layout.pos.row = 1, layout.pos.col = 1)) # Visualize the clustering (using Spectral between clusters and MDS within) res <- dissplot(d, l, method = list(inter = "Spectral", intra = "MDS"), main = "K-Means + Seriation", newpage = FALSE) popViewport() pushViewport(viewport(layout.pos.row = 1, layout.pos.col = 2)) # More visualization options. Note that we reuse the reordered object res! # color: use 10 shades red-blue, biased towards small distances plot(res, main = "K-Means + Seriation (red-blue + biased)", col= bluered(10, bias = .5), newpage = FALSE) popViewport() pushViewport(viewport(layout.pos.row = 2, layout.pos.col = 1)) # Threshold (using zlim) and cubic scale to highlight differences plot(res, main = "K-Means + Seriation (cubic + threshold)", zlim = c(0, 2), col = grays(100, power = 3), newpage = FALSE) popViewport() pushViewport(viewport(layout.pos.row = 2, layout.pos.col = 2)) # Use gray scale with logistic transformation plot(res, main = "K-Means + Seriation (logistic scale)", col = gray( plogis(seq(max(res$x_reordered), min(res$x_reordered), length.out = 100), location = 2, scale = 1/2, log = FALSE) ), newpage = FALSE) popViewport(2) # The reordered_cluster_dissimilarity_matrix object res names(res) ## -------------------------------------------------------------------- ## ggplot-based dissplot if (require("ggplot2")) { library("ggplot2") # Plot original matrix ggdissplot(d, method = NA) # Plot seriated matrix ggdissplot(d, method = "TSP") + labs(title = "Seriation (TSP)") # Cluster iris with k-means and 3 clusters l <- kmeans(x_iris, centers = 3)$cluster ggdissplot(d, labels = l) + labs(title = "K-means + Seriation") # show only lower triangle ggdissplot(d, labels = l, lower_tri = TRUE, upper_tri = FALSE) + labs(title = "K-means + Seriation") # No lines or cluster labels and add a label for the color key (fill) ggdissplot(d, labels = l, cluster_lines = FALSE, cluster_labels = FALSE) + labs(title = "K-means + Seriation", fill = "Distances\n(Euclidean)") # Diverging color palette with manual set midpoint and different seriation methods ggdissplot(d, l, method = list(inter = "Spectral", intra = "MDS")) + labs(title = "K-Means + Seriation", subtitle = "biased color scale") + scale_fill_gradient2(midpoint = median(d)) # Use manipulate scale using package scales library("scales") # Threshold (using limit and na.value) and cubic scale to highlight differences cubic_dist_trans <- trans_new( name = "cubic", # note that we have to do the inverse transformation for distances trans = function(x) x^(1/3), inverse = function(x) x^3 ) ggdissplot(d, l, method = list(inter = "Spectral", intra = "MDS")) + labs(title = "K-Means + Seriation", subtitle = "cubic + biased color scale") + scale_fill_gradient(low = "black", high = "white", limit = c(0,2), na.value = "white", trans = cubic_dist_trans) # Use gray scale with logistic transformation logis_2_.5_dist_trans <- trans_new( name = "Logistic transform (location, scale)", # note that we have to do the inverse transformation for distances trans = function(x) plogis(x, location = 2, scale = .5, log = FALSE), inverse = function(x) qlogis(x, location = 2, scale = .5, log = FALSE), ) ggdissplot(d, l, method = list(inter = "Spectral", intra = "MDS")) + labs(title = "K-Means + Seriation", subtitle = "logistic color scale") + scale_fill_gradient(low = "black", high = "white", trans = logis_2_.5_dist_trans, breaks = c(0, 1, 2, 3, 4)) }
data("iris") # shuffle rows x_iris <- iris[sample(seq(nrow(iris))), -5] d <- dist(x_iris) # Plot original matrix dissplot(d, method = NA) # Plot reordered matrix using the nearest insertion algorithm (from tsp) dissplot(d, method = "TSP", main = "Seriation (TSP)") # Cluster iris with k-means and 3 clusters and reorder the dissimality matrix l <- kmeans(x_iris, centers = 3)$cluster dissplot(d, labels = l, main = "k-means") # show only distances as lower triangle dissplot(d, labels = l, main = "k-means", lower_tri = TRUE, upper_tri = FALSE) # Use a grid layout to place several plots on a page library("grid") grid.newpage() pushViewport(viewport(layout=grid.layout(nrow = 2, ncol = 2), gp = gpar(fontsize = 8))) pushViewport(viewport(layout.pos.row = 1, layout.pos.col = 1)) # Visualize the clustering (using Spectral between clusters and MDS within) res <- dissplot(d, l, method = list(inter = "Spectral", intra = "MDS"), main = "K-Means + Seriation", newpage = FALSE) popViewport() pushViewport(viewport(layout.pos.row = 1, layout.pos.col = 2)) # More visualization options. Note that we reuse the reordered object res! # color: use 10 shades red-blue, biased towards small distances plot(res, main = "K-Means + Seriation (red-blue + biased)", col= bluered(10, bias = .5), newpage = FALSE) popViewport() pushViewport(viewport(layout.pos.row = 2, layout.pos.col = 1)) # Threshold (using zlim) and cubic scale to highlight differences plot(res, main = "K-Means + Seriation (cubic + threshold)", zlim = c(0, 2), col = grays(100, power = 3), newpage = FALSE) popViewport() pushViewport(viewport(layout.pos.row = 2, layout.pos.col = 2)) # Use gray scale with logistic transformation plot(res, main = "K-Means + Seriation (logistic scale)", col = gray( plogis(seq(max(res$x_reordered), min(res$x_reordered), length.out = 100), location = 2, scale = 1/2, log = FALSE) ), newpage = FALSE) popViewport(2) # The reordered_cluster_dissimilarity_matrix object res names(res) ## -------------------------------------------------------------------- ## ggplot-based dissplot if (require("ggplot2")) { library("ggplot2") # Plot original matrix ggdissplot(d, method = NA) # Plot seriated matrix ggdissplot(d, method = "TSP") + labs(title = "Seriation (TSP)") # Cluster iris with k-means and 3 clusters l <- kmeans(x_iris, centers = 3)$cluster ggdissplot(d, labels = l) + labs(title = "K-means + Seriation") # show only lower triangle ggdissplot(d, labels = l, lower_tri = TRUE, upper_tri = FALSE) + labs(title = "K-means + Seriation") # No lines or cluster labels and add a label for the color key (fill) ggdissplot(d, labels = l, cluster_lines = FALSE, cluster_labels = FALSE) + labs(title = "K-means + Seriation", fill = "Distances\n(Euclidean)") # Diverging color palette with manual set midpoint and different seriation methods ggdissplot(d, l, method = list(inter = "Spectral", intra = "MDS")) + labs(title = "K-Means + Seriation", subtitle = "biased color scale") + scale_fill_gradient2(midpoint = median(d)) # Use manipulate scale using package scales library("scales") # Threshold (using limit and na.value) and cubic scale to highlight differences cubic_dist_trans <- trans_new( name = "cubic", # note that we have to do the inverse transformation for distances trans = function(x) x^(1/3), inverse = function(x) x^3 ) ggdissplot(d, l, method = list(inter = "Spectral", intra = "MDS")) + labs(title = "K-Means + Seriation", subtitle = "cubic + biased color scale") + scale_fill_gradient(low = "black", high = "white", limit = c(0,2), na.value = "white", trans = cubic_dist_trans) # Use gray scale with logistic transformation logis_2_.5_dist_trans <- trans_new( name = "Logistic transform (location, scale)", # note that we have to do the inverse transformation for distances trans = function(x) plogis(x, location = 2, scale = .5, log = FALSE), inverse = function(x) qlogis(x, location = 2, scale = .5, log = FALSE), ) ggdissplot(d, l, method = list(inter = "Spectral", intra = "MDS")) + labs(title = "K-Means + Seriation", subtitle = "logistic color scale") + scale_fill_gradient(low = "black", high = "white", trans = logis_2_.5_dist_trans, breaks = c(0, 1, 2, 3, 4)) }
Method to get the order information from an object of class ser_permutation or ser_permutation_vector. Order information can be extracted as a permutation vector, a vector containing each object's rank or a permutation matrix.
get_order(x, ...) ## S3 method for class 'ser_permutation_vector' get_order(x, ...) ## S3 method for class 'ser_permutation' get_order(x, dim = 1, ...) ## S3 method for class 'hclust' get_order(x, ...) ## S3 method for class 'dendrogram' get_order(x, ...) ## S3 method for class 'integer' get_order(x, ...) ## S3 method for class 'numeric' get_order(x, ...) get_rank(x, ...) get_permutation_matrix(x, ...)
get_order(x, ...) ## S3 method for class 'ser_permutation_vector' get_order(x, ...) ## S3 method for class 'ser_permutation' get_order(x, dim = 1, ...) ## S3 method for class 'hclust' get_order(x, ...) ## S3 method for class 'dendrogram' get_order(x, ...) ## S3 method for class 'integer' get_order(x, ...) ## S3 method for class 'numeric' get_order(x, ...) get_rank(x, ...) get_permutation_matrix(x, ...)
x |
an object of class ser_permutation or ser_permutation_vector. |
... |
further arguments are ignored for |
dim |
order information for which dimension should be returned? |
get_order()
returns the permutation as an integer vector which arranges the
objects in the seriation order. That is, a vector with the index of the first,
second, -th object in the order defined by the permutation.
These permutation vectors can directly be
used to reorder objects using subsetting with
"["
. Note: In
seriation we usually use these order-based permutation vectors.
Note on names: While R's order()
returns an unnamed vector,
get_order()
returns names (if available). The names are the object label
corresponding to the index at that position.
Therefore, the names in the order are in the order after
the permutation.
get_rank()
returns the seriation as an integer vector containing the
rank/position for each objects after the permutation is applied.
That is, a vector with the position of the first, second,
-th object after permutation. Note: Use
order()
to convert ranks back to an order.
get_permutation_matrix()
returns a permutation
matrix.
Returns an integer permutation vector/a permutation matrix.
Michael Hahsler
Other permutation:
permutation_vector2matrix()
,
permute()
,
ser_dist()
,
ser_permutation()
,
ser_permutation_vector()
## create a random ser_permutation_vector ## Note that ser_permutation_vector is a single permutation vector x <- structure(1:10, names = paste0("X", 1:10)) o <- sample(x) o p <- ser_permutation_vector(o) p get_order(p) get_rank(p) get_permutation_matrix(p) ## reorder objects using subsetting, the provided permute function or by ## multiplying the with the permutation matrix. We use here x[get_order(p)] permute(x, p) drop(get_permutation_matrix(p) %*% x) ## ser_permutation contains one permutation vector for each dimension p2 <- ser_permutation(p, sample(5)) p2 get_order(p2, dim = 2) get_rank(p2, dim = 2) get_permutation_matrix(p2, dim = 2)
## create a random ser_permutation_vector ## Note that ser_permutation_vector is a single permutation vector x <- structure(1:10, names = paste0("X", 1:10)) o <- sample(x) o p <- ser_permutation_vector(o) p get_order(p) get_rank(p) get_permutation_matrix(p) ## reorder objects using subsetting, the provided permute function or by ## multiplying the with the permutation matrix. We use here x[get_order(p)] permute(x, p) drop(get_permutation_matrix(p) %*% x) ## ser_permutation contains one permutation vector for each dimension p2 <- ser_permutation(p, sample(5)) p2 get_order(p2, dim = 2) get_rank(p2, dim = 2) get_permutation_matrix(p2, dim = 2)
Provides heatmaps reordered using several different seriation methods. This includes dendrogram based reordering with optimal leaf order and matrix seriation-based heat maps.
hmap( x, distfun = stats::dist, method = "OLO_complete", control = NULL, scale = c("none", "row", "column"), plot_margins = "auto", col = NULL, col_dist = grays(power = 2), row_labels = NULL, col_labels = NULL, ... ) gghmap( x, distfun = stats::dist, method = "OLO_complete", control = NULL, scale = c("none", "row", "column"), prop = FALSE, ... )
hmap( x, distfun = stats::dist, method = "OLO_complete", control = NULL, scale = c("none", "row", "column"), plot_margins = "auto", col = NULL, col_dist = grays(power = 2), row_labels = NULL, col_labels = NULL, ... ) gghmap( x, distfun = stats::dist, method = "OLO_complete", control = NULL, scale = c("none", "row", "column"), prop = FALSE, ... )
x |
a matrix or a dissimilarity matrix of class dist. If a
dissimilarity matrix is used, then the |
distfun |
function used to compute the distance (dissimilarity) between
both rows and columns. For |
method |
a character strings indicating the used seriation algorithm
(see |
control |
a list of control options passed on to the seriation
algorithm specified in |
scale |
character indicating if the values should be centered and scaled in either the row direction or the column direction, or none. Default is none. |
plot_margins |
character indicating what to show in the margins. Options are:
|
col |
a list of colors used. |
col_dist |
colors used for displaying distances. |
row_labels , col_labels
|
a logical indicating if row and column labels
in |
... |
further arguments passed on to |
prop |
logical; change the aspect ratio so cells in the image have a equal width and height. |
For dendrogram based heat maps, the arguments are passed on to
stats::heatmap()
in stats. The following arguments for heatmap()
cannot be used: margins
, Rowv
, Colv
, hclustfun
, reorderfun
.
For seriation-based heat maps further arguments include:
gp
an object of class gpar
containing graphical
parameters (see gpar()
in package grid).
newpage
a logical indicating whether to start plot on a new
page
prop
a logical indicating whether the height and width of x
should
be plotted proportional to its dimensions.
showdist
Display seriated dissimilarity matrices? Values are
"none"
, "both"
, "rows"
or "columns"
.
key
logical; show a colorkey?
key.lab
Label plotted next to the color key.
margins
bottom and right-hand-side margins are calculated
automatically or can be specifies as a vector of two numbers (in lines).
zlim
range of values displayed.
col
, col_dist
color palettes used.
An invisible list with elements:
rowInd , colInd
|
index permutation vectors. |
reorder_method |
name of the method used to reorder the matrix. |
The list may contain additional elements (dendrograms, colors, etc).
Michael Hahsler
Other plots:
VAT()
,
bertinplot()
,
dissplot()
,
palette()
,
pimage()
data("Wood") # regular heatmap from package stats heatmap(Wood, main = "Wood (standard heatmap)") # Default heatmap does Euclidean distance, hierarchical clustering with # complete-link and optimal leaf ordering. Note that the rows are # ordered top-down in the seriation order (stats::heatmap orders in reverse) hmap(Wood, main = "Wood (opt. leaf ordering)") hmap(Wood, plot_margins = "distances", main = "Wood (opt. leaf ordering)") hmap(Wood, plot_margins = "none", main = "Wood (opt. leaf ordering)") # Heatmap with correlation-based distance, green-red color (greenred is # predefined) and optimal leaf ordering and no row label dist_cor <- function(x) as.dist(sqrt(1 - cor(t(x)))) hmap(Wood, distfun = dist_cor, col = greenred(100), main = "Wood (reorded by corr. between obs.)") # Heatmap with order based on the angle in two-dimensional MDS space. hmap(Wood, method = "MDS_angle", col = greenred(100), row_labels = FALSE, main = "Wood (reorderd using ange in MDS space)") # Heatmap for distances d <- dist(Wood) hmap(d, main = "Wood (Euclidean distances)") # order-based with dissimilarity matrices hmap(Wood, method = "MDS_angle", col = greenred(100), col_dist = greens(100, power = 2), keylab = "norm. Expression", main = "Wood (reorderd with distances)") # without the distance matrices hmap(Wood, method = "MDS_angle", plot_margins = "none", col = greenred(100), main = "Wood (reorderd without distances)") # Manually create a simple heatmap with pimage. o <- seriate(Wood, method = "heatmap", control = list(dist_fun = dist, seriation_method = "OLO_ward")) o pimage(Wood, o) # Note: method heatmap calculates reorderd hclust objects which can be used # for many heatmap implementations like the standard implementation in # package stats. heatmap(Wood, Rowv = as.dendrogram(o[[1]]), Colv = as.dendrogram(o[[2]])) # ggplot 2 version does not support dendrograms in the margin (for now) if (require("ggplot2")) { library("ggplot2") gghmap(Wood) + labs(title = "Wood", subtitle = "Optimal leaf ordering") # More parameters (see ? ggpimage): reverse column order and flip axes, make a proportional plot gghmap(Wood, reverse_columns = TRUE) + labs(title = "Wood", subtitle = "Optimal leaf ordering") gghmap(Wood, flip_axes = TRUE) + labs(title = "Wood", subtitle = "Optimal leaf ordering") gghmap(Wood, flip_axes = TRUE, prop = TRUE) + labs(title = "Wood", subtitle = "Optimal leaf ordering") dist_cor <- function(x) as.dist(sqrt(1 - cor(t(x)))) gghmap(Wood, distfun = dist_cor) + labs(title = "Wood", subtitle = "Reorded by correlation between observations") + scale_fill_gradient2(low = "darkgreen", high = "red") gghmap(d, prop = TRUE) + labs(title = "Wood", subtitle = "Euclidean distances, reordered") # Note: the ggplot2-based version currently cannot show distance matrices # in the same plot. # Manually seriate and plot as pimage. o <- seriate(Wood, method = "heatmap", control = list(dist_fun = dist, seriation_method = "OLO_ward")) o ggpimage(Wood, o) }
data("Wood") # regular heatmap from package stats heatmap(Wood, main = "Wood (standard heatmap)") # Default heatmap does Euclidean distance, hierarchical clustering with # complete-link and optimal leaf ordering. Note that the rows are # ordered top-down in the seriation order (stats::heatmap orders in reverse) hmap(Wood, main = "Wood (opt. leaf ordering)") hmap(Wood, plot_margins = "distances", main = "Wood (opt. leaf ordering)") hmap(Wood, plot_margins = "none", main = "Wood (opt. leaf ordering)") # Heatmap with correlation-based distance, green-red color (greenred is # predefined) and optimal leaf ordering and no row label dist_cor <- function(x) as.dist(sqrt(1 - cor(t(x)))) hmap(Wood, distfun = dist_cor, col = greenred(100), main = "Wood (reorded by corr. between obs.)") # Heatmap with order based on the angle in two-dimensional MDS space. hmap(Wood, method = "MDS_angle", col = greenred(100), row_labels = FALSE, main = "Wood (reorderd using ange in MDS space)") # Heatmap for distances d <- dist(Wood) hmap(d, main = "Wood (Euclidean distances)") # order-based with dissimilarity matrices hmap(Wood, method = "MDS_angle", col = greenred(100), col_dist = greens(100, power = 2), keylab = "norm. Expression", main = "Wood (reorderd with distances)") # without the distance matrices hmap(Wood, method = "MDS_angle", plot_margins = "none", col = greenred(100), main = "Wood (reorderd without distances)") # Manually create a simple heatmap with pimage. o <- seriate(Wood, method = "heatmap", control = list(dist_fun = dist, seriation_method = "OLO_ward")) o pimage(Wood, o) # Note: method heatmap calculates reorderd hclust objects which can be used # for many heatmap implementations like the standard implementation in # package stats. heatmap(Wood, Rowv = as.dendrogram(o[[1]]), Colv = as.dendrogram(o[[2]])) # ggplot 2 version does not support dendrograms in the margin (for now) if (require("ggplot2")) { library("ggplot2") gghmap(Wood) + labs(title = "Wood", subtitle = "Optimal leaf ordering") # More parameters (see ? ggpimage): reverse column order and flip axes, make a proportional plot gghmap(Wood, reverse_columns = TRUE) + labs(title = "Wood", subtitle = "Optimal leaf ordering") gghmap(Wood, flip_axes = TRUE) + labs(title = "Wood", subtitle = "Optimal leaf ordering") gghmap(Wood, flip_axes = TRUE, prop = TRUE) + labs(title = "Wood", subtitle = "Optimal leaf ordering") dist_cor <- function(x) as.dist(sqrt(1 - cor(t(x)))) gghmap(Wood, distfun = dist_cor) + labs(title = "Wood", subtitle = "Reorded by correlation between observations") + scale_fill_gradient2(low = "darkgreen", high = "red") gghmap(d, prop = TRUE) + labs(title = "Wood", subtitle = "Euclidean distances, reordered") # Note: the ggplot2-based version currently cannot show distance matrices # in the same plot. # Manually seriate and plot as pimage. o <- seriate(Wood, method = "heatmap", control = list(dist_fun = dist, seriation_method = "OLO_ward")) o ggpimage(Wood, o) }
A data matrix containing the results of 8 referenda for 41 Irish communities used in Falguerolles et al (1997).
The format is a 41 x 9 matrix. Two values are missing.
Column 6 contains the size of the Electorate in 1992.
The data was kindly provided by Guenter Sawitzki.
de Falguerolles, A., Friedrich, F., Sawitzki, G. (1997) A Tribute to J. Bertin's Graphical Data Analysis. In: Proceedings of the SoftStat '97 (Advances in Statistical Software 6), 11–20.
Other data:
Chameleon
,
Munsingen
,
SupremeCourt
,
Townships
,
Wood
,
Zoo
,
create_lines_data()
,
is.robinson()
data(Irish)
data(Irish)
Provides functions to create and recognize (anti) Robinson and pre-Robinson matrices. A (anti) Robinson matrix has strictly decreasing (increasing) values when moving away from the main diagonal. A pre-Robinson matrix is a matrix which can be transformed into a perfect Robinson matrix using simultaneous permutations of rows and columns.
is.robinson(x, anti = TRUE, pre = FALSE) random.robinson(n, anti = TRUE, pre = FALSE, noise = 0)
is.robinson(x, anti = TRUE, pre = FALSE) random.robinson(n, anti = TRUE, pre = FALSE, noise = 0)
x |
a symmetric, positive matrix or a dissimilarity matrix (a
|
anti |
logical; check for anti Robinson structure? Note that for distances, anti Robinson structure is appropriate. |
pre |
logical; recognize/create pre-Robinson matrices. |
n |
number of objects. |
noise |
noise intensity between 0 and 1. Zero means no noise. Noise more than zero results in non-Robinson matrices. |
Note that the default matrices are anti Robinson matrices. This is done because distance matrices (the default in R) are typically anti Robinson matrices with values increasing when moving away from the diagonal.
Robinson matrices are recognized using the fact that they have zero anti Robinson events. For pre-Robinson matrices we use spectral seriation first since spectral seriation is guaranteed to perfectly reorder pre-Robinson matrices (see Laurent and Seminaroti, 2015).
Random pre-Robinson matrices are generated by reversing the process of
unidimensional scaling. We randomly (uniform distribution with range
) choose
coordinates for
n
points on a straight
line and calculate the pairwise distances. For Robinson matrices, the points
are sorted first according to . For noise,
coordinates is
added. The coordinates are chosen uniformly between 0 and
noise
, with
noise
between 0 and 1.
A single logical value.
M. Laurent, M. Seminaroti (2015): The quadratic assignment problem is easy for Robinsonian matrices with Toeplitz structure, Operations Research Letters 43(1), 103–109.
Other data:
Chameleon
,
Irish
,
Munsingen
,
SupremeCourt
,
Townships
,
Wood
,
Zoo
,
create_lines_data()
## create a perfect anti Robinson structure m <- random.robinson(10) pimage(m) is.robinson(m) ## permute the structure to make it not Robinsonian. However, ## it is still pre-Robinson. o <- sample(10) m2 <- permute(m, ser_permutation(o,o)) pimage(m2) is.robinson(m2) is.robinson(m2, pre = TRUE) ## create a binary random Robinson matrix (not anti Robinson) m3 <- random.robinson(10, anti = FALSE) > .7 pimage(m3) is.robinson(m3, anti = FALSE) ## create matrices with noise (as distance matrices) m4 <- as.dist(random.robinson(50, pre = FALSE, noise = .1)) pimage(m4) criterion(m4, method = "AR") m5 <- as.dist(random.robinson(50, pre = FALSE, noise = .5)) pimage(m5) criterion(m5, method = "AR")
## create a perfect anti Robinson structure m <- random.robinson(10) pimage(m) is.robinson(m) ## permute the structure to make it not Robinsonian. However, ## it is still pre-Robinson. o <- sample(10) m2 <- permute(m, ser_permutation(o,o)) pimage(m2) is.robinson(m2) is.robinson(m2, pre = TRUE) ## create a binary random Robinson matrix (not anti Robinson) m3 <- random.robinson(10, anti = FALSE) > .7 pimage(m3) is.robinson(m3, anti = FALSE) ## create matrices with noise (as distance matrices) m4 <- as.dist(random.robinson(50, pre = FALSE, noise = .1)) pimage(m4) criterion(m4, method = "AR") m5 <- as.dist(random.robinson(50, pre = FALSE, noise = .5)) pimage(m5) criterion(m5, method = "AR")
Performs the non linear dimensionality reduction method locally linear embedding proposed in Roweis and Saul (2000).
lle(x, m, k, reg = 2)
lle(x, m, k, reg = 2)
x |
a matrix. |
m |
dimensions of the desired embedding. |
k |
number of neighbors. |
reg |
regularization method. 1, 2 and 3, by default 2. See details. |
LLE tries to find a lower-dimensional projection which preserves distances within local neighborhoods. This is done by (1) find for each object the k nearest neighbors, (2) construct the LLE weight matrix which represents each point as a linear combination of its neighborhood, and (2) perform partial eigenvalue decomposition to find the embedding.
The reg
parameter allows the decision between different regularization methods.
As one step of the LLE algorithm, the inverse of the Gram-matrix
has to be calculated. The rank of
equals
which is mostly smaller
than
- this is why a regularization
should be performed.
The calculation of regularization parameter
can be done using different methods:
reg = 1
: standardized sum of eigenvalues of (Roweis and Saul; 2000)
reg = 2
(default): trace of Gram-matrix divided by (Grilli, 2007)
reg = 3
: constant value 3*10e-3
a matrix of vector with the embedding.
Michael Hahsler (based on code by Holger Diedrich and Markus Abel)
Roweis, Sam T. and Saul, Lawrence K. (2000), Nonlinear Dimensionality Reduction by Locally Linear Embedding, Science, 290(5500), 2323–2326. doi:10.1126/science.290.5500.2323
Grilli, Elisa (2007) Automated Local Linear Embedding with an application to microarray data, Dissertation thesis, University of Bologna. doi:10.6092/unibo/amsdottorato/380
data(iris) x <- iris[, -5] # project iris on 2 dimensions conf <- lle(x, m = 2, k = 30) conf plot(conf, col = iris[, 5]) # project iris onto a single dimension conf <- lle(x, m = 1, k = 30) conf plot_config(conf, col = iris[, 5], labels = FALSE)
data(iris) x <- iris[, -5] # project iris on 2 dimensions conf <- lle(x, m = 2, k = 30) conf plot(conf, col = iris[, 5]) # project iris onto a single dimension conf <- lle(x, m = 1, k = 30) conf plot_config(conf, col = iris[, 5], labels = FALSE)
Definition of different local neighborhood functions for the method "SA"
for seriate()
.
LS_swap(o, pos = sample.int(length(o), 2)) LS_insert(o, pos = sample.int(length(o), 2)) LS_reverse(o, pos = sample.int(length(o), 2)) LS_mixed(o, pos = sample.int(length(o), 2))
LS_swap(o, pos = sample.int(length(o), 2)) LS_insert(o, pos = sample.int(length(o), 2)) LS_reverse(o, pos = sample.int(length(o), 2)) LS_mixed(o, pos = sample.int(length(o), 2))
o |
an integer vector with the order |
pos |
random positions used for the local move. |
Local neighborhood functions are LS_insert
, LS_swap
, LS_reverse
, and LS_mix
(1/3 insertion, 1/3 swap and 1/3 reverse). Any neighborhood function can be defined.
returns the new order vector representing the random neighbor.
This data set contains a grave times artifact incidence matrix for the Celtic Münsingen-Rain cemetery in Switzerland as provided by Hodson (1968) and published by Kendall 1971.
A 59 x 70 0-1 matrix. Rows (graves) and columns (artifacts) are in the order determined by Hodson (1968).
Hodson, F.R. (1968). The La Tene Cemetery at Münsingen-Rain, Stämpfli, Bern.
Kendall, D.G. (1971): Seriation from abundance matrices. In: Hodson, F.R., Kendall, D.G. and Tautu, P., (Editors), Mathematics in the Archaeological and Historical Sciences, Edinburgh University Press, Edinburgh, 215–232.
Other data:
Chameleon
,
Irish
,
SupremeCourt
,
Townships
,
Wood
,
Zoo
,
create_lines_data()
,
is.robinson()
data("Munsingen") ## Seriation method after Kendall (1971) ## Kendall's square symmetric matrix S and SoS S <- function(x, w = 1) { sij <- function(i , j) w * sum(pmin(x[i,], x[j,])) h <- nrow(x) r <- matrix(ncol = h, nrow =h) for(i in 1:h) for (j in 1:h) r[i,j] <- sij(i,j) r } SoS <- function(x) S(S(x)) ## Kendall's horse shoe (Hamiltonian arc) horse_shoe_plot <- function(mds, sigma, threshold = mean(sigma), ...) { plot(mds, main = paste("Kendall's horse shoe with th =", threshold), ...) l <- which(sigma > threshold, arr.ind=TRUE) for(i in 1:nrow(l)) lines(rbind(mds[l[i,1],], mds[l[i,2],])) } ## shuffle data x <- Munsingen[sample(nrow(Munsingen)),] ## calculate matrix and do isoMDS (from package MASS) sigma <- SoS(x) library("MASS") mds <- isoMDS(1/(1+sigma))$points ## plot Kendall's horse shoe horse_shoe_plot(mds, sigma) ## find order using a TSP library("TSP") tour <- solve_TSP(insert_dummy(TSP(dist(mds)), label = "cut"), method = "2-opt", control = list(rep = 15)) tour <- cut_tour(tour, "cut") lines(mds[tour,], col = "red", lwd = 2) ## create and plot order order <- ser_permutation(tour, 1:ncol(x)) bertinplot(x, order, options= list(panel=panel.circles, rev = TRUE)) ## compare criterion values rbind( random = criterion(x), reordered = criterion(x, order), Hodson = criterion(Munsingen) )
data("Munsingen") ## Seriation method after Kendall (1971) ## Kendall's square symmetric matrix S and SoS S <- function(x, w = 1) { sij <- function(i , j) w * sum(pmin(x[i,], x[j,])) h <- nrow(x) r <- matrix(ncol = h, nrow =h) for(i in 1:h) for (j in 1:h) r[i,j] <- sij(i,j) r } SoS <- function(x) S(S(x)) ## Kendall's horse shoe (Hamiltonian arc) horse_shoe_plot <- function(mds, sigma, threshold = mean(sigma), ...) { plot(mds, main = paste("Kendall's horse shoe with th =", threshold), ...) l <- which(sigma > threshold, arr.ind=TRUE) for(i in 1:nrow(l)) lines(rbind(mds[l[i,1],], mds[l[i,2],])) } ## shuffle data x <- Munsingen[sample(nrow(Munsingen)),] ## calculate matrix and do isoMDS (from package MASS) sigma <- SoS(x) library("MASS") mds <- isoMDS(1/(1+sigma))$points ## plot Kendall's horse shoe horse_shoe_plot(mds, sigma) ## find order using a TSP library("TSP") tour <- solve_TSP(insert_dummy(TSP(dist(mds)), label = "cut"), method = "2-opt", control = list(rep = 15)) tour <- cut_tour(tour, "cut") lines(mds[tour,], col = "red", lwd = 2) ## create and plot order order <- ser_permutation(tour, 1:ncol(x)) bertinplot(x, order, options= list(panel=panel.circles, rev = TRUE)) ## compare criterion values rbind( random = criterion(x), reordered = criterion(x, order), Hodson = criterion(Munsingen) )
Defines several color palettes for pimage()
, dissplot()
and
hmap()
.
bluered(n = 100, bias = 1, power = 1, ...) greenred(n = 100, bias = 1, power = 1, ...) reds(n = 100, bias = 1, power = 1, ...) blues(n = 100, bias = 1, power = 1, ...) greens(n = 100, bias = 1, power = 1, ...) greys(n = 100, bias = 1, power = 1, ...) grays(n = 100, bias = 1, power = 1, ...)
bluered(n = 100, bias = 1, power = 1, ...) greenred(n = 100, bias = 1, power = 1, ...) reds(n = 100, bias = 1, power = 1, ...) blues(n = 100, bias = 1, power = 1, ...) greens(n = 100, bias = 1, power = 1, ...) greys(n = 100, bias = 1, power = 1, ...) grays(n = 100, bias = 1, power = 1, ...)
n |
number of different colors produces. |
bias |
a positive number. Higher values give more widely spaced colors at the high end. |
power |
used to control how chroma and luminance is increased (1 = linear, 2 = quadratic, etc.) |
... |
further parameters are passed on to |
The color palettes are created with colorspace::sequential_hcl()
and
colorspace::diverging_hcl()
.
The two sequential palettes are: reds()
and grays()
(or
greys()
).
The two diverging palettes are: bluered()
and greenred()
.
A vector with n
colors.
Michael Hahsler
Other plots:
VAT()
,
bertinplot()
,
dissplot()
,
hmap()
,
pimage()
m <- outer(1:10,1:10) m pimage(m) pimage(m, col = greys(100, power = 2)) pimage(m, col = greys(100, bias = 2)) pimage(m, col = bluered(100)) pimage(m, col = bluered(100, power = .5)) pimage(m, col = bluered(100, bias = 2)) pimage(m - 25, col = greenred(20, bias = 2)) ## choose your own color palettes library(colorspace) hcl_palettes(plot = TRUE) ## blues (with 20 shades) pimage(m, col = colorspace::sequential_hcl(20, "Blues", rev = TRUE)) ## blue to green (aka "Cork") pimage(m, col = colorspace::diverging_hcl(100, "Cork"))
m <- outer(1:10,1:10) m pimage(m) pimage(m, col = greys(100, power = 2)) pimage(m, col = greys(100, bias = 2)) pimage(m, col = bluered(100)) pimage(m, col = bluered(100, power = .5)) pimage(m, col = bluered(100, bias = 2)) pimage(m - 25, col = greenred(20, bias = 2)) ## choose your own color palettes library(colorspace) hcl_palettes(plot = TRUE) ## blues (with 20 shades) pimage(m, col = colorspace::sequential_hcl(20, "Blues", rev = TRUE)) ## blue to green (aka "Cork") pimage(m, col = colorspace::diverging_hcl(100, "Cork"))
Converts between permutation vectors and matrices.
permutation_vector2matrix(x) permutation_matrix2vector(x)
permutation_vector2matrix(x) permutation_matrix2vector(x)
x |
A permutation vector (any object that can be converted into a
permutation vector, e.g., a integer vector or a |
permutation_vector2matrix()
: returns a permutation matrix.
permutation_matrix2vector()
: returns the permutation as a integer vector.
Michael Hahsler
Other permutation:
get_order()
,
permute()
,
ser_dist()
,
ser_permutation()
,
ser_permutation_vector()
## create a random permutation vector pv <- structure(sample(5), names = paste0("X", 1:5)) pv ## convert into a permutation matrix pm <- permutation_vector2matrix(pv) pm ## convert back permutation_matrix2vector(pm)
## create a random permutation vector pv <- structure(sample(5), names = paste0("X", 1:5)) pv ## convert into a permutation matrix pm <- permutation_vector2matrix(pv) pm ## convert back permutation_matrix2vector(pm)
Provides the generic function and methods for permuting the order of various
objects including vectors, lists, dendrograms (also hclust
objects),
the order of observations in a dist
object, the rows and columns of a
matrix or data.frame, and all dimensions of an array given a suitable
ser_permutation object.
permute(x, order, ...) ## S3 method for class 'array' permute(x, order, margin = NULL, ...) ## S3 method for class 'matrix' permute(x, order, margin = NULL, ...) ## S3 method for class 'data.frame' permute(x, order, margin = NULL, ...) ## S3 method for class 'table' permute(x, order, margin = NULL, ...) ## S3 method for class 'numeric' permute(x, order, ...) ## S3 method for class 'character' permute(x, order, ...) ## S3 method for class 'list' permute(x, order, ...) ## S3 method for class 'dist' permute(x, order, ...) ## S3 method for class 'dendrogram' permute(x, order, dist = NULL, ...) ## S3 method for class 'hclust' permute(x, order, dist = NULL, ...)
permute(x, order, ...) ## S3 method for class 'array' permute(x, order, margin = NULL, ...) ## S3 method for class 'matrix' permute(x, order, margin = NULL, ...) ## S3 method for class 'data.frame' permute(x, order, margin = NULL, ...) ## S3 method for class 'table' permute(x, order, margin = NULL, ...) ## S3 method for class 'numeric' permute(x, order, ...) ## S3 method for class 'character' permute(x, order, ...) ## S3 method for class 'list' permute(x, order, ...) ## S3 method for class 'dist' permute(x, order, ...) ## S3 method for class 'dendrogram' permute(x, order, dist = NULL, ...) ## S3 method for class 'hclust' permute(x, order, dist = NULL, ...)
x |
an object (a list, a vector, a |
order |
an object of class ser_permutation which contains
suitable permutation vectors for |
... |
if |
margin |
specifies the dimensions to be permuted as a vector with dimension indices.
If |
dist |
the distance matrix used to create the dendrogram. Only needed if order is the name of a seriation method. |
The permutation vectors in ser_permutation are suitable if the number
of permutation vectors matches the number of dimensions of x
and if
the length of each permutation vector has the same length as the
corresponding dimension of x
.
For 1-dimensional/1-mode data (list, vector, dist
), order
can
also be a single permutation vector of class ser_permutation_vector
or data which can be automatically coerced to this class (e.g. a numeric
vector).
For dendrogram
and hclust
, subtrees are rotated to represent
the order best possible. If the order is not achieved perfectly then the
user is warned. See also reorder.hclust()
for
reordering hclust
objects.
A permuted object of the same class as x
.
Michael Hahsler
Other permutation:
get_order()
,
permutation_vector2matrix()
,
ser_dist()
,
ser_permutation()
,
ser_permutation_vector()
# List data types for permute methods("permute") # Permute matrix m <- matrix(rnorm(10), 5, 2, dimnames = list(1:5, LETTERS[1:2])) m # Permute rows and columns o <- ser_permutation(5:1, 2:1) o permute(m, o) ## permute only columns permute(m, o, margin = 2) ## permute using PCA seriation permute(m, "PCA") ## permute only rows using PCA permute(m, "PCA", margin = 1) # Permute data.frames using heatmap seration (= hierarchical # clustering + optimal leaf ordering) df <- as.data.frame(m) permute(df, "Heatmap") # Permute objects in a dist object d <- dist(m) d permute(d, c(3, 2, 1, 4, 5)) permute(d, "Spectral") # Permute a list l <- list(a = 1:5, b = letters[1:3], c = 0) l permute(l, c(2, 3, 1)) # Permute to reorder dendrogram (see also reorder.hclust) hc <- hclust(d) plot(hc) plot(permute(hc, 5:1)) plot(permute(hc, 5:1, incompartible = "stop")) plot(permute(hc, "OLO", dist = d)) plot(permute(hc, "GW", dist = d)) plot(permute(hc, "MDS", dist = d)) plot(permute(hc, "TSP", dist = d))
# List data types for permute methods("permute") # Permute matrix m <- matrix(rnorm(10), 5, 2, dimnames = list(1:5, LETTERS[1:2])) m # Permute rows and columns o <- ser_permutation(5:1, 2:1) o permute(m, o) ## permute only columns permute(m, o, margin = 2) ## permute using PCA seriation permute(m, "PCA") ## permute only rows using PCA permute(m, "PCA", margin = 1) # Permute data.frames using heatmap seration (= hierarchical # clustering + optimal leaf ordering) df <- as.data.frame(m) permute(df, "Heatmap") # Permute objects in a dist object d <- dist(m) d permute(d, c(3, 2, 1, 4, 5)) permute(d, "Spectral") # Permute a list l <- list(a = 1:5, b = letters[1:3], c = 0) l permute(l, c(2, 3, 1)) # Permute to reorder dendrogram (see also reorder.hclust) hc <- hclust(d) plot(hc) plot(permute(hc, 5:1)) plot(permute(hc, 5:1, incompartible = "stop")) plot(permute(hc, "OLO", dist = d)) plot(permute(hc, "GW", dist = d)) plot(permute(hc, "MDS", dist = d)) plot(permute(hc, "TSP", dist = d))
Provides methods for matrix shading, i.e., displaying a color image for
matrix (including correlation matrices and data frames) and dist
objects given an
optional permutation. The plot arranges colored rectangles to represent the
values in the matrix. This visualization is also know as a heatmap.
Implementations based on the
grid graphics engine and based n ggplot2 are provided.
pimage(x, order = FALSE, ...) ## S3 method for class 'matrix' pimage( x, order = FALSE, col = NULL, main = "", xlab = "", ylab = "", zlim = NULL, key = TRUE, keylab = "", symkey = TRUE, upper_tri = TRUE, lower_tri = TRUE, diag = TRUE, row_labels = NULL, col_labels = NULL, prop = isSymmetric(x), flip_axes = FALSE, reverse_columns = FALSE, ..., newpage = TRUE, pop = TRUE, gp = NULL ) ## S3 method for class 'table' pimage(x, order = NULL, ...) ## S3 method for class 'data.frame' pimage(x, order = NULL, ...) ## S3 method for class 'dist' pimage( x, order = NULL, col = NULL, main = "", xlab = "", ylab = "", zlim = NULL, key = TRUE, keylab = "", symkey = TRUE, upper_tri = TRUE, lower_tri = TRUE, diag = TRUE, row_labels = NULL, col_labels = NULL, prop = TRUE, flip_axes = FALSE, reverse_columns = FALSE, ..., newpage = TRUE, pop = TRUE, gp = NULL ) ggpimage(x, order = NULL, ...) ## S3 method for class 'matrix' ggpimage( x, order = NULL, zlim = NULL, upper_tri = TRUE, lower_tri = TRUE, diag = TRUE, row_labels = NULL, col_labels = NULL, prop = isSymmetric(x), flip_axes = FALSE, reverse_columns = FALSE, ... ) ## S3 method for class 'dist' ggpimage( x, order = NULL, zlim = NULL, upper_tri = TRUE, lower_tri = TRUE, diag = TRUE, row_labels = NULL, col_labels = NULL, prop = TRUE, flip_axes = FALSE, reverse_columns = FALSE, ... )
pimage(x, order = FALSE, ...) ## S3 method for class 'matrix' pimage( x, order = FALSE, col = NULL, main = "", xlab = "", ylab = "", zlim = NULL, key = TRUE, keylab = "", symkey = TRUE, upper_tri = TRUE, lower_tri = TRUE, diag = TRUE, row_labels = NULL, col_labels = NULL, prop = isSymmetric(x), flip_axes = FALSE, reverse_columns = FALSE, ..., newpage = TRUE, pop = TRUE, gp = NULL ) ## S3 method for class 'table' pimage(x, order = NULL, ...) ## S3 method for class 'data.frame' pimage(x, order = NULL, ...) ## S3 method for class 'dist' pimage( x, order = NULL, col = NULL, main = "", xlab = "", ylab = "", zlim = NULL, key = TRUE, keylab = "", symkey = TRUE, upper_tri = TRUE, lower_tri = TRUE, diag = TRUE, row_labels = NULL, col_labels = NULL, prop = TRUE, flip_axes = FALSE, reverse_columns = FALSE, ..., newpage = TRUE, pop = TRUE, gp = NULL ) ggpimage(x, order = NULL, ...) ## S3 method for class 'matrix' ggpimage( x, order = NULL, zlim = NULL, upper_tri = TRUE, lower_tri = TRUE, diag = TRUE, row_labels = NULL, col_labels = NULL, prop = isSymmetric(x), flip_axes = FALSE, reverse_columns = FALSE, ... ) ## S3 method for class 'dist' ggpimage( x, order = NULL, zlim = NULL, upper_tri = TRUE, lower_tri = TRUE, diag = TRUE, row_labels = NULL, col_labels = NULL, prop = TRUE, flip_axes = FALSE, reverse_columns = FALSE, ... )
x |
a matrix, a data.frame, or an object of class |
order |
a logical where |
... |
if |
col |
a list of colors used. If |
main |
plot title. |
xlab , ylab
|
labels for the x and y axes. |
zlim |
vector with two elements giving the range (min, max) for representing the values in the matrix. |
key |
logical; add a color key? No key is available for logical matrices. |
keylab |
string plotted next to the color key. |
symkey |
logical; if |
upper_tri , lower_tri , diag
|
a logical indicating whether to show the upper triangle, the lower triangle or the diagonal of the (distance) matrix. |
row_labels , col_labels
|
a logical indicating if row and column labels
in |
prop |
logical; change the aspect ratio so cells in the image have a equal width and height. |
flip_axes |
logical; exchange rows and columns for plotting. |
reverse_columns |
logical; revers the order of how the columns are displayed. |
newpage , pop , gp
|
Start plot on a new page, pop the viewports after
plotting, and use the supplied |
Plots a matrix in its original row and column orientation (image in stats reverses the rows). This means, in a plot the columns become the x-coordinates and the rows the y-coordinates (in reverse order).
Grid-based plot: The viewports used for plotting are called:
"plot"
, "image"
and "colorkey"
. Use grid functions
to manipulate the plots (see Examples section).
ggplot2-based plot: A ggplot2 object is returned. Colors, axis limits
and other visual aspects can be added using standard ggplot2 functions
(labs
, scale_fill_continuous
, labs
, etc.).
Nothing.
Christian Buchta and Michael Hahsler
Other plots:
VAT()
,
bertinplot()
,
dissplot()
,
hmap()
,
palette()
set.seed(1234) data(iris) x <- as.matrix(iris[sample(nrow(iris), 20) , -5]) pimage(x) # Show all labels and flip axes, reverse columns, or change colors pimage(x, prop = TRUE) pimage(x, flip_axes = TRUE) pimage(x, reverse_columns = TRUE) pimage(x, col = grays(100)) # A matrix with positive and negative values x_scaled <- scale(x) pimage(x_scaled) # Use reordering pimage(x_scaled, order = TRUE) pimage(x_scaled, order = "Heatmap") ## Example: Distance Matrix # Show a reordered distance matrix (distances between rows). # Dark means low distance. The aspect ratio is automatically fixed to 1:1 # using prop = TRUE d <- dist(x) pimage(d) pimage(d, order = TRUE) # Supress the upper triangle and diagonal pimage(d, order = TRUE, upper = FALSE, diag = FALSE) # Show only distances that are smaller than 2 using limits on z. pimage(d, order = TRUE, zlim = c(0, 3)) ## Example: Correlation Matrix # we calculate correlation between rows and seriate the matrix # and seriate by converting the correlations into distances. # pimage reorders then rows and columns with c(o, o). r <- cor(t(x)) o <- seriate(as.dist(sqrt(1 - r))) pimage(r, order = c(o, o), upper = FALSE, diag = FALSE, zlim = c(-1, 1), reverse_columns = TRUE, main = "Correlation matrix") # Add to the plot using functions in package grid # Note: pop = FALSE allows us to manipulate viewports library("grid") pimage(x, order = TRUE, pop = FALSE) # available viewports are: "main", "colorkey", "plot", "image" current.vpTree() # Highlight cell 2/2 with a red arrow # Note: columns are x and rows are y. downViewport(name = "image") grid.lines(x = c(1, 2), y = c(-1, 2), arrow = arrow(), default.units = "native", gp = gpar(col = "red", lwd = 3)) # add a red box around the first 4 rows of the 2nd column grid.rect(x = 1 + .5 , y = 4 + .5, width = 1, height = 4, hjust = 0, vjust = 1, default.units = "native", gp = gpar(col = "red", lwd = 3, fill = NA)) ## remove the viewports popViewport(0) ## put several pimages on a page (use grid viewports and newpage = FALSE) # set up grid layout library(grid) grid.newpage() top_vp <- viewport(layout = grid.layout(nrow = 1, ncol = 2, widths = unit(c(.4, .6), unit = "npc"))) col1_vp <- viewport(layout.pos.row = 1, layout.pos.col = 1, name = "col1_vp") col2_vp <- viewport(layout.pos.row = 1, layout.pos.col = 2, name = "col2_vp") splot <- vpTree(top_vp, vpList(col1_vp, col2_vp)) pushViewport(splot) seekViewport("col1_vp") o <- seriate(d) pimage(x, c(o, NA), col_labels = FALSE, main = "Data", newpage = FALSE) seekViewport("col2_vp") ## add the reordered dissimilarity matrix for rows pimage(d, o, main = "Distances", newpage = FALSE) popViewport(0) ##------------------------------------------------------------- ## ggplot2 Examples if (require("ggplot2")) { library("ggplot2") set.seed(1234) data(iris) x <- as.matrix(iris[sample(nrow(iris), 20) , -5]) ggpimage(x) # Show all labels and flip axes, reverse columns ggpimage(x, prop = TRUE) ggpimage(x, flip_axes = TRUE) ggpimage(x, reverse_columns = TRUE) # A matrix with positive and negative values x_scaled <- scale(x) ggpimage(x_scaled) # Use reordering ggpimage(x_scaled, order = TRUE) ggpimage(x_scaled, order = "Heatmap") ## Example: Distance Matrix # Show a reordered distance matrix (distances between rows). # Dark means low distance. The aspect ratio is automatically fixed to 1:1 # using prop = TRUE d <- dist(x) ggpimage(d) ggpimage(d, order = TRUE) # Supress the upper triangle and diagonal ggpimage(d, order = TRUE, upper = FALSE, diag = FALSE) # Show only distances that are smaller than 2 using limits on z. ggpimage(d, order = TRUE, zlim = c(0, 2)) ## Example: Correlation Matrix # we calculate correlation between rows and seriate the matrix r <- cor(t(x)) o <- seriate(as.dist(sqrt(1 - r))) ggpimage(r, order = c(o, o), upper = FALSE, diag = FALSE, zlim = c(-1, 1), reverse_columns = TRUE) + labs(title = "Correlation matrix") ## Example: Custom themes and colors # Reorder matrix, use custom colors, add a title, # and hide colorkey. ggpimage(x) + theme(legend.position = "none") + labs(title = "Random Data") + xlab("Variables") # Add lines ggpimage(x) + geom_hline(yintercept = seq(0, nrow(x)) + .5) + geom_vline(xintercept = seq(0, ncol(x)) + .5) # Use ggplot2 themes with theme_set old_theme <- theme_set(theme_linedraw()) ggpimage(d) theme_set(old_theme) # Use custom color palettes: Gray scale, Colorbrewer (provided in ggplot2) and colorspace ggpimage(d, order = seriate(d), upper_tri = FALSE) + scale_fill_gradient(low = "black", high = "white", na.value = "white") ggpimage(d, order = seriate(d), upper_tri = FALSE) + scale_fill_distiller(palette = "Spectral", direction = +1, na.value = "white") ggpimage(d, order = seriate(d), upper_tri = FALSE) + colorspace::scale_fill_continuous_sequential("Reds", rev = FALSE, na.value = "white") }
set.seed(1234) data(iris) x <- as.matrix(iris[sample(nrow(iris), 20) , -5]) pimage(x) # Show all labels and flip axes, reverse columns, or change colors pimage(x, prop = TRUE) pimage(x, flip_axes = TRUE) pimage(x, reverse_columns = TRUE) pimage(x, col = grays(100)) # A matrix with positive and negative values x_scaled <- scale(x) pimage(x_scaled) # Use reordering pimage(x_scaled, order = TRUE) pimage(x_scaled, order = "Heatmap") ## Example: Distance Matrix # Show a reordered distance matrix (distances between rows). # Dark means low distance. The aspect ratio is automatically fixed to 1:1 # using prop = TRUE d <- dist(x) pimage(d) pimage(d, order = TRUE) # Supress the upper triangle and diagonal pimage(d, order = TRUE, upper = FALSE, diag = FALSE) # Show only distances that are smaller than 2 using limits on z. pimage(d, order = TRUE, zlim = c(0, 3)) ## Example: Correlation Matrix # we calculate correlation between rows and seriate the matrix # and seriate by converting the correlations into distances. # pimage reorders then rows and columns with c(o, o). r <- cor(t(x)) o <- seriate(as.dist(sqrt(1 - r))) pimage(r, order = c(o, o), upper = FALSE, diag = FALSE, zlim = c(-1, 1), reverse_columns = TRUE, main = "Correlation matrix") # Add to the plot using functions in package grid # Note: pop = FALSE allows us to manipulate viewports library("grid") pimage(x, order = TRUE, pop = FALSE) # available viewports are: "main", "colorkey", "plot", "image" current.vpTree() # Highlight cell 2/2 with a red arrow # Note: columns are x and rows are y. downViewport(name = "image") grid.lines(x = c(1, 2), y = c(-1, 2), arrow = arrow(), default.units = "native", gp = gpar(col = "red", lwd = 3)) # add a red box around the first 4 rows of the 2nd column grid.rect(x = 1 + .5 , y = 4 + .5, width = 1, height = 4, hjust = 0, vjust = 1, default.units = "native", gp = gpar(col = "red", lwd = 3, fill = NA)) ## remove the viewports popViewport(0) ## put several pimages on a page (use grid viewports and newpage = FALSE) # set up grid layout library(grid) grid.newpage() top_vp <- viewport(layout = grid.layout(nrow = 1, ncol = 2, widths = unit(c(.4, .6), unit = "npc"))) col1_vp <- viewport(layout.pos.row = 1, layout.pos.col = 1, name = "col1_vp") col2_vp <- viewport(layout.pos.row = 1, layout.pos.col = 2, name = "col2_vp") splot <- vpTree(top_vp, vpList(col1_vp, col2_vp)) pushViewport(splot) seekViewport("col1_vp") o <- seriate(d) pimage(x, c(o, NA), col_labels = FALSE, main = "Data", newpage = FALSE) seekViewport("col2_vp") ## add the reordered dissimilarity matrix for rows pimage(d, o, main = "Distances", newpage = FALSE) popViewport(0) ##------------------------------------------------------------- ## ggplot2 Examples if (require("ggplot2")) { library("ggplot2") set.seed(1234) data(iris) x <- as.matrix(iris[sample(nrow(iris), 20) , -5]) ggpimage(x) # Show all labels and flip axes, reverse columns ggpimage(x, prop = TRUE) ggpimage(x, flip_axes = TRUE) ggpimage(x, reverse_columns = TRUE) # A matrix with positive and negative values x_scaled <- scale(x) ggpimage(x_scaled) # Use reordering ggpimage(x_scaled, order = TRUE) ggpimage(x_scaled, order = "Heatmap") ## Example: Distance Matrix # Show a reordered distance matrix (distances between rows). # Dark means low distance. The aspect ratio is automatically fixed to 1:1 # using prop = TRUE d <- dist(x) ggpimage(d) ggpimage(d, order = TRUE) # Supress the upper triangle and diagonal ggpimage(d, order = TRUE, upper = FALSE, diag = FALSE) # Show only distances that are smaller than 2 using limits on z. ggpimage(d, order = TRUE, zlim = c(0, 2)) ## Example: Correlation Matrix # we calculate correlation between rows and seriate the matrix r <- cor(t(x)) o <- seriate(as.dist(sqrt(1 - r))) ggpimage(r, order = c(o, o), upper = FALSE, diag = FALSE, zlim = c(-1, 1), reverse_columns = TRUE) + labs(title = "Correlation matrix") ## Example: Custom themes and colors # Reorder matrix, use custom colors, add a title, # and hide colorkey. ggpimage(x) + theme(legend.position = "none") + labs(title = "Random Data") + xlab("Variables") # Add lines ggpimage(x) + geom_hline(yintercept = seq(0, nrow(x)) + .5) + geom_vline(xintercept = seq(0, ncol(x)) + .5) # Use ggplot2 themes with theme_set old_theme <- theme_set(theme_linedraw()) ggpimage(d) theme_set(old_theme) # Use custom color palettes: Gray scale, Colorbrewer (provided in ggplot2) and colorspace ggpimage(d, order = seriate(d), upper_tri = FALSE) + scale_fill_gradient(low = "black", high = "white", na.value = "white") ggpimage(d, order = seriate(d), upper_tri = FALSE) + scale_fill_distiller(palette = "Spectral", direction = +1, na.value = "white") ggpimage(d, order = seriate(d), upper_tri = FALSE) + colorspace::scale_fill_continuous_sequential("Reds", rev = FALSE, na.value = "white") }
A data set collected by Holzinger and Swineford (1939) which consists of the results of 24 psychological tests given to 145 seventh and eighth grade students in a Chicago suburb. This data set contains the correlation matrix for the 24 test results. The data set was also used as an example for visualization of cluster analysis by Ling (1973).
A 24 x 24 correlation matrix.
Holzinger, K. L., Swineford, F. (1939): A study in factor analysis: The stability of a bi-factor solution. Supplementary Educational Monograph, No. 48. Chicago: University of Chicago Press.
Ling, R. L. (1973): A computer generated aid for cluster analysis. Communications of the ACM, 16(6), pp. 355–361.
data("Psych24") ## create a dist object and also get rid of the one negative entry in the ## correlation matrix d <- as.dist(1 - abs(Psych24)) pimage(d) ## do hclust as in Ling (1973) hc <- hclust(d, method = "complete") plot(hc) pimage(d, hc) ## use seriation order <- seriate(d, method = "tsp") #order <- seriate(d, method = "tsp", control = list(method = "concorde")) pimage(d, order)
data("Psych24") ## create a dist object and also get rid of the one negative entry in the ## correlation matrix d <- as.dist(1 - abs(Psych24)) pimage(d) ## do hclust as in Ling (1973) hc <- hclust(d, method = "complete") plot(hc) pimage(d, hc) ## use seriation order <- seriate(d, method = "tsp") #order <- seriate(d, method = "tsp", control = list(method = "concorde")) pimage(d, order)
Register the DendSer dendrogram seriation method and the ARc criterion
(Earle and Hurley, 2015) for use with seriate()
.
register_DendSer()
register_DendSer()
Registers the method "DendSer"
for seriate. DendSer is a fast
heuristic for reordering dendrograms developed by Earle and Hurley (2015)
able to use different criteria.
control
for seriate()
with
method "DendSer"
accepts the following parameters:
"h"
or "method"
: A dendrogram or a method for hierarchical clustering
(see hclust). Default: complete-link.
"criterion"
: A seriation criterion to optimize (see
list_criterion_methods("dist")
. Default: "BAR"
(Banded
anti-Robinson from with 20% band width).
"verbose"
: a logical; print progress information?
"DendSer_args"
: additional arguments for DendSer::DendSer()
.
For convenience, the following methods (for different cost functions) are also provided:
"DendSer_ARc"
(anti-robinson form),
"DendSer_BAR"
(banded anti-Robinson form),
"DendSer_LPL"
(lazy path length),
"DendSer_PL"
(path length).
Note: Package DendSer needs to be installed.
Nothing.
Michael Hahsler based on code by Catherine B. Hurley and Denise Earle
D. Earle, C. B. Hurley (2015): Advances in dendrogram seriation for application to visualization. Journal of Computational and Graphical Statistics, 24(1), 1–25.
Other seriation:
register_GA()
,
register_optics()
,
register_smacof()
,
register_tsne()
,
register_umap()
,
registry_for_seriation_methods
,
seriate()
,
seriate_best()
## Not run: register_DendSer() get_seriation_method("dist", "DendSer") d <- dist(random.robinson(20, pre=TRUE)) ## use Banded AR form with default clustering (complete-link) o <- seriate(d, "DendSer_BAR") pimage(d, o) ## use different hclust method (Ward) and AR as the cost function for ## dendrogram reordering o <- seriate(d, "DendSer", control = list(method = "ward.D2", criterion = "AR")) pimage(d, o) ## End(Not run)
## Not run: register_DendSer() get_seriation_method("dist", "DendSer") d <- dist(random.robinson(20, pre=TRUE)) ## use Banded AR form with default clustering (complete-link) o <- seriate(d, "DendSer_BAR") pimage(d, o) ## use different hclust method (Ward) and AR as the cost function for ## dendrogram reordering o <- seriate(d, "DendSer", control = list(method = "ward.D2", criterion = "AR")) pimage(d, o) ## End(Not run)
Register a GA-based seriation metaheuristic for use with seriate()
.
register_GA() gaperm_mixedMutation(ismProb = 0.8)
register_GA() gaperm_mixedMutation(ismProb = 0.8)
ismProb |
probability to use |
Registers the method "GA"
for seriate()
. This method can be used
to optimize any criterion in package seriation.
The GA uses by default the ordered cross-over (OX) operator. For mutation,
the GA uses a mixture of simple insertion and simple inversion operators.
This mixed operator is created using
seriation::gaperm_mixedMutation(ismProb = .8)
, where ismProb
is the probability that the simple insertion mutation operator is used. See
package GA for a description of other available cross-over and
mutation operators for permutations. The appropriate operator functions in
GA start with gaperm_
.
We warm start the GA using "suggestions"
given by several heuristics.
Set "suggestions"
to NA
to start with a purely random initial
population.
See Example section for available control parameters.
Note: Package GA needs to be installed.
Nothing.
Michael Hahsler
Luca Scrucca (2013): GA: A Package for Genetic Algorithms in R. Journal of Statistical Software, 53(4), 1–37. URL doi:10.18637/jss.v053.i04.
Other seriation:
register_DendSer()
,
register_optics()
,
register_smacof()
,
register_tsne()
,
register_umap()
,
registry_for_seriation_methods
,
seriate()
,
seriate_best()
## Not run: register_GA() get_seriation_method("dist", "GA") data(SupremeCourt) d <- as.dist(SupremeCourt) ## optimize for linear seriation criterion (LS) o <- seriate(d, "GA", criterion = "LS", verbose = TRUE) pimage(d, o) ## Note that by default the algorithm is already seeded with a LS heuristic. ## This run is no warm start (no suggestions) and increase run to 100 o <- seriate(d, "GA", criterion = "LS", suggestions = NA, run = 100, verbose = TRUE) pimage(d, o) o <- seriate(d, "GA", criterion = "LS", suggestions = NA, run = 100, verbose = TRUE, ) pimage(d, o) ## End(Not run)
## Not run: register_GA() get_seriation_method("dist", "GA") data(SupremeCourt) d <- as.dist(SupremeCourt) ## optimize for linear seriation criterion (LS) o <- seriate(d, "GA", criterion = "LS", verbose = TRUE) pimage(d, o) ## Note that by default the algorithm is already seeded with a LS heuristic. ## This run is no warm start (no suggestions) and increase run to 100 o <- seriate(d, "GA", criterion = "LS", suggestions = NA, run = 100, verbose = TRUE) pimage(d, o) o <- seriate(d, "GA", criterion = "LS", suggestions = NA, run = 100, verbose = TRUE, ) pimage(d, o) ## End(Not run)
Use ordering points to identify the clustering structure (OPTICS) for seriate()
.
register_optics()
register_optics()
Registers the method "optics"
for seriate()
. This method applies
the OPTICS ordering algorithm implemented in dbscan::optics()
to create an ordering.
Note: Package dbscan needs to be installed.
Nothing.
Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, Joerg Sander (1999). OPTICS: Ordering Points To Identify the Clustering Structure. ACM SIGMOD international conference on Management of data, ACM Press, pp. 49-60. doi:10.1145/304181.304187
Other seriation:
register_DendSer()
,
register_GA()
,
register_smacof()
,
register_tsne()
,
register_umap()
,
registry_for_seriation_methods
,
seriate()
,
seriate_best()
## Not run: register_optics() get_seriation_method("dist", "optics") d <- dist(random.robinson(50, pre=TRUE, noise=.1)) o <- seriate(d, method = "optics") pimage(d, o) ## End(Not run)
## Not run: register_optics() get_seriation_method("dist", "optics") d <- dist(random.robinson(50, pre=TRUE, noise=.1)) o <- seriate(d, method = "optics") pimage(d, o) ## End(Not run)
Registers the "MDS_smacof"
method for seriate()
based on multidimensional
scaling using stress majorization and the corresponding "smacof_stress0"
criterion implemented in package smacof (de Leeuw & Mair, 2009).
register_smacof()
register_smacof()
Seriation method "smacof"
implements stress majorization with several transformation functions.
These functions are passed on as the type control parameter. We default
to "ratio"
, which together with "interval"
performs metric MDS.
"ordinal"
can be used
for non-metric MDS. See smacof::smacofSym()
for details on the
control parameters.
The corresponding criterion called "smacof_stress0"
is also registered.
There additional parameter type
is used to specify the used
transformation function. It should agree with the function used for seriation.
See smacof::stress0()
for details on the stress calculation.
Note: Package smacof needs to be installed.
Nothing.
Jan de Leeuw, Patrick Mair (2009). Multidimensional Scaling Using Majorization: SMACOF in R. Journal of Statistical Software, 31(3), 1-30. doi:10.18637/jss.v031.i03
Other seriation:
register_DendSer()
,
register_GA()
,
register_optics()
,
register_tsne()
,
register_umap()
,
registry_for_seriation_methods
,
seriate()
,
seriate_best()
## Not run: register_smacof() get_seriation_method("dist", "MDS_smacof") d <- dist(random.robinson(20, pre = TRUE)) ## use Banded AR form with default clustering (complete-link) o <- seriate(d, "MDS_smacof", verbose = TRUE) pimage(d, o) # recalculate stress for the order MDS_stress(d, o) # ordinal MDS. stress needs to be calculated using the correct type with stress0 o <- seriate(d, "MDS_smacof", type = "ordinal", verbose = TRUE) criterion(d, o, method = "smacof_stress0", type = "ordinal") ## End(Not run)
## Not run: register_smacof() get_seriation_method("dist", "MDS_smacof") d <- dist(random.robinson(20, pre = TRUE)) ## use Banded AR form with default clustering (complete-link) o <- seriate(d, "MDS_smacof", verbose = TRUE) pimage(d, o) # recalculate stress for the order MDS_stress(d, o) # ordinal MDS. stress needs to be calculated using the correct type with stress0 o <- seriate(d, "MDS_smacof", type = "ordinal", verbose = TRUE) criterion(d, o, method = "smacof_stress0", type = "ordinal") ## End(Not run)
Use t-distributed stochastic neighbor embedding (t-SNE) for seriate()
.
register_tsne()
register_tsne()
Registers the method "tsne"
for seriate()
. This method applies
1D t-SNE to a data matrix or a distance matrix and extracts the order
from the 1D embedding. To speed up the process, an initial embedding is
created using 1D multi-dimensional scaling (MDS) or principal
comonents analysis (PCA) which is improved by t-SNE.
The control
parameter "mds"
or "pca"
controls if MDS (for distances)
or PCA (for data matrices) is used to create an
initial embedding. See Rtsne::Rtsne()
to learn about the other
available control
parameters.
Perplexity is automatically set as the minimum between 30 and the number of
observations. It can be also specified using the control parameter
"preplexity"
.
Note: Package Rtsne needs to be installed.
Nothing.
van der Maaten, L.J.P. & Hinton, G.E., 2008. Visualizing High-Dimensional Data Using t-SNE. Journal of Machine Learning Research, 9, pp.2579-2605.
Other seriation:
register_DendSer()
,
register_GA()
,
register_optics()
,
register_smacof()
,
register_umap()
,
registry_for_seriation_methods
,
seriate()
,
seriate_best()
## Not run: register_tsne() # distances get_seriation_method("dist", "tsne") data(SupremeCourt) d <- as.dist(SupremeCourt) o <- seriate(d, method = "tsne", verbose = TRUE) pimage(d, o) # look at the returned configuration and plot it attr(o[[1]], "configuration") plot_config(o) # the t-SNE results are also available as an attribute (see ? Rtsne::Rtsne) attr(o[[1]], "model") ## matrix get_seriation_method("matrix", "tsne") data("Zoo") x <- Zoo x[,"legs"] <- (x[,"legs"] > 0) # t-SNE does not allow duplicates x <- x[!duplicated(x), , drop = FALSE] class <- x$class label <- rownames(x) x <- as.matrix(x[,-17]) o <- seriate(x, method = "tsne", eta = 10, verbose = TRUE) pimage(x, o, prop = FALSE, row_labels = TRUE, col_labels = TRUE) # look at the row embedding plot_config(o[[1]], col = class) ## End(Not run)
## Not run: register_tsne() # distances get_seriation_method("dist", "tsne") data(SupremeCourt) d <- as.dist(SupremeCourt) o <- seriate(d, method = "tsne", verbose = TRUE) pimage(d, o) # look at the returned configuration and plot it attr(o[[1]], "configuration") plot_config(o) # the t-SNE results are also available as an attribute (see ? Rtsne::Rtsne) attr(o[[1]], "model") ## matrix get_seriation_method("matrix", "tsne") data("Zoo") x <- Zoo x[,"legs"] <- (x[,"legs"] > 0) # t-SNE does not allow duplicates x <- x[!duplicated(x), , drop = FALSE] class <- x$class label <- rownames(x) x <- as.matrix(x[,-17]) o <- seriate(x, method = "tsne", eta = 10, verbose = TRUE) pimage(x, o, prop = FALSE, row_labels = TRUE, col_labels = TRUE) # look at the row embedding plot_config(o[[1]], col = class) ## End(Not run)
Use uniform manifold approximation and projection (UMAP) to embed the data
on the number line and create a order for seriate()
.
register_umap()
register_umap()
Registers the method "umap"
for seriate()
. This method applies
1D UMAP to a data matrix or a distance matrix and extracts the order from
the 1D embedding.
Control parameter n_epochs
can be increased to find a better embedding.
The returned seriation permutation vector has an attribute named
embedding
containing the umap embedding.
Note: Package umap needs to be installed.
Nothing.
McInnes, L and Healy, J, UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, ArXiv e-prints 1802.03426, 2018.
umap::umap()
in umap.
Other seriation:
register_DendSer()
,
register_GA()
,
register_optics()
,
register_smacof()
,
register_tsne()
,
registry_for_seriation_methods
,
seriate()
,
seriate_best()
## Not run: register_umap() ## distances get_seriation_method("dist", "umap") data(SupremeCourt) d <- as.dist(SupremeCourt) o <- seriate(d, method = "umap", verbose = TRUE) pimage(d, o) # look at the returned embedding and plot it attr(o[[1]], "configuration") plot_config(o) ## matrix get_seriation_method("matrix", "umap") data("Zoo") Zoo[,"legs"] <- (Zoo[,"legs"] > 0) x <- as.matrix(Zoo[,-17]) label <- rownames(Zoo) class <- Zoo$class o <- seriate(x, method = "umap", verbose = TRUE) pimage(x, o) plot_config(o[[1]], col = class) ## End(Not run)
## Not run: register_umap() ## distances get_seriation_method("dist", "umap") data(SupremeCourt) d <- as.dist(SupremeCourt) o <- seriate(d, method = "umap", verbose = TRUE) pimage(d, o) # look at the returned embedding and plot it attr(o[[1]], "configuration") plot_config(o) ## matrix get_seriation_method("matrix", "umap") data("Zoo") Zoo[,"legs"] <- (Zoo[,"legs"] > 0) x <- as.matrix(Zoo[,-17]) label <- rownames(Zoo) class <- Zoo$class o <- seriate(x, method = "umap", verbose = TRUE) pimage(x, o) plot_config(o[[1]], col = class) ## End(Not run)
A registry to manage methods used by criterion()
to calculate a criterion value given data and a
permutation.
registry_criterion list_criterion_methods(kind, names_only = TRUE) get_criterion_method(kind, name) set_criterion_method( kind, name, fun, description = NULL, merit = NA, control = list(), verbose = FALSE, ... ) ## S3 method for class 'criterion_method' print(x, ...)
registry_criterion list_criterion_methods(kind, names_only = TRUE) get_criterion_method(kind, name) set_criterion_method( kind, name, fun, description = NULL, merit = NA, control = list(), verbose = FALSE, ... ) ## S3 method for class 'criterion_method' print(x, ...)
kind |
the data type the method works on. For example, |
names_only |
logical; return only the method name. |
name |
the name for the method used to refer to the method in the
function |
fun |
a function containing the method's code. |
description |
a description of the method. For example, a long name. |
merit |
logical; indicating if the criterion measure is a merit
( |
control |
a list with control arguments and default values. |
verbose |
logical; print a message when a new method is registered. |
... |
further information that is stored for the method in the registry. |
x |
an object of class "criterion_method" to be printed. |
An object of class criterion_registry
(inherits from registry
) of length 21.
All methods below are convenience methods for the registry named
registry_criterion
.
list_criterion_method()
lists all available methods for a given data
type (kind
). The result is a vector of character strings with the
short names of the methods. If kind
is missing, then a list of
methods is returned.
get_criterion_method()
returns information (including the
implementing function) about a given method in form of an object of class
"criterion_method"
.
With set_criterion_method()
new criterion methods can be added by the
user. The implementing function (fun
) needs to have the formal
arguments x, order, ...
, where x
is the data object, order is
an object of class ser_permutation_vector and ...
can contain
additional information for the method passed on from criterion()
. The
implementation has to return the criterion value as a scalar.
list_criterion_method()
results is a vector of character strings with the
names of the methods used for criterion()
.
get_criterion_method()
returns a given method in form of an object of class
"criterion_method"
.
Michael Hahsler
This registry uses registry::registry.
Other criterion:
criterion()
## the registry registry_criterion # List all criterion calculation methods by type list_criterion_methods() # List methods for matrix list_criterion_methods("matrix") # get more description list_criterion_methods("matrix", names_only = FALSE) # get a specific method get_criterion_method(kind = "dist", name = "AR_d") # Define a new method (sum of the diagonal elements) ## 1. implement a function to calculate the measure criterion_method_matrix_foo <- function(x, order, ...) { if(!is.null(order)) x <- permute(x,order) sum(diag(x)) } ## 2. Register new method set_criterion_method("matrix", "DiagSum", criterion_method_matrix_foo, description = "Calculated the sum of all diagonal entries", merit = FALSE) list_criterion_methods("matrix") get_criterion_method("matrix", "DiagSum") ## 3. use all criterion methods (including the new one) criterion(matrix(1:9, ncol = 3))
## the registry registry_criterion # List all criterion calculation methods by type list_criterion_methods() # List methods for matrix list_criterion_methods("matrix") # get more description list_criterion_methods("matrix", names_only = FALSE) # get a specific method get_criterion_method(kind = "dist", name = "AR_d") # Define a new method (sum of the diagonal elements) ## 1. implement a function to calculate the measure criterion_method_matrix_foo <- function(x, order, ...) { if(!is.null(order)) x <- permute(x,order) sum(diag(x)) } ## 2. Register new method set_criterion_method("matrix", "DiagSum", criterion_method_matrix_foo, description = "Calculated the sum of all diagonal entries", merit = FALSE) list_criterion_methods("matrix") get_criterion_method("matrix", "DiagSum") ## 3. use all criterion methods (including the new one) criterion(matrix(1:9, ncol = 3))
A registry to manage methods used by seriate()
.
registry_seriate list_seriation_methods(kind, names_only = TRUE) get_seriation_method(kind, name) set_seriation_method( kind, name, definition, description = NULL, control = list(), randomized = FALSE, optimizes = NA_character_, verbose = FALSE, ... ) ## S3 method for class 'seriation_method' print(x, ...)
registry_seriate list_seriation_methods(kind, names_only = TRUE) get_seriation_method(kind, name) set_seriation_method( kind, name, definition, description = NULL, control = list(), randomized = FALSE, optimizes = NA_character_, verbose = FALSE, ... ) ## S3 method for class 'seriation_method' print(x, ...)
kind |
the data type the method works on. For example, |
names_only |
logical; return only the method name. |
name |
the name for the method used to refer to the method in
|
definition |
a function containing the method's code. |
description |
a description of the method. For example, a long name. |
control |
a list with control arguments and default values. |
randomized |
logical; does the algorithm use randomization and re-running
the algorithm several times will lead to different results (see: |
optimizes |
what criterion does the algorithm try to optimize
(see: |
verbose |
logical; print a message when a new method is registered. |
... |
further information that is stored for the method in the registry. |
x |
an object of class "seriation_method" to be printed. |
An object of class seriation_registry
(inherits from registry
) of length 58.
The functions below are convenience function for the registry
registry_seriate
.
list_seriation_method()
lists all available methods for a given data
type (kind
) (e.g., "dist", "matrix").
The result is a vector of character strings with the
method names that can be used in function seriate()
.
If kind
is missing, then a list of
methods is returned.
get_seriation_method()
returns detailed information for a given method in
form of an object of class "seriation_method"
.
The information includes a description, parameters and the
implementing function.
With set_seriation_method()
new seriation methods can be added by the
user. The implementing function (definition
) needs to have the formal
arguments x, control
and, for arrays and matrices margin
,
where x
is the data object and
control
contains a list with additional information for the method
passed on from seriate()
, and margin
is a vector specifying
what dimensions should be seriated.
The implementation has to return a list of
objects which can be coerced into ser_permutation_vector
objects
(e.g., integer vectors). The elements in the list have to be in
corresponding order to the dimensions of x
.
list_seriation_method()
result is a vector of character strings with the
names of the methods. These names are used for methods in seriate()
.
get_seriation_method()
returns a given method in form of an object of class
"seriation_method"
.
Michael Hahsler
This registry uses registry::registry.
Other seriation:
register_DendSer()
,
register_GA()
,
register_optics()
,
register_smacof()
,
register_tsne()
,
register_umap()
,
seriate()
,
seriate_best()
# Registry registry_seriate # List all seriation methods by type list_seriation_methods() # List methods for matrix seriation list_seriation_methods("matrix") get_seriation_method(name = "BEA") # Example for defining a new seriation method (reverse identity function for matrix) # 1. Create the seriation method: Reverse the row order # (NA means no seriation is applied to columns) seriation_method_reverse_rows <- function(x, control = NULL, margin = c(1, 2)) { list(rev(seq(nrow(x))), NA)[margin] } # 2. Register new method set_seriation_method("matrix", "Reverse_rows", seriation_method_reverse_rows, description = "Reverse identity order", control = list()) list_seriation_methods("matrix") get_seriation_method("matrix", "reverse_rows") # 3. Use the new seriation methods seriate(matrix(1:12, ncol = 3), "reverse_rows")
# Registry registry_seriate # List all seriation methods by type list_seriation_methods() # List methods for matrix seriation list_seriation_methods("matrix") get_seriation_method(name = "BEA") # Example for defining a new seriation method (reverse identity function for matrix) # 1. Create the seriation method: Reverse the row order # (NA means no seriation is applied to columns) seriation_method_reverse_rows <- function(x, control = NULL, margin = c(1, 2)) { list(rev(seq(nrow(x))), NA)[margin] } # 2. Register new method set_seriation_method("matrix", "Reverse_rows", seriation_method_reverse_rows, description = "Reverse identity order", control = list()) list_seriation_methods("matrix") get_seriation_method("matrix", "reverse_rows") # 3. Use the new seriation methods seriate(matrix(1:12, ncol = 3), "reverse_rows")
Reorder method for dendrograms for optimal leaf ordering.
## S3 method for class 'hclust' reorder(x, dist, method = "OLO", ...)
## S3 method for class 'hclust' reorder(x, dist, method = "OLO", ...)
x |
an object of class |
dist |
an object of class |
method |
a character string with the name of the used measure. Available are:
|
... |
further arguments are currently ignored. |
Minimizes the distance between neighboring objects (leaf nodes) in the dendrogram by flipping the order of subtrees. The algorithm by Gruvaeus and Wainer is implemented in package gclus (Hurley 2004).
A reordered hclust
object.
Michael Hahsler
Bar-Joseph, Z., E. D. Demaine, D. K. Gifford, and T. Jaakkola. (2001): Fast Optimal Leaf Ordering for Hierarchical Clustering. Bioinformatics, 17(1), 22–29.
Gruvaeus, G. and Wainer, H. (1972): Two Additions to Hierarchical Cluster Analysis, British Journal of Mathematical and Statistical Psychology, 25, 200–206.
Hurley, Catherine B. (2004): Clustering Visualizations of Multidimensional Data. Journal of Computational and Graphical Statistics, 13(4), 788–806.
## cluster European cities by distance data("eurodist") d <- as.dist(eurodist) hc <- hclust(eurodist) ## plot original dendrogram and the reordered dendrograms plot(hc) plot(reorder(hc, d, method = "GW")) plot(reorder(hc, d, method = "OLO"))
## cluster European cities by distance data("eurodist") d <- as.dist(eurodist) hc <- hclust(eurodist) ## plot original dendrogram and the reordered dendrograms plot(hc) plot(reorder(hc, d, method = "GW")) plot(reorder(hc, d, method = "OLO"))
Calculates dissimilarities/correlations between seriation orders in a list of type ser_permutation_vector.
ser_dist(x, y = NULL, method = "spearman", reverse = TRUE, ...) ser_cor(x, y = NULL, method = "spearman", reverse = TRUE, test = FALSE) ser_align(x, method = "spearman")
ser_dist(x, y = NULL, method = "spearman", reverse = TRUE, ...) ser_cor(x, y = NULL, method = "spearman", reverse = TRUE, test = FALSE) ser_align(x, method = "spearman")
x |
set of seriation orders as a list with elements which can be coerced into ser_permutation_vector objects. |
y |
if not |
method |
a character string with the name of the used measure.
Available measures are for correlation and distances are
|
reverse |
a logical indicating if the revers orders should also be checked in for rank-based methods. |
... |
Further arguments passed on to the method. |
test |
a logical indicating if a correlation test should be performed. |
For seriation, an order and its reverse are considered identical and are
often just an artifact due to the method that creates the order.
This is one of the major differences between seriation orders and rankings
which impacts how correlations and similarities between seriation orders are
calculated. The default setting reverse = TRUE
corrects for this issue.
ser_cor()
calculates the correlation between two seriation orders.
For ranking-based correlation measures (Spearman and
Kendall) the absolute value of the correlation is returned. This effectively
corrects for correlations between reversed orders but has the effect that
no negative correlations exist.
For test = TRUE
, the appropriate test for association is performed
and a matrix with p-values is returned as the attribute "p-value"
.
Note that no correction for multiple testing is performed.
For ser_dist()
, the correlation coefficients (Kendall's tau and
Spearman's rho) are converted into a dissimilarity by taking one minus the
correlation value. The Manhattan distance between the ranks in a
linear order is equivalent to Spearman's footrule metric (Diaconis 1988).
For the non-correlation based measures,
reverse = TRUE
returns the pairwise minima using also the reversed
order.
Two precedence invariant measure especially developed for seriation are
available. Here reverse
is ignored.
The positional proximity coefficient (ppc) is a precedence invariant measure based on product of the squared positional distances in two permutations defined as (see Goulermas et al 2016):
where and
are two seriation orders,
and
are the associated permutation vectors and
is a
normalization factor. The associated generalized correlation coefficient is
defined as
.
The absolute pairwise rank difference (aprd) is also precedence invariant and defined as a distance measure:
where is the power which can be passed on as parameter
p
and
is by default set to 2.
ser_align()
tries to normalize the direction in a list of seriations
such that ranking-based methods can be used. We add for each permutation
also the reversed order to the set and then use a modified version of Prim's
algorithm for finding a minimum spanning tree (MST) to choose if the
original seriation order or its reverse should be used. We retain the direction
of each order that is added to the MST first.
Every time an order is added, its reverse is removed from the possible
remaining orders.
ser_dist()
returns an object of class stats::dist.
ser_align()
returns a new list with elements of class
ser_permutation.
Michael Hahsler
P. Diaconis (1988): Group Representations in Probability and Statistics, Institute of Mathematical Statistics, Hayward, CA.
J.Y. Goulermas, A. Kostopoulos, and T. Mu (2016): A New Measure for Analyzing and Fusing Sequences of Objects. IEEE Transactions on Pattern Analysis and Machine Intelligence 38(5):833-48. doi:10.1109/TPAMI.2015.2470671
Other permutation:
get_order()
,
permutation_vector2matrix()
,
permute()
,
ser_permutation()
,
ser_permutation_vector()
set.seed(1234) ## seriate dist of 50 flowers from the iris data set data("iris") x <- as.matrix(iris[-5]) x <- x[sample(1:nrow(x), 50), ] rownames(x) <- 1:50 d <- dist(x) ## Create a list of different seriations methods <- c("HC_complete", "OLO", "GW", "VAT", "TSP", "Spectral", "MDS", "Identity", "Random") os <- sapply(methods, function(m) { cat("Doing", m, "... ") tm <- system.time(o <- seriate(d, method = m)) cat("took", tm[3],"s.\n") o }) ## Compare the methods using distances. Default is based on ## Spearman's rank correlation coefficient where reverse orders are ## also considered. ds <- ser_dist(os) hmap(ds, margin = c(7,7)) ## Compare using correlation between orders. Reversed orders have ## negative correlation! cs <- ser_cor(os, reverse = FALSE) hmap(cs, margin = c(7,7)) ## Compare orders by allowing orders to be reversed. ## Now all but random and identity are highly positive correlated cs2 <- ser_cor(os, reverse = TRUE) hmap(cs2, margin=c(7,7)) ## A better approach is to align the direction of the orders first ## and then calculate correlation. os_aligned <- ser_align(os) cs3 <- ser_cor(os_aligned, reverse = FALSE) hmap(cs3, margin = c(7,7)) ## Compare the orders using clustering. We use Spearman's foot rule ## (Manhattan distance of ranks). In order to use rank-based method, ## we align the direction of the orders. os_aligned <- ser_align(os) ds <- ser_dist(os_aligned, method = "manhattan") plot(hclust(ds))
set.seed(1234) ## seriate dist of 50 flowers from the iris data set data("iris") x <- as.matrix(iris[-5]) x <- x[sample(1:nrow(x), 50), ] rownames(x) <- 1:50 d <- dist(x) ## Create a list of different seriations methods <- c("HC_complete", "OLO", "GW", "VAT", "TSP", "Spectral", "MDS", "Identity", "Random") os <- sapply(methods, function(m) { cat("Doing", m, "... ") tm <- system.time(o <- seriate(d, method = m)) cat("took", tm[3],"s.\n") o }) ## Compare the methods using distances. Default is based on ## Spearman's rank correlation coefficient where reverse orders are ## also considered. ds <- ser_dist(os) hmap(ds, margin = c(7,7)) ## Compare using correlation between orders. Reversed orders have ## negative correlation! cs <- ser_cor(os, reverse = FALSE) hmap(cs, margin = c(7,7)) ## Compare orders by allowing orders to be reversed. ## Now all but random and identity are highly positive correlated cs2 <- ser_cor(os, reverse = TRUE) hmap(cs2, margin=c(7,7)) ## A better approach is to align the direction of the orders first ## and then calculate correlation. os_aligned <- ser_align(os) cs3 <- ser_cor(os_aligned, reverse = FALSE) hmap(cs3, margin = c(7,7)) ## Compare the orders using clustering. We use Spearman's foot rule ## (Manhattan distance of ranks). In order to use rank-based method, ## we align the direction of the orders. os_aligned <- ser_align(os) ds <- ser_dist(os_aligned, method = "manhattan") plot(hclust(ds))
The class ser_permutation
is a collection of permutation vectors
(see class ser_permutation_vector), one for each dimension (mode)
of the data to be permuted.
ser_permutation(x, ...) ## S3 method for class 'ser_permutation' print(x, ...) ## S3 method for class 'ser_permutation' summary(object, ...) ## S3 method for class 'ser_permutation' c(..., recursive = FALSE) ## S3 method for class 'ser_permutation' object[i, ...]
ser_permutation(x, ...) ## S3 method for class 'ser_permutation' print(x, ...) ## S3 method for class 'ser_permutation' summary(object, ...) ## S3 method for class 'ser_permutation' c(..., recursive = FALSE) ## S3 method for class 'ser_permutation' object[i, ...]
x , object
|
an object of class |
... |
vectors for further dimensions. |
recursive |
ignored. |
i |
index of the dimension(s) to extract. |
An object of class ser_permutation
.
Michael Hahsler
Other permutation:
get_order()
,
permutation_vector2matrix()
,
permute()
,
ser_dist()
,
ser_permutation_vector()
o <- ser_permutation(1:5, 10:1) o ## length (number of dimensions) length(o) ## get permutation vector for 2nd dimension get_order(o, 2) ## reverse dimensions o[2:1] ## combine o <- c(o, ser_permutation(1:15)) o ## get an individual permutation o[[2]] ## reverse the order of a permutation o[[2]] <- rev(o[[2]]) get_order(o,2)
o <- ser_permutation(1:5, 10:1) o ## length (number of dimensions) length(o) ## get permutation vector for 2nd dimension get_order(o, 2) ## reverse dimensions o[2:1] ## combine o <- c(o, ser_permutation(1:15)) o ## get an individual permutation o[[2]] ## reverse the order of a permutation o[[2]] <- rev(o[[2]]) get_order(o,2)
The class ser_permutation_vector
represents a single permutation vector.
ser_permutation_vector(x, method = NULL) ## S3 method for class 'ser_permutation_vector' c(..., recursive = FALSE) ## S3 method for class 'ser_permutation_vector' rev(x) get_method(x, printable = FALSE) ## S3 method for class 'ser_permutation_vector' length(x) ## S3 method for class 'ser_permutation_vector' print(x, ...) ## S3 method for class 'ser_permutation_vector' summary(object, ...)
ser_permutation_vector(x, method = NULL) ## S3 method for class 'ser_permutation_vector' c(..., recursive = FALSE) ## S3 method for class 'ser_permutation_vector' rev(x) get_method(x, printable = FALSE) ## S3 method for class 'ser_permutation_vector' length(x) ## S3 method for class 'ser_permutation_vector' print(x, ...) ## S3 method for class 'ser_permutation_vector' summary(object, ...)
x , object
|
an object if class |
method |
a string representing the method used to obtain the permutation vector. |
... |
further arguments. |
recursive |
ignored |
printable |
a logical; prints "unknown" instead of |
A permutation vector
maps a set of objects
onto itself.
Ordering Representation:
In seriation we represent a permutation
as a vector which lists the objects' indices in their permuted order. This can
be seen as replacing the object in position
with the object
in position
.
For example, the permutation vector
indicates that in
first position is the object with index 3 then the object with index 1 and finally
the object with index 2. This representation is often called a (re)arrangement or ordering.
The ordering can be extracted from a permutation vector object
via
get_order()
. Such an ordering can be directly used
to subset the list of original objects with "["
to apply the permutation.
Rank Representation:
An alternative way to specify a permutation is via a list of the ranks
of the objects after permutation. This representation is often called
a map or substitution. Ranks can be extracted from a permutation vector using get_rank()
.
Permutation Matrix:
Another popular representation is a permutation matrix which performs
permutations using matrix multiplication. A permutation matrix can be obtained
using get_permutation_matrix()
.
ser_permutation_vector
objects are usually packed into
a ser_permutation object
which is a collection (a list
) of permutation vectors for
-mode data.
The constructor ser_permutation_vector()
checks if the permutation vector is valid
(i.e. if all integers occur exactly once).
The constructor ser_permutation_vector()
returns an
object a ser_permutation_vector
Michael Hahsler
Other permutation:
get_order()
,
permutation_vector2matrix()
,
permute()
,
ser_dist()
,
ser_permutation()
o <- structure(sample(10), names = paste0("X", 1:10)) o p <- ser_permutation_vector(o, "random") p ## some methods length(p) get_method(p) get_order(p) get_rank(p) get_permutation_matrix(p) r <- rev(p) r get_order(r) ## create a symbolic identity permutation vector (with unknown length) ## Note: This can be used to permute an object, but methods ## like length and get_order are not available. ip <- ser_permutation_vector(NA) ip
o <- structure(sample(10), names = paste0("X", 1:10)) o p <- ser_permutation_vector(o, "random") p ## some methods length(p) get_method(p) get_order(p) get_rank(p) get_permutation_matrix(p) r <- rev(p) r get_order(r) ## create a symbolic identity permutation vector (with unknown length) ## Note: This can be used to permute an object, but methods ## like length and get_order are not available. ip <- ser_permutation_vector(NA) ip
Tries to find a linear order for objects using data in the form of a
dissimilarity matrix (two-way one-mode data), a data matrix (two-way
two-mode data), or a data array (k-way k-mode data). The order can then be
used to reorder the dissimilarity matrix/data matrix using
permute()
.
seriate(x, ...) ## S3 method for class 'dist' seriate(x, method = "Spectral", control = NULL, rep = 1L, ...) ## S3 method for class 'matrix' seriate(x, method = "PCA", control = NULL, margin = c(1L, 2L), rep = 1L, ...) ## S3 method for class 'array' seriate( x, method = "PCA", control = NULL, margin = seq(length(dim(x))), rep = 1L, ... ) ## S3 method for class 'data.frame' seriate( x, method = "Heatmap", control = NULL, margin = c(1L, 2L), rep = 1L, ... ) ## S3 method for class 'table' seriate(x, method = "CA", control = NULL, margin = c(1L, 2L), ...)
seriate(x, ...) ## S3 method for class 'dist' seriate(x, method = "Spectral", control = NULL, rep = 1L, ...) ## S3 method for class 'matrix' seriate(x, method = "PCA", control = NULL, margin = c(1L, 2L), rep = 1L, ...) ## S3 method for class 'array' seriate( x, method = "PCA", control = NULL, margin = seq(length(dim(x))), rep = 1L, ... ) ## S3 method for class 'data.frame' seriate( x, method = "Heatmap", control = NULL, margin = c(1L, 2L), rep = 1L, ... ) ## S3 method for class 'table' seriate(x, method = "CA", control = NULL, margin = c(1L, 2L), ...)
x |
the data. |
... |
further arguments are added to the |
method |
a character string with the name of the seriation method (default: varies by data type). |
control |
a list of control options passed on to the seriation algorithm. |
rep |
number of random restarts for randomized methods.
Uses |
margin |
an integer vector giving the margin indices (dimensions) to be
seriated. For example, for a matrix, |
Seriation methods are managed via a registry. See
list_seriation_methods()
for help. In the following, we focus on
discussing the
built-in methods that are registered automatically by the package seriation.
The available control options, default settings, and
a description for each algorithm
can be retrieved using get_seriation_method(name = "<seriation method>")
.
Some control parameters are also described in more detail below.
Some methods are very slow, and progress can be printed using the control
parameter verbose = TRUE
.
Many seriation methods (heuristically) optimize (minimize or maximize) an
objective function often called seriation criterion.
The value of the seriation criterion for a given order can be
calculated using criterion()
. In this manual page, we
include the criterion, which is optimized by each method using bold font.
If no criterion is mentioned, then the method does not directly optimize a criterion.
A definition of the different seriation criteria can be found on the criterion()
manual page.
Seriation methods for distance matrices (dist)
One-mode two-way data must be provided as a dist object (not a symmetric matrix). Similarities have to be transformed into dissimilarities. Seriation algorithms fall into different groups based on the approach. In the following, we describe the currently implemented methods. A list with all methods and the available parameters is available here. Hahsler (2017) for a more detailed description and an experimental comparison of the most popular methods.
Dendrogram leaf order
These methods create a dendrogram using hierarchical clustering and then derive the seriation order from the leaf order in the dendrogram. Leaf reordering may be applied.
Hierarchical clustering: "HC"
, "HC_single"
, "HC_complete"
,
"HC_average"
, "HC_ward"
Uses the order of the leaf nodes in a dendrogram obtained by hierarchical
clustering as a simple seriation technique. This method
applies hierarchical clustering (stats::hclust()
) to x
. The clustering
method can be given using a "linkage"
element in the control
list. If omitted, the default "complete"
is used.
For convenience, the other methods are provided as shortcuts.
Reordered by the Gruvaeus and Wainer heuristic: "GW"
, "GW_single"
, "GW_average"
,
"GW_complete"
, "GW_ward"
(Gruvaeus and Wainer, 1972)
Method "GW"
uses an algorithm developed by Gruvaeus and Wainer (1972)
as implemented gclus::reorder.hclust()
(Hurley 2004). The clusters are
ordered at each level so that the objects at the edge of each cluster are
adjacent to the nearest object outside the cluster. The
method produces a unique order.
The methods start with a dendrogram created by hclust()
. As the
"linkage"
element in the control
list, a clustering method
(default "average"
) can be specified. Alternatively, an stats::hclust
object can be supplied using an element named "hclust"
.
A dendrogram (binary tree) has internal nodes (subtrees) and
the same number of leaf orderings. That is, at each internal node, the left
and right subtree (or leaves) can be swapped or, in terms of a dendrogram,
be flipped. The leaf-node reordering to minimize
Minimizes the Hamiltonian path length (restricted by the dendrogram).
Reordered by optimal leaf ordering: "OLO"
, "OLO_single"
,
"OLO_average"
, "OLO_complete"
, "OLO_ward"
(Bar-Joseph et al., 2001)
Starts with a dendrogram and
produces an optimal leaf ordering that minimizes the sum of
the distances along the (Hamiltonian) path connecting the leaves in the
given order. The algorithm's time complexity is . Note that
non-finite distance values are not allowed.
Minimizes the Hamiltonian path length (restricted by the dendrogram).
Dendrogram seriation: "DendSer"
(Earle and Hurley, 2015)
Use heuristic dendrogram seriation to optimize for various criteria.
The DendSer code has to be first registered. A
detailed description can be found on the manual page for
register_DendSer()
.
Dimensionality reduction
Find a seriation order by reducing the dimensionality to 1 dimension. This is typically done by minimizing a stress measure or the reconstruction error. Note that dimensionality reduction to a single dimension is a very difficult discrete optimization problem. For example, MDS algorithms used for a single dimension tend to end up in local optima (see Maier and De Leeuw, 2015). However, generally, ordering along a single component of MDS provides good results sufficient for applications like visualization.
Classical metric multidimensional scaling: "MDS"
Orders along the 1D classical metric multidimensional scaling.
control
parameters are passed on to stats::cmdscale()
.
Isometric feature mapping: "isomap"
(Tenenbaum, 2000)
Orders along the 1D isometric feature mapping.
control
parameters are passed on to vegan::isomap()
Kruskal's non-metric multidimensional scaling: "isoMDS"
, "monoMDS"
,
"metaMDS"
(Kruskal, 1964)
Orders along the 1D Kruskal's non-metric multidimensional scaling.
Package vegan provides an alternative implementation called monoMDS
and a version that uses random restarts for stability called metaMDS
.
control
parameters are passed on to MASS::isoMDS()
, vegan::monoMDS()
or vegan::metaMDS()
.
Sammon's non-linear mapping: "Sammon_mapping"
(Sammon, 1969)
Orders along the 1D Sammon's non-linear mapping.
control
parameters are passed on to MASS::sammon()
.
Angular order of the first two eigenvectors: "MDS_angle"
Finds a 2D configuration using MDS (stats::cmdscale()
)
to approximate the eigenvectors of the covariance matrix in the
original data matrix.
Orders by the angle in this space and splits the order by the
larges gap between adjacent angles. A similar method was used by
Friendly (2002) to order variables in correlation matrices
by angles of first two eigenvectors.
Smacof: "MDS_smacof"
(de Leeuw and Mair, 2009)
Perform seriation using stress majorization with several transformation functions.
This method has to be registered first using register_smacof()
.
Optimization
These methods try to optimize a seriation criterion directly, typically using a heuristic approach.
Anti-Robinson seriation by simulated annealing: "ARSA"
(Brusco et al 2008)
The algorithm automatically finds a suitable start temperature and calculates
the needed number of iterations. The algorithm gets slow for a large number of
objects. The speed can be improved by lowering the cooling parameter "cool"
or increasing the minimum temperature "tmin"
.
However, this will decrease the seriation quality.
Directly minimizes the linear seriation criterion (LS).
Complete Enumeration: "Enumerate"
This method finds the optimal permutation given a seriation criterion by complete enumeration
of all permutations.
The criterion is specified as the control
parameters "criterion"
.
Default is the weighted gradient measure. Use "verbose = TRUE"
to see
the progress.
Note: The number of permutations for objects is
.
Complete enumeration is only possible for tiny problems (<10 objects) and is limited on most systems
to a problem size of up to 12 objects.
Gradient measure seriation by branch-and-bound: "BBURCG"
, "BBWRCG"
(Brusco and Stahl 2005)
The method uses branch-and-bound to minimize the
unweighted gradient measure ("BBURCG"
) and the
weighted gradient measure ("BBWRCG"
).
This type of optimization is only feasible for a small number of objects (< 50 objects).
For BBURCG, the control parameter "eps"
can be used to relax the problem by defining
that a distance needs to be eps larger to count as a violation. This relaxation will improve the speed,
but miss some Robinson events. The default value is 0.
Genetic Algorithm: "GA"
The GA code has to be first registered. A detailed description can
be found on the manual page for register_GA()
.
Quadratic assignment problem seriation:
"QAP_LS"
, "QAP_2SUM"
, "QAP_BAR"
, "QAP_Inertia"
(Hahsler, 2017)
Formulates the seriation problem as a quadratic assignment problem and applies a simulated annealing solver to find a good solution. These methods minimize the Linear Seriation Problem (LS) formulation (Hubert and Schultz 1976), the 2-Sum Problem formulation (Barnard, Pothen, and Simon 1993), the banded anti-Robinson form (BAR), or the inertia criterion.
control
parameters are passed on to qap::qap()
.
An important parameter is rep
to return the best result from the
given number of repetitions with random restarts. The default is 1, but bigger
numbers result in better and more stable results.
General Simulated Annealing: "GSA"
Implement simulated annealing similar to the ARSA method. However, it
can optimize
for any criterion measure defined in seriation. By default, the
algorithm optimizes for the raw gradient measure, and is warm started with the
result of spectral seriation (2-Sum problem) since Hahsler (2017) shows that
2-Sum solutions are similar to solutions for the gradient measure.
Use warmstart = "random"
for no warm start.
The initial temperature t0
and minimum temperature tmin
can be set. If
t0
is not set, then it is estimated by sampling uphill moves and setting
t0
such that the median uphill move have a probability
of tinitialaccept
.
Using the cooling rate cool
, the number of iterations
to go for t0
to tmin
is calculated.
Several popular local neighborhood functions are
provided, and new ones can be defined (see LS). Local moves are tried in each
iteration nlocal
times the number of objects.
Note that this is an R implementation repeatedly calling the criterion funciton which is very slow.
Stochastic gradient descent: "SGD"
Starts with a solution and then performs stochastic gradient descent to find a close-by local optimum given a specified criterion.
Important control
parameters:
"criterion"
: the criterion to optimize
"init"
: initial seriation (an order or the name of a seriation method)
"max_iter"
: number of trials
Spectral seriation: "Spectral"
, "Spectral_norm"
(Ding and He, 2004)
Spectral seriation uses a relaxation to minimize the 2-Sum Problem (Barnard, Pothen, and Simon, 1993). It uses the order of the Fiedler vector of the similarity matrix's (normalized) Laplacian.
Spectral seriation gives a good trade-off between seriation quality, and scalability (see Hahsler, 2017).
Traveling salesperson problem solver: "TSP"
Uses a traveling salesperson problem solver to minimize the
Hamiltonian path length. The solvers in TSP are used (see
TSP::solve_TSP()
). The solver method can be passed on via the control
argument, e.g., control = list(method = "two_opt")
. Default is the est
of 10 runs of arbitrary insertion heuristic with 2-opt improvement.
Since a tour returned by a TSP solver is a connected circle and we are looking for a path representing a linear order, we need to find the best cutting point. Climer and Zhang (2006) suggest adding a dummy city with equal distance to each other city before generating the tour. The place of this dummy city in an optimal tour with minimal length is the best cutting point (it lies between the most distant cities).
Other Methods
Identity permutation: '"Identity"
Reverse Identity permutation: '"Reverse"
Random permutation: "Random"
Rank-two ellipse seriation: "R2E"
(Chen 2002)
Rank-two ellipse seriation starts with generating a sequence of correlation matrices
.
is the correlation matrix of the original
distance matrix
(supplied to the function as
x
), and
where calculates the correlation
matrix.
The rank of the matrix falls with increasing
. The first
in the sequence, which has a rank of 2 is found. Projecting all
points in this matrix on the first two eigenvectors, all points fall on an
ellipse. The order of the points on this ellipse is the resulting order.
The ellipse can be cut at the two interception points (top or bottom) of the vertical axis with the ellipse. In this implementation, the topmost cutting point is used.
Sorting Points Into Neighborhoods: "SPIN_STS"
, "SPIN_NH"
(Tsafrir, 2005)
Given a weight matrix , the SPIN algorithms try to
minimize the energy for a permutation (matrix
) given by
where denotes the matrix trace.
"SPIN_STS"
implements the Side-to-Side algorithm, which tries to push
out large distance values. The default weight matrix suggested in the paper
with and
is used. We run the algorithm form
step
(25) iteration and restart the algorithm nstart
(10) with
random initial permutations (default values in parentheses).
"SPIN_NH"
implements the neighborhood algorithm (concentrate low
distance values around the diagonal) with a Gaussian weight matrix
, where
is the size of the
dissimilarity matrix and
is the variance around the diagonal
that control the influence of global (large
) or local (small
) structure.
We use the heuristic suggested in the paper for the linear assignment
problem. We do not terminate as indicated in the algorithm but run all the
iterations since the heuristic does not guarantee that the energy is
strictly decreasing. We also implement the heuristic "annealing" scheme
where is successively reduced. The parameters in
control
are sigma
which can be a single value or a decreasing sequence
(default: 20 to 1 in 10 steps), and step
, which defines how many update
steps are performed before for each value of alpha
. Via
W_function
a custom function to create with the function
signature
function(n, sigma, verbose)
can be specified.
Visual Assessment of (Clustering) Tendency: "VAT"
(Bezdek and Hathaway, 2002).
Creates an order based on Prim's algorithm for finding a minimum spanning tree (MST) in a weighted connected graph representing the distance matrix. The order is given by the order in which the nodes (objects) are added to the MST.
Seriation methods for matrices (matrix)
Two-mode two-way data are general matrices. Some methods also require that the matrix is positive. Data frames and contingency tables (base::table) are converted into a matrix. However, the default methods are different.
Some methods find the row and column order simultaneously,
while others calculate them independently.
Currently, the
following methods are implemented for matrix
:
Seriating rows and columns simultaneously
Row and column order influence each other.
Bond Energy Algorithm: "BEA"
(McCormick, 1972).
The algorithm tries to maximize a non-negative matrix's Measure of Effectiveness. Due to the definition of this measure, the tasks of ordering rows and columns are separable and can be solved independently.
BEA consists of the following three steps:
Place one randomly chosen column.
Try to place each remaining column at each possible position left, right and between the already placed columns and calculate every time the increase in ME. Choose the column and position which gives the largest increase in ME and place the column. Repeat till all columns are placed.
Repeat procedure with rows.
The overall procedure amounts to two approximate traveling salesperson problems (TSP) where the distance is the -1 times the ME increase. The BEA algorithm is equivalent to a simple suboptimal TSP heuristic called 'cheapest insertion'. Several consecutive runs of the algorithm might improve the energy if a better initial column/row is chosen.
Arabie and Hubert (1990) question its use with non-binary data if the objective is to find a seriation or one-dimensional orderings of rows and columns.
TSP to optimize the Measure of Effectiveness: "BEA_TSP"
(Lenstra 1974).
Since BEA is equivalent to a simple TSP heuristic, we can use better TSP
solvers to get better results.
Distances between rows are calculated for a data matrix as
Distances between columns are calculated the same way from the transposed data matrix.
Solving the two TSP using these distances optimizes the measure of effectiveness. With an exact TSP solver, the optimal solution can be found.
control
parameter:
"method"
: a TSP solver method (see TSP::solve_TSP()
).
Unconstrained Brower and Kyle seriation: "BK_unconstrained"
(Brower and Kyle 1988).
Reorderes 0-1 matrices to create a block structure along the diagonal. It iteratively reorders by the mean row indices of 1s and mean column indices of 1s till the orders become stable.
control
parameter: None
Correspondence analysis "CA"
(Friendly, 2023)
This function is designed to help simplify a mosaic plot or other displays of a matrix of frequencies. It calculates a correspondence analysis of the matrix and an order for rows and columns according to the scores on a correspondence analysis dimension.
This is the default method for contingency tables.
control
parameters:
"dim"
: CA dimension used for reordering.
"ca_param"
: List with parameters for the call to ca::ca()
.
Seriating rows and columns separately using dissimilarities
Heatmap seriation: "Heatmap"
Calculates distances between rows and between columns and then applies seriation so each. This is the default method for data frames.
control
parameter:
"seriation_method"
: a list with row and column seriation methods.
The special method "HC_Mean"
is available to use hierarchical clustering
with reordering the leaves by the row/column means (see stats::heatmap()
).
Defaults to optimal leaf ordering "OLO"
.
"seriation_control"
: a list with control parameters for row and column
seriation methods.
"dist_fun"
: specify the distance calculation as a function.
"scale"
: "none"
, "row"
, or "col"
.
Seriate rows using the data matrix
These methods need access to the data matrix instead of dissimilarities to reorder objects (rows). Columns can also be reorderd by applying the same technique to the transposed data matrix.
Order along the 1D locally linear embedding: "LLE"
Performs 1D the non-linear dimensionality reduction method locally linear embedding
(see lle()
).
Order along the first principal component: "PCA"
Uses the projection of the data on its first principal component (using
stats::princomp()
) to
determine the order of rows. Performs the same procedure on the transposed
matrix to obtain the column order.
Note that for a distance matrix calculated from x
with Euclidean
distance, this method minimizes the least square criterion.
Angular order of the first two PCA components: "PCA_angle"
For rows, projects the data on the first two principal components and then orders by the angle in this space. The order is split by the larges gap between adjacent angles. A similar method was suggested by Friendly (2002) to order variables in correlation matrices by angles of first two eigenvectors. PCA also computes the eigenvectors of the covariance matrix of the data.
Performs the same process on the transposed matrix for the column order.
Other methods
Angular order of the first two eigenvectors: "AOE"
(Friendly 2002)
This method reordered correlation matrices by the angle in the space spanned by the two largest eigenvectors of the matrix. The order is split by the largest angle gap. This is the original method proposed by Friendly (2002).
By row/column mean: "Mean"
A transformation can be applied before calculating the means.
The function is specified as control
parameter "transformation"
. Any function that takes as an input a
matrix and returns the transformed matrix can be used. Examples
are scale
or \(x) x^.5
.
Identity permutation: "Identity"
Reverse Identity permutation: "Reverse"
Random permutation: "Random"
For general arrays no built-in methods are currently available.
Returns an object of class ser_permutation.
Michael Hahsler
Arabie, P. and L.J. Hubert (1990): The bond energy algorithm revisited, IEEE Transactions on Systems, Man, and Cybernetics, 20(1), 268–274. doi:10.1109/21.47829
Bar-Joseph, Z., E. D. Demaine, D. K. Gifford, and T. Jaakkola. (2001): Fast Optimal Leaf Ordering for Hierarchical Clustering. Bioinformatics, 17(1), 22–29. doi:10.1093/bioinformatics/17.suppl_1.S22
Barnard, S. T., A. Pothen, and H. D. Simon (1993): A Spectral Algorithm for Envelope Reduction of Sparse Matrices. In Proceedings of the 1993 ACM/IEEE Conference on Supercomputing, 493–502. Supercomputing '93. New York, NY, USA: ACM. https://ieeexplore.ieee.org/document/1263497
Bezdek, J.C. and Hathaway, R.J. (2002): VAT: a tool for visual assessment of (cluster) tendency. Proceedings of the 2002 International Joint Conference on Neural Networks (IJCNN '02), Volume: 3, 2225–2230. doi:10.1109/IJCNN.2002.1007487
Brower, J.C. and Kile, K.M. (1988): Sedation of an original data matrix as applied to paleoecology. Lethaia, 21, 79–93. doi:10.1111/j.1502-3931.1988.tb01756.x
Brusco, M., Koehn, H.F., and Stahl, S. (2008): Heuristic Implementation of Dynamic Programming for Matrix Permutation Problems in Combinatorial Data Analysis. Psychometrika, 73(3), 503–522. doi:10.1007/s11336-007-9049-5
Brusco, M., and Stahl, S. (2005): Branch-and-Bound Applications in Combinatorial Data Analysis. New York: Springer. doi:10.1007/0-387-28810-4
Chen, C. H. (2002): Generalized Association Plots: Information Visualization via Iteratively Generated Correlation Matrices. Statistica Sinica, 12(1), 7–29.
Ding, C. and Xiaofeng He (2004): Linearized cluster assignment via spectral ordering. Proceedings of the Twenty-first International Conference on Machine learning (ICML '04). doi:10.1145/1015330.1015407
Climer, S. and Xiongnu Zhang (2006): Rearrangement Clustering: Pitfalls, Remedies, and Applications, Journal of Machine Learning Research, 7(Jun), 919–943.
D. Earle, C. B. Hurley (2015): Advances in dendrogram seriation for application to visualization. Journal of Computational and Graphical Statistics, 24(1), 1–25.
Friendly, M. (2002): Corrgrams: Exploratory Displays for Correlation Matrices. The American Statistician, 56(4), 316–324. doi:10.1198/000313002533
Friendly, M. (2023). vcdExtra: 'vcd' Extensions and Additions. R package version 0.8-5, https://CRAN.R-project.org/package=vcdExtra.
Gruvaeus, G. and Wainer, H. (1972): Two Additions to Hierarchical Cluster Analysis, British Journal of Mathematical and Statistical Psychology, 25, 200–206. doi:10.1111/j.2044-8317.1972.tb00491.x
Hahsler, M. (2017): An experimental comparison of seriation methods for one-mode two-way data. European Journal of Operational Research, 257, 133–143. doi:10.1016/j.ejor.2016.08.066
Hubert, Lawrence, and James Schultz (1976): Quadratic Assignment as a General Data Analysis Strategy. British Journal of Mathematical and Statistical Psychology, 29(2). Blackwell Publishing Ltd. 190–241. doi:10.1111/j.2044-8317.1976.tb00714.x
Hurley, Catherine B. (2004): Clustering Visualizations of Multidimensional Data. Journal of Computational and Graphical Statistics, 13(4), 788–806. doi:10.1198/106186004X12425
Kruskal, J.B. (1964). Nonmetric multidimensional scaling: a numerical method. Psychometrika, 29, 115–129.
Lenstra, J.K (1974): Clustering a Data Array and the Traveling-Salesman Problem, Operations Research, 22(2) 413–414. doi:10.1287/opre.22.2.413
Mair P., De Leeuw J. (2015). Unidimensional scaling. In Wiley StatsRef: Statistics Reference Online, Wiley, New York. doi:10.1002/9781118445112.stat06462.pub2
McCormick, W.T., P.J. Schweitzer and T.W. White (1972): Problem decomposition and data reorganization by a clustering technique, Operations Research, 20(5), 993–1009. doi:10.1287/opre.20.5.993
Tenenbaum, J.B., de Silva, V. & Langford, J.C. (2000) A global network framework for nonlinear dimensionality reduction. Science 290, 2319-2323.
Tsafrir, D., Tsafrir, I., Ein-Dor, L., Zuk, O., Notterman, D.A. and Domany, E. (2005): Sorting points into neighborhoods (SPIN): data analysis and visualization by ordering distance matrices, Bioinformatics, 21(10) 2301–8. doi:10.1093/bioinformatics/bti329
Sammon, J. W. (1969) A non-linear mapping for data structure analysis. IEEE Trans. Comput., C-18 401–409.
Other seriation:
register_DendSer()
,
register_GA()
,
register_optics()
,
register_smacof()
,
register_tsne()
,
register_umap()
,
registry_for_seriation_methods
,
seriate_best()
# Show available seriation methods (for dist and matrix) list_seriation_methods() # show the description for ARSA get_seriation_method("dist", name = "ARSA") ### Seriate as distance matrix (for 50 flowers from the iris dataset) data("iris") x <- as.matrix(iris[-5]) x <- x[sample(nrow(x), size = 50), ] d <- dist(x) order <- seriate(d) order pimage(d, main = "Distances (Random Order)") pimage(d, order, main = "Distances (Reordered)") # Compare seriation quality rbind( random = criterion(d), reordered = criterion(d, order) ) # Reorder the distance matrix d_reordered <- permute(d, order) pimage(d_reordered, main = "Distances (Reordered)") ### Seriate a matrix (50 flowers from iris) # To make the variables comparable, we scale the data x <- scale(x, center = FALSE) # The iris flowers are ordered by species in the data set pimage(x, main = "original data", prop = FALSE) criterion(x) # Apply some methods order <- seriate(x, method = "BEA_TSP") pimage(x, order, main = "TSP to optimize ME", prop = FALSE) criterion(x, order) order <- seriate(x, method = "PCA") pimage(x, order, main = "First principal component", prop = FALSE) criterion(x, order) order <- seriate(x, method = "heatmap") pimage(x, order, main = "Heatmap seriation", prop = FALSE) criterion(x, order) # reorder the matrix x_reordered <- permute(x, order) # create a heatmap seriation manually by calculating # distances between rows and between columns order <- c( seriate(dist(x), method = "OLO"), seriate(dist(t(x)), method = "OLO") ) pimage(x, order, main = "Heatmap seriation", prop = FALSE) criterion(x, order) ### Seriate a correlation matrix corr <- cor(x) # plot in original order pimage(corr, main = "Correlation matrix") # reorder the correlation matrix using the angle of eigenvectors pimage(corr, order = "AOE", main = "Correlation matrix (AOE)") # we can also define a distance (we used d = sqrt(1 - r)) and # then reorder the matrix (rows and columns) using any seriation method. d <- as.dist(sqrt(1 - corr)) o <- seriate(d, method = "R2E") corr_reordered <- permute(corr, order = c(o, o)) pimage(corr_reordered, main = "Correlation matrix (R2E)")
# Show available seriation methods (for dist and matrix) list_seriation_methods() # show the description for ARSA get_seriation_method("dist", name = "ARSA") ### Seriate as distance matrix (for 50 flowers from the iris dataset) data("iris") x <- as.matrix(iris[-5]) x <- x[sample(nrow(x), size = 50), ] d <- dist(x) order <- seriate(d) order pimage(d, main = "Distances (Random Order)") pimage(d, order, main = "Distances (Reordered)") # Compare seriation quality rbind( random = criterion(d), reordered = criterion(d, order) ) # Reorder the distance matrix d_reordered <- permute(d, order) pimage(d_reordered, main = "Distances (Reordered)") ### Seriate a matrix (50 flowers from iris) # To make the variables comparable, we scale the data x <- scale(x, center = FALSE) # The iris flowers are ordered by species in the data set pimage(x, main = "original data", prop = FALSE) criterion(x) # Apply some methods order <- seriate(x, method = "BEA_TSP") pimage(x, order, main = "TSP to optimize ME", prop = FALSE) criterion(x, order) order <- seriate(x, method = "PCA") pimage(x, order, main = "First principal component", prop = FALSE) criterion(x, order) order <- seriate(x, method = "heatmap") pimage(x, order, main = "Heatmap seriation", prop = FALSE) criterion(x, order) # reorder the matrix x_reordered <- permute(x, order) # create a heatmap seriation manually by calculating # distances between rows and between columns order <- c( seriate(dist(x), method = "OLO"), seriate(dist(t(x)), method = "OLO") ) pimage(x, order, main = "Heatmap seriation", prop = FALSE) criterion(x, order) ### Seriate a correlation matrix corr <- cor(x) # plot in original order pimage(corr, main = "Correlation matrix") # reorder the correlation matrix using the angle of eigenvectors pimage(corr, order = "AOE", main = "Correlation matrix (AOE)") # we can also define a distance (we used d = sqrt(1 - r)) and # then reorder the matrix (rows and columns) using any seriation method. d <- as.dist(sqrt(1 - corr)) o <- seriate(d, method = "R2E") corr_reordered <- permute(corr, order = c(o, o)) pimage(corr_reordered, main = "Correlation matrix (R2E)")
Often the best seriation method for a particular dataset is not know and
heuristics may produce unstable results.
seriate_best()
and seriate_rep()
automatically try different seriation methods or
rerun randomized methods several times to find the best and order
given a criterion measure. seriate_improve()
uses a local improvement strategy
to imporve an existing solution.
seriate_best( x, methods = NULL, control = NULL, criterion = NULL, rep = 10L, parallel = TRUE, verbose = TRUE, ... ) seriate_rep( x, method = NULL, control = NULL, criterion = NULL, rep = 10L, parallel = TRUE, verbose = TRUE, ... ) seriate_improve( x, order, criterion = NULL, control = NULL, verbose = TRUE, ... )
seriate_best( x, methods = NULL, control = NULL, criterion = NULL, rep = 10L, parallel = TRUE, verbose = TRUE, ... ) seriate_rep( x, method = NULL, control = NULL, criterion = NULL, rep = 10L, parallel = TRUE, verbose = TRUE, ... ) seriate_improve( x, order, criterion = NULL, control = NULL, verbose = TRUE, ... )
x |
the data. |
methods |
a vector of character string with the name of the seriation methods to try. |
control |
a list of control options passed on to |
criterion |
|
rep |
number of times to repeat the randomized seriation algorithm. |
parallel |
logical; perform replications in parallel.
Uses |
verbose |
logical; show progress and results for different methods |
... |
further arguments are passed on to the |
method |
a character string with the name of the seriation method (default: varies by data type). |
order |
a |
seriate_rep()
rerun a randomized seriation methods to find the best solution
given the criterion specified for the method in the registry.
A specific criterion can also be specified.
Non-stochastic methods are automatically only run once.
seriate_best()
runs a set of methods and returns the best result given a
criterion. Stochastic methods are automatically randomly restarted several times.
seriate_improve()
improves a seriation order using simulated annealing using
a specified criterion measure. It uses seriate()
with method "GSA
",
a reduced probability to accept bad moves, and a lower minimum temperature. Control
parameters for this method are accepted.
Criterion
If no criterion is specified, then the criterion specified for the method in
the registry (see [get_seriation_method()]
) is used. For methods with no
criterion in the registry (marked as "other"), a default method is used.
The defaults are:
dist
: "AR_deviations"
- the study in Hahsler (2007) has shown that this
criterion has high similarity with most other criteria.
matrix
: "Moore_stress"
Parallel Execution
Some methods support for parallel execution is provided using the foreach package. To use parallel execution, a suitable backend needs to be registered (see the Examples section for using the doParallel backend).
Returns an object of class ser_permutation.
Michael Hahsler
Hahsler, M. (2017): An experimental comparison of seriation methods for one-mode two-way data. European Journal of Operational Research, 257, 133–143. doi:10.1016/j.ejor.2016.08.066
Other seriation:
register_DendSer()
,
register_GA()
,
register_optics()
,
register_smacof()
,
register_tsne()
,
register_umap()
,
registry_for_seriation_methods
,
seriate()
data(SupremeCourt) d_supreme <- as.dist(SupremeCourt) # find best seriation order (tries by by default several fast methods) o <- seriate_best(d_supreme, criterion = "AR_events") o pimage(d_supreme, o) # run a randomized algorithms several times. It automatically chooses the # LS criterion. Repetition information is returned as attributes o <- seriate_rep(d_supreme, "QAP_LS", rep = 5) attr(o, "criterion") hist(attr(o, "criterion_distribution")) pimage(d_supreme, o) ## Not run: # Using parallel execution on a larger dataset data(iris) m_iris <- as.matrix(iris[sample(seq(nrow(iris))),-5]) d_iris <- dist(m_iris) library(doParallel) registerDoParallel(cores = detectCores() - 1L) # seriate rows of the iris data set o <- seriate_best(d_iris, criterion = "LS") o pimage(d_iris, o) # improve the order to minimize RGAR instead of LS o_improved <- seriate_improve(d_iris, o, criterion = "RGAR") pimage(d_iris, o_improved) # available control parameters for seriate_improve() get_seriation_method(name = "GSA") ## End(Not run)
data(SupremeCourt) d_supreme <- as.dist(SupremeCourt) # find best seriation order (tries by by default several fast methods) o <- seriate_best(d_supreme, criterion = "AR_events") o pimage(d_supreme, o) # run a randomized algorithms several times. It automatically chooses the # LS criterion. Repetition information is returned as attributes o <- seriate_rep(d_supreme, "QAP_LS", rep = 5) attr(o, "criterion") hist(attr(o, "criterion_distribution")) pimage(d_supreme, o) ## Not run: # Using parallel execution on a larger dataset data(iris) m_iris <- as.matrix(iris[sample(seq(nrow(iris))),-5]) d_iris <- dist(m_iris) library(doParallel) registerDoParallel(cores = detectCores() - 1L) # seriate rows of the iris data set o <- seriate_best(d_iris, criterion = "LS") o pimage(d_iris, o) # improve the order to minimize RGAR instead of LS o_improved <- seriate_improve(d_iris, o, criterion = "RGAR") pimage(d_iris, o_improved) # available control parameters for seriate_improve() get_seriation_method(name = "GSA") ## End(Not run)
Contains a (a subset of the) decisions for the stable 8-yr period 1995-2002 of the second Rehnquist Supreme Court. Decisions are aggregated to the joint probability for disagreement between judges.
A square, symmetric 9-by-9 matrix with the joint probability for disagreement.
Michael Hahsler
Sirovich, L. (2003). A pattern analysis of the second Rehnquist U.S. Supreme Court. _Proceedings of the National Academy of Sciences of the United States of America,_ **100**, 7432-7437. \doi{10.1073/pnas.1132164100}
Other data:
Chameleon
,
Irish
,
Munsingen
,
Townships
,
Wood
,
Zoo
,
create_lines_data()
,
is.robinson()
data("SupremeCourt") # a matrix with joint probability of disagreement SupremeCourt # show judges in original alphabetical order d <- as.dist(SupremeCourt) pimage(d, diag = TRUE, upper = TRUE) # reorder judges using seriation based on similar decisions o <- seriate(d) o pimage(d, o, diag = TRUE, upper = TRUE) # Use optimal leaf ordering (hierarchical clustering with reordering) # which uses a dendrogram o <- seriate(d, method = "OLO") o plot(o[[1]]) # Use multi-dimensional scaling and show the configuration o <- seriate(d, method = "sammon") o pimage(d, o, diag = TRUE, upper = TRUE) plot_config(o[[1]])
data("SupremeCourt") # a matrix with joint probability of disagreement SupremeCourt # show judges in original alphabetical order d <- as.dist(SupremeCourt) pimage(d, diag = TRUE, upper = TRUE) # reorder judges using seriation based on similar decisions o <- seriate(d) o pimage(d, o, diag = TRUE, upper = TRUE) # Use optimal leaf ordering (hierarchical clustering with reordering) # which uses a dendrogram o <- seriate(d, method = "OLO") o plot(o[[1]]) # Use multi-dimensional scaling and show the configuration o <- seriate(d, method = "sammon") o pimage(d, o, diag = TRUE, upper = TRUE) plot_config(o[[1]])
This data contains nine characteristics for 16 townships. The data set was used by Bertin (1981) to illustrate that the conciseness of presentation can be improved by seriating the rows and columns.
A matrix with 16 0-1 variables (columns) indicating the presence
(1
) or absence (0
) of characteristics of townships
(rows).
Michael Hahsler
Bertin, J. (1981): Graphics and Graphic Information Processing. Berlin, Walter de Gruyter.
Other data:
Chameleon
,
Irish
,
Munsingen
,
SupremeCourt
,
Wood
,
Zoo
,
create_lines_data()
,
is.robinson()
data("Townships") ## original data pimage(Townships) criterion(Townships) ## seriated data using an improved Bond-Energy Algorithm order <- seriate(Townships, method = "BEA_TSP") pimage(Townships, order) criterion(Townships, order)
data("Townships") ## original data pimage(Townships) criterion(Townships) ## seriated data using an improved Bond-Energy Algorithm order <- seriate(Townships, method = "BEA_TSP") pimage(Townships, order) criterion(Townships, order)
Fits an (approximate) unidimensional scaling configuration given an order.
uniscale(d, order, accept_reorder = FALSE, warn = TRUE, ...) MDS_stress(d, order, refit = TRUE, warn = FALSE) get_config(x, dim = 1L, ...) plot_config(x, main, pch = 19, labels = TRUE, pos = 1, cex = 1, ...)
uniscale(d, order, accept_reorder = FALSE, warn = TRUE, ...) MDS_stress(d, order, refit = TRUE, warn = FALSE) get_config(x, dim = 1L, ...) plot_config(x, main, pch = 19, labels = TRUE, pos = 1, cex = 1, ...)
d |
a dissimilarity matrix. |
order |
a precomputed permutation (configuration) order. |
accept_reorder |
logical; accept a configuration that does not preserve
the requested order. If |
warn |
logical; produce a warning if the 1D MDS fit does not preserve the given order. |
... |
additional arguments are passed on to the seriation method. |
refit |
logical; forces to refit a minimum-stress MDS configuration,
even if |
x |
a scaling returned by |
dim |
The dimension if |
main |
main plot label |
pch |
print character |
labels |
add the object names to the plot |
pos |
label position for 2D plot (see |
cex |
label expansion factor. |
This implementation uses the method describes in Maier and De Leeuw (2015) to calculate the
minimum stress configuration for a given (seriation) order by performing a 1D MDS fit.
If the 1D MDS fit does not preserve the given order perfectly, then a warning is
produced indicating
for how many positions order could not be preserved.
The seriation method which is consistent to uniscale is "MDS_smacof"
which needs to be registered with register_smacof()
.
The code is similar to smacof::uniscale()
(de Leeuw, 2090),
but scales to larger
datasets since it only uses the permutation given by order
.
MDS_stress()
calculates the normalized stress of a configuration given by a seriation order.
If the order does not contain a configuration, then a minimum-stress configuration if calculates
for the given order.
All distances are first normalized to an average distance of close to 1 using
.
Some seriation methods produce a MDS configuration (a 1D or 2D embedding). get_config()
retrieved the configuration attribute from the ser_permutation_vector
. NULL
is returned if the seriation did not produce a configuration.
plot_config()
plots 1D and 2D configurations. ...
is passed on
to plot.default
and accepts col
, labels
, etc.
A vector with the fitted configuration.
Michael Hahsler with code from Patrick Mair (from smacof::uniscale()
).
Mair P., De Leeuw J. (2015). Unidimensional scaling. In Wiley StatsRef: Statistics Reference Online, Wiley, New York. doi:10.1002/9781118445112.stat06462.pub2
Jan de Leeuw, Patrick Mair (2009). Multidimensional Scaling Using Majorization: SMACOF in R. Journal of Statistical Software, 31(3), 1-30. doi:10.18637/jss.v031.i03
data(SupremeCourt) d <- as.dist(SupremeCourt) d # embedding-based methods return "configuration" attribute # plot_config visualizes the configuration o <- seriate(d, method = "sammon") get_order(o) plot_config(o) # the configuration (Note: objects are in the original order in d) get_config(o) # angle methods return a 2D configuration o <- seriate(d, method = "MDS_angle") get_order(o) get_config(o) plot_config(o, ) # calculate a configuration for a seriation method that does not # create a configuration o <- seriate(d, method = "ARSA") get_order(o) get_config(o) # find the minimum-stress configuration for the ARSA order sc <- uniscale(d, o) sc plot_config(sc)
data(SupremeCourt) d <- as.dist(SupremeCourt) d # embedding-based methods return "configuration" attribute # plot_config visualizes the configuration o <- seriate(d, method = "sammon") get_order(o) plot_config(o) # the configuration (Note: objects are in the original order in d) get_config(o) # angle methods return a 2D configuration o <- seriate(d, method = "MDS_angle") get_order(o) get_config(o) plot_config(o, ) # calculate a configuration for a seriation method that does not # create a configuration o <- seriate(d, method = "ARSA") get_order(o) get_config(o) # find the minimum-stress configuration for the ARSA order sc <- uniscale(d, o) sc plot_config(sc)
Implements Visual Analysis for Cluster Tendency Assessment (VAT; Bezdek and Hathaway, 2002) and Improved Visual Analysis for Cluster Tendency Assessment (iVAT; Wang et al, 2010).
VAT(x, upper_tri = TRUE, lower_tri = TRUE, ...) iVAT(x, upper_tri = TRUE, lower_tri = TRUE, ...) path_dist(x) ggVAT(x, upper_tri = TRUE, lower_tri = TRUE, ...) ggiVAT(x, upper_tri = TRUE, lower_tri = TRUE, ...)
VAT(x, upper_tri = TRUE, lower_tri = TRUE, ...) iVAT(x, upper_tri = TRUE, lower_tri = TRUE, ...) path_dist(x) ggVAT(x, upper_tri = TRUE, lower_tri = TRUE, ...) ggiVAT(x, upper_tri = TRUE, lower_tri = TRUE, ...)
x |
a |
upper_tri , lower_tri
|
a logical indicating whether to show the upper or lower triangle of the VAT matrix. |
... |
further arguments are passed on to |
path_dist()
redefines the distance between two objects as the minimum
over the largest distances in all possible paths between the objects as used
for iVAT.
Nothing.
Michael Hahsler
Bezdek, J.C. and Hathaway, R.J. (2002): VAT: a tool for visual assessment of (cluster) tendency. Proceedings of the 2002 International Joint Conference on Neural Networks (IJCNN '02), Volume: 3, 2225–2230.
Havens, T.C. and Bezdek, J.C. (2012): An Efficient Formulation of the Improved Visual Assessment of Cluster Tendency (iVAT) Algorithm, IEEE Transactions on Knowledge and Data Engineering, 24(5), 813–822.
Wang L., U.T.V. Nguyen, J.C. Bezdek, C.A. Leckie and K. Ramamohanarao (2010): iVAT and aVAT: Enhanced Visual Analysis for Cluster Tendency Assessment, Proceedings of the PAKDD 2010, Part I, LNAI 6118, 16–27.
Other plots:
bertinplot()
,
dissplot()
,
hmap()
,
palette()
,
pimage()
## lines data set from Havens and Bezdek (2011) x <- create_lines_data(250) plot(x, xlim=c(-5,5), ylim=c(-3,3), cex=.2) d <- dist(x) ## create regular VAT VAT(d, main = "VAT for Lines") ## same as: pimage(d, seriate(d, "VAT")) ## ggplot2 version if (require("ggplot2")) { ggVAT(d) + labs(title = "VAT") } ## create iVAT which shows visually the three lines iVAT(d, main = "iVAT for Lines") ## same as: ## d_path <- path_dist(d) ## pimage(d_path, seriate(d_path, "VAT for Lines")) ## ggplot2 version if (require("ggplot2")) { ggiVAT(d) + labs(title = "iVAT for Lines") } ## compare with dissplot (shows banded structures and relationship between ## center line and the two outer lines) dissplot(d, method = "OLO_single", main = "Dissplot for Lines", col = bluered(100, bias = .5)) ## compare with optimally reordered heatmap hmap(d, method = "OLO_single", main = "Heatmap for Lines (opt. leaf ordering)", col = bluered(100, bias = .5))
## lines data set from Havens and Bezdek (2011) x <- create_lines_data(250) plot(x, xlim=c(-5,5), ylim=c(-3,3), cex=.2) d <- dist(x) ## create regular VAT VAT(d, main = "VAT for Lines") ## same as: pimage(d, seriate(d, "VAT")) ## ggplot2 version if (require("ggplot2")) { ggVAT(d) + labs(title = "VAT") } ## create iVAT which shows visually the three lines iVAT(d, main = "iVAT for Lines") ## same as: ## d_path <- path_dist(d) ## pimage(d_path, seriate(d_path, "VAT for Lines")) ## ggplot2 version if (require("ggplot2")) { ggiVAT(d) + labs(title = "iVAT for Lines") } ## compare with dissplot (shows banded structures and relationship between ## center line and the two outer lines) dissplot(d, method = "OLO_single", main = "Dissplot for Lines", col = bluered(100, bias = .5)) ## compare with optimally reordered heatmap hmap(d, method = "OLO_single", main = "Heatmap for Lines (opt. leaf ordering)", col = bluered(100, bias = .5))
A data matrix containing a sample of the normalized gene expression data for 6 locations in the stem of Popla trees published in the study by Herzberg et al (2001). The sample of 136 genes selected by Caraux and Pinloche (2005).
The format is a 136 x 6 matrix.
The data was obtained from http://www.atgc-montpellier.fr/permutmatrix/manual/Exemples/Wood/Wood.htm.
Hertzberg M., H. Aspeborg, J. Schrader, A. Andersson, R.Erlandsson, K. Blomqvist, R. Bhalerao, M. Uhlen, T. T. Teeri, J. Lundeberg, Bjoern Sundberg, P. Nilsson and Goeran Sandberg (2001): A transcriptional roadmap to wood formation, PNAS, 98(25), 14732–14737.
Caraux G. and Pinloche S. (2005): PermutMatrix: a graphical environment to arrange gene expression profiles in optimal linear order, Bioinformatics, 21(7) 1280–1281.
Other data:
Chameleon
,
Irish
,
Munsingen
,
SupremeCourt
,
Townships
,
Zoo
,
create_lines_data()
,
is.robinson()
data(Wood) head(Wood)
data(Wood) head(Wood)
A database containing characteristics of different animals. The database was created and donated by Richard S. Forsyth and is available from the UCI Machine Learning Repository (Newman et al, 1998).
A data frame with 101 observations on the following 17 variables.
hair
a numeric vector
feathers
a numeric vector
eggs
a numeric vector
milk
a numeric vector
airborne
a numeric vector
aquatic
a numeric vector
predator
a numeric vector
toothed
a numeric vector
backbone
a numeric vector
breathes
a numeric vector
venomous
a numeric vector
fins
a numeric vector
legs
a numeric vector
tail
a numeric vector
domestic
a numeric vector
catsize
a numeric vector
class
a factor with levels amphibian
bird
fish
insect
invertebrate
mammal
reptile
David Aha, Patrick Murphy, Christopher Merz, Eamonn Keogh, Cathy Blake, Seth Hettich, David Newman, Arthur Asuncion, Moshe Lichman, Dheeru Dua, Casey Graff (2023): UCI Machine Learning Repository, https://archive.ics.uci.edu/, University of California, Irvine.
Other data:
Chameleon
,
Irish
,
Munsingen
,
SupremeCourt
,
Townships
,
Wood
,
create_lines_data()
,
is.robinson()
data("Zoo") x <- scale(Zoo[, -17]) d <- dist(x) pimage(d) order <- seriate(d, method = "tsp") pimage(d, order)
data("Zoo") x <- scale(Zoo[, -17]) d <- dist(x) pimage(d) order <- seriate(d, method = "tsp") pimage(d, order)