Title: | Extensible Markov Model for Modelling Temporal Relationships Between Clusters |
---|---|
Description: | Implements TRACDS (Temporal Relationships between Clusters for Data Streams), a generalization of Extensible Markov Model (EMM). TRACDS adds a temporal or order model to data stream clustering by superimposing a dynamically adapting Markov Chain. Also provides an implementation of EMM (TRACDS on top of tNN data stream clustering). Development of this package was supported in part by NSF IIS-0948893 and R21HG005912 from the National Human Genome Research Institute. Hahsler and Dunham (2010) <doi:10.18637/jss.v035.i05>. |
Authors: | Michael Hahsler [aut, cre, cph] , Margaret H. Dunham [ctb] |
Maintainer: | Michael Hahsler <[email protected]> |
License: | GPL-2 |
Version: | 1.2.1 |
Built: | 2024-10-26 05:27:34 UTC |
Source: | https://github.com/mhahsler/rEMM |
This data set contains count data for 16S ribosomal RNA (rRNA) sequences for
the two phylogenetic classes Alphaproteobacteria and Mollicutes.
The counts for 30
sequences for each class were obtained by counting the occurrence of triplets
of nucleotides in windows of length 100 without any overlap. To
separate sequences a row of dummy count of NA
is used.
data("16S")
data("16S")
Alphaproteobacteria16S
and Mollicutes16S
are matrices
with about 449 rows and 64 (number of possible triplets) columns.
The raw sequence information was obtained from the National center for biotechnology information (NCBI) website at http://www.ncbi.nih.gov/
data("16S") emm <- EMM("Kullback", threshold=0.1) build(emm, Mollicutes16S+1) ## start state for sequences have an initial state probability >0 it <- initial_transition(emm) it[it>0]
data("16S") emm <- EMM("Kullback", threshold=0.1) build(emm, Mollicutes16S+1) ## start state for sequences have an initial state probability >0 it <- initial_transition(emm) it[it>0]
Add new data to an EMM.
build(x, newdata, ...)
build(x, newdata, ...)
x |
an |
newdata |
a vector (one observation), or a matrix or data.frame (each row is an observation) |
... |
further arguments. If |
build()
performs clustering and also updates the TRACDS temporal
layer.
NA
s are handled in the data by using only the other
dimensions if the data for dissimilarity computation
(see package proxy).
A reference to the changed EMM object with the data added.
Note: EMM objects store all variable data in an environment which
enables us to update partial data without copying the whole object. Assignment
will not create a copy! Use the provided method copy()
.
Class TRACDS
,
fade
and dist
in proxy.
## load EMMTraffic data data("EMMTraffic") EMMTraffic ## create EMM emm <- EMM(measure="eJaccard", threshold=0.2) ## build model using EMMTraffic data (note that the EMM object is ## changed without assignment!) build(emm, EMMTraffic) ## same as: emm <- build(emm, EMMTraffic) size(emm) plot(emm) ## emm2 <- emm does not create a copy (just a reference) ## a "deep" copy is created using copy() emm2<- copy(emm) ## convert the emm into a graph as.igraph(emm)
## load EMMTraffic data data("EMMTraffic") EMMTraffic ## create EMM emm <- EMM(measure="eJaccard", threshold=0.2) ## build model using EMMTraffic data (note that the EMM object is ## changed without assignment!) build(emm, EMMTraffic) ## same as: emm <- build(emm, EMMTraffic) size(emm) plot(emm) ## emm2 <- emm does not create a copy (just a reference) ## a "deep" copy is created using copy() emm2<- copy(emm) ## convert the emm into a graph as.igraph(emm)
Cluster new data into an existing tNN object.
cluster(x, newdata, ...)
cluster(x, newdata, ...)
x |
a |
newdata |
a vector (one observation), or a matrix or data.frame (each row is an observation). |
... |
further arguments like |
cluster()
implements tNN clustering The dissimilarity between
the new observation and the centers of the clusters is calculated. The
new observation is assigned to the closest cluster if the dissimilarity
value is smaller than the threshold (for the state). If no such state
exists, a new state is created for the observation. This simple
clustering algorithm is called nearest neighbor threshold nearest
neighbor (threshold NN).
NA
s are handled in the data by using only the other
dimensions if the data for dissimilarity computation
(see package~proxy).
The clusters which the data points in the last cluster()
operation where assigned to can be retrieved using the method
last_clustering()
.
A reference to the changed tNN object with the data added.
Note: tNN objects store all variable data in an environment which
enables us to update partial data without copying the whole object. Assignment
will not create a copy! Use the provided method copy()
.
Class tNN
,
fade
and dist
in proxy.
## load EMMTraffic data data(EMMTraffic) ## create empty clustering tnn <- tNN(th=0.2, measure="eJaccard") tnn ## cluster some data cluster(tnn, EMMTraffic) tnn ## what clusters were the data points assigned to? last_clustering(tnn) ## plot the clustering as a scatterplot matrix of the cluster centers plot(tnn)
## load EMMTraffic data data(EMMTraffic) ## create empty clustering tnn <- tNN(th=0.2, measure="eJaccard") tnn ## cluster some data cluster(tnn, EMMTraffic) tnn ## what clusters were the data points assigned to? last_clustering(tnn) ## plot the clustering as a scatterplot matrix of the cluster centers plot(tnn)
Combines two or more EMMs into a single object.
## S4 method for signature 'EMM' c(x, ..., copy=TRUE, recursive = FALSE)
## S4 method for signature 'EMM' c(x, ..., copy=TRUE, recursive = FALSE)
x |
first |
... |
further objects of the same class as |
copy |
a logical. Copy |
recursive |
a logical. If |
Returns invisibly an object of the same class as EMM
.
data("16S") ## create two EMMs for different data emm1 <- EMM("Kullback", threshold=0.1, data=Mollicutes16S+1) emm2 <- EMM("Kullback", threshold=0.1, data=Alphaproteobacteria16S+1) ## combine the two EMMs emm12 <- c(emm1, emm2) ## this is the same as: ## emm12 <- copy(emm1) ## c(emm12, emm2, copy=FALSE) ## recluster states so similar states in the to EMMs will be merged emm12r <- recluster_tNN(emm12) op <- par(mfrow = c(1, 2), pty = "s") plot(emm12, main="Two EMMs") plot(emm12r, main="Two EMMs (reclustered)") par(op)
data("16S") ## create two EMMs for different data emm1 <- EMM("Kullback", threshold=0.1, data=Mollicutes16S+1) emm2 <- EMM("Kullback", threshold=0.1, data=Alphaproteobacteria16S+1) ## combine the two EMMs emm12 <- c(emm1, emm2) ## this is the same as: ## emm12 <- copy(emm1) ## c(emm12, emm2, copy=FALSE) ## recluster states so similar states in the to EMMs will be merged emm12r <- recluster_tNN(emm12) op <- par(mfrow = c(1, 2), pty = "s") plot(emm12, main="Two EMMs") plot(emm12r, main="Two EMMs (reclustered)") par(op)
Data set with flow readings (in cubic meter per second) for six catchments of in the vicinity of the Derwent river in the northern UK. The data was collected daily from November 1, 1971 – January 31, 1977. The catchments are Long Bridge, Matlock Bath, Chat Sworth, What Stand Well, Ashford (Wye) and Wind Field Park (Amber).
The owner of the data is the Ridings Area Office of the Environment Agency North-East, UK.
data(Derwent)
data(Derwent)
A matrix of size 1918 days times 6 catchments.
UK National River Flow Archive (NRFA), https://nrfa.ceh.ac.uk/
The owner of the data is the Ridings Area Office of the Environment Agency North-East, UK.
Wikipedia, River Derwent, Yorkshire, https://en.wikipedia.org/wiki/River_Derwent,_Yorkshire
Wikipedia, River Wye, Derbyshire, https://en.wikipedia.org/wiki/River_Wye,_Derbyshire
Wikipedia, River Amber, https://en.wikipedia.org/wiki/River_Amber
data(Derwent) i <- 1 plot(Derwent[,i], type="l", main=colnames(Derwent[i]), ylab="Gauged Flows")
data(Derwent) i <- 1 plot(Derwent[,i], type="l", main=colnames(Derwent[i]), ylab="Gauged Flows")
Provides Data Stream Clusterer (DSC) interfaces for EMM and tNN so they can be used in the stream framework.
DSC_EMM(formula = NULL, threshold = 0.2, measure = "euclidean", distFun = NULL, centroids = identical(tolower(measure), "euclidean"), lambda = 0) DSC_tNN(formula = NULL, threshold = 0.2, measure = "euclidean", centroids = identical(tolower(measure), "euclidean"), lambda = 0) get_EMM(dsc) set_EMM(dsc, x)
DSC_EMM(formula = NULL, threshold = 0.2, measure = "euclidean", distFun = NULL, centroids = identical(tolower(measure), "euclidean"), lambda = 0) DSC_tNN(formula = NULL, threshold = 0.2, measure = "euclidean", centroids = identical(tolower(measure), "euclidean"), lambda = 0) get_EMM(dsc) set_EMM(dsc, x)
formula |
|
threshold |
A |
measure |
A |
distFun |
Specify a function passed on as method to |
centroids |
A |
lambda |
A |
dsc |
an object of class |
x |
an object of class |
DSC_tNN and DSC_EMM wrap the clustering algorithms so they can be used with the stream framework.
See DSC
for details.
get_EMM()
and set_EMM()
can be used to access the EMM object inside the DSC_EMM object.
An object of class "DSC_EMM"
or "DSC_tNN"
.
library(stream) ### tNN clustering example stream <- DSD_Gaussians() stream cl <- DSC_tNN(threshold = .1) cl update(cl, stream, 100) cl get_centers(cl) get_weights(cl) plot(cl, stream) ## EMM clustering example data("EMMsim") plot(EMMsim_train, pch = NA) lines(EMMsim_train, col = "gray") points(EMMsim_train, pch = EMMsim_sequence_train) stream <- DSD_Memory(EMMsim_train) stream cl <- DSC_EMM(threshold = 0.1, measure = "euclidean", lambda = .1) update(cl, stream, n = 200) cl reset_stream(stream) plot(cl, stream, n = 200, method = "pca") # inspect and recluster the EMM in the DSC_EMM object emm <- get_EMM(cl) plot(emm) emm <- recluster_hclust(emm, k = 4, method = "average") plot(emm) set_EMM(cl, emm) reset_stream(stream) plot(cl, stream, n = 200, method = "pca")
library(stream) ### tNN clustering example stream <- DSD_Gaussians() stream cl <- DSC_tNN(threshold = .1) cl update(cl, stream, 100) cl get_centers(cl) get_weights(cl) plot(cl, stream) ## EMM clustering example data("EMMsim") plot(EMMsim_train, pch = NA) lines(EMMsim_train, col = "gray") points(EMMsim_train, pch = EMMsim_sequence_train) stream <- DSD_Memory(EMMsim_train) stream cl <- DSC_EMM(threshold = 0.1, measure = "euclidean", lambda = .1) update(cl, stream, n = 200) cl reset_stream(stream) plot(cl, stream, n = 200, method = "pca") # inspect and recluster the EMM in the DSC_EMM object emm <- get_EMM(cl) plot(emm) emm <- recluster_hclust(emm, k = 4, method = "average") plot(emm) set_EMM(cl, emm) reset_stream(stream) plot(cl, stream, n = 200, method = "pca")
Create a new object of class "EMM"
.
EMM(threshold = 0.2, measure = "euclidean", distFun = NULL, centroids = identical(tolower(measure), "euclidean"), lambda = 0, data = NULL)
EMM(threshold = 0.2, measure = "euclidean", distFun = NULL, centroids = identical(tolower(measure), "euclidean"), lambda = 0, data = NULL)
threshold |
Object of class |
measure |
Object of class |
distFun |
Specify a function passed on as method to |
centroids |
Object of class |
lambda |
Object of class |
data |
Initial data to build the EMM.
This just calls |
An object of class "EMM"
.
## load EMMTraffic data data(EMMTraffic) ## create empty EMM emm <- EMM(threshold=0.2, measure="eJaccard", lambda=.1) emm ## cluster some data build(emm, EMMTraffic) emm ## what clusters were the data points assigned to? last_clustering(emm) ## plot the clustering as a graph plot(emm)
## load EMMTraffic data data(EMMTraffic) ## create empty EMM emm <- EMM(threshold=0.2, measure="eJaccard", lambda=.1) emm ## cluster some data build(emm, EMMTraffic) emm ## what clusters were the data points assigned to? last_clustering(emm) ## plot the clustering as a graph plot(emm)
This class represents the extensible Markov Model. It consists
of a simple data stream clustering algorithm (class "tNN"
) and
a temporal layer (class "TRACDS"
).
Objects can be created using the creator function EMM
or by
directly calling new("EMM", ...)
. Most slots for the extended
classes can be used as parameters for EMM
.
The slots are described in corresponding the extended classes (see section Extends).
Class "tNN"
, directly.
Class "TRACDS"
, directly.
signature(x = "EMM")
: Make a copy of the EMM object.
Making explicit copies is necessary since the subclasses store
information in environments which are not copied for regular
assignements.
signature(x = "EMM")
: Returns the size of
the EMM (number of clusters/states).
M.H. Dunham, Y. Meng, J. Huang (2004): Extensible Markov Model, In: ICDM '04: Proceedings of the Fourth IEEE International Conference on Data Mining, pp. 371–374.
build
,
fade
,
merge_clusters
,
plot
,
prune
,
rare_clusters
,
rare_transitions
,
remove_clusters
,
remove_transitions
,
remove_selftransitions
,
recluster
, and
score
.
A simulated data set with four clusters in . Each cluster is
represented by a bivariate normally distributed random variable.
are the coordinates of the means of the
distributions and
contains the covariance matrices.
Two data stream are created using a fixed sequence
through
the four clusters. For the training data, the sequence is repeated 40 times
(200 data points) and for the test data five times (25 data points).
The code to generate the data is shown in the Examples section below.
data(EMMsim)
data(EMMsim)
EMMsim_train
and EMMsim_test
are matrices containing the
data.
EMMsim_sequence_train
and EMMsim_sequence_test
contain the sequence of the data through the four clusters.
data(EMMsim) plot(EMMsim_train) points(EMMsim_test, col = "red") ## the data was generated by ## Not run: set.seed(1234) ## simulated data mu <- cbind(x = c(0, 0.2, 1, 0.9), y = c(0, 0.7, 1, 0.2)) sd_rho <- cbind( x = c(0.2, 0.15, 0.05, 0.02), y = c(0.1, 0.04, 0.03, 0.05), rho = c(0, 0.7, 0.3,-0.4) ) Sigma <- lapply( 1:nrow(sd_rho), FUN = function(i) rbind( c(sd_rho[i, "x"] ^ 2, sd_rho[i, "rho"] * sd_rho[i, "x"] * sd_rho[i, "y"]), c(sd_rho[i, "rho"] * sd_rho[i, "x"] * sd_rho[i, "y"], sd_rho[i, "y"] ^ 2) ) ) sequence <- c(1, 2, 1, 3, 4) EMMsim_sequence_train <- rep(sequence, 40) EMMsim_sequence_test <- rep(sequence, 5) library("MASS") EMMsim_train <- t(sapply( EMMsim_sequence_train, FUN = function(i) mvrnorm(1, mu = mu[i, ], Sigma = Sigma[[i]]) )) EMMsim_test <- t(sapply( rep(EMMsim_sequence_test), FUN = function(i) mvrnorm(1, mu = mu[i, ], Sigma = Sigma[[i]]) )) ## End(Not run)
data(EMMsim) plot(EMMsim_train) points(EMMsim_test, col = "red") ## the data was generated by ## Not run: set.seed(1234) ## simulated data mu <- cbind(x = c(0, 0.2, 1, 0.9), y = c(0, 0.7, 1, 0.2)) sd_rho <- cbind( x = c(0.2, 0.15, 0.05, 0.02), y = c(0.1, 0.04, 0.03, 0.05), rho = c(0, 0.7, 0.3,-0.4) ) Sigma <- lapply( 1:nrow(sd_rho), FUN = function(i) rbind( c(sd_rho[i, "x"] ^ 2, sd_rho[i, "rho"] * sd_rho[i, "x"] * sd_rho[i, "y"]), c(sd_rho[i, "rho"] * sd_rho[i, "x"] * sd_rho[i, "y"], sd_rho[i, "y"] ^ 2) ) ) sequence <- c(1, 2, 1, 3, 4) EMMsim_sequence_train <- rep(sequence, 40) EMMsim_sequence_test <- rep(sequence, 5) library("MASS") EMMsim_train <- t(sapply( EMMsim_sequence_train, FUN = function(i) mvrnorm(1, mu = mu[i, ], Sigma = Sigma[[i]]) )) EMMsim_test <- t(sapply( rep(EMMsim_sequence_test), FUN = function(i) mvrnorm(1, mu = mu[i, ], Sigma = Sigma[[i]]) )) ## End(Not run)
Each observation in this hypothetical data set is a vector of seven values obtained from sensors located at specific points on roads. Each sensor collects a count of the number of vehicles which have crossed this sensor in the preceding time interval.
data(EMMTraffic)
data(EMMTraffic)
A matrix with 12 observations (rows).
M.H. Dunham, Y. Meng, J. Huang (2004): Extensible Markov Model, In: ICDM '04: Proceedings of the Fourth IEEE International Conference on Data Mining, pp. 371–374.
data(EMMTraffic) EMMTraffic
data(EMMTraffic) EMMTraffic
Reduces the weight of old observations in the data stream.
build
has a learning rate parameter
lambda
. If this parameter is set, build
automatically
fades all counts before a new data point is added. The second
mechanism is to explicitly call the function~fade
whenever
fading is needed. This has the advantage that the overhead of manipulating
all counts in the EMM can be reduced and that fading can be used in a more
flexible manner. For example, if the data points are arriving at an irregular
rate, fade
could be called at regular time intervals
(e.g., every second).
fade(x, t, lambda)
fade(x, t, lambda)
x |
an object of class |
t |
number of time intervals (if missing 1 is used) |
lambda |
learning rate. If |
Old data points are faded by using a weight.
We define the weight
for data that is timesteps in the past by the following strictly
decreasing function:
Since the weight is multiplicative, it can be applied iteratively by
multiplying at each time step all counts by .
For the clustering algorithm the weight of the clusters (number of data
points assigned to the cluster) is faded. For the EMM the initial count vector
and all transition counts are faded.
Returns a reference to the changed object x
.
data("EMMTraffic") ## For the example we use a very high learning rate ## this calls fade after each new data point emm_l <- EMM(measure="eJaccard", threshold=0.2, lambda = 1) build(emm_l, EMMTraffic) ## build a regular EMM for comparison emm <- EMM(measure="eJaccard", threshold=0.2) build(emm, EMMTraffic) ## compare the transition matrix transition_matrix(emm) transition_matrix(emm_l) ## compare graphs op <- par(mfrow = c(1, 2), pty = "m") plot(emm, main = "regular EMM") plot(emm_l, main = "EMM with high learning rate") par(op)
data("EMMTraffic") ## For the example we use a very high learning rate ## this calls fade after each new data point emm_l <- EMM(measure="eJaccard", threshold=0.2, lambda = 1) build(emm_l, EMMTraffic) ## build a regular EMM for comparison emm <- EMM(measure="eJaccard", threshold=0.2) build(emm, EMMTraffic) ## compare the transition matrix transition_matrix(emm) transition_matrix(emm_l) ## compare graphs op <- par(mfrow = c(1, 2), pty = "m") plot(emm, main = "regular EMM") plot(emm_l, main = "EMM with high learning rate") par(op)
Finds the cluster and thus the EMM states for observations.
## S4 method for signature 'tNN,matrix' find_clusters(x, newdata, match_cluster=c("exact", "nn"), dist = FALSE)
## S4 method for signature 'tNN,matrix' find_clusters(x, newdata, match_cluster=c("exact", "nn"), dist = FALSE)
x |
an |
newdata |
a matrix/data.frame with observations. |
match_cluster |
find exact or nearest neighbor (nn) cluster/state. If a number is supplied then the threshold times this number is used for exact matching. |
dist |
also report the distance to the chosen cluster/state (as a data.frame). |
Returns the name of the matching clusters/states or a data.frame with
columns "state" and "dist" if dist=TRUE
.
data("EMMTraffic") emm <- EMM(measure="eJaccard", threshold=0.2) emm <- build(emm, EMMTraffic) find_clusters(emm, EMMTraffic) find_clusters(emm, EMMTraffic, dist=TRUE) ## add noise to the data set.seed(1234) newdata <- sapply(EMMTraffic, jitter, amount=15) ## default is exact match find_clusters(emm, newdata, dist=TRUE) ## match with nearest neighbor find_clusters(emm, newdata, match_cluster="nn", dist=TRUE) ## exact match only if within .5 times threshold find_clusters(emm, newdata, match_cluster=.5, dist=TRUE) ## exact match only if within 2 times threshold find_clusters(emm, newdata, match_cluster=2, dist=TRUE)
data("EMMTraffic") emm <- EMM(measure="eJaccard", threshold=0.2) emm <- build(emm, EMMTraffic) find_clusters(emm, EMMTraffic) find_clusters(emm, EMMTraffic, dist=TRUE) ## add noise to the data set.seed(1234) newdata <- sapply(EMMTraffic, jitter, amount=15) ## default is exact match find_clusters(emm, newdata, dist=TRUE) ## match with nearest neighbor find_clusters(emm, newdata, match_cluster="nn", dist=TRUE) ## exact match only if within .5 times threshold find_clusters(emm, newdata, match_cluster=.5, dist=TRUE) ## exact match only if within 2 times threshold find_clusters(emm, newdata, match_cluster=2, dist=TRUE)
Merge several clusters/states of an EMM into a single cluster/state.
## S4 method for signature 'EMM,character' merge_clusters(x, to_merge, clustering = FALSE, new_center = NULL, copy=TRUE)
## S4 method for signature 'EMM,character' merge_clusters(x, to_merge, clustering = FALSE, new_center = NULL, copy=TRUE)
x |
an |
to_merge |
vector of names of the states/clusters to merge. The
name of the first state in |
clustering |
is |
new_center |
supply new centers for the merged clusters.
New centroids are automatically
computed. If (pseudo) medoids are used, new medoids should be supplied.
If none is supplied, the medoid of the cluster in |
copy |
logical; make a copy of x before reclustering? Otherwise the function will change |
Returns the changed EMM with the states/clusters merged invisibly.
If copy=FALSE
then it returns a reference to the changes
object passed as x
.
data("EMMTraffic") emm <- EMM(measure="eJaccard", threshold=0.2) build(emm, EMMTraffic) states(emm) ## create a new emm with states 1-3 merged emm_m123 <- merge_clusters(emm, c("1", "2", "3")) states(emm_m123)
data("EMMTraffic") emm <- EMM(measure="eJaccard", threshold=0.2) build(emm, EMMTraffic) states(emm) ## create a new emm with states 1-3 merged emm_m123 <- merge_clusters(emm, c("1", "2", "3")) states(emm_m123)
Visualize EMM objects.
## S4 method for signature 'EMM,missing' plot(x, y, method=c("igraph", "interactive", "graph", "MDS", "cluster_counts", "transition_counts"), data = NULL, parameter=NULL, ...)
## S4 method for signature 'EMM,missing' plot(x, y, method=c("igraph", "interactive", "graph", "MDS", "cluster_counts", "transition_counts"), data = NULL, parameter=NULL, ...)
x |
an |
y |
unused (just for compatibility with the generic for plot in graphics) |
method |
see details section. |
data |
Project the state centers onto these data. Points which do not belong to any cluster are shown in blue. |
parameter |
a list of parameters for plotting (see Details section). |
... |
further arguments passed on to |
There are several methods for plotting:
"igraph"
produces a graph representation of the EMM using igraph.
Additional arguments like layout
are passed on to plot for igraph.
"interactive"
produces an interactive graph representation of the EMM (using igraph).
Arguments suitable for plot.igraph
in igraph can be
passed on as ...
.
"graph"
produces a graph representation of the EMM using Rgraphviz.
If Rgraphviz is not installed/available then the method reverts to
"igraph"
.
"MDS"
projects the cluster centers into 2-dimensional space.
"cluster_counts"
produces a barplot for cluster counts.
"transition_counts"
produces a barplot for transition counts.
The following plotting parameters are currently supported (by some of the visualizations):
represent state counts by vertex size?
(default: TRUE
)
represent transition counts/probabilities by arrow width?
(default: TRUE
)
use "counts"
or "probabilities"
for
arrow width. (default: "counts")
Controls the variation of vertex sizes and edge widths (default: 1).
add labels for centers (n/a for type = "graph"
).
cluster labels to use instead of 1,2,....
Use different markers for points depending on the
state they belong to (only available for MDS when
data
is specified).
draw a circle around state centers to indicate the area in which points
are assigned to the cluster (experimental, only available for
MDS when data
is specified).
a vector of state names to be marked and the color(s) used for marking (default: red).
a vector of transition names in the format "3->2" to be marked and the color(s) used for marking (default: red).
For some plots (e.g., "igraph"
) ...
is passed on to the
primitive plotting function and can be used to change the plot (colors, etc.)
See ? igraph.plotting
.
For "graph"
the two special parameters "nAttrs" and "eAttrs"
for node and edge attributes can be used.
data("EMMTraffic") emm <- EMM(threshold = 0.2, measure = "eJaccard", data = EMMTraffic) op <- par(mfrow = c(2, 2), pty = "s") plot(emm, main = "Graph") ## Plot the graph as a tree with a set root node and an aspect ratio of 1:1. g <- as.igraph(emm) plot(emm, main = "Graph (tree layout)", layout = igraph::layout_as_tree(g, root = 1), asp = 1) plot( emm, method = "MDS", main = "Graph (MDS projection)", xlim = c(-0.5, 0.5), ylim = c(-0.5, 0.5) ) plot(emm, method = "MDS", data = EMMTraffic, main = "Projection of cluster \ncenters on data") par(op)
data("EMMTraffic") emm <- EMM(threshold = 0.2, measure = "eJaccard", data = EMMTraffic) op <- par(mfrow = c(2, 2), pty = "s") plot(emm, main = "Graph") ## Plot the graph as a tree with a set root node and an aspect ratio of 1:1. g <- as.igraph(emm) plot(emm, main = "Graph (tree layout)", layout = igraph::layout_as_tree(g, root = 1), asp = 1) plot( emm, method = "MDS", main = "Graph (MDS projection)", xlim = c(-0.5, 0.5), ylim = c(-0.5, 0.5) ) plot(emm, method = "MDS", data = EMMTraffic, main = "Projection of cluster \ncenters on data") par(op)
Predict a state or the probability distribution over states in
time steps.
## S4 method for signature 'TRACDS' predict(object, current_state = NULL, n=1, probabilities = FALSE, randomized = FALSE, prior=FALSE)
## S4 method for signature 'TRACDS' predict(object, current_state = NULL, n=1, probabilities = FALSE, randomized = FALSE, prior=FALSE)
object |
an |
current_state |
use a specified current state.
If |
n |
number of time steps. |
probabilities |
if |
randomized |
if |
prior |
add one to each transition count. This is equal to starting with a uniform prior for the transition count distribution, i.e. initially all transitions are equally likely. It also prevents the product of probabilities to be zero if a transition was never observed. |
Prediction is done using where
is the transition
probability matrix maintained by the EMM.
Random tie-breaking is used.
The name of the predicted state or a vector with the probability distribution over all states.
data("EMMTraffic") emm <- EMM(measure="eJaccard", threshold=0.2) emm <- build(emm, EMMTraffic) #plot(emm) ## plot graph ## Predict state starting an state 1 after 1, 2 and 100 time intervals ## Note, state 7 is an absorbing state. predict(emm, n=1, current_state="1") predict(emm, n=2, current_state="1") predict(emm, n=100, current_state="1") ## Get probability distribution predict(emm, n=2, current_state="1", probabilities = TRUE)
data("EMMTraffic") emm <- EMM(measure="eJaccard", threshold=0.2) emm <- build(emm, EMMTraffic) #plot(emm) ## plot graph ## Predict state starting an state 1 after 1, 2 and 100 time intervals ## Note, state 7 is an absorbing state. predict(emm, n=1, current_state="1") predict(emm, n=2, current_state="1") predict(emm, n=100, current_state="1") ## Get probability distribution predict(emm, n=2, current_state="1", probabilities = TRUE)
Simplifies an EMM and/or the clustering by removing all clusters/states and/or transitions which have a count of equal or smaller than a given threshold.
## S4 method for signature 'EMM' prune(x, count_threshold, clusters = TRUE, transitions = FALSE, copy = TRUE, compact = TRUE) rare_clusters(x, count_threshold, ...) rare_transitions(x, count_threshold, ...)
## S4 method for signature 'EMM' prune(x, count_threshold, clusters = TRUE, transitions = FALSE, copy = TRUE, compact = TRUE) rare_clusters(x, count_threshold, ...) rare_transitions(x, count_threshold, ...)
x |
an object of class |
count_threshold |
all states/edges with a count of less or equal to the threshold are removed from the model. |
clusters |
logical; prune clusters? |
transitions |
logical; prune transitions? |
copy |
logical; make a copy of x before reclustering? Otherwise the function will change |
compact |
logical; tries make the data structure used for the temporal model more compact after pruning. |
... |
further arguments (currently not used). |
prune
returns invisibly an object of class EMM
.
If copy=FALSE
then it returns a reference to the changes
object passed as x
.
rare_clusters
returns a vector of names of rare clusters.
rare_transitions
returns a data.frame of rare transitions.
remove_transitions
,
remove_clusters
,
compact
data("EMMTraffic") ## For the example we use a very high learning rate emm_l <- EMM(threshold=0.2, measure="eJaccard", lambda = 1) build(emm_l, EMMTraffic) ## show state counts and transition counts cluster_counts(emm_l) transition_matrix(emm_l, type="counts") ## rare state/transitions rare_clusters(emm_l, count_threshold=0.1) rare_transitions(emm_l, count_threshold=0.1) ## remove all states with a threshold of 0.1 emm_lr <- prune(emm_l, count_threshold=0.1) ## compare graphs op <- par(mfrow = c(1, 2), pty = "m") plot(emm_l, main = "EMM with high learning rate") plot(emm_lr, main = "Simplified EMM") par(op)
data("EMMTraffic") ## For the example we use a very high learning rate emm_l <- EMM(threshold=0.2, measure="eJaccard", lambda = 1) build(emm_l, EMMTraffic) ## show state counts and transition counts cluster_counts(emm_l) transition_matrix(emm_l, type="counts") ## rare state/transitions rare_clusters(emm_l, count_threshold=0.1) rare_transitions(emm_l, count_threshold=0.1) ## remove all states with a threshold of 0.1 emm_lr <- prune(emm_l, count_threshold=0.1) ## compare graphs op <- par(mfrow = c(1, 2), pty = "m") plot(emm_l, main = "EMM with high learning rate") plot(emm_lr, main = "Simplified EMM") par(op)
Use various clustering methods to recluster states/clusters in an EMM. The centers of the clusters in the EMM object are used as data points by the reclustering algorithm. States/centers put by reclustering into the same cluster are merged to produce a new reclustered EMM.
## S4 method for signature 'EMM' recluster_hclust(x, k=NULL, h=NULL, method="average", ...,prune=NULL, copy=TRUE) ## S4 method for signature 'EMM' recluster_kmeans(x, k, ..., prune=NULL, copy=TRUE) ## S4 method for signature 'EMM' recluster_pam(x, k, ..., prune=NULL, copy=TRUE) ## S4 method for signature 'EMM' recluster_reachability(x, h, ..., prune=NULL, copy=TRUE) ## S4 method for signature 'EMM' recluster_tNN(x, threshold=NULL, ..., prune=NULL, copy=TRUE) ## S4 method for signature 'EMM' recluster_transitions(x, threshold=NULL, ..., prune=NULL, copy=TRUE)
## S4 method for signature 'EMM' recluster_hclust(x, k=NULL, h=NULL, method="average", ...,prune=NULL, copy=TRUE) ## S4 method for signature 'EMM' recluster_kmeans(x, k, ..., prune=NULL, copy=TRUE) ## S4 method for signature 'EMM' recluster_pam(x, k, ..., prune=NULL, copy=TRUE) ## S4 method for signature 'EMM' recluster_reachability(x, h, ..., prune=NULL, copy=TRUE) ## S4 method for signature 'EMM' recluster_tNN(x, threshold=NULL, ..., prune=NULL, copy=TRUE) ## S4 method for signature 'EMM' recluster_transitions(x, threshold=NULL, ..., prune=NULL, copy=TRUE)
x |
an |
k |
number of clusters. |
h |
heights where the dendrogram tree should be cut. |
threshold |
threshold used on the dissimilarity to join clusters for tNN. If no threshold is specified then the threshold stored in the EMM is used. |
method |
clustering method used by |
... |
additional arguments passed on to the clustering algorithm. |
prune |
logical; prune states with less
than |
copy |
logical; make a copy of x before reclustering? Otherwise the function will change |
For recluster_kmeans
k
can also be a set of initial cluster
centers (see argument centers
for kmeans
in package stats).
For recluster_hclust
k
or h
can also be a vector.
The result is then a list with several (nested) EMMs, one for each value.
For recluster_reachability
reclusters all clusters which are reachable
from each other. A cluster is reachable from
if
's center is closer to
's center than
h
or if is
reachable by any cluster reachable by
.
For recluster_tNN
reclusters such that two clusters with
centers less than the threshold apart will be reclustered into a
single cluster. This is useful, for example, after combining two models.
For recluster_transitions
does not recluster clusters!
It find groups of clusters which are overlapping (centers are
less than 2 thresholds apart) and then redistributes the transition weights
such that all members of one group are connected to all the members of the
other group using the same weight.
An object of class "EMM"
or, if copy=FALSE
a refernece
to the changed object passed as x
.
Clustering information is available
as the attribute "cluster_info"
.
The information provided depends
in the clustering algorithm (see hclust
, kmeans
and pam
).
merge_clusters
, prune
,
kmeans
, hclust
,
pam
data(EMMsim) emm <- EMM(threshold = .2) build(emm, EMMsim_train) ## do reclustering on a copy of the emm and plot dendrogram emm_hc <- recluster_hclust(emm, h = 0.6) attr(emm_hc, "cluster_info") ## compare original and clustered EMM op <- par(mfrow = c(2, 2), pty = "m") plot(emm, method= "MDS", main ="original EMM", data = EMMsim_train) plot(attr(emm_hc, "cluster_info")$dendrogram) abline(h=0.6, col="red") plot(emm_hc, method="MDS", main ="clustered EMM", data = EMMsim_train) plot(emm_hc, method="MDS", main ="clustered EMM") par(op)
data(EMMsim) emm <- EMM(threshold = .2) build(emm, EMMsim_train) ## do reclustering on a copy of the emm and plot dendrogram emm_hc <- recluster_hclust(emm, h = 0.6) attr(emm_hc, "cluster_info") ## compare original and clustered EMM op <- par(mfrow = c(2, 2), pty = "m") plot(emm, method= "MDS", main ="original EMM", data = EMMsim_train) plot(attr(emm_hc, "cluster_info")$dendrogram) abline(h=0.6, col="red") plot(emm_hc, method="MDS", main ="clustered EMM", data = EMMsim_train) plot(emm_hc, method="MDS", main ="clustered EMM") par(op)
Remove states/clusters or transitions from an EMM.
remove_clusters(x, to_remove, copy = TRUE) remove_transitions(x, from, to,copy = TRUE) remove_selftransitions(x, copy = TRUE)
remove_clusters(x, to_remove, copy = TRUE) remove_transitions(x, from, to,copy = TRUE) remove_selftransitions(x, copy = TRUE)
x |
an |
to_remove |
Names of states/clusters to remove. |
from , to
|
Names of states for removing transitions. If |
copy |
logical; make a copy of x before reclustering? Otherwise the function will change |
remove_selftransitions
removes the transitions from each state to itself.
Returns a EMM with removed states/transitions.
If copy=FALSE
a reference to the object x
with the states/transistions removed is returned.
data("EMMTraffic") emm <- EMM(measure="eJaccard", threshold=0.2) emm <- build(emm, EMMTraffic) ## remove state 3 emm_rs3 <- remove_clusters(emm, "3") ## remove transition 5->2 emm_rt52 <- remove_transitions(emm, "5", "2") ## compare EMMs op <- par(mfrow = c(2, 2), pty = "m") plot(emm, method = "igraph", main = "original EMM") plot(emm_rs3, method = "igraph", main = "state 3 removed") plot(emm_rt52, method = "igraph", main = "transition 5->2 removed") par(op)
data("EMMTraffic") emm <- EMM(measure="eJaccard", threshold=0.2) emm <- build(emm, EMMTraffic) ## remove state 3 emm_rs3 <- remove_clusters(emm, "3") ## remove transition 5->2 emm_rt52 <- remove_transitions(emm, "5", "2") ## compare EMMs op <- par(mfrow = c(2, 2), pty = "m") plot(emm, method = "igraph", main = "original EMM") plot(emm_rs3, method = "igraph", main = "state 3 removed") plot(emm_rt52, method = "igraph", main = "transition 5->2 removed") par(op)
Calculates a score of how likely it is that a new sequence was generated by the same process as the sequences used to build the EMM.
## S4 method for signature 'EMM,matrix' score(x, newdata, method = c("product", "log_sum", "sum", "log_odds", "supported_transitions", "supported_states", "sum_transitions", "log_loss", "likelihood", "log_likelihood", "AIC"), match_cluster = "exact", random = FALSE, prior = TRUE, normalize = TRUE, initial_transition = FALSE, threshold = NA) ## S4 method for signature 'EMM,EMM' score(x, newdata, method = c("product", "log_sum", "sum", "supported_transitions"), match_cluster = "exact", random = FALSE, prior = TRUE, initial_transition = FALSE)
## S4 method for signature 'EMM,matrix' score(x, newdata, method = c("product", "log_sum", "sum", "log_odds", "supported_transitions", "supported_states", "sum_transitions", "log_loss", "likelihood", "log_likelihood", "AIC"), match_cluster = "exact", random = FALSE, prior = TRUE, normalize = TRUE, initial_transition = FALSE, threshold = NA) ## S4 method for signature 'EMM,EMM' score(x, newdata, method = c("product", "log_sum", "sum", "supported_transitions"), match_cluster = "exact", random = FALSE, prior = TRUE, initial_transition = FALSE)
x |
an |
newdata |
sequence or another |
method |
method to calculate the score (see details) |
match_cluster |
do the new observations have to fall within
the threshold of the cluster ( |
random |
logical; should the order of newdata be randomized? Can be used to compare the score with the actual score. |
prior |
logical; add one to each transition count. This is equal to start with a count of one for each transition, i.e. initially all transitions are equally likely. It prevents the product of probabilities to be zero if a transition was never observed. |
normalize |
logical; normalize the score by the length of the sequence. |
initial_transition |
logical; include the initial transition in the computation? |
threshold |
minimum count threshold used by supported transitions and supported states. |
The scores for a new sequence of length
can be computed
by the following methods. For
match_cluster="exact"
or "nn"
:
Product of transition probabilities along the path of in the
model. A single missing transition (transition probability of zero)
will result in
a score of 0. Use
prior
to avoid this.
where is the transition probability
between the state representing positions
and
in the sequence.
Average of transition probabilities along the path of in the
model.
Average of the log of the transition probabilities along the path of
in the model. The ranking of the scores is equivalent to
the product of probabilities, however, the calculation is more reliable
since the product of probabilities might become a very small number.
A single missing transition (transition probability of zero)
will result in a score of neg. infinity.
Use prior
to avoid this.
Fraction of transitions in the new sequence supported (present) in the model after assigning each data point in
to a state in
the model.
Fraction of points in the new sequence
for which a state (cluster) exists in the model.
match_cluster
is always "exact"
because for "nn"
this measure would
always give 1. Note that this measure ignores transition information.
If threshold is given, then only states with a count greater than the given threshold are counted as supported.
Sum of the counts on the edges in the model on the path of sequence normalized by the total number of transition counts in the model.
where is the transition count between the state representing positions
and
in the sequence.
If threshold is given, then only transitions with a count greater than the given threshold are counted as supported.
The likelihood of the model given the new data is the unnormalized product score (product of transition probabilities).
The average log loss is defined as
It represents the average compression rate of the new sequence given the model.
Akaike Information Criterion corrected for finite sample size.
where and
is the model complexity measured by the number of
non-zero entries in the transition matrix.
We use the likelihood of the model given by the proportion
of supported transitions. AIC can be used for model selection
where the smallest value indicates the preferred model.
where
represents the
-th data point in the new sequence,
is the transition probability from state
to state
in the model,
is the state the
-th data point (
) in
the new sequence is assigned to.
is an indicator function which is 0 for
and 1 otherwise.
For match_cluster="weighted"
:
Weighted version of the product of probabilities. The weight is the similarity between a new data point and the state in the model it is assigned to.
Weighted version of the sum of probabilities.
Weighted version of the sum of the log of probabilities.
Same as "supported_states"
but instead of counting the
supported states, the similarity
is used as a weight. Threshold is not implemented.
where is a modified and normalized
similarity function given by
where
is the distance measure and
is the threshold that
was used to create the model.
A scalar score value.
transition
to access transition probabilities
and find_clusters
for assigning observations to states/clusters.
data("EMMsim") emm <- EMM(threshold = .2) emm <- build(emm, EMMsim_train) # default is method "product". The score is much higher compared to a randomized order. score(emm, EMMsim_test) score(emm, EMMsim_test, random = TRUE) ### create shuffled data (destroy temporal relationship) ### and create noisy data test_shuffled <- EMMsim_test[sample(1:nrow(EMMsim_test)), ] test_noise <- jitter(EMMsim_test, amount = .3) ### helper for plotting mybars <- function(...) { oldpar <- par(mar = c(5, 10, 4, 2)) ss <- rbind(...) barplot( ss[, ncol(ss):1], xlim = c(-1, 4), beside = TRUE, horiz = TRUE, las = 2, legend = rownames(ss) ) par(oldpar) } ### compare various scores methods <- c( "product", "sum", "log_sum", "supported_states", "supported_transitions", "sum_transitions", "log_loss", "likelihood" ) ### default is exact matching clean <- sapply( methods, FUN = function(m) score(emm, EMMsim_test, method = m) ) shuffled <- sapply( methods, FUN = function(m) score(emm, test_shuffled, method = m) ) noise <- sapply( methods, FUN = function(m) score(emm, test_noise, method = m) ) mybars(shuffled, noise, clean) ### weighted matching is better for noisy data clean <- sapply( methods, FUN = function(m) score(emm, EMMsim_test, method = m, match = "weighted") ) shuffled <- sapply( methods, FUN = function(m) score(emm, test_shuffled, method = m, match = "weighted") ) noise <- sapply( methods, FUN = function(m) score(emm, test_noise, method = m, match = "weighted") ) mybars(shuffled, noise, clean)
data("EMMsim") emm <- EMM(threshold = .2) emm <- build(emm, EMMsim_train) # default is method "product". The score is much higher compared to a randomized order. score(emm, EMMsim_test) score(emm, EMMsim_test, random = TRUE) ### create shuffled data (destroy temporal relationship) ### and create noisy data test_shuffled <- EMMsim_test[sample(1:nrow(EMMsim_test)), ] test_noise <- jitter(EMMsim_test, amount = .3) ### helper for plotting mybars <- function(...) { oldpar <- par(mar = c(5, 10, 4, 2)) ss <- rbind(...) barplot( ss[, ncol(ss):1], xlim = c(-1, 4), beside = TRUE, horiz = TRUE, las = 2, legend = rownames(ss) ) par(oldpar) } ### compare various scores methods <- c( "product", "sum", "log_sum", "supported_states", "supported_transitions", "sum_transitions", "log_loss", "likelihood" ) ### default is exact matching clean <- sapply( methods, FUN = function(m) score(emm, EMMsim_test, method = m) ) shuffled <- sapply( methods, FUN = function(m) score(emm, test_shuffled, method = m) ) noise <- sapply( methods, FUN = function(m) score(emm, test_noise, method = m) ) mybars(shuffled, noise, clean) ### weighted matching is better for noisy data clean <- sapply( methods, FUN = function(m) score(emm, EMMsim_test, method = m, match = "weighted") ) shuffled <- sapply( methods, FUN = function(m) score(emm, test_shuffled, method = m, match = "weighted") ) noise <- sapply( methods, FUN = function(m) score(emm, test_noise, method = m, match = "weighted") ) mybars(shuffled, noise, clean)
Each state/cluster gets the average count if all the outgoing transitions of its neighbors (i.e., clusters which are within range x its threshold).
## S4 method for signature 'EMM' smooth_transitions(x, range = 2, copy = TRUE)
## S4 method for signature 'EMM' smooth_transitions(x, range = 2, copy = TRUE)
x |
an object of class |
range |
threshold multiplier for the smoothing range. |
copy |
logical; make a copy of x before reclustering? Otherwise the function will change |
smooth_transitions
returns invisibly an object of class EMM
.
If copy=FALSE
then it returns a reference to the changes
object passed as x
.
data("EMMTraffic") ## learn a model emm <- EMM(threshold=0.2, measure="eJaccard") build(emm, EMMTraffic) ## smooth the model by adding tansitions emm_s <- smooth_transitions(emm) ## compare graphs op <- par(mfrow = c(1, 2), pty = "m") plot(emm, method="MDS", main="Original") plot(emm_s, method="MDS", main="Smoothed") par(op)
data("EMMTraffic") ## learn a model emm <- EMM(threshold=0.2, measure="eJaccard") build(emm, EMMTraffic) ## smooth the model by adding tansitions emm_s <- smooth_transitions(emm) ## compare graphs op <- par(mfrow = c(1, 2), pty = "m") plot(emm, method="MDS", main="Original") plot(emm_s, method="MDS", main="Smoothed") par(op)
This function creates a synthetic data stream
with data points in roughly by choosing
points form k clusters following a sequence
through these clusters. Each cluster has a density function following a
d-dimensional normal distributions. In the test set outliers are introduced.
synthetic_stream(k = 10, d = 2, n_subseq = 100, p_transition = 0.5, p_swap = 0, n_train = 5000, n_test = 1000, p_outlier = 0.01, rangeVar = c(0, 0.005))
synthetic_stream(k = 10, d = 2, n_subseq = 100, p_transition = 0.5, p_swap = 0, n_train = 5000, n_test = 1000, p_outlier = 0.01, rangeVar = c(0, 0.005))
k |
number of clusters. |
d |
dimensionality of data set. |
n_subseq |
length of subsequence which will be repeat to create the data set. |
p_transition |
probability that the next position in the subsequence will belong to a different cluster. |
p_swap |
probability that two data points are swapped. This represents measurement errors (e.g., a data points arrive out of order) or that the data stream does not exactly follow the subsequence. |
n_train |
size of training set (without outliers). |
n_test |
size of test set (with outliers). |
p_outlier |
probability that a data point is replaced by an outlier
(a randomly chosen point in |
rangeVar |
Used to create the random covariance matrices for the
clusters. See |
The data generation process creates a data set consisting of k
clusters in
roughly . The data points for each cluster are be drawn from a
multivariate normal distribution given a random mean and a random
variance/covariance matrix for each cluster. The temporal aspect is modeled by
a fixed subsequence (of length
n_subseq
) through the k
clusters. In each step in the subsequence we
have a transition probability p_transition
that the next data point
is in the same
cluster or in a randomly chosen other cluster, thus we can create slowly or
fast changing data. For the complete sequence, the subsequence is repeated
to create n_test
/n_train
data points.
The data set is generated by drawing a data point from
the cluster corresponding to each position in the sequence. Outliers are
introduced by replacing data points in the data set with probability
$p_outlier
by
randomly chosen data points in .
Finally, to introduce imperfection
in the temporal sequence (e.g., because the data does not follow exactly a
repeating sequence or because observations do not arrive in the correct order),
we swap two consecutive observations with probability
p_swap
.
A list with the following elements:
test |
test data. |
train |
training data. |
sequence_test |
sequence of the test data points through the clusters. |
sequence_train |
sequence of the training data points through the clusters. |
swap_test |
index where points are swapped. |
swap_train |
index where points are swapped. |
outlier_position |
logical vector for outliers in test data. |
model |
centers and covariance matrices for the clusters. |
## create only test data (with outliers) ds <- synthetic_stream(n_train = 0) ## plot test data plot(ds$test, pch = ds$sequence_test, col = "gray") text(ds$model$mu[, 1], ds$model$mu[, 2], 1:10) ## mark outliers points(ds$test[ds$outlier_position, ], pch = 3, lwd = 2, col = "red")
## create only test data (with outliers) ds <- synthetic_stream(n_train = 0) ## plot test data plot(ds$test, pch = ds$sequence_test, col = "gray") text(ds$model$mu[, 1], ds$model$mu[, 2], 1:10) ## mark outliers points(ds$test[ds$outlier_position, ], pch = 3, lwd = 2, col = "red")
Implements the threshold Nearest Neighbor clustering algorithm used by EMM.
Objects can be created with new()
or by the creator function
tNN
.
measure
:Object of class "character"
containing
the name of the dissimilarity measure used
(see dist
in proxy for available measures)
centroids
:Object of class "logical"
indicating
if centroids are used for clusters. If FALSE
,
pseudo medians (first observation of a cluster) are used
to represent a cluster.
threshold
:Object of class "numeric"
with the dissimilarity threshold used
by the NN clustering algorithm for assigning a new
observation to existing clusters.
lambda
:Object of class "numeric"
specifying the
rate for fading.
lambda_factor
:Object of class "numeric"
expressing
the fading rate expressed as a factor.
tnn_d
:An environment containing the variable data for the tNN object:
centers
:Object of class "matrix"
containing
the cluster centers.
counts
:Object of class "numeric"
with the
number of observations assigned to each cluster.
var_thresholds
:Object of class "numeric"
with the
dissimilarity thresholds for individual clusters (usually
the same as threshold).
last
:A "character"
vector containing the
cluster names the points for the previous call of
cluster()
were assigned to.
signature(x = "tNN")
: Make a copy of the tNN object.
Making explicit copies is necessary since
information is stored in an environment which is not copied
for regular assignements.
signature(x = "tNN")
: returns the cluster
centers as a matrix.
signature(x = "tNN")
: returns the cluster
counts as a vector.
signature(x = "tNN")
: returns the names of the
clusters.
signature(x = "tNN")
: returns the
indices of the clusters the data points in the last cluster
operation where assigned to. To save memory the last clustering
information can be removed by setting the formal parameter
remove
to TRUE
.
signature(x = "tNN")
: returns the number of clusters
in the clustering.
signature(x = "tNN", y = "missing")
: plots the cluster
centers using a scatterplot matrix (see pairs
).
M.H. Dunham, Y. Meng, J. Huang (2004): Extensible Markov Model, In: ICDM '04: Proceedings of the Fourth IEEE International Conference on Data Mining, pp. 371–374.
cluster
for adding new data to the clustering.
find_clusters
to find the nearest neighbor cluster
for given data points.
EMM
extends "tNN".
Create an Markov model from a regular clustering (k-means or PAM) of sequence data.
TRAC(x, data = NULL, centers = NULL, measure = "euclidean")
TRAC(x, data = NULL, centers = NULL, measure = "euclidean")
x |
a clustering object (result of kmeans or PAM), a data set (a data matrix), or a vector with (integer) cluster assignments. |
data |
the data used for clustering (only used if |
centers |
if |
measure |
used distance measure. |
The order is inferred from the order in the original data set.
A EMM
object representing the clustering of sequence data.
data("EMMsim") ## using kmeans cl <- kmeans(EMMsim_train, 10) emm <- TRAC(cl) emm plot(emm, method = "MDS") ## using a cluster assignment vector (taken from the k-means clustering above) x <- cl$cluster emm <- TRAC(x, data = EMMsim_train) emm plot(emm, method = "MDS")
data("EMMsim") ## using kmeans cl <- kmeans(EMMsim_train, 10) emm <- TRAC(cl) emm plot(emm, method = "MDS") ## using a cluster assignment vector (taken from the k-means clustering above) x <- cl$cluster emm <- TRAC(x, data = EMMsim_train) emm plot(emm, method = "MDS")
Representation of the temporal structure of a data stream clustering using a extensible Markov model.
Objects can be created using the creator function TRACDS
or by
directly calling new("TRACDS", ...)
. Most slots for the extended
classes can be used as parameters.
lambda
:Object of class "numeric"
specifying the
rate for fading.
lambda_factor
:Object of class "numeric"
expressing
the fading rate expressed as a factor.
tracds_d
:An environment containing all the variable data of the TRACDS object:
mm
:Object of class "SimpleMC"
representing the
first order Markov model of the EMM.
current_state
:Object of class "character"
with the
name of current state in the EMM. NA
means
no current state.
signature(x = "TRACDS")
: Make a copy of the TRACDS object.
Making explicit copies is necessary since
information is stored in an environment which is not copied for regular
assignements.
signature(x = "TRACDS")
: returns the name of
the current state.
signature(x = "TRACDS")
: returns the number of states.
signature(x = "TRACDS")
: returns the number of transitions with a count larger than 0 stored in the object.
signature(x = "TRACDS", y = "missing")
: Plots the
object as a directed graph.
signature(x = "TRACDS")
: returns the names of the
states.
signature(x = "TRACDS")
: returns all transitions as a matrix of state names with a from and a to column.
A TRACDS object can be coerced to igraph or graph objects using
as.igraph
() and as.graph()
.
Michael Hahsler and Margaret H. Dunham. Temporal structure learning for clustering massive data streams in real-time. In SIAM Conference on Data Mining (SDM11), pages 664–675. SIAM, April 2011. doi:10.1137/1.9781611972818.57
M. Hahsler, M. H. Dunham (2010): rEMM: Extensible Markov Model for Data Stream Clustering in R, Journal of Statistical Software, 35(5), 1-31, URL doi:10.18637/jss.v035.i05
M.H. Dunham, Y. Meng, J. Huang (2004): Extensible Markov Model, In: ICDM '04: Proceedings of the Fourth IEEE International Conference on Data Mining, pp. 371–374.
Look at
transition
,
transition_matrix
and
initial_transition
to access the transition information in
the EMM.
predict
is used to predict future states of an EMM.
EMM
extends "TRACDS"
.
Calculates individual transition probabilities/counts
or a complete transition matrix
for an EMM (which contains "TRACDS"
).
## S4 method for signature 'TRACDS,character,character' transition(x, from, to, type = c("probability", "counts", "log_odds"), prior = TRUE) ## S4 method for signature 'TRACDS' transition_matrix(x, type = c("probability", "counts", "log_odds"), prior = TRUE) ## S4 method for signature 'TRACDS' initial_transition(x, type = c("probability", "counts", "log_odds"), prior = TRUE)
## S4 method for signature 'TRACDS,character,character' transition(x, from, to, type = c("probability", "counts", "log_odds"), prior = TRUE) ## S4 method for signature 'TRACDS' transition_matrix(x, type = c("probability", "counts", "log_odds"), prior = TRUE) ## S4 method for signature 'TRACDS' initial_transition(x, type = c("probability", "counts", "log_odds"), prior = TRUE)
x |
an object of class |
from , to
|
Names a states. If |
type |
What should be calculated? |
prior |
add one to each transition count. This is equal to starting with a uniform prior for the transition count distribution, i.e., initially all transitions are equally likely. |
Log odds are calculated as where
is the probability
of the transition and
is the number of states in the EMM.
is
the probability of a transition under the null model which assumes that the
transition probability from each state to each other state (including staying
in the same state) is the same, i.e., the null model has a transition matrix
with all entries equal to
.
A scalar (for transition
), a square matrix
(for transition_matrix
) or a vector (for initial_transition
).
data("EMMTraffic") emm <- EMM(measure="eJaccard", threshold=0.2) emm <- build(emm, EMMTraffic) ## get transition matrix transition_matrix(emm, type="count", prior=FALSE) transition_matrix(emm, type="count") transition_matrix(emm, prior=FALSE) transition_matrix(emm) ## get initial state probabilities initial_transition(emm) ## access individual transition probability (state 1 -> 2) transition(emm, "1","2") ## get counts for all existing transitions tr <- transitions(emm) tr cbind(as.data.frame(tr), counts=transition(emm, tr, type="counts"))
data("EMMTraffic") emm <- EMM(measure="eJaccard", threshold=0.2) emm <- build(emm, EMMTraffic) ## get transition matrix transition_matrix(emm, type="count", prior=FALSE) transition_matrix(emm, type="count") transition_matrix(emm, prior=FALSE) transition_matrix(emm) ## get initial state probabilities initial_transition(emm) ## access individual transition probability (state 1 -> 2) transition(emm, "1","2") ## get counts for all existing transitions tr <- transitions(emm) tr cbind(as.data.frame(tr), counts=transition(emm, tr, type="counts"))
Finds the state sequence of a new sequence in an EMM and returns a table with the transition probabilities or counts.
## S4 method for signature 'EMM,matrix' transition_table(x, newdata, type = c("probability", "counts", "log_odds"), match_cluster = "exact", prior=TRUE, initial_transition = FALSE)
## S4 method for signature 'EMM,matrix' transition_table(x, newdata, type = c("probability", "counts", "log_odds"), match_cluster = "exact", prior=TRUE, initial_transition = FALSE)
x |
an |
newdata |
new sequence, |
type |
the measure to return. |
match_cluster |
do the new observations have to fall within
the threshold of the cluster ( |
prior |
add one to each transition count. This is equal to starting with a uniform prior for the transition count distribution, i.e. initially all transitions are equally likely. It also prevents the product of probabilities to be zero if a transition was never observed. |
initial_transition |
include the initial transition in the table? |
A data.frame with three columns (from state, to state and the transition probability/count.)
transition
to access transition probabilities
and find_clusters
for assigning observations to states/clusters.
data("EMMsim") emm <- EMM(threshold=.5) emm <- build(emm, EMMsim_train) head(transition_table(emm, EMMsim_test)) head(transition_table(emm, EMMsim_test, type ="prob", initial_transition=TRUE))
data("EMMsim") emm <- EMM(threshold=.5) emm <- build(emm, EMMsim_train) head(transition_table(emm, EMMsim_test)) head(transition_table(emm, EMMsim_test, type ="prob", initial_transition=TRUE))
Add a sequence of new state assignments to a TRACDS object.
## S4 method for signature 'TRACDS' update(object, newdata, verbose=FALSE, ...) reset(x) compact(x)
## S4 method for signature 'TRACDS' update(object, newdata, verbose=FALSE, ...) reset(x) compact(x)
x , object
|
a |
newdata |
a vector with a state assignemnt sequence (typically produced by clustering). |
verbose |
logical; verbose output? |
... |
further arguments. |
update()
adds a new state assignemnt sequenc to the TRACDS object by increasing the
transition counts and, if needed, creating new states.
reset()
resets the current state to NA
for reading in a
new sequence. An NA
in newdata
also resets the current state.
compact()
reduces the size (memory) used to store the temporal
transition matrix.
A reference to the changed TRACDS object with the data added.
Note: EMM objects store all variable data in an environment which
enables us to update partial data without copying the whole object. Assignment
will not create a copy! Use the provided method copy()
.
## create an empty TRACDS object tracds <- TRACDS() tracds ## update with an cluster assignment sequence update(tracds, c(1,2,5,5,2)) tracds plot(tracds)
## create an empty TRACDS object tracds <- TRACDS() tracds ## update with an cluster assignment sequence update(tracds, c(1,2,5,5,2)) tracds plot(tracds)