Density-based Spatial Clustering of Applications with Noise (DBSCAN)
Description
Fast reimplementation of the DBSCAN (Density-based spatial clustering of
applications with noise) clustering algorithm using a kd-tree.
Usage
dbscan(x, eps, minPts = 5, weights = NULL, borderPoints = TRUE, ...)
is.corepoint(x, eps, minPts = 5, ...)
predict(object, newdata, data, ...)
Arguments
x |
a data matrix, a data.frame, a dist object or a frNN object with
fixed-radius nearest neighbors.
|
eps |
size (radius) of the epsilon neighborhood. Can be omitted if
x is a frNN object.
|
minPts |
number of minimum points required in the eps neighborhood for
core points (including the point itself).
|
weights |
numeric; weights for the data points. Only needed to perform
weighted clustering.
|
borderPoints |
logical; should border points be assigned to clusters.
The default is TRUE for regular DBSCAN. If FALSE then border
points are considered noise (see DBSCAN* in Campello et al, 2013).
|
... |
additional arguments are passed on to the fixed-radius nearest
neighbor search algorithm. See frNN() for details on how to
control the search strategy.
|
object |
clustering object.
|
newdata |
new data points for which the cluster membership should be
predicted.
|
data |
the data set used to create the clustering object.
|
Details
The
implementation is significantly faster and can work with larger data sets
than fpc::dbscan()
in fpc. Use dbscan::dbscan()
(with specifying the package) to
call this implementation when you also load package fpc.
The algorithm
This implementation of DBSCAN follows the original
algorithm as described by Ester et al (1996). DBSCAN performs the following steps:
-
Estimate the density
around each data point by counting the number of points in a user-specified
eps-neighborhood and applies a used-specified minPts thresholds to identify
-
core points (points with more than minPts points in their neighborhood),
-
border points (non-core points with a core point in their neighborhood) and
-
noise points (all other points).
-
Core points form the backbone of clusters by joining them into
a cluster if they are density-reachable from each other (i.e., there is a chain of core
points where one falls inside the eps-neighborhood of the next).
-
Border points are assigned to clusters. The algorithm needs parameters
eps
(the radius of the epsilon neighborhood) and minPts
(the
density threshold).
Border points are arbitrarily assigned to clusters in the original
algorithm. DBSCAN* (see Campello et al 2013) treats all border points as
noise points. This is implemented with borderPoints = FALSE
.
Specifying the data
If x
is a matrix or a data.frame, then fast fixed-radius nearest
neighbor computation using a kd-tree is performed using Euclidean distance.
See frNN()
for more information on the parameters related to
nearest neighbor search. Note that only numerical values are allowed in x
.
Any precomputed distance matrix (dist object) can be specified as x
.
You may run into memory issues since distance matrices are large.
A precomputed frNN object can be supplied as x
. In this case
eps
does not need to be specified. This option us useful for large
data sets, where a sparse distance matrix is available. See
frNN()
how to create frNN objects.
Setting parameters for DBSCAN
The parameters minPts
and eps
define the minimum density required
in the area around core points which form the backbone of clusters.
minPts
is the number of points
required in the neighborhood around the point defined by the parameter eps
(i.e., the radius around the point). Both parameters
depend on each other and changing one typically requires changing
the other one as well. The parameters also depend on the size of the data set with
larger datasets requiring a larger minPts
or a smaller eps
.
-
minPts:
The original
DBSCAN paper (Ester et al, 1996) suggests to start by setting minPts≥d+1
,
the data dimensionality plus one or higher with a minimum of 3. Larger values
are preferable since increasing the parameter suppresses more noise in the data
by requiring more points to form clusters.
Sander et al (1998) uses in the examples two times the data dimensionality.
Note that setting minPts≤2
is equivalent to hierarchical clustering
with the single link metric and the dendrogram cut at height eps
.
-
eps:
A suitable neighborhood size
parameter eps
given a fixed value for minPts
can be found
visually by inspecting the kNNdistplot()
of the data using
k=minPts−1
(minPts
includes the point itself, while the
k-nearest neighbors distance does not). The k-nearest neighbor distance plot
sorts all data points by their k-nearest neighbor distance. A sudden
increase of the kNN distance (a knee) indicates that the points to the right
are most likely outliers. Choose eps
for DBSCAN where the knee is.
Predict cluster memberships
predict()
can be used to predict cluster memberships for new data
points. A point is considered a member of a cluster if it is within the eps
neighborhood of a core point of the cluster. Points
which cannot be assigned to a cluster will be reported as
noise points (i.e., cluster ID 0).
Important note: predict()
currently can only use Euclidean distance to determine
the neighborhood of core points. If dbscan()
was called using distances other than Euclidean,
then the neighborhood calculation will not be correct and only approximated by Euclidean
distances. If the data contain factor columns (e.g., using Gower's distance), then
the factors in data
and query
first need to be converted to numeric to use the
Euclidean approximation.
Value
dbscan()
returns an object of class dbscan_fast
with the following components:
eps |
value of the eps parameter.
|
minPts |
value of the minPts parameter.
|
cluster |
A integer vector with cluster assignments. Zero indicates noise points.
|
is.corepoint()
returns a logical vector indicating for each data point if it is a
core point.
Author(s)
Michael Hahsler
References
Hahsler M, Piekenbrock M, Doran D (2019). dbscan: Fast
Density-Based Clustering with R. Journal of Statistical Software,
91(1), 1-30.
doi:10.18637/jss.v091.i01
Martin Ester, Hans-Peter Kriegel, Joerg Sander, Xiaowei Xu (1996). A
Density-Based Algorithm for Discovering Clusters in Large Spatial Databases
with Noise. Institute for Computer Science, University of Munich.
Proceedings of 2nd International Conference on Knowledge Discovery and
Data Mining (KDD-96), 226-231.
https://dl.acm.org/doi/10.5555/3001460.3001507
Campello, R. J. G. B.; Moulavi, D.; Sander, J. (2013). Density-Based
Clustering Based on Hierarchical Density Estimates. Proceedings of the
17th Pacific-Asia Conference on Knowledge Discovery in Databases, PAKDD
2013, Lecture Notes in Computer Science 7819, p. 160.
doi:10.1007/978-3-642-37456-2_14
Sander, J., Ester, M., Kriegel, HP. et al. (1998). Density-Based
Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications.
Data Mining and Knowledge Discovery 2, 169-194.
doi:10.1023/A:1009745219419
See Also
Other clustering functions:
extractFOSC()
,
hdbscan()
,
jpclust()
,
optics()
,
sNNclust()
Examples
data(iris)
iris <- as.matrix(iris[, 1:4])
kNNdistplot(iris, minPts = 5)
abline(h=.7, col = "red", lty = 2)
res <- dbscan(iris, eps = .7, minPts = 5)
res
pairs(iris, col = res$cluster + 1L)
fr <- frNN(iris, eps = .7)
dbscan(fr, minPts = 5)
set.seed(665544)
n <- 100
x <- cbind(
x = runif(10, 0, 10) + rnorm(n, sd = 0.2),
y = runif(10, 0, 10) + rnorm(n, sd = 0.2)
)
res <- dbscan(x, eps = .3, minPts = 3)
res
plot(x, col = res$cluster)
points(x[res$cluster == 0, ], pch = 3, col = "grey")
hullplot(x, res)
newdata <- x[1:5,] + rnorm(10, 0, .3)
hullplot(x, res)
points(newdata, pch = 3 , col = "red", lwd = 3)
text(newdata, pos = 1)
pred_label <- predict(res, newdata, data = x)
pred_label
points(newdata, col = pred_label + 1L, cex = 2, lwd = 2)
if (requireNamespace("fpc", quietly = TRUE) &&
requireNamespace("microbenchmark", quietly = TRUE)) {
t_dbscan <- microbenchmark::microbenchmark(
dbscan::dbscan(x, .3, 3), times = 10, unit = "ms")
t_dbscan_linear <- microbenchmark::microbenchmark(
dbscan::dbscan(x, .3, 3, search = "linear"), times = 10, unit = "ms")
t_dbscan_dist <- microbenchmark::microbenchmark(
dbscan::dbscan(x, .3, 3, search = "dist"), times = 10, unit = "ms")
t_fpc <- microbenchmark::microbenchmark(
fpc::dbscan(x, .3, 3), times = 10, unit = "ms")
r <- rbind(t_fpc, t_dbscan_dist, t_dbscan_linear, t_dbscan)
r
boxplot(r,
names = c('fpc', 'dbscan (dist)', 'dbscan (linear)', 'dbscan (kdtree)'),
main = "Runtime comparison in ms")
median(t_fpc$time) / median(t_dbscan$time)
}
nn <- structure(list(ids = list(c(2,3), c(1,3), c(1,2,3), c(3,5), c(4,5)), eps = 1),
class = c("NN", "frNN"))
nn
dbscan(nn, minPts = 2)