Data streams are often processed in a distributed manner using multiple machines or multiple processes. For example, a data stream may be produced by a sensor attached to a remote machine or multiple clustering algorithms run in parallel using several R processes. Another application is to connect to other software components in a stream mining pipeline.
First, we show how socket connections together with the package
stream
can be used to connect multiple processes or
machines.
Then we give examples of how package streamConnect
makes
connecting stream mining components more convenient by providing an
interface to connect stream processing using sockets or web
services. While sockets are only used to connect data steam
generating processes, web services are more versatile and can also be
used to create data stream clustering processes as a service.
The final section of this paper shows how to deploy the server/web service.
The functions write_stream()
and the class
DSD_ReadStream
provided in package stream
can
be used for communicate via connections (files, sockets, URLs, etc.). In
the first example, we manually set up the connection. The example is
useful to understand how sockets work especially for users interested in
implementing their own components using other programming languages or
connecting with other data stream software.
A more convenient way to do this using package
streamConnect
is described later in this paper.
For we find an available port.
## [1] 22741
The server serves data from a data stream. We use library
callr
to create a separate R process that serves a data
stream creating 10 points every second using a socket connection, but
you can also put the code in function r_bg()
in a file
called server.R
and run (potentially on a different
machine) it with R CMD BATCH server.R
from the command
line.
##
## Attaching package: 'callr'
## The following object is masked from 'package:rmarkdown':
##
## run
rp1 <- r_bg(function(port) {
library(stream)
stream <- DSD_Gaussians(k = 3, d = 3)
blocksize <- 10
con <- socketConnection(port = port, server = TRUE)
while (TRUE) {
write_stream(stream, con, n = blocksize, close = FALSE)
Sys.sleep(1)
}
close(con)
},
args = list(port = port))
rp1
## PROCESS 'R', running, pid 4239.
The client consumes the data stream. We open the connection which
starts the data generating process. Note that streamConnect
is not used here. For convenience, we only use the helper
retry()
defined in streamConnect to make sure the server
connections are established.
## A connection with
## description "->localhost:22741"
## class "sockconn"
## mode "r"
## text "text"
## opened "opened"
## can read "yes"
## can write "yes"
We poll all available data (n = -1
) several times. The
first request should yield 10 points, the second none and the third
request should yield 20 points (2 seconds).
## V1 V2 V3
## 1 0.5274059 0.3134461 0.9408616
## 2 0.5453364 0.3035071 0.9553197
## 3 0.5525617 0.2996399 0.9647634
## 4 0.5472777 0.3306940 0.9472553
## 5 0.3844658 0.5000385 0.7393791
## 6 0.5903006 0.2760730 0.9768200
## 7 0.5756423 0.2784506 0.9690516
## 8 0.8785989 0.4174093 0.3770402
## 9 0.5809484 0.2741703 0.9993021
## 10 0.4256463 0.3883936 0.6968576
## [1] V1 V2 V3
## <0 rows> (or 0-length row.names)
## V1 V2 V3
## 1 0.3790218 0.4196092 0.6743943
## 2 0.5449854 0.3264309 0.9849376
## 3 0.8524743 0.4178367 0.3446821
## 4 0.9025799 0.4679661 0.3943303
## 5 0.5522955 0.3254771 0.9675878
## 6 0.8544014 0.4285364 0.3528075
## 7 0.3912812 0.4470927 0.7083757
## 8 0.3667402 0.4630571 0.6955140
## 9 0.4760971 0.3595637 0.9052460
## 10 0.5481738 0.3081793 0.9300577
## 11 0.5720659 0.2850204 0.9974920
## 12 0.4535362 0.3834308 0.7222305
## 13 0.5470396 0.3285149 0.9927372
## 14 0.4184458 0.4156664 0.7127637
## 15 0.4113970 0.4691194 0.7449585
## 16 0.5391118 0.3017884 0.9534165
## 17 0.3931801 0.4033715 0.6778493
## 18 0.3742362 0.4635003 0.7036466
## 19 0.3794860 0.4948628 0.7310798
## 20 0.5379319 0.3351038 0.9355852
streamConnect
provides a more convenient way to set up a
connection using sockets. publish_DSD_via_Socket()
creates
a socket broadcasting the data stream and DSD_ReadSocket
creates a DSD
object reading from that socket.
We will use an available port.
## [1] 21270
We create a DSD process sending data to the port.
library(streamConnect)
rp1 <- DSD_Gaussians(k = 3, d = 3) %>% publish_DSD_via_Socket(port = port)
rp1
## PROCESS 'R', running, pid 4294.
Next, we create a DSD that connects to the socket.
DSD_ReadSocket()
already performs internally retries
library(streamConnect)
dsd <- DSD_ReadSocket(port = port, col.names = c("x", "y", "z", ".class"))
dsd
## Data Stream from Connection (d = 3, k = NA)
## Class: DSD_ReadStream, DSD_R, DSD
## connection: ->localhost:21270 (opened)
## x y z .class
## 1 0.19907712 0.2138432 0.8207215 1
## 2 0.05321204 0.9762836 0.8730718 3
## 3 0.37642041 0.2800791 0.4610397 2
## 4 0.01385207 0.9103306 0.8555305 3
## 5 0.15676114 0.1519955 0.8148040 1
## 6 0.37506178 0.2747544 0.5235124 2
## 7 0.41135538 0.2742436 0.5044203 2
## 8 0.11216269 0.9298172 0.9651101 3
## 9 0.04816198 0.9525125 0.8444650 3
## 10 0.35266452 0.3379103 0.4634215 2
Web services are more versatile, they can be used to deploy data
stream sources using
publish_DSD_via_WebService()
/DSD_ReadWebservice
or data stream tasks using
publish_DSC_via_WebService()
/DSC_WebService
.
Here we only show how to deploy a clusterer, but a DSD can be published
in a similar manner. Larger workflows can be created using
DST_Runner
from stream
.
streamConnect
uses the package plumber
to
manage web services. The data is transmitted in serialized form. The
default serialization format it csv
(comma separated
values). Other formats are json
and rds
(see
plumber::serializer_csv
).
We will use an available port.
## [1] 17871
Creating a clustering web service process listening for data on the port.
## PROCESS 'R', running, pid 4349.
Connect to the web service with a local DSC interface.
library(streamConnect)
dsc <- DSC_WebService(paste0("http://localhost", ":", port),
verbose = TRUE, config = httr::verbose(info = TRUE))
## Connecting to DSC Web service at http://localhost:17871
## Success
## Web Service Data Stream Clusterer: DBSTREAM
## Served from: http://localhost:17871
## Class: DSC_WebService, DSC_R, DSC
## Number of micro-clusters: 0
## Number of macro-clusters: 0
Note that the verbose output can help with debugging connection issues.
Cluster some data.
## Web Service Data Stream Clusterer: DBSTREAM
## Served from: http://localhost:17871
## Class: DSC_WebService, DSC_R, DSC
## Number of micro-clusters: 21
## Number of macro-clusters: 3
## # A tibble: 21 × 2
## X1 X2
## <dbl> <dbl>
## 1 0.398 0.0570
## 2 0.691 0.128
## 3 0.470 0.736
## 4 0.416 0.756
## 5 0.771 0.0620
## 6 0.363 0.0875
## 7 0.377 0.809
## 8 0.502 0.768
## 9 0.741 0.137
## 10 0.532 0.681
## # ℹ 11 more rows
## [1] 55.828187 36.737336 64.323546 75.273460 52.998762 23.259867 35.202017
## [8] 35.032084 63.145635 15.645327 37.237636 72.602291 23.466981 58.094944
## [15] 40.887977 22.320892 5.292756 3.752464 13.139182 10.203561 6.778256
Web services and the socket-based server can be easily deployed to
any server or cloud system including containers. Make sure R and the
package streamConnect
and all dependencies are installed.
Create a short R script to start the server/service and deploy it.
library(streamConnect)
port = 8001
publish_DSC_via_WebService("DSC_DBSTREAM(r = .05)", port = port,
background = FALSE)
Web services can also be deployed using a plumber task file. The following call does not create a server, but returns the name of the task file.
Open the file in R studio to deploy it or read the plumber Hosting vignette.