Title: | Mining Association Rules and Frequent Itemsets |
---|---|
Description: | Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules). Also provides C implementations of the association mining algorithms Apriori and Eclat. Hahsler, Gruen and Hornik (2005) <doi:10.18637/jss.v014.i15>. |
Authors: | Michael Hahsler [aut, cre, cph] , Christian Buchta [aut, cph], Bettina Gruen [aut, cph], Kurt Hornik [aut, cph] , Christian Borgelt [ctb, cph], Ian Johnson [ctb], Makhlouf Ledmi [ctb] |
Maintainer: | Michael Hahsler <[email protected]> |
License: | GPL-3 |
Version: | 1.7-8 |
Built: | 2024-10-27 05:20:42 UTC |
Source: | https://github.com/mhahsler/arules |
Provides the generic function and the methods to abbreviate long item labels
in transactions, associations (rules and itemsets) and transaction ID lists.
Note that abbreviate()
is not a generic and this arules defines a
generic with the base::abbreviate()
as the default.
abbreviate(names.arg, ...) ## S4 method for signature 'itemMatrix' abbreviate(names.arg, minlength = 4, ..., method = "both.sides") ## S4 method for signature 'transactions' abbreviate(names.arg, minlength = 4, ..., method = "both.sides") ## S4 method for signature 'rules' abbreviate(names.arg, minlength = 4, ..., method = "both.sides") ## S4 method for signature 'itemsets' abbreviate(names.arg, minlength = 4, ..., method = "both.sides") ## S4 method for signature 'tidLists' abbreviate(names.arg, minlength = 4, ..., method = "both.sides")
abbreviate(names.arg, ...) ## S4 method for signature 'itemMatrix' abbreviate(names.arg, minlength = 4, ..., method = "both.sides") ## S4 method for signature 'transactions' abbreviate(names.arg, minlength = 4, ..., method = "both.sides") ## S4 method for signature 'rules' abbreviate(names.arg, minlength = 4, ..., method = "both.sides") ## S4 method for signature 'itemsets' abbreviate(names.arg, minlength = 4, ..., method = "both.sides") ## S4 method for signature 'tidLists' abbreviate(names.arg, minlength = 4, ..., method = "both.sides")
names.arg |
an object of class transactions, itemMatrix, itemsets, rules or tidLists. |
... |
further arguments passed on to the default abbreviation function. |
minlength |
number of characters allowed in abbreviation |
method |
apply to level and value (both.sides) |
Sudheer Chelluboina and Michael Hahsler based on code by Martin Vodenicharov.
Other associations functions:
associations-class
,
c()
,
duplicated()
,
extract
,
inspect()
,
is.closed()
,
is.generator()
,
is.maximal()
,
is.redundant()
,
is.significant()
,
is.superset()
,
itemsets-class
,
match()
,
rules-class
,
sample()
,
sets
,
size()
,
sort()
,
unique()
Other itemMatrix and transactions functions:
c()
,
crossTable()
,
duplicated()
,
extract
,
hierarchy
,
image()
,
inspect()
,
is.superset()
,
itemFrequency()
,
itemFrequencyPlot()
,
itemMatrix-class
,
match()
,
merge()
,
random.transactions()
,
sample()
,
sets
,
size()
,
supportingTransactions()
,
tidLists-class
,
transactions-class
,
unique()
data(Adult) inspect(head(Adult, 1)) Adult_abbr <- abbreviate(Adult, 15) inspect(head(Adult_abbr, 1))
data(Adult) inspect(head(Adult, 1)) Adult_abbr <- abbreviate(Adult, 15) inspect(head(Adult_abbr, 1))
Provides the generic function addComplement()
and a method for
transactions to add complement items. That is,
it adds an artificial item to each transaction which does not contain the
original item. Such items are also called negative items (Antonie et al,
2014).
addComplement(x, labels, complementLabels = NULL) ## S4 method for signature 'transactions' addComplement(x, labels, complementLabels = NULL)
addComplement(x, labels, complementLabels = NULL) ## S4 method for signature 'transactions' addComplement(x, labels, complementLabels = NULL)
x |
an object of class transactions. |
labels |
character strings; item labels for which complements should be created. |
complementLabels |
character strings; labels for the artificial complement-items. If omitted then the original label is prepended by "!" to form the complement-item label. |
Returns an object of class transactions with complement items added.
Michael Hahsler
Antonie L., Li J., Zaiane O. (2014) Negative Association Rules. In: Aggarwal C., Han J. (eds) Frequent Pattern Mining, Springer International Publishing, pp. 135-145. doi:10.1007/978-3-319-07821-2_6
data("Groceries") ## add a complement-items for "whole milk" and "other vegetables" g2 <- addComplement(Groceries, c("whole milk", "other vegetables")) g2 tail(itemInfo(g2)) inspect(head(g2, 3)) ## use a custom label for the complement-item g3 <- addComplement(g2, "coffee", complementLabels = "NO coffee") inspect(head(g2, 3)) ## add complements for all items (this is excessive for this dataset) g4 <- addComplement(Groceries, itemLabels(Groceries)) g4 ## add complements for all items with a minimum support of 0.1 g5 <- addComplement(Groceries, names(which(itemFrequency(Groceries) >= 0.1))) g5
data("Groceries") ## add a complement-items for "whole milk" and "other vegetables" g2 <- addComplement(Groceries, c("whole milk", "other vegetables")) g2 tail(itemInfo(g2)) inspect(head(g2, 3)) ## use a custom label for the complement-item g3 <- addComplement(g2, "coffee", complementLabels = "NO coffee") inspect(head(g2, 3)) ## add complements for all items (this is excessive for this dataset) g4 <- addComplement(Groceries, itemLabels(Groceries)) g4 ## add complements for all items with a minimum support of 0.1 g5 <- addComplement(Groceries, names(which(itemFrequency(Groceries) >= 0.1))) g5
The AdultUCI
data set contains the questionnaire data of the
Adult database (originally called the Census Income
Database) formatted as a data.frame. The Adult
data set contains the
data already prepared and coerced to transactions for
use with arules.
Adult
is an object of class transactions
with 48842 transactions
and 115 items. See below for details.
The AdultUCI data set contains a data frame with 48842 observations on the following 15 variables.
a numeric vector.
a factor with levels Federal-gov
,
Local-gov
, Never-worked
, Private
, Self-emp-inc
,
Self-emp-not-inc
, State-gov
, and Without-pay
.
an ordered factor with levels Preschool
<
1st-4th
< 5th-6th
< 7th-8th
< 9th
< 10th
< 11th
< 12th
< HS-grad
< Prof-school
<
Assoc-acdm
< Assoc-voc
< Some-college
<
Bachelors
< Masters
< Doctorate
.
a numeric vector.
a factor with
levels Divorced
, Married-AF-spouse
, Married-civ-spouse
,
Married-spouse-absent
, Never-married
, Separated
, and
Widowed
.
a factor with levels Adm-clerical
,
Armed-Forces
, Craft-repair
, Exec-managerial
,
Farming-fishing
, Handlers-cleaners
, Machine-op-inspct
,
Other-service
, Priv-house-serv
, Prof-specialty
,
Protective-serv
, Sales
, Tech-support
, and
Transport-moving
.
a factor with levels
Husband
, Not-in-family
, Other-relative
,
Own-child
, Unmarried
, and Wife
.
a factor
with levels Amer-Indian-Eskimo
, Asian-Pac-Islander
,
Black
, Other
, and White
.
a factor with levels Female
and Male
.
a numeric vector.
a numeric vector.
a numeric vector.
a numeric vector.
a factor with levels Cambodia
, Canada
, China
,
Columbia
, Cuba
, Dominican-Republic
, Ecuador
,
El-Salvador
, England
, France
, Germany
,
Greece
, Guatemala
, Haiti
, Holand-Netherlands
,
Honduras
, Hong
, Hungary
, India
, Iran
,
Ireland
, Italy
, Jamaica
, Japan
, Laos
,
Mexico
, Nicaragua
, Outlying-US(Guam-USVI-etc)
,
Peru
, Philippines
, Poland
, Portugal
,
Puerto-Rico
, Scotland
, South
, Taiwan
,
Thailand
, Trinadad&Tobago
, United-States
,
Vietnam
, and Yugoslavia
.
an ordered factor with
levels small
< large
.
The Adult database was extracted from the census bureau database
found at https://www.census.gov/ in 1994 by Ronny Kohavi and Barry
Becker (Data Mining and Visualization, Silicon Graphics). It was originally
used to predict whether income exceeds USD 50K/yr based on census data. We
added the attribute income
with levels small
and large
(>50K).
We prepared the data set for association mining as shown in the section
Examples. We removed the continuous attribute fnlwgt
(final weight).
We also eliminated education-num
because it is just a numeric
representation of the attribute education
. The other 4 continuous
attributes we mapped to ordinal attributes as follows:
age: cut into levels Young
(0-25), Middle-aged
(26-45),
Senior
(46-65) and Old
(66+)
hours-per-week: cut into levels Part-time
(0-25),
Full-time
(25-40), Over-time
(40-60) and Too-much
(60+)
capital-gain and capital-loss: each cut into levels None
(0),
Low
(0 < median of the values greater zero < max) and High
(>=max)
Michael Hahsler
A. Asuncion & D. J. Newman (2007): UCI Repository of Machine Learning Databases. Irvine, CA: University of California, Department of Information and Computer Science.
The data set was first cited in Kohavi, R. (1996): Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining.
data("AdultUCI") dim(AdultUCI) AdultUCI[1:2, ] ## remove attributes AdultUCI[["fnlwgt"]] <- NULL AdultUCI[["education-num"]] <- NULL ## map metric attributes AdultUCI[["age"]] <- ordered(cut(AdultUCI[["age"]], c(15, 25, 45, 65, 100)), labels = c("Young", "Middle-aged", "Senior", "Old")) AdultUCI[["hours-per-week"]] <- ordered(cut(AdultUCI[["hours-per-week"]], c(0,25,40,60,168)), labels = c("Part-time", "Full-time", "Over-time", "Workaholic")) AdultUCI[["capital-gain"]] <- ordered(cut(AdultUCI[["capital-gain"]], c(-Inf,0,median(AdultUCI[["capital-gain"]][AdultUCI[["capital-gain"]] > 0]), Inf)), labels = c("None", "Low", "High")) AdultUCI[["capital-loss"]] <- ordered(cut(AdultUCI[["capital-loss"]], c(-Inf,0, median(AdultUCI[["capital-loss"]][AdultUCI[["capital-loss"]] > 0]), Inf)), labels = c("None", "Low", "High")) ## create transactions Adult <- transactions(AdultUCI) Adult
data("AdultUCI") dim(AdultUCI) AdultUCI[1:2, ] ## remove attributes AdultUCI[["fnlwgt"]] <- NULL AdultUCI[["education-num"]] <- NULL ## map metric attributes AdultUCI[["age"]] <- ordered(cut(AdultUCI[["age"]], c(15, 25, 45, 65, 100)), labels = c("Young", "Middle-aged", "Senior", "Old")) AdultUCI[["hours-per-week"]] <- ordered(cut(AdultUCI[["hours-per-week"]], c(0,25,40,60,168)), labels = c("Part-time", "Full-time", "Over-time", "Workaholic")) AdultUCI[["capital-gain"]] <- ordered(cut(AdultUCI[["capital-gain"]], c(-Inf,0,median(AdultUCI[["capital-gain"]][AdultUCI[["capital-gain"]] > 0]), Inf)), labels = c("None", "Low", "High")) AdultUCI[["capital-loss"]] <- ordered(cut(AdultUCI[["capital-loss"]], c(-Inf,0, median(AdultUCI[["capital-loss"]][AdultUCI[["capital-loss"]] > 0]), Inf)), labels = c("None", "Low", "High")) ## create transactions Adult <- transactions(AdultUCI) Adult
Provides the generic function affinity()
and methods to compute
and return a similarity matrix with the affinities between items for a set
itemsets stored in a matrix or in transactions via its superclass itemMatrix.
affinity(x) ## S4 method for signature 'matrix' affinity(x) ## S4 method for signature 'itemMatrix' affinity(x)
affinity(x) ## S4 method for signature 'matrix' affinity(x) ## S4 method for signature 'itemMatrix' affinity(x)
x |
a matrix or an object of class itemMatrix or transactions containing itemsets. |
Affinity between the two items and
is defined by Aggarwal et
al. (2002) as
where is the support measure. Note that affinity is equivalent to the
Jaccard similarity between items.
returns an object of class ar_similarity which represents the
affinities between items in x
.
Michael Hahsler
Charu C. Aggarwal, Cecilia Procopiuc, and Philip S. Yu (2002) Finding localized associations in market basket data, IEEE Trans. on Knowledge and Data Engineering, 14(1):51–62.
Other proximity classes and functions:
dissimilarity()
,
predict()
,
proximity-classes
data("Adult") ## choose a sample, calculate affinities s <- sample(Adult, 500) s a <- affinity(s) image(a)
data("Adult") ## choose a sample, calculate affinities s <- sample(Adult, 500) s a <- affinity(s) image(a)
Specifies the restrictions on the associations mined by
apriori()
. These restrictions can implement certain aspects of
rule templates described by Klemettinen (1994).
Note that appearance is only supported by the implementation of
apriori()
.
labels
character vectors giving the labels of the items which can appear in the specified place (rhs, lhs or both for rules and items for itemsets). none specifies, that the items mentioned there cannot appear anywhere in the rule/itemset. Note that items cannot be specified in more than one place (i.e., you cannot specify an item in lhs and rhs, but have to specify it as both).
default
one of
"both"
, "lhs"
, "rhs"
, "none"
. Specified the
default appearance for all items not explicitly mentioned in the other
elements of the list. Leave unspecified and the code will guess the correct
setting.
set
used internally.
items
used internally.
If appearance restrictions are used, an
appearance object will be created automatically within the
apriori()
function using the information in the named list of
the function's appearance
argument. In this case, the item labels
used in the list will be automatically matched against the items in the used
transactions.
Objects can also be created by calls of the form new("APappearance", ...)
. In this case, item IDs (column numbers of the transactions incidence
matrix) have to be used instead of labels.
as("NULL", "APappearance")
as("list", "APappearance")
Michael Hahsler and Bettina Gruen
Christian Borgelt (2004) Apriori — Finding Association Rules/Hyperedges with the Apriori Algorithm. https://borgelt.net/apriori.html
M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen and A. I. Verkamo (1994). Finding Interesting Rules from Large Sets of Discovered Association Rules. In Proceedings of the Third International Conference on Information and Knowledge Management, 401–407.
Other mining algorithms:
AScontrol-classes
,
ASparameter-classes
,
apriori()
,
eclat()
,
fim4r()
,
ruleInduction()
,
weclat()
data("Adult") ## find only frequent itemsets which do not contain small or large income is <- apriori(Adult, parameter = list(support= 0.1, target="frequent"), appearance = list(none = c("income=small", "income=large"))) itemFrequency(items(is))["income=small"] itemFrequency(items(is))["income=large"] ## find itemsets that only contain small or large income, or young age is <- apriori(Adult, parameter = list(support= 0.1, target="frequent"), appearance = list(items = c("income=small", "income=large", "age=Young"))) inspect(head(is)) ## find only rules with income-related variables in the right-hand-side. incomeItems <- grep("^income=", itemLabels(Adult), value = TRUE) incomeItems rules <- apriori(Adult, parameter = list(support=0.2, confidence = 0.5), appearance = list(rhs = incomeItems)) inspect(head(rules)) ## Note: For more complicated restrictions you have to mine all rules/itemsets and ## then filter the results afterwards.
data("Adult") ## find only frequent itemsets which do not contain small or large income is <- apriori(Adult, parameter = list(support= 0.1, target="frequent"), appearance = list(none = c("income=small", "income=large"))) itemFrequency(items(is))["income=small"] itemFrequency(items(is))["income=large"] ## find itemsets that only contain small or large income, or young age is <- apriori(Adult, parameter = list(support= 0.1, target="frequent"), appearance = list(items = c("income=small", "income=large", "age=Young"))) inspect(head(is)) ## find only rules with income-related variables in the right-hand-side. incomeItems <- grep("^income=", itemLabels(Adult), value = TRUE) incomeItems rules <- apriori(Adult, parameter = list(support=0.2, confidence = 0.5), appearance = list(rhs = incomeItems)) inspect(head(rules)) ## Note: For more complicated restrictions you have to mine all rules/itemsets and ## then filter the results afterwards.
Mine frequent itemsets, association rules or association hyperedges using the Apriori algorithm.
apriori(data, parameter = NULL, appearance = NULL, control = NULL, ...)
apriori(data, parameter = NULL, appearance = NULL, control = NULL, ...)
data |
object of class transactions. Any data structure which can be coerced into transactions (e.g., a binary matrix, a data.frame or a tibble) can also be specified and will be internally coerced to transactions. |
parameter |
object of class APparameter or named
list. The default behavior is to mine rules with minimum support of 0.1,
minimum confidence of 0.8, maximum of 10 items (maxlen), and a maximal time
for subset checking of 5 seconds ( |
appearance |
object of class APappearance or named list. With this argument item appearance can be restricted (implements rule templates). By default all items can appear unrestricted. |
control |
object of class APcontrol or named list. Controls the algorithmic performance of the mining algorithm (item sorting, report progress (verbose), etc.) |
... |
Additional arguments are for convenience added to the parameter list. |
The Apriori algorithm (Agrawal et al, 1993) employs level-wise search for frequent itemsets. The used C implementation of Apriori by Christian Borgelt (2003) includes some improvements (e.g., a prefix tree and item sorting).
Warning about automatic conversion of matrices or data.frames to transactions.
It is preferred to create transactions manually before
calling apriori()
to have control over item coding. This is especially
important when you are working with multiple datasets or several subsets of
the same dataset. To read about item coding, see itemCoding.
If a data.frame is specified as x
, then the data is automatically
converted into transactions by discretizing numeric data using
discretizeDF()
and then coercion to transactions. The discretization
may fail if the data is not well behaved.
Apriori only creates rules with one item in the RHS (Consequent).
The default value in APparameter for minlen
is 1.
This meains that rules with only one item (i.e., an empty antecedent/LHS)
like
will be created. These rules mean that no matter what other items are
involved, the item in the RHS will appear with the probability given by the
rule's confidence (which equals the support). If you want to avoid these
rules then use the argument parameter = list(minlen = 2)
.
Notes on run time and memory usage:
If the minimum support
is
chosen too low for the dataset, then the algorithm will try to create an
extremely large set of itemsets/rules. This will result in very long run
time and eventually the process will run out of memory. To prevent this,
the default maximal length of itemsets/rules is restricted to 10 items (via
the parameter element maxlen = 10
) and the time for checking subsets is
limited to 5 seconds (via maxtime = 5
). The output will show if you hit
these limits in the "checking subsets" line of the output. The time limit is
only checked when the subset size increases, so it may run significantly
longer than what you specify in maxtime. Setting maxtime = 0
disables
the time limit.
Interrupting execution with Control-C/Esc
is not recommended. Memory
cleanup will be prevented resulting in a memory leak. Also, interrupts are
only checked when the subset size increases, so it may take some time till
the execution actually stops.
Returns an object of class rules or itemsets.
Michael Hahsler and Bettina Gruen
R. Agrawal, T. Imielinski, and A. Swami (1993) Mining association rules between sets of items in large databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 207–216, Washington D.C. doi:10.1145/170035.170072
Christian Borgelt (2012) Frequent Item Set Mining. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 2(6):437-456. J. Wiley & Sons, Chichester, United Kingdom 2012. doi:10.1002/widm.1074
Christian Borgelt and Rudolf Kruse (2002) Induction of Association Rules: Apriori Implementation. 15th Conference on Computational Statistics (COMPSTAT 2002, Berlin, Germany) Physica Verlag, Heidelberg, Germany.
Christian Borgelt (2003) Efficient Implementations of Apriori and Eclat. Workshop of Frequent Item Set Mining Implementations (FIMI 2003, Melbourne, FL, USA).
APRIORI Implementation: https://borgelt.net/apriori.html
Other mining algorithms:
APappearance-class
,
AScontrol-classes
,
ASparameter-classes
,
eclat()
,
fim4r()
,
ruleInduction()
,
weclat()
## Example 1: Create transaction data and mine association rules a_list <- list( c("a","b","c"), c("a","b"), c("a","b","d"), c("c","e"), c("a","b","d","e") ) ## Set transaction names names(a_list) <- paste("Tr",c(1:5), sep = "") a_list ## Use the constructor to create transactions trans1 <- transactions(a_list) trans1 rules <- apriori(trans1) inspect(rules) ## Example 2: Mine association rules from an existing transactions dataset ## using different minimum support and minimum confidence thresholds data("Adult") rules <- apriori(Adult, parameter = list(supp = 0.5, conf = 0.9, target = "rules")) summary(rules) # since ... gets automatically added to parameter, we can also write the # same call shorter: apriori(Adult, supp = 0.5, conf = 0.9, target = "rules")
## Example 1: Create transaction data and mine association rules a_list <- list( c("a","b","c"), c("a","b"), c("a","b","d"), c("c","e"), c("a","b","d","e") ) ## Set transaction names names(a_list) <- paste("Tr",c(1:5), sep = "") a_list ## Use the constructor to create transactions trans1 <- transactions(a_list) trans1 rules <- apriori(trans1) inspect(rules) ## Example 2: Mine association rules from an existing transactions dataset ## using different minimum support and minimum confidence thresholds data("Adult") rules <- apriori(Adult, parameter = list(supp = 0.5, conf = 0.9, target = "rules")) summary(rules) # since ... gets automatically added to parameter, we can also write the # same call shorter: apriori(Adult, supp = 0.5, conf = 0.9, target = "rules")
The AScontrol
class holds the algorithmic parameters for the used
mining algorithms. APcontrol
and ECcontrol
directly extend
AScontrol
with additional slots for parameters only suitable for the
algorithms Apriori (APcontrol
) and Eclat (ECcontrol
).
sort
an integer scalar indicating how to sort items with respect to their frequency: (default: 2)
1: ascending
-1: descending
0: do not sort
2: ascending
-2: descending with respect to transaction size sum
verbose
a logical indicating whether to report progress
filter
a numeric scalar indicating how to filter unused items from transactions (default: 0.1)
: do not filter items with respect to. usage in sets
: fraction of removed items for filtering
: take execution times ratio into account
tree
a logical indicating whether to
organize transactions as a prefix tree (default: TRUE
)
heap
a logical indicating whether to
use heapsort instead of quicksort to sort the transactions
(default: TRUE
)
memopt
a logical indicating whether to
minimize memory usage instead of maximize speed (default: FALSE
)
load
a logical indicating whether to
load transactions into memory (default: TRUE
)
sparse
a numeric value for the threshold for sparse representation (default: 7)
APcontrol
:
filter
, tree
, heap
, memopt
, load
, sort
, verbose
ECcontrol
:
sparse
, sort
, verbose
A suitable default control object will be
automatically created by the apriori()
or the
eclat()
function. By specifying a named list (names equal to
slots) as the control
argument for apriori()
or
eclat()
, default values can be replaced with the values
in the list.
Objects can also be created via coercion.
as("NULL", "APcontrol")
as("list", "APcontrol")
as("NULL", "ECcontrol")
as("list", "ECcontrol")
Michael Hahsler and Bettina Gruen
Christian Borgelt (2004) Apriori — Finding Association Rules/Hyperedges with the Apriori Algorithm. https://borgelt.net/apriori.html
Other mining algorithms:
APappearance-class
,
ASparameter-classes
,
apriori()
,
eclat()
,
fim4r()
,
ruleInduction()
,
weclat()
The ASparameter
class holds the mining parameters (e.g., minimum
support) for the used mining algorithms. APparameter
and
ECparameter
directly extend ASparameter
with additional slots
for parameters only suitable for apriori()
(APparameter
) or eclat()
(ECparameter
).
support
a numeric value for the
minimal support of an item set (default: )
minlen
an integer value for the minimal number of items per item set (default: 1 item)
maxlen
an integer value for the maximal number of items per item set (default: 10 items)
target
a character string indicating the type of association mined. Partial names are matched. Available targets are:
"frequent itemsets"
"maximally frequent itemsets"
"generator frequent itemsets"
"closed frequent itemsets"
"rules"
only available for apriori;
use ruleInduction for eclat.
"hyperedgesets"
only available for apriori;
see references for the definition of association hyperedgesets.
ext
a logical indicating whether to report coverage (i.e., LHS-support)
as an extended quality measure (default: TRUE
)
confidence
a numeric value for the
minimal confidence of rules/association hyperedges (default:
). For frequent itemsets it is set to
NA
.
smax
a numeric value for the maximal support of itemsets/rules/hyperedgesets (default: 1)
arem
a character string indicating the used additional rule
evaluation measure (default: "none"
) given by one of
"none"
: no additional evaluation measure
"diff"
: absolute confidence difference
"quot"
: difference of confidence quotient to 1
"aimp"
: absolute difference of improvement to 1
"info"
: information difference to prior
"chi2"
: normalized measure
Note: The measure is only reported if aval
is set to TRUE
.
Use minval
to set minimum thresholds on the measures.
aval
a logical indicating whether to
return the additional rule evaluation measure selected with arem
.
minval
a numeric value for the minimal value of additional
evaluation measure selected with arem
(default: )
originalSupport
a logical indicating whether to
use the original definition of minimum support
(support of the LHS and RHS of the rule). If set to FALSE
then the support of the LHS (also called coverage of the rule) is returned as support.
The minimum support threshold is applied to this support. (default: TRUE
)
maxtime
Time limit in seconds for checking subsets.
maxtime = 0
disables the time limit. (default: 5 seconds)
tidLists
a logical indicating whether eclat()
should
return also a list of supporting transactions IDs.
(default: FALSE
)
APparameter
:
confidence
, minval
, smax
, arem
, aval
, originalSupport
, maxtime
, support
, minlen
, maxlen
, target
, ext
ECparameter
:
tidLists
, support
, minlen
, maxlen
, target
, ext
A suitable default parameter object will be
automatically created by apriori()
or
eclat()
. By specifying a named list (names equal to
slots) as parameter
argument for apriori()
or
eclat()
, the default values can be replaced with the values
in the list.
Objects can also be created via coercion.
as("NULL", "APparameter")
as("list", "APparameter")
as("NULL", "ECparameter")
as("list", "ECparameter")
Michael Hahsler and Bettina Gruen
Christian Borgelt (2004) Apriori — Finding Association Rules/Hyperedges with the Apriori Algorithm. https://borgelt.net/apriori.html
Other mining algorithms:
APappearance-class
,
AScontrol-classes
,
apriori()
,
eclat()
,
fim4r()
,
ruleInduction()
,
weclat()
The associations
class is a virtual class which is extended to
represent mining result (e.g., sets of itemsets or
rules). The class defines some common methods for its subclasses.
## S4 method for signature 'associations' quality(x) ## S4 replacement method for signature 'associations' quality(x) <- value ## S4 method for signature 'associations' info(x) ## S4 replacement method for signature 'associations' info(x) <- value ## S4 method for signature 'associations' head(x, n = 6L, by = NULL, decreasing = TRUE, ...) ## S4 method for signature 'associations' tail(x, n = 6L, by = NULL, decreasing = TRUE, ...) ## S4 method for signature 'associations' items(x) ## S4 method for signature 'associations' length(x) ## S4 method for signature 'associations' labels(object)
## S4 method for signature 'associations' quality(x) ## S4 replacement method for signature 'associations' quality(x) <- value ## S4 method for signature 'associations' info(x) ## S4 replacement method for signature 'associations' info(x) <- value ## S4 method for signature 'associations' head(x, n = 6L, by = NULL, decreasing = TRUE, ...) ## S4 method for signature 'associations' tail(x, n = 6L, by = NULL, decreasing = TRUE, ...) ## S4 method for signature 'associations' items(x) ## S4 method for signature 'associations' length(x) ## S4 method for signature 'associations' labels(object)
x , object
|
the object. |
value |
the replacement value. |
n |
number of elements |
by |
sort by this interest measure |
decreasing |
sort in decreasing order? |
... |
further arguments. |
The implementations of associations
store itemsets (e.g., the LHS and
RHS of a rule) as objects of class itemMatrix (i.e., sparse
binary matrices). Quality measures (e.g., support) are stored in a
data.frame accessible via method quality()
.
See Sections Functions and See Also to see all available methods.
Note: Associations can store multisets with duplicated elements. Duplicated
elements can result from combining several sets of associations. Use
unique()
to remove duplicate associations.
quality(associations)
: returns the quality data.frame.
quality(associations) <- value
: replaces the quality data.frame. The lengths of the vectors in the data.frame have to equal the number of associations in the set.
info(associations)
: returns the info list.
info(associations) <- value
: replaces the info list.
head(associations)
: returns the first n associations.
tail(associations)
: returns the last n associations.
items(associations)
: dummy method. This method has to be implemented by all subclasses of associations and return the items which make up each association as an object of class itemMatrix.
length(associations)
: dummy method. This method has to be implemented by all subclasses of associations and return the number of elements in the association.
labels(associations)
: dummy method. This method has to be implemented by all subclasses of associations and return a vector of length(object) of labels for the elements in the association.
quality
a data.frame
info
a list
A virtual class: No objects may be created from it.
Michael Hahsler
Other associations functions:
abbreviate()
,
c()
,
duplicated()
,
extract
,
inspect()
,
is.closed()
,
is.generator()
,
is.maximal()
,
is.redundant()
,
is.significant()
,
is.superset()
,
itemsets-class
,
match()
,
rules-class
,
sample()
,
sets
,
size()
,
sort()
,
unique()
Provides the methods to combine several associations or transactions objects into a single object.
## S4 method for signature 'itemMatrix' c(x, ..., recursive = FALSE) ## S4 method for signature 'transactions' c(x, ..., recursive = FALSE) ## S4 method for signature 'tidLists' c(x, ..., recursive = FALSE) ## S4 method for signature 'rules' c(x, ..., recursive = FALSE) ## S4 method for signature 'itemsets' c(x, ..., recursive = FALSE)
## S4 method for signature 'itemMatrix' c(x, ..., recursive = FALSE) ## S4 method for signature 'transactions' c(x, ..., recursive = FALSE) ## S4 method for signature 'tidLists' c(x, ..., recursive = FALSE) ## S4 method for signature 'rules' c(x, ..., recursive = FALSE) ## S4 method for signature 'itemsets' c(x, ..., recursive = FALSE)
x |
first object. |
... |
further objects of the same class as |
recursive |
a logical. If |
Combining arules objects is done by combining the rows of itemMatrix objects representing the associations or transactions.
Note that c()
can result in duplicates.
Use union()
rather than c()
to combine several mined
itemsets or rules into a single
set without duplicates.
An object of the same class as x
.
Michael Hahsler
Other associations functions:
abbreviate()
,
associations-class
,
duplicated()
,
extract
,
inspect()
,
is.closed()
,
is.generator()
,
is.maximal()
,
is.redundant()
,
is.significant()
,
is.superset()
,
itemsets-class
,
match()
,
rules-class
,
sample()
,
sets
,
size()
,
sort()
,
unique()
Other itemMatrix and transactions functions:
abbreviate()
,
crossTable()
,
duplicated()
,
extract
,
hierarchy
,
image()
,
inspect()
,
is.superset()
,
itemFrequency()
,
itemFrequencyPlot()
,
itemMatrix-class
,
match()
,
merge()
,
random.transactions()
,
sample()
,
sets
,
size()
,
supportingTransactions()
,
tidLists-class
,
transactions-class
,
unique()
data("Adult") ## combine transactions a1 <- Adult[1:10] a2 <- Adult[101:110] aComb <- c(a1, a2) summary(aComb) ## combine rules (can contain the same rule multiple times) r1 <- apriori(Adult[1:1000]) r2 <- apriori(Adult[1001:2000]) rComb <- c(r1, r2) rComb ## union of rules (a set with only unique rules: same as unique(rComb)) rUnion <- union(r1,r2) rUnion
data("Adult") ## combine transactions a1 <- Adult[1:10] a2 <- Adult[101:110] aComb <- c(a1, a2) summary(aComb) ## combine rules (can contain the same rule multiple times) r1 <- apriori(Adult[1:1000]) r2 <- apriori(Adult[1001:2000]) rComb <- c(r1, r2) rComb ## union of rules (a set with only unique rules: same as unique(rComb)) rUnion <- union(r1,r2) rUnion
Defines a method to compute confidence intervals for interest measures for association rules.
## S3 method for class 'rules' confint( object, parm = "oddsRatio", level = 0.95, measure = NULL, side = c("two.sided", "lower", "upper"), method = NULL, replications = 1000, smoothCounts = 0, transactions = NULL, ... )
## S3 method for class 'rules' confint( object, parm = "oddsRatio", level = 0.95, measure = NULL, side = c("two.sided", "lower", "upper"), method = NULL, replications = 1000, smoothCounts = 0, transactions = NULL, ... )
object |
an object of class rules. |
parm , measure
|
name of the interest measures (see |
level |
the confidence level required. |
side |
Should a two-sided confidence interval or a one-sided limit be returned? Lower returns an interval with only a lower limit and upper returns an interval with only an upper limit. |
method |
method to construct the confidence interval. The available methods depends on the measure and the most common method is used by default. |
replications |
number of replications for method |
smoothCounts |
pseudo count for addaptive smoothing (Laplace smoothing). Often a pseudo counts of .5 is used for smoothing (see Detail Section). |
transactions |
if the rules object does not contain sufficient quality information, then a set of transactions to calculate the confidence interval for can be specified. |
... |
Additional parameters are ignored with a warning. |
This method creates a contingency table for each rule and then constructs a confidence interval for the specified measures.
Fast confidence interval approximations are currently available for the
measures "support"
, "count"
, "confidence"
, "lift"
, "oddsRatio"
, and "phi"
.
For all other measures, bootstrap sampling from a multinomial distribution
is used.
Haldan-Anscombe correction (Haldan, 1940; Anscombe, 1956) to avoids issues
with zero counts can be specified by smoothCounts = 0.5
. Here .5 is
added to each count in the contingency table.
Returns a matrix with with one row for each rule and the two columns
named "LL"
and "UL"
with the interval boundaries.
The matrix has the following additional attributes:
measure |
the interest measure. |
level |
the confidence level |
side |
the confidence level |
smoothCounts |
used count smoothing. |
method |
name of the method to create the interval |
desc |
description of the used method to calculate the confidence interval. The mentioned references can be found below. |
Michael Hahsler
Wilson, E. B. (1927). "Probable inference, the law of succession, and statistical inference". Journal of the American Statistical Association, 22 (158): 209-212. doi:10.1080/01621459.1927.10502953
Clopper, C.; Pearson, E. S. (1934). "The use of confidence or fiducial limits illustrated in the case of the binomial". Biometrika, 26 (4): 404-413. doi:10.1093/biomet/26.4.404
Doob, J. L. (1935). "The Limiting Distributions of Certain Statistics". Annals of Mathematical Statistics, 6: 160-169. doi:10.1214/aoms/1177732594
Fisher, R.A. (1962). "Confidence limits for a cross-product ratio". Australian Journal of Statistics, 4, 41.
Woolf, B. (1955). "On estimating the relation between blood group and diseases". Annals of Human Genetics, 19, 251-253.
Haldane, J.B.S. (1940). "The mean and variance of the moments of chi-squared when used as a test of homogeneity, when expectations are small". Biometrika, 29, 133-134.
Anscombe, F.J. (1956). "On estimating binomial response relations". Biometrika, 43, 461-464.
Other interest measures:
coverage()
,
interestMeasure()
,
is.redundant()
,
is.significant()
,
support()
data("Income") # mine some rules with the consequent "language in home=english" rules <- apriori(Income, parameter = list(support = 0.5), appearance = list(rhs = "language in home=english")) # calculate the confidence interval for the rules' odds ratios. # note that we use Haldane-Anscombe correction (with smoothCounts = .5) # to avoid issues with 0 counts in the contingency table. ci <- confint(rules, "oddsRatio", smoothCounts = .5) ci # We add the odds ratio (with Haldane-Anscombe correction) # and the confidence intervals to the quality slot of the rules. quality(rules) <- cbind( quality(rules), oddsRatio = interestMeasure(rules, "oddsRatio", smoothCounts = .5), oddsRatio = ci) rules <- sort(rules, by = "oddsRatio") inspect(rules) # use confidence intervals for lift to find rules with a lift significantly larger then 1. # We set the confidence level to 95%, create a one-sided interval and check # if the interval does not cover 1 (i.e., the lower limit is larger than 1). ci <- confint(rules, "lift", level = 0.95, side = "lower") ci inspect(rules[ci[, "LL"] > 1])
data("Income") # mine some rules with the consequent "language in home=english" rules <- apriori(Income, parameter = list(support = 0.5), appearance = list(rhs = "language in home=english")) # calculate the confidence interval for the rules' odds ratios. # note that we use Haldane-Anscombe correction (with smoothCounts = .5) # to avoid issues with 0 counts in the contingency table. ci <- confint(rules, "oddsRatio", smoothCounts = .5) ci # We add the odds ratio (with Haldane-Anscombe correction) # and the confidence intervals to the quality slot of the rules. quality(rules) <- cbind( quality(rules), oddsRatio = interestMeasure(rules, "oddsRatio", smoothCounts = .5), oddsRatio = ci) rules <- sort(rules, by = "oddsRatio") inspect(rules) # use confidence intervals for lift to find rules with a lift significantly larger then 1. # We set the confidence level to 95%, create a one-sided interval and check # if the interval does not cover 1 (i.e., the lower limit is larger than 1). ci <- confint(rules, "lift", level = 0.95, side = "lower") ci inspect(rules[ci[, "LL"] > 1])
Provides the generic function and a method to calculate the coverage (support of the left-hand-side) of rules.
coverage(x, transactions = NULL, reuse = TRUE) ## S4 method for signature 'rules' coverage(x, transactions = NULL, reuse = TRUE)
coverage(x, transactions = NULL, reuse = TRUE) ## S4 method for signature 'rules' coverage(x, transactions = NULL, reuse = TRUE)
x |
the set of rules. |
transactions |
the data set used to generate |
reuse |
reuse support and confidence stored in |
Coverage (also called cover or LHS-support) is the support of the
left-hand-side of the rule , i.e.,
. It represents a measure of
to how often the rule can be applied.
Coverage can be quickly calculated from the rule's quality measures (support and confidence) stored in the quality slot. If these values are not present, then the support of the LHS is counted using the data supplied in transactions.
Coverage is also one of the measures available via the function
interestMeasure()
.
A numeric vector of the same length as x
containing the
coverage values for the sets in x
.
Michael Hahsler
Other interest measures:
confint()
,
interestMeasure()
,
is.redundant()
,
is.significant()
,
support()
data("Income") ## find and some rules (we only use 5 rules here) and calculate coverage rules <- apriori(Income)[1:5] quality(rules) <- cbind(quality(rules), coverage = coverage(rules)) inspect(rules)
data("Income") ## find and some rules (we only use 5 rules here) and calculate coverage rules <- apriori(Income)[1:5] quality(rules) <- cbind(quality(rules), coverage = coverage(rules)) inspect(rules)
Provides the generic function crossTable()
and a method to
cross-tabulate joint occurrences across all pairs of items.
crossTable(x, ...) ## S4 method for signature 'itemMatrix' crossTable( x, measure = c("count", "support", "probability", "lift"), sort = FALSE )
crossTable(x, ...) ## S4 method for signature 'itemMatrix' crossTable( x, measure = c("count", "support", "probability", "lift"), sort = FALSE )
x |
object to be cross-tabulated (transactions or itemMatrix). |
... |
additional arguments. |
measure |
measure to return. Default is co-occurrence counts. |
sort |
sort the items by support. |
A symmetric matrix of n x n, where n is the number of items times
in x
. The matrix contains the co-occurrence counts between pairs of
items.
Michael Hahsler
Other itemMatrix and transactions functions:
abbreviate()
,
c()
,
duplicated()
,
extract
,
hierarchy
,
image()
,
inspect()
,
is.superset()
,
itemFrequency()
,
itemFrequencyPlot()
,
itemMatrix-class
,
match()
,
merge()
,
random.transactions()
,
sample()
,
sets
,
size()
,
supportingTransactions()
,
tidLists-class
,
transactions-class
,
unique()
data("Groceries") ct <- crossTable(Groceries, sort = TRUE) ct[1:5, 1:5] sp <- crossTable(Groceries, measure = "support", sort = TRUE) sp[1:5, 1:5] lift <- crossTable(Groceries, measure = "lift", sort = TRUE) lift[1:5, 1:5]
data("Groceries") ct <- crossTable(Groceries, sort = TRUE) ct[1:5, 1:5] sp <- crossTable(Groceries, measure = "support", sort = TRUE) sp[1:5, 1:5] lift <- crossTable(Groceries, measure = "lift", sort = TRUE) lift[1:5, 1:5]
Provides the generic function DATAFRAME()
and the methods to create
a data.frame representation from some arules objects.
These methods are used for the coercion to a
data.frame, but offer more control over the coercion process (item
separators, etc.).
DATAFRAME(from, ...) ## S4 method for signature 'rules' DATAFRAME(from, separate = TRUE, ...) ## S4 method for signature 'itemsets' DATAFRAME(from, ...) ## S4 method for signature 'itemMatrix' DATAFRAME(from, ...)
DATAFRAME(from, ...) ## S4 method for signature 'rules' DATAFRAME(from, separate = TRUE, ...) ## S4 method for signature 'itemsets' DATAFRAME(from, ...) ## S4 method for signature 'itemMatrix' DATAFRAME(from, ...)
from |
the object to be converted into a data.frame. |
... |
further arguments are passed on to the |
separate |
logical; separate LHS and RHS in separate columns? (only for rules) |
Using DATAFRAME()
is equivalent to the standard coercion
as(x, "data.frame")
. However, for rules, the argument separate = TRUE
will produce separate columns for the LHS and the RHS of the rule.
Furthermore, the arguments itemSep
, setStart
, setEnd
(and ruleSep
for separate = FALSE
) will be passed on to the
labels()
method for the object specified in from
.
a data.frame.
Michael Hahsler
Other import/export:
LIST()
,
pmml
,
read
,
write()
data(Adult) DATAFRAME(head(Adult)) DATAFRAME(head(Adult), setStart = '', itemSep = ' + ', setEnd = '') rules <- apriori(Adult, parameter = list(supp = 0.5, conf = 0.9, target = "rules")) rules <- head(rules, by = "conf") ### default coercions (same as as(rules, "data.frame")) DATAFRAME(rules) DATAFRAME(rules, separate = TRUE) DATAFRAME(rules, separate = TRUE, setStart = '', itemSep = ' + ', setEnd = '')
data(Adult) DATAFRAME(head(Adult)) DATAFRAME(head(Adult), setStart = '', itemSep = ' + ', setEnd = '') rules <- apriori(Adult, parameter = list(supp = 0.5, conf = 0.9, target = "rules")) rules <- head(rules, by = "conf") ### default coercions (same as as(rules, "data.frame")) DATAFRAME(rules) DATAFRAME(rules, separate = TRUE) DATAFRAME(rules, separate = TRUE, setStart = '', itemSep = ' + ', setEnd = '')
This function implements several basic unsupervised methods to convert a continuous variable into a categorical variable (factor) using different binning strategies. For convenience, a whole data.frame can be discretized (i.e., all numeric columns are discretized).
discretize( x, method = "frequency", breaks = 3, labels = NULL, include.lowest = TRUE, right = FALSE, dig.lab = 3, ordered_result = FALSE, infinity = FALSE, onlycuts = FALSE, categories = NULL, ... ) discretizeDF(df, methods = NULL, default = NULL)
discretize( x, method = "frequency", breaks = 3, labels = NULL, include.lowest = TRUE, right = FALSE, dig.lab = 3, ordered_result = FALSE, infinity = FALSE, onlycuts = FALSE, categories = NULL, ... ) discretizeDF(df, methods = NULL, default = NULL)
x |
a numeric vector (continuous variable). |
method |
discretization method. Available are: |
breaks , categories
|
either number of categories or a vector with boundaries for
discretization (all values outside the boundaries will be set to NA).
|
labels |
character vector; labels for the levels of the resulting
category. By default, labels are constructed using "(a,b]" interval
notation. If |
include.lowest |
logical; should the first interval be closed to the left? |
right |
logical; should the intervals be closed on the right (and open on the left) or vice versa? |
dig.lab |
integer; number of digits used to create labels. |
ordered_result |
logical; return a ordered factor? |
infinity |
logical; should the first/last break boundary changed to +/-Inf? |
onlycuts |
logical; return only computed interval boundaries? |
... |
for method "cluster" further arguments are passed on to
|
df |
data.frame; each numeric column in the data.frame is discretized. |
methods |
named list of lists or a data.frame; the named list contains
lists of discretization parameters (see parameters of |
default |
named list; parameters for |
Discretize calculates breaks between intervals using various methods and
then uses base::cut()
to convert the numeric values into intervals
represented as a factor.
Discretization may fail for several reasons. Some reasons are
A variable contains only a single value. In this case, the variable should be dropped or directly converted into a factor with a single level (see factor).
Some calculated breaks are not unique. This can happen for method frequency with very skewed data (e.g., a large portion of the values is 0). In this case, non-unique breaks are dropped with a warning. It would be probably better to look at the histogram of the data and decide on breaks for the method fixed.
discretize
only implements unsupervised discretization. See
arulesCBA::discretizeDF.supervised()
in package arulesCBA
for supervised discretization.
discretizeDF()
applies discretization to each numeric column.
Individual discretization parameters can be specified in the form:
methods = list(column_name1 = list(method = ,...), column_name2 = list(...))
.
If no discretization method is specified for a column, then the
discretization in default is applied (NULL
invokes the default method
in discretize()
). The special method "none"
can be specified
to suppress discretization for a column.
discretize()
returns a factor representing the
categorized continuous variable with
attribute "discretized:breaks"
indicating the used breaks or and
"discretized:method"
giving the used method. If onlycuts = TRUE
is used, a vector with the calculated interval boundaries is returned.
discretizeDF()
returns a discretized data.frame.
Michael Hahsler
base::cut()
,
arulesCBA::discretizeDF.supervised()
.
Other preprocessing:
hierarchy
,
itemCoding
,
merge()
,
sample()
data(iris) x <- iris[,1] ### look at the distribution before discretizing hist(x, breaks = 20, main = "Data") def.par <- par(no.readonly = TRUE) # save default layout(mat = rbind(1:2,3:4)) ### convert continuous variables into categories (there are 3 types of flowers) ### the default method is equal frequency table(discretize(x, breaks = 3)) hist(x, breaks = 20, main = "Equal Frequency") abline(v = discretize(x, breaks = 3, onlycuts = TRUE), col = "red") # Note: the frequencies are not exactly equal because of ties in the data ### equal interval width table(discretize(x, method = "interval", breaks = 3)) hist(x, breaks = 20, main = "Equal Interval length") abline(v = discretize(x, method = "interval", breaks = 3, onlycuts = TRUE), col = "red") ### k-means clustering table(discretize(x, method = "cluster", breaks = 3)) hist(x, breaks = 20, main = "K-Means") abline(v = discretize(x, method = "cluster", breaks = 3, onlycuts = TRUE), col = "red") ### user-specified (with labels) table(discretize(x, method = "fixed", breaks = c(-Inf, 6, Inf), labels = c("small", "large"))) hist(x, breaks = 20, main = "Fixed") abline(v = discretize(x, method = "fixed", breaks = c(-Inf, 6, Inf), onlycuts = TRUE), col = "red") par(def.par) # reset to default ### prepare the iris data set for association rule mining ### use default discretization irisDisc <- discretizeDF(iris) head(irisDisc) ### discretize all numeric columns differently irisDisc <- discretizeDF(iris, default = list(method = "interval", breaks = 2, labels = c("small", "large"))) head(irisDisc) ### specify discretization for the petal columns and don't discretize the others irisDisc <- discretizeDF(iris, methods = list( Petal.Length = list(method = "frequency", breaks = 3, labels = c("short", "medium", "long")), Petal.Width = list(method = "frequency", breaks = 2, labels = c("narrow", "wide")) ), default = list(method = "none") ) head(irisDisc) ### discretize new data using the same discretization scheme as the ### data.frame supplied in methods. Note: NAs may occure if a new ### value falls outside the range of values observed in the ### originally discretized table (use argument infinity = TRUE in ### discretize to prevent this case.) discretizeDF(iris[sample(1:nrow(iris), 5),], methods = irisDisc)
data(iris) x <- iris[,1] ### look at the distribution before discretizing hist(x, breaks = 20, main = "Data") def.par <- par(no.readonly = TRUE) # save default layout(mat = rbind(1:2,3:4)) ### convert continuous variables into categories (there are 3 types of flowers) ### the default method is equal frequency table(discretize(x, breaks = 3)) hist(x, breaks = 20, main = "Equal Frequency") abline(v = discretize(x, breaks = 3, onlycuts = TRUE), col = "red") # Note: the frequencies are not exactly equal because of ties in the data ### equal interval width table(discretize(x, method = "interval", breaks = 3)) hist(x, breaks = 20, main = "Equal Interval length") abline(v = discretize(x, method = "interval", breaks = 3, onlycuts = TRUE), col = "red") ### k-means clustering table(discretize(x, method = "cluster", breaks = 3)) hist(x, breaks = 20, main = "K-Means") abline(v = discretize(x, method = "cluster", breaks = 3, onlycuts = TRUE), col = "red") ### user-specified (with labels) table(discretize(x, method = "fixed", breaks = c(-Inf, 6, Inf), labels = c("small", "large"))) hist(x, breaks = 20, main = "Fixed") abline(v = discretize(x, method = "fixed", breaks = c(-Inf, 6, Inf), onlycuts = TRUE), col = "red") par(def.par) # reset to default ### prepare the iris data set for association rule mining ### use default discretization irisDisc <- discretizeDF(iris) head(irisDisc) ### discretize all numeric columns differently irisDisc <- discretizeDF(iris, default = list(method = "interval", breaks = 2, labels = c("small", "large"))) head(irisDisc) ### specify discretization for the petal columns and don't discretize the others irisDisc <- discretizeDF(iris, methods = list( Petal.Length = list(method = "frequency", breaks = 3, labels = c("short", "medium", "long")), Petal.Width = list(method = "frequency", breaks = 2, labels = c("narrow", "wide")) ), default = list(method = "none") ) head(irisDisc) ### discretize new data using the same discretization scheme as the ### data.frame supplied in methods. Note: NAs may occure if a new ### value falls outside the range of values observed in the ### originally discretized table (use argument infinity = TRUE in ### discretize to prevent this case.) discretizeDF(iris[sample(1:nrow(iris), 5),], methods = irisDisc)
Provides the generic function dissimilarity()
and the methods to
compute and returns distances for binary data in a matrix
,
transactions or associations which
can be used for grouping and clustering. See Hahsler (2016) for an
introduction to distance-based clustering of association rules.
dissimilarity(x, y = NULL, method = NULL, args = NULL, ...) ## S4 method for signature 'matrix' dissimilarity(x, y = NULL, method = NULL, args = NULL) ## S4 method for signature 'itemMatrix' dissimilarity(x, y = NULL, method = NULL, args = NULL, which = "transactions") ## S4 method for signature 'associations' dissimilarity(x, y = NULL, method = NULL, args = NULL, which = "associations")
dissimilarity(x, y = NULL, method = NULL, args = NULL, ...) ## S4 method for signature 'matrix' dissimilarity(x, y = NULL, method = NULL, args = NULL) ## S4 method for signature 'itemMatrix' dissimilarity(x, y = NULL, method = NULL, args = NULL, which = "transactions") ## S4 method for signature 'associations' dissimilarity(x, y = NULL, method = NULL, args = NULL, which = "associations")
x |
the set of elements (e.g., |
y |
|
method |
the distance measure to be used. Implemented measures are
(defaults to
|
args |
a list of additional arguments for the methods. |
... |
further arguments. |
which |
a character string indicating if the dissimilarity should be
calculated between transactions/associations (default) or items (use |
returns an object of class dist
.
Michael Hahsler
Aggarwal, C.C., Cecilia Procopiuc, and Philip S. Yu. (2002) Finding localized associations in market basket data. IEEE Trans. on Knowledge and Data Engineering 14(1):51–62.
Dice, L. R. (1945) Measures of the amount of ecologic association between species. Ecology 26, pages 297–302.
Gupta, G., Strehl, A., and Ghosh, J. (1999) Distance based clustering of association rules. In Intelligent Engineering Systems Through Artificial Neural Networks (Proceedings of ANNIE 1999), pages 759-764. ASME Press.
Hahsler, M. (2016) Grouping association rules using lift. In C. Iyigun, R. Moghaddess, and A. Oztekin, editors, 11th INFORMS Workshop on Data Mining and Decision Analytics (DM-DA 2016).
Sneath, P. H. A. (1957) Some thoughts on bacterial classification. Journal of General Microbiology 17, pages 184–200.
Sokal, R. R. and Michener, C. D. (1958) A statistical method for evaluating systematic relationships. University of Kansas Science Bulletin 38, pages 1409–1438.
Toivonen, H., Klemettinen, M., Ronkainen, P., Hatonen, K. and Mannila H. (1995) Pruning and grouping discovered association rules. In Proceedings of KDD'95.
Other proximity classes and functions:
affinity()
,
predict()
,
proximity-classes
## cluster items in Groceries with support > 5% data("Groceries") s <- Groceries[, itemFrequency(Groceries) > 0.05] d_jaccard <- dissimilarity(s, which = "items") plot(hclust(d_jaccard, method = "ward.D2"), main = "Dendrogram for items") ## cluster transactions for a sample of Adult data("Adult") s <- sample(Adult, 500) ## calculate Jaccard distances and do hclust d_jaccard <- dissimilarity(s) hc <- hclust(d_jaccard, method = "ward.D2") plot(hc, labels = FALSE, main = "Dendrogram for Transactions (Jaccard)") ## get 20 clusters and look at the difference of the item frequencies (bars) ## for the top 20 items) in cluster 1 compared to the data (line) assign <- cutree(hc, 20) itemFrequencyPlot(s[assign == 1], population = s, topN = 20) ## calculate affinity-based distances between transactions and do hclust d_affinity <- dissimilarity(s, method = "affinity") hc <- hclust(d_affinity, method = "ward.D2") plot(hc, labels = FALSE, main = "Dendrogram for Transactions (Affinity)") ## cluster association rules rules <- apriori(Adult, parameter = list(support = 0.3)) rules <- subset(rules, subset = lift > 2) ## use affinity to cluster rules ## Note: we need to supply the transactions (or affinities) from the ## dataset (sample). d_affinity <- dissimilarity(rules, method = "affinity", args = list(transactions = s)) hc <- hclust(d_affinity, method = "ward.D2") plot(hc, main = "Dendrogram for Rules (Affinity)") ## create 4 groups and inspect the rules in the first group. assign <- cutree(hc, k = 3) inspect(rules[assign == 1])
## cluster items in Groceries with support > 5% data("Groceries") s <- Groceries[, itemFrequency(Groceries) > 0.05] d_jaccard <- dissimilarity(s, which = "items") plot(hclust(d_jaccard, method = "ward.D2"), main = "Dendrogram for items") ## cluster transactions for a sample of Adult data("Adult") s <- sample(Adult, 500) ## calculate Jaccard distances and do hclust d_jaccard <- dissimilarity(s) hc <- hclust(d_jaccard, method = "ward.D2") plot(hc, labels = FALSE, main = "Dendrogram for Transactions (Jaccard)") ## get 20 clusters and look at the difference of the item frequencies (bars) ## for the top 20 items) in cluster 1 compared to the data (line) assign <- cutree(hc, 20) itemFrequencyPlot(s[assign == 1], population = s, topN = 20) ## calculate affinity-based distances between transactions and do hclust d_affinity <- dissimilarity(s, method = "affinity") hc <- hclust(d_affinity, method = "ward.D2") plot(hc, labels = FALSE, main = "Dendrogram for Transactions (Affinity)") ## cluster association rules rules <- apriori(Adult, parameter = list(support = 0.3)) rules <- subset(rules, subset = lift > 2) ## use affinity to cluster rules ## Note: we need to supply the transactions (or affinities) from the ## dataset (sample). d_affinity <- dissimilarity(rules, method = "affinity", args = list(transactions = s)) hc <- hclust(d_affinity, method = "ward.D2") plot(hc, main = "Dendrogram for Rules (Affinity)") ## create 4 groups and inspect the rules in the first group. assign <- cutree(hc, k = 3) inspect(rules[assign == 1])
Provides the generic function duplicated()
and the methods to find
duplicated elements in
itemMatrix, associations and their subclasses.
duplicated(x, incomparables = FALSE, ...) ## S4 method for signature 'itemMatrix' duplicated(x, incomparables = FALSE) ## S4 method for signature 'rules' duplicated(x, incomparables = FALSE) ## S4 method for signature 'itemsets' duplicated(x, incomparables = FALSE)
duplicated(x, incomparables = FALSE, ...) ## S4 method for signature 'itemMatrix' duplicated(x, incomparables = FALSE) ## S4 method for signature 'rules' duplicated(x, incomparables = FALSE) ## S4 method for signature 'itemsets' duplicated(x, incomparables = FALSE)
x |
an object of class itemMatrix or associations. |
incomparables |
argument currently unused. |
... |
further arguments (currently unused). |
A logical vector indicating duplicated elements.
Michael Hahsler
Other associations functions:
abbreviate()
,
associations-class
,
c()
,
extract
,
inspect()
,
is.closed()
,
is.generator()
,
is.maximal()
,
is.redundant()
,
is.significant()
,
is.superset()
,
itemsets-class
,
match()
,
rules-class
,
sample()
,
sets
,
size()
,
sort()
,
unique()
Other itemMatrix and transactions functions:
abbreviate()
,
c()
,
crossTable()
,
extract
,
hierarchy
,
image()
,
inspect()
,
is.superset()
,
itemFrequency()
,
itemFrequencyPlot()
,
itemMatrix-class
,
match()
,
merge()
,
random.transactions()
,
sample()
,
sets
,
size()
,
supportingTransactions()
,
tidLists-class
,
transactions-class
,
unique()
data("Adult") r1 <- apriori(Adult[1:1000], parameter = list(support = 0.5)) r2 <- apriori(Adult[1001:2000], parameter = list(support = 0.5)) ## Note this creates a collection of rules from two sets of rules r_comb <- c(r1, r2) duplicated(r_comb)
data("Adult") r1 <- apriori(Adult[1:1000], parameter = list(support = 0.5)) r2 <- apriori(Adult[1001:2000], parameter = list(support = 0.5)) ## Note this creates a collection of rules from two sets of rules r_comb <- c(r1, r2) duplicated(r_comb)
Mine frequent itemsets with the Eclat algorithm. This algorithm uses simple intersection operations for equivalence class clustering along with bottom-up lattice traversal.
eclat(data, parameter = NULL, control = NULL, ...)
eclat(data, parameter = NULL, control = NULL, ...)
data |
object of class transactions or any data structure which can be coerced into transactions (e.g., binary matrix, data.frame). |
parameter |
object of class ECparameter or named list (default values are: support 0.1 and maxlen 5) |
control |
object of class ECcontrol or named list for algorithmic controls. |
... |
Additional arguments are added for convenience to the parameter list. |
Calls the C implementation of the Eclat algorithm by Christian Borgelt for mining frequent itemsets.
Eclat can also return the transaction IDs for each found itemset using
tidLists = TRUE
as a parameter and the result can be retrieved as a
tidLists object with method tidLists()
for class
itemsets. Note that storing transaction ID lists is
very memory intensive, creating transaction ID lists only works for minimum
support values which create a relatively small number of itemsets. See also
supportingTransactions()
.
ruleInduction()
can be used to generate rules from the found
itemsets.
A weighted version of ECLAT is available as function weclat()
.
This version can be used to perform weighted association rule mining (WARM).
Returns an object of class itemsets.
Michael Hahsler and Bettina Gruen
Mohammed J. Zaki, Srinivasan Parthasarathy, Mitsunori Ogihara, and Wei Li. (1997) New algorithms for fast discovery of association rules. KDD'97: Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, August 1997, Pages 283-286.
Christian Borgelt (2003) Efficient Implementations of Apriori and Eclat. Workshop of Frequent Item Set Mining Implementations (FIMI 2003, Melbourne, FL, USA).
ECLAT Implementation: https://borgelt.net/eclat.html
Other mining algorithms:
APappearance-class
,
AScontrol-classes
,
ASparameter-classes
,
apriori()
,
fim4r()
,
ruleInduction()
,
weclat()
data("Adult") ## Mine itemsets with minimum support of 0.1 and 5 or less items itemsets <- eclat(Adult, parameter = list(supp = 0.1, maxlen = 5)) itemsets ## Create rules from the frequent itemsets rules <- ruleInduction(itemsets, confidence = .9) rules
data("Adult") ## Mine itemsets with minimum support of 0.1 and 5 or less items itemsets <- eclat(Adult, parameter = list(supp = 0.1, maxlen = 5)) itemsets ## Create rules from the frequent itemsets rules <- ruleInduction(itemsets, confidence = .9) rules
The Epub
data set contains the download history of documents from the
electronic publication platform of the Vienna University of Economics and
Business Administration. The data was recorded between Jan 2003 and Dec
2008.
Object of class transactions
with 15729 transactions
and 936 items.
Item labels are document IDs of the form "doc_11d"
.
Session IDs and time stamps for transactions are also provided as transaction information.
Michael Hahsler
Provided by Michael Hahsler from the custom information system ePub-WU (which has since been replaced by eprint).
data(Epub) inspect(head(Epub))
data(Epub) inspect(head(Epub))
Methods for "["
, i.e., extraction or subsetting for arules objects.
## S4 method for signature 'itemMatrix,ANY,ANY,ANY' x[i, j, ..., drop = TRUE] ## S4 method for signature 'transactions,ANY,ANY,ANY' x[i, j, ..., drop = TRUE] ## S4 method for signature 'tidLists,ANY,ANY,ANY' x[i, j, ..., drop = TRUE] ## S4 method for signature 'rules,ANY,ANY,ANY' x[i, j, ..., drop = TRUE] ## S4 method for signature 'itemsets,ANY,ANY,ANY' x[i, j, ..., drop = TRUE]
## S4 method for signature 'itemMatrix,ANY,ANY,ANY' x[i, j, ..., drop = TRUE] ## S4 method for signature 'transactions,ANY,ANY,ANY' x[i, j, ..., drop = TRUE] ## S4 method for signature 'tidLists,ANY,ANY,ANY' x[i, j, ..., drop = TRUE] ## S4 method for signature 'rules,ANY,ANY,ANY' x[i, j, ..., drop = TRUE] ## S4 method for signature 'itemsets,ANY,ANY,ANY' x[i, j, ..., drop = TRUE]
x |
an object of class itemMatrix, transactions or associations. |
i |
select rows/sets using an integer vector containing row numbers or a logical vector. |
j |
select columns/items using an integer vector containing column numbers (i.e., item IDs), a logical vector or a vector of strings containing parts of item labels. |
... |
further arguments are ignored. |
drop |
ignored. |
Michael Hahsler
Other associations functions:
abbreviate()
,
associations-class
,
c()
,
duplicated()
,
inspect()
,
is.closed()
,
is.generator()
,
is.maximal()
,
is.redundant()
,
is.significant()
,
is.superset()
,
itemsets-class
,
match()
,
rules-class
,
sample()
,
sets
,
size()
,
sort()
,
unique()
Other itemMatrix and transactions functions:
abbreviate()
,
c()
,
crossTable()
,
duplicated()
,
hierarchy
,
image()
,
inspect()
,
is.superset()
,
itemFrequency()
,
itemFrequencyPlot()
,
itemMatrix-class
,
match()
,
merge()
,
random.transactions()
,
sample()
,
sets
,
size()
,
supportingTransactions()
,
tidLists-class
,
transactions-class
,
unique()
data(Adult) Adult ## select first 10 transactions Adult[1:10] ## select first 10 items for first 100 transactions Adult[1:100, 1:10] ## select the first 100 transactions for the items containing ## "income" or "age=Young" in their labels Adult[1:100, c("income=small", "income=large" ,"age=Young")]
data(Adult) Adult ## select first 10 transactions Adult[1:10] ## select first 10 items for first 100 transactions Adult[1:100, 1:10] ## select the first 100 transactions for the items containing ## "income" or "age=Young" in their labels Adult[1:100, c("income=small", "income=large" ,"age=Young")]
Interfaces the algorithms implemented in fim4r. The algorithms include: Apriori, Eclat, FPgrowth, Carpenter, IsTa, RElim and SaM.
fim4r( transactions, method = NULL, target = "frequent", support = 0.1, confidence = 0.8, originalSupport = TRUE, appear = NULL, report = NULL, verbose = TRUE, ... )
fim4r( transactions, method = NULL, target = "frequent", support = 0.1, confidence = 0.8, originalSupport = TRUE, appear = NULL, report = NULL, verbose = TRUE, ... )
transactions |
a transactions object |
method |
the algorithm to be used. One of:
|
target |
the target type. One of: |
support |
a numeric value for the minimal support in the range |
confidence |
a numeric value for the minimal confidence of rules in the range |
originalSupport |
logical; Use the support threshold on the support of the whole rule (LHS and RHS).
If |
appear |
Specify item appearance in rules (only for apriori, eclat, fpgrowth
and the target
|
report |
cannot be used via the interface. |
verbose |
logical; print used parameters? |
... |
further arguments are passed on to the function
the |
Installation:
The package fim4r is not available via CRAN. If needed,
the fim4r()
function downloads and installs the current version of the
package automatically. Your system needs to have build tools installed.
Build tools: You need to be able to install source packages. For Windows users this means that you need to install the RTools with a version matching your R version.
Additional Notes:
Support and confidence are specified here in the range .
This is different from the use in
fim4r
package where supp
and conf
have the range .
arules::fim4r()
automatically converts support and confidence internally.
fim4r
methods also return the empty itemset while arules
methods do not.
See ? fim4r::fim4r
for help on additional available arguments. This is only available
after package fim4r
is installed.
Algorithm descriptions and references can be found on the fim4r web page in the References Section.
An object of class itemsets or rules.
Christian Borgelt, fimi4r: Frequent Item Set Mining and Association Rule Induction for R. https://borgelt.net/fim4r.html
Other mining algorithms:
APappearance-class
,
AScontrol-classes
,
ASparameter-classes
,
apriori()
,
eclat()
,
ruleInduction()
,
weclat()
## Not run: data(Adult) # list available algorithms fim4r() # mine association rules with FPgrowth r <- fim4r(Adult, method = "fpgrowth", target = "rules", supp = .7, conf = .8) r inspect(head(r, by = "lift")) # mine closed itemsets with Carpenter or IsTa fim4r(Adult, method = "carpenter", target = "closed", supp = .7) fim4r(Adult, method = "ista", target = "closed", supp = .7) # mine frequent itemset of length 2 (zmin and zmax = 2) freq_2 <- fim4r(Adult, method = "relim", target = "frequent", supp = .7, zmin = 2, zmax = 2) inspect(freq_2) # mine maximal frequent itemsets mfis <- fim4r(Adult, method = "sam", target = "maximal", supp = .7) inspect(mfis) # Examples for how to use item appearance with apriori, eclat, # fpgrowth in fim4r. We first mine all rules. inspect(fim4r(Adult, method = "fpgrowth", target = "rules", supp = .8)) # ignore item "capital-gain=None" inspect(fim4r(Adult, method = "fpgrowth", target = "rules", supp = .8, appear = list(c("capital-gain=None"), c("-")))) # "capital-gain=None" cannot appear in consequent (antecedent only) inspect(fim4r(Adult, method = "fpgrowth", target = "rules", supp = .8, appear = list(c("capital-gain=None"), c("a")))) # "capital-gain=None" cannot appear in the antecedent inspect(fim4r(Adult, method = "fpgrowth", target = "rules", supp = .8, appear = list(c("capital-gain=None"), c("c")))) # restrict the consequent to the item "capital-gain=None". # That is, "" = all items can only appear in the antecedent with the # exception that "capital-gain=None" can only appear in the consequent. inspect(fim4r(Adult, method = "fpgrowth", target = "rules", supp = .8, appear = list(c("", "capital-gain=None"), c("a", "c")))) ## End(Not run)
## Not run: data(Adult) # list available algorithms fim4r() # mine association rules with FPgrowth r <- fim4r(Adult, method = "fpgrowth", target = "rules", supp = .7, conf = .8) r inspect(head(r, by = "lift")) # mine closed itemsets with Carpenter or IsTa fim4r(Adult, method = "carpenter", target = "closed", supp = .7) fim4r(Adult, method = "ista", target = "closed", supp = .7) # mine frequent itemset of length 2 (zmin and zmax = 2) freq_2 <- fim4r(Adult, method = "relim", target = "frequent", supp = .7, zmin = 2, zmax = 2) inspect(freq_2) # mine maximal frequent itemsets mfis <- fim4r(Adult, method = "sam", target = "maximal", supp = .7) inspect(mfis) # Examples for how to use item appearance with apriori, eclat, # fpgrowth in fim4r. We first mine all rules. inspect(fim4r(Adult, method = "fpgrowth", target = "rules", supp = .8)) # ignore item "capital-gain=None" inspect(fim4r(Adult, method = "fpgrowth", target = "rules", supp = .8, appear = list(c("capital-gain=None"), c("-")))) # "capital-gain=None" cannot appear in consequent (antecedent only) inspect(fim4r(Adult, method = "fpgrowth", target = "rules", supp = .8, appear = list(c("capital-gain=None"), c("a")))) # "capital-gain=None" cannot appear in the antecedent inspect(fim4r(Adult, method = "fpgrowth", target = "rules", supp = .8, appear = list(c("capital-gain=None"), c("c")))) # restrict the consequent to the item "capital-gain=None". # That is, "" = all items can only appear in the antecedent with the # exception that "capital-gain=None" can only appear in the consequent. inspect(fim4r(Adult, method = "fpgrowth", target = "rules", supp = .8, appear = list(c("", "capital-gain=None"), c("a", "c")))) ## End(Not run)
The Groceries
data set contains 1 month (30 days) of real-world
point-of-sale transaction data from a typical local grocery outlet. The
data set contains 9835 transactions and the items are aggregated to 169
categories.
Object of class transactions.
If you use this data set in your paper, please cite to the paper in the References Section.
Michael Hahsler
The data set is provided for arules by Michael Hahsler, Kurt Hornik and Thomas Reutterer.
Michael Hahsler, Kurt Hornik, and Thomas Reutterer (2006) Implications of probabilistic data modeling for mining association rules. In M. Spiliopoulou, R. Kruse, C. Borgelt, A. Nuernberger, and W. Gaul, editors, From Data and Information Analysis to Knowledge Engineering, Studies in Classification, Data Analysis, and Knowledge Organization, pages 598–605. Springer-Verlag.
Functions to use item hierarchies to aggregate items at different group levels, to perform multi-level transaction analysis.
addAggregate(x, by, postfix = "*") filterAggregate(x) aggregate(x, ...) ## S4 method for signature 'itemMatrix' aggregate(x, by) ## S4 method for signature 'itemsets' aggregate(x, by) ## S4 method for signature 'rules' aggregate(x, by)
addAggregate(x, by, postfix = "*") filterAggregate(x) aggregate(x, ...) ## S4 method for signature 'itemMatrix' aggregate(x, by) ## S4 method for signature 'itemsets' aggregate(x, by) ## S4 method for signature 'rules' aggregate(x, by)
x |
an transactions, itemsets or rules object. |
by |
name of a field (hierarchy level) available in
itemInfo of |
postfix |
characters added to mark group-level items. |
... |
further arguments. |
Often an item hierarchy is available for transactions used for association rule mining. For example in a supermarket dataset items like "bread" and "beagle" might belong to the item group (category) "baked goods."
Transactions can store item hierarchies as additional columns in the
itemInfo data.frame ("labels"
cannot be used since it is reserved for
the item labels).
Aggregation: To perform analysis at a group level of the item
hierarchy, aggregate()
produces a new object with items aggregated to
a given group level. A group-level item is present if one or more of the
items in the group are present in the original object. If rules are
aggregated, and the aggregation would lead to the same aggregated group item
in the lhs and in the rhs, then that group item is removed from the lhs.
Rules or itemsets, which are not unique after the aggregation, are also
removed. Note also that the quality measures are not applicable to the new
rules and thus are removed. If these measures are required, then aggregate
the transactions before mining rules.
Multi-level analysis: To analyze relationships between individual
items and item groups at the same time, addAggregate()
can be used to
create a new transactions object which contains both, the original items and
group-level items (marked with a given postfix). In association rule mining,
all items are handled the same, which means that we will produce a large
number of rules of the type:
item A => group of item A
with a confidence of 1. This will also happen if you mine itemsets.
filterAggregate()
can be used to filter these spurious rules or
itemsets.
aggregate()
returns an object of the same class as x
encoded with a number of items equal to the number of unique values in
by
. Note that for associations (itemsets and rules) the number of
associations in the returned set will most likely be reduced since several
associations might map to the same aggregated association and aggregate
returns a unique set. If several associations map to a single aggregated
association then the quality measures of one of the original associations is
randomly chosen.
addAggregate()
returns a new transactions object with the original
items and the group-items added. filterAggregateRules()
returns a new
rules object with the spurious rules remove.
Michael Hahsler
Other preprocessing:
discretize()
,
itemCoding
,
merge()
,
sample()
Other itemMatrix and transactions functions:
abbreviate()
,
c()
,
crossTable()
,
duplicated()
,
extract
,
image()
,
inspect()
,
is.superset()
,
itemFrequency()
,
itemFrequencyPlot()
,
itemMatrix-class
,
match()
,
merge()
,
random.transactions()
,
sample()
,
sets
,
size()
,
supportingTransactions()
,
tidLists-class
,
transactions-class
,
unique()
data("Groceries") Groceries ## Groceries contains a hierarchy stored in itemInfo head(itemInfo(Groceries)) ## Example 1: Aggregate items using an existing hierarchy stored in itemInfo. ## We aggregate to level2 stored in Groceries. All items with the same level2 label ## will become a single item with that name. ## Note that the number of items is therefore reduced to 55 Groceries_level2 <- aggregate(Groceries, by = "level2") Groceries_level2 head(itemInfo(Groceries_level2)) ## labels are alphabetically sorted! ## compare original and aggregated transactions inspect(head(Groceries, 2)) inspect(head(Groceries_level2, 2)) ## Example 2: Aggregate using a character vector. ## We create here labels manually to organize items by their first letter. mylevels <- toupper(substr(itemLabels(Groceries), 1, 1)) head(mylevels) Groceries_alpha <- aggregate(Groceries, by = mylevels) Groceries_alpha inspect(head(Groceries_alpha, 2)) ## Example 3: Aggregate rules ## Note: You could also directly mine rules from aggregated transactions to ## get support, lift and support rules <- apriori(Groceries, parameter = list(supp = 0.005, conf = 0.5)) rules inspect(rules[1]) rules_level2 <- aggregate(rules, by = "level2") inspect(rules_level2[1]) ## Example 4: Mine multi-level rules. ## (1) Add aggregate items. These items will have labels ending with a * Groceries_multilevel <- addAggregate(Groceries, "level2") summary(Groceries_multilevel) inspect(head(Groceries_multilevel)) rules <- apriori(Groceries_multilevel, parameter = list(support = 0.01, conf = .9)) inspect(head(rules, by = "lift")) ## Note that this contains many spurious rules of type 'item X => aggregate of item X' ## with a confidence of 1 and high lift. We can filter spurious rules resulting from ## the aggregation rules <- filterAggregate(rules) inspect(head(rules, by = "lift"))
data("Groceries") Groceries ## Groceries contains a hierarchy stored in itemInfo head(itemInfo(Groceries)) ## Example 1: Aggregate items using an existing hierarchy stored in itemInfo. ## We aggregate to level2 stored in Groceries. All items with the same level2 label ## will become a single item with that name. ## Note that the number of items is therefore reduced to 55 Groceries_level2 <- aggregate(Groceries, by = "level2") Groceries_level2 head(itemInfo(Groceries_level2)) ## labels are alphabetically sorted! ## compare original and aggregated transactions inspect(head(Groceries, 2)) inspect(head(Groceries_level2, 2)) ## Example 2: Aggregate using a character vector. ## We create here labels manually to organize items by their first letter. mylevels <- toupper(substr(itemLabels(Groceries), 1, 1)) head(mylevels) Groceries_alpha <- aggregate(Groceries, by = mylevels) Groceries_alpha inspect(head(Groceries_alpha, 2)) ## Example 3: Aggregate rules ## Note: You could also directly mine rules from aggregated transactions to ## get support, lift and support rules <- apriori(Groceries, parameter = list(supp = 0.005, conf = 0.5)) rules inspect(rules[1]) rules_level2 <- aggregate(rules, by = "level2") inspect(rules_level2[1]) ## Example 4: Mine multi-level rules. ## (1) Add aggregate items. These items will have labels ending with a * Groceries_multilevel <- addAggregate(Groceries, "level2") summary(Groceries_multilevel) inspect(head(Groceries_multilevel)) rules <- apriori(Groceries_multilevel, parameter = list(support = 0.01, conf = .9)) inspect(head(rules, by = "lift")) ## Note that this contains many spurious rules of type 'item X => aggregate of item X' ## with a confidence of 1 and high lift. We can filter spurious rules resulting from ## the aggregation rules <- filterAggregate(rules) inspect(head(rules, by = "lift"))
Compute the hub transaction weights for a collection of transactions using the HITS (hubs and authorities) algorithm.
hits( data, iter = 16L, tol = NULL, type = c("normed", "relative", "absolute"), verbose = FALSE )
hits( data, iter = 16L, tol = NULL, type = c("normed", "relative", "absolute"), verbose = FALSE )
data |
an object of or coercible to class transactions. |
iter |
an integer value specifying the maximum number of iterations to use. |
tol |
convergence tolerance (default |
type |
a string value specifying the norming of the hub weights. For
|
verbose |
a logical specifying if progress and runtime information should be displayed. |
Model a collection of transactions as a bipartite graph of hubs
(transactions) and authorities (items) with unit arcs and free node weights.
That is, a transaction weight is the sum of the (normalized) weights of the
items and vice versa. The weights are estimated by iterating the model to a
steady-state using a builtin convergence tolerance of FLT_EPSILON
for
(the change in) the norm of the vector of authorities.
A numeric
vector with transaction weights for data
.
Christian Buchta
K. Sun and F. Bai (2008). Mining Weighted Association Rules without Preassigned Weights. IEEE Transactions on Knowledge and Data Engineering, 4 (30), 489–495.
Other weighted association mining functions:
SunBai
,
weclat()
data(SunBai) ## calculate transaction weigths w <- hits(SunBai) w ## add transaction weight to the dataset transactionInfo(SunBai)[["weight"]] <- w transactionInfo(SunBai) ## calulate regular item frequencies itemFrequency(SunBai, weighted = FALSE) ## calulate weighted item frequencies itemFrequency(SunBai, weighted = TRUE)
data(SunBai) ## calculate transaction weigths w <- hits(SunBai) w ## add transaction weight to the dataset transactionInfo(SunBai)[["weight"]] <- w transactionInfo(SunBai) ## calulate regular item frequencies itemFrequency(SunBai, weighted = FALSE) ## calulate weighted item frequencies itemFrequency(SunBai, weighted = TRUE)
Provides image()
methods to generate level plots to visually
inspect binary incidence matrices, i.e., objects based on
itemMatrix (e.g., transactions, tidLists, items in
itemsets or rhs/lhs in rules). These plots can be used to identify problems
in a data set (e.g., recording problems with some transactions containing
all items).
## S4 method for signature 'itemMatrix' image(x, xlab = "Items (Columns)", ylab = "Elements (Rows)", ...) ## S4 method for signature 'transactions' image(x, xlab = "Items (Columns)", ylab = "Transactions (Rows)", ...) ## S4 method for signature 'tidLists' image(x, xlab = "Transactions (Columns)", ylab = "Items/itemsets (Rows)", ...)
## S4 method for signature 'itemMatrix' image(x, xlab = "Items (Columns)", ylab = "Elements (Rows)", ...) ## S4 method for signature 'transactions' image(x, xlab = "Items (Columns)", ylab = "Transactions (Rows)", ...) ## S4 method for signature 'tidLists' image(x, xlab = "Transactions (Columns)", ylab = "Items/itemsets (Rows)", ...)
x |
the object (itemMatrix, transactions or tidLists). |
xlab , ylab
|
labels for the plot. |
... |
further arguments passed on to |
Michael Hahsler
image()
in package Matrix
Other itemMatrix and transactions functions:
abbreviate()
,
c()
,
crossTable()
,
duplicated()
,
extract
,
hierarchy
,
inspect()
,
is.superset()
,
itemFrequency()
,
itemFrequencyPlot()
,
itemMatrix-class
,
match()
,
merge()
,
random.transactions()
,
sample()
,
sets
,
size()
,
supportingTransactions()
,
tidLists-class
,
transactions-class
,
unique()
data("Epub") ## in this data set we can see that not all ## items were available from the beginning. image(Epub[1:1000])
data("Epub") ## in this data set we can see that not all ## items were available from the beginning. image(Epub[1:1000])
Survey example data from the book The Elements of Statistical Learning.
The data is provided in two formats:
Income
is an object of class transactions
with 6876 transactions (complete cases)
and 50 items. See below for details.
IncomeESL
is a data frame with 8993 observations on the
following 14 variables:
an ordered factor with
levels [0,10)
< [10,15)
< [15,20)
< [20,25)
<
[25,30)
< [30,40)
< [40,50)
< [50,75)
<
75+
a factor with levels male
female
a factor with levels married
cohabitation
divorced
widowed
single
an ordered factor with levels 14-17
< 18-24
<
25-34
< 35-44
< 45-54
< 55-64
< 65+
an ordered factor with levels grade <9
<
grades 9-11
< high school graduate
< college (1-3
years)
< college graduate
< graduate study
a factor with levels professional/managerial
sales
laborer
clerical/service
homemaker
student
military
retired
unemployed
an ordered factor with levels <1
< 1-3
<
4-6
< 7-10
< >10
a factor with
levels not married
yes
no
an ordered factor with levels 1
< 2
< 3
<
4
< 5
< 6
< 7
< 8
< 9+
an ordered factor with levels 0
< 1
< 2
< 3
< 4
< 5
< 6
< 7
< 8
< 9+
a factor with levels own
rent
live with parents/family
a factor
with levels house
condominium
apartment
mobile
Home
other
a factor with levels
american indian
asian
black
east indian
hispanic
pacific islander
white
other
a factor with levels english
spanish
other
The IncomeESL
data set originates from an example in the book
The Elements of Statistical Learning (see Section source). The
data set is an extract from this survey. It consists of 8993 instances
(obtained from the original data set with 9409 instances, by removing those
observations with the annual income missing) with 14 demographic attributes.
The data set is a good mixture of categorical and continuous variables with
a lot of missing data. This is characteristic of data mining applications.
The Income data set contains the data already prepared and coerced to
transactions.
To create transactions for Income, the original data frame
in IncomeESL
is prepared in a similar way as described in The
Elements of Statistical Learning. We removed cases with missing values and
cut each ordinal variable (age, education, income, years in bay area, number
in household, and number of children) at its median into two values (see
Section examples).
Michael Hahsler
Impact Resources, Inc., Columbus, OH (1987).
Obtained from the web site of the book: Hastie, T., Tibshirani, R. & Friedman, J. (2001) The Elements of Statistical Learning. Springer-Verlag.
data("IncomeESL") IncomeESL[1:3, ] ## remove incomplete cases IncomeESL <- IncomeESL[complete.cases(IncomeESL), ] ## preparing the data set IncomeESL[["income"]] <- factor((as.numeric(IncomeESL[["income"]]) > 6) +1, levels = 1 : 2 , labels = c("$0-$40,000", "$40,000+")) IncomeESL[["age"]] <- factor((as.numeric(IncomeESL[["age"]]) > 3) +1, levels = 1 : 2 , labels = c("14-34", "35+")) IncomeESL[["education"]] <- factor((as.numeric(IncomeESL[["education"]]) > 4) +1, levels = 1 : 2 , labels = c("no college graduate", "college graduate")) IncomeESL[["years in bay area"]] <- factor( (as.numeric(IncomeESL[["years in bay area"]]) > 4) +1, levels = 1 : 2 , labels = c("1-9", "10+")) IncomeESL[["number in household"]] <- factor( (as.numeric(IncomeESL[["number in household"]]) > 3) +1, levels = 1 : 2 , labels = c("1", "2+")) IncomeESL[["number of children"]] <- factor( (as.numeric(IncomeESL[["number of children"]]) > 1) +0, levels = 0 : 1 , labels = c("0", "1+")) ## creating transactions Income <- transactions(IncomeESL) Income
data("IncomeESL") IncomeESL[1:3, ] ## remove incomplete cases IncomeESL <- IncomeESL[complete.cases(IncomeESL), ] ## preparing the data set IncomeESL[["income"]] <- factor((as.numeric(IncomeESL[["income"]]) > 6) +1, levels = 1 : 2 , labels = c("$0-$40,000", "$40,000+")) IncomeESL[["age"]] <- factor((as.numeric(IncomeESL[["age"]]) > 3) +1, levels = 1 : 2 , labels = c("14-34", "35+")) IncomeESL[["education"]] <- factor((as.numeric(IncomeESL[["education"]]) > 4) +1, levels = 1 : 2 , labels = c("no college graduate", "college graduate")) IncomeESL[["years in bay area"]] <- factor( (as.numeric(IncomeESL[["years in bay area"]]) > 4) +1, levels = 1 : 2 , labels = c("1-9", "10+")) IncomeESL[["number in household"]] <- factor( (as.numeric(IncomeESL[["number in household"]]) > 3) +1, levels = 1 : 2 , labels = c("1", "2+")) IncomeESL[["number of children"]] <- factor( (as.numeric(IncomeESL[["number of children"]]) > 1) +0, levels = 0 : 1 , labels = c("0", "1+")) ## creating transactions Income <- transactions(IncomeESL) Income
Provides the generic function inspect()
and methods to display
associations and transactions plus additional information formatted for
online inspection.
inspect(x, ...) ## S4 method for signature 'itemsets' inspect(x, itemSep = ", ", setStart = "{", setEnd = "}", linebreak = NULL, ...) ## S4 method for signature 'rules' inspect( x, itemSep = ", ", setStart = "{", setEnd = "}", ruleSep = "=>", linebreak = NULL, ... ) ## S4 method for signature 'transactions' inspect(x, itemSep = ", ", setStart = "{", setEnd = "}", linebreak = NULL, ...) ## S4 method for signature 'itemMatrix' inspect(x, itemSep = ", ", setStart = "{", setEnd = "}", linebreak = NULL, ...) ## S4 method for signature 'tidLists' inspect(x, ...)
inspect(x, ...) ## S4 method for signature 'itemsets' inspect(x, itemSep = ", ", setStart = "{", setEnd = "}", linebreak = NULL, ...) ## S4 method for signature 'rules' inspect( x, itemSep = ", ", setStart = "{", setEnd = "}", ruleSep = "=>", linebreak = NULL, ... ) ## S4 method for signature 'transactions' inspect(x, itemSep = ", ", setStart = "{", setEnd = "}", linebreak = NULL, ...) ## S4 method for signature 'itemMatrix' inspect(x, itemSep = ", ", setStart = "{", setEnd = "}", linebreak = NULL, ...) ## S4 method for signature 'tidLists' inspect(x, ...)
x |
a set of associations or transactions or an itemMatrix. |
... |
additional arguments. can be used to customize the output: |
itemSep |
item separator |
setStart |
set start symbol |
setEnd |
set end symbol |
linebreak |
print only one element per line in case the output lines get very long? |
ruleSep |
rule separator |
inspect()
prints the results directly. If you need to create a data.frame
with a human readable version, then you can use DATAFRAME()
.
Nothing is returned (see the Details Section).
Michael Hahsler and Kurt Hornik
Other associations functions:
abbreviate()
,
associations-class
,
c()
,
duplicated()
,
extract
,
is.closed()
,
is.generator()
,
is.maximal()
,
is.redundant()
,
is.significant()
,
is.superset()
,
itemsets-class
,
match()
,
rules-class
,
sample()
,
sets
,
size()
,
sort()
,
unique()
Other itemMatrix and transactions functions:
abbreviate()
,
c()
,
crossTable()
,
duplicated()
,
extract
,
hierarchy
,
image()
,
is.superset()
,
itemFrequency()
,
itemFrequencyPlot()
,
itemMatrix-class
,
match()
,
merge()
,
random.transactions()
,
sample()
,
sets
,
size()
,
supportingTransactions()
,
tidLists-class
,
transactions-class
,
unique()
data("Adult") rules <- apriori(Adult) ## display some rules inspect(rules[1000:1001]) inspect(rules[1000:1001], ruleSep = "~~>", itemSep = " + ", setStart = "", setEnd = "", linebreak = FALSE) ## to get rules in readable format, use coercion or DATAFRAME with additional parameters. as(rules[1000:1001], "data.frame") DATAFRAME(rules[1000:1001]) DATAFRAME(rules[1000:1001], separate = TRUE, setStart = "", setEnd = "")
data("Adult") rules <- apriori(Adult) ## display some rules inspect(rules[1000:1001]) inspect(rules[1000:1001], ruleSep = "~~>", itemSep = " + ", setStart = "", setEnd = "", linebreak = FALSE) ## to get rules in readable format, use coercion or DATAFRAME with additional parameters. as(rules[1000:1001], "data.frame") DATAFRAME(rules[1000:1001]) DATAFRAME(rules[1000:1001], separate = TRUE, setStart = "", setEnd = "")
Provides the generic function interestMeasure()
and the
methods to calculate various additional interest measures for existing sets
of itemsets or rules.
interestMeasure(x, measure, transactions = NULL, reuse = TRUE, ...) ## S4 method for signature 'itemsets' interestMeasure(x, measure, transactions = NULL, reuse = TRUE, ...) ## S4 method for signature 'rules' interestMeasure(x, measure, transactions = NULL, reuse = TRUE, ...)
interestMeasure(x, measure, transactions = NULL, reuse = TRUE, ...) ## S4 method for signature 'itemsets' interestMeasure(x, measure, transactions = NULL, reuse = TRUE, ...) ## S4 method for signature 'rules' interestMeasure(x, measure, transactions = NULL, reuse = TRUE, ...)
x |
|
measure |
name or vector of names of the desired interest measures (see the Details section for available measures). If measure is missing then all available measures are calculated. |
transactions |
the transactions used to mine the associations
or a set of different transactions to calculate interest measures from
(Note: you need to set |
reuse |
logical indicating if information in the quality slot should be
reuse for calculating the measures. This speeds up the process significantly
since only very little (or no) transaction counting is necessary if support,
confidence and lift are already available. Use |
... |
further arguments for the measure calculation. Many measures
are based on contingency table counts and zero counts can produce |
A searchable list of definitions, equations and references for all available interest measures can be found at https://mhahsler.github.io/arules/docs/measures. The descriptions are also linked in the list below.
The following measures are implemented for itemsets:
"support": Support.
"count": Support Count.
"allConfidence": All-Confidence.
"crossSupportRatio": Cross-Support Ratio.
"lift": Lift.
The following measures are implemented for rules:
"support": Support.
"confidence": Confidence.
"lift": Lift.
"count": Support Count.
"addedValue": Added Value.
"boost": Confidence Boost.
"casualConfidence": Casual Confidence.
"casualSupport": Casual Support.
"centeredConfidence": Centered Confidence.
"certainty": Certainty Factor.
"chiSquared": Chi-Squared. Additional parameters are: significance = TRUE
returns the p-value of the test for independence instead of the chi-squared statistic. For p-values, substitution effects (the occurrence of one item makes the occurrence of another item less likely) can be tested using the parameter complements = FALSE
. Note: Correction for multiple comparisons can be done using stats::p.adjust()
.
"collectiveStrength": Collective Strength.
"confirmedConfidence": Descriptive Confirmed Confidence.
"conviction": Conviction.
"cosine": Cosine.
"counterexample": Example and Counter-Example Rate.
"coverage": Coverage.
"doc": Difference of Confidence.
"fishersExactTest": Fisher's Exact Test. By default complementary effects are mined, substitutes can be found by using the parameter complements = FALSE
. Note that Fisher's exact test is equal to hyper-confidence with significance = TRUE
. Correction for multiple comparisons can be done using stats::p.adjust()
.
"gini": Gini Index.
"hyperConfidence": Hyper-Confidence. Reports the confidence level by default and the significance level if significance = TRUE
is used. By default complementary effects are mined, substitutes (too low co-occurrence counts) can be found by using the parameter complements = FALSE
.
"hyperLift": Hyper-Lift. The used quantile can be changed using parameter level
(default: level = 0.99
).
"imbalance": Imbalance Ratio.
"implicationIndex": Implication Index.
"importance": Importance.
"improvement": Improvement. The additional parameter improvementMeasure
(default: 'confidence'
) can be used to specify the measure used for the improvement calculation. See Generalized improvement.
"jaccard": Jaccard Coefficient.
"jMeasure": J-Measure.
"kappa": Kappa.
"kulczynski": Kulczynski.
"lambda": Lambda.
"laplace": Laplace Corrected Confidence. Parameter k
can be used to specify the number of classes (default is 2).
"leastContradiction": Least Contradiction.
"lerman": Lerman Similarity.
"leverage": Leverage.
"LIC": Lift Increase. The additional parameter improvementMeasure
(default: 'lift'
) can be used to specify the measure used for the increase calculation. See Generalized increase ratio.
"maxconfidence": MaxConfidence.
"mutualInformation": Mutual Information.
"oddsRatio": Odds Ratio.
"phi": Phi Correlation Coefficient.
"ralambondrainy": Ralambondrainy.
"relativeRisk": Relative Risk.
"rhsSupport": Right-Hand-Side Support.
"rulePowerFactor": Rule Power Factor.
"sebag": Sebag-Schoenauer.
"stdLift": Standardized Lift.
"table": Contingency Table. Returns the four counts for the contingency table. The entries are labeled n11
, n01
, n10
, and n00
(the first subscript is for X and the second is for Y; 1 indicated presence and 0 indicates absence). If several measures are specified, then the counts have the prefix table.
"varyingLiaison": Varying Rates Liaison.
"yuleQ": Yule's Q.
"yuleY": Yule's Y.
If only one measure is used, the function returns a numeric vector
containing the values of the interest measure for each association in the
set of associations x
.
If more than one measures are specified, the result is a data.frame containing the different measures for each association as columns.
NA
is returned for rules/itemsets for which a certain measure is not
defined.
Michael Hahsler
Hahsler, Michael (2015). A Probabilistic Comparison of Commonly Used Interest Measures for Association Rules, 2015, URL: https://mhahsler.github.io/arules/docs/measures.
Haldane, J.B.S. (1940). "The mean and variance of the moments of chi-squared when used as a test of homogeneity, when expectations are small". Biometrika, 29, 133-134.
Anscombe, F.J. (1956). "On estimating binomial response relations". Biometrika, 43, 461-464.
Other interest measures:
confint()
,
coverage()
,
is.redundant()
,
is.significant()
,
support()
data("Income") rules <- apriori(Income) ## calculate a single measure and add it to the quality slot quality(rules) <- cbind(quality(rules), hyperConfidence = interestMeasure(rules, measure = "hyperConfidence", transactions = Income)) inspect(head(rules, by = "hyperConfidence")) ## calculate several measures m <- interestMeasure(rules, c("confidence", "oddsRatio", "leverage"), transactions = Income) inspect(head(rules)) head(m) ## calculate all available measures for the first 5 rules and show them as a ## table with the measures as rows t(interestMeasure(head(rules, 5), transactions = Income)) ## calculate measures on a different set of transactions (I use a sample here) ## Note: reuse = TRUE (default) would just return the stored support on the ## data set used for mining newTrans <- sample(Income, 100) m2 <- interestMeasure(rules, "support", transactions = newTrans, reuse = FALSE) head(m2) ## calculate all available measures for the 5 frequent itemsets with highest support its <- apriori(Income, parameter = list(target = "frequent itemsets")) its <- head(its, 5, by = "support") inspect(its) interestMeasure(its, transactions = Income)
data("Income") rules <- apriori(Income) ## calculate a single measure and add it to the quality slot quality(rules) <- cbind(quality(rules), hyperConfidence = interestMeasure(rules, measure = "hyperConfidence", transactions = Income)) inspect(head(rules, by = "hyperConfidence")) ## calculate several measures m <- interestMeasure(rules, c("confidence", "oddsRatio", "leverage"), transactions = Income) inspect(head(rules)) head(m) ## calculate all available measures for the first 5 rules and show them as a ## table with the measures as rows t(interestMeasure(head(rules, 5), transactions = Income)) ## calculate measures on a different set of transactions (I use a sample here) ## Note: reuse = TRUE (default) would just return the stored support on the ## data set used for mining newTrans <- sample(Income, 100) m2 <- interestMeasure(rules, "support", transactions = newTrans, reuse = FALSE) head(m2) ## calculate all available measures for the 5 frequent itemsets with highest support its <- apriori(Income, parameter = list(target = "frequent itemsets")) its <- head(its, 5, by = "support") inspect(its) interestMeasure(its, transactions = Income)
Provides the generic function and the method is.closed()
for finding
closed itemsets.
Closed itemsets are used as a concise representation of
frequent itemsets. The closure of an itemset is its largest proper superset
which has the same support (is contained in exactly the same transactions).
An itemset is closed, if it is its own closure (Pasquier et al. 1999).
is.closed(x) ## S4 method for signature 'itemsets' is.closed(x)
is.closed(x) ## S4 method for signature 'itemsets' is.closed(x)
x |
a set of itemsets. |
Closed frequent itemsets can also be mined directly using
apriori()
or eclat()
with target "closed frequent itemsets"
.
a logical vector with the same length as x
indicating for
each element in x
if it is a closed itemset.
Michael Hahsler
Nicolas Pasquier, Yves Bastide, Rafik Taouil, and Lotfi Lakhal (1999). Discovering frequent closed itemsets for association rules. In Proceeding of the 7th International Conference on Database Theory, Lecture Notes In Computer Science (LNCS 1540), pages 398–416. Springer, 1999.
Other postprocessing:
is.generator()
,
is.maximal()
,
is.redundant()
,
is.significant()
,
is.superset()
Other associations functions:
abbreviate()
,
associations-class
,
c()
,
duplicated()
,
extract
,
inspect()
,
is.generator()
,
is.maximal()
,
is.redundant()
,
is.significant()
,
is.superset()
,
itemsets-class
,
match()
,
rules-class
,
sample()
,
sets
,
size()
,
sort()
,
unique()
Provides the generic function and the method 'is.generator() for finding generator itemsets. Generators are part of concise representations for frequent itemsets. A generator in a set of itemsets is an itemset that has no subset with the same support (Liu et al, 2008). Note that the empty set is by definition a generator, but it is typically not stored in the itemsets in arules.
is.generator(x) ## S4 method for signature 'itemsets' is.generator(x)
is.generator(x) ## S4 method for signature 'itemsets' is.generator(x)
x |
a set of itemsets. |
a logical vector with the same length as x
indicating for
each element in x
if it is a generator itemset.
Michael Hahsler
Yves Bastide, Niolas Pasquier, Rafik Taouil, Gerd Stumme, Lotfi Lakhal (2000). Mining Minimal Non-redundant Association Rules Using Frequent Closed Itemsets. In International Conference on Computational Logic, Lecture Notes in Computer Science (LNCS 1861). pages 972–986. doi:10.1007/3-540-44957-4_65
Guimei Liu, Jinyan Li, Limsoon Wong (2008). A new concise representation of frequent itemsets using generators and a positive border. Knowledge and Information Systems 17(1):35-56. doi:10.1007/s10115-007-0111-5
Other postprocessing:
is.closed()
,
is.maximal()
,
is.redundant()
,
is.significant()
,
is.superset()
Other associations functions:
abbreviate()
,
associations-class
,
c()
,
duplicated()
,
extract
,
inspect()
,
is.closed()
,
is.maximal()
,
is.redundant()
,
is.significant()
,
is.superset()
,
itemsets-class
,
match()
,
rules-class
,
sample()
,
sets
,
size()
,
sort()
,
unique()
# Example from Liu et al (2008) trans_list <- list( t1 = c("a","b","c"), t2 = c("a","b", "c", "d"), t3 = c("a","d"), t4 = c("a","c") ) trans <- transactions(trans_list) its <- apriori(trans, support = 1/4, target = "frequent itemsets") is.generator(its)
# Example from Liu et al (2008) trans_list <- list( t1 = c("a","b","c"), t2 = c("a","b", "c", "d"), t3 = c("a","d"), t4 = c("a","c") ) trans <- transactions(trans_list) its <- apriori(trans, support = 1/4, target = "frequent itemsets") is.generator(its)
Provides the generic function is.maximal()
and methods for
finding maximal itemsets. Maximal frequent itemsets are used as a concise
representation of frequent itemsets. An itemset is maximal in a set if no
proper superset of the itemset is contained in the set (Zaki et al., 1997).
is.maximal(x, ...) ## S4 method for signature 'itemMatrix' is.maximal(x) ## S4 method for signature 'itemsets' is.maximal(x) ## S4 method for signature 'rules' is.maximal(x)
is.maximal(x, ...) ## S4 method for signature 'itemMatrix' is.maximal(x) ## S4 method for signature 'itemsets' is.maximal(x) ## S4 method for signature 'rules' is.maximal(x)
x |
the set of itemsets, rules or an itemMatrix object. |
... |
further arguments. |
Maximally frequent itemsets can also be mined directly using
apriori()
or eclat()
with target "maximally frequent
itemsets".
We define here maximal rules, as the rules generated by maximal itemsets.
a logical vector with the same length as x
indicating for
each element in x
if it is a maximal itemset.
Michael Hahsler
Mohammed J. Zaki, Srinivasan Parthasarathy, Mitsunori Ogihara, and Wei Li (1997). New algorithms for fast discovery of association rules. Technical Report 651, Computer Science Department, University of Rochester, Rochester, NY 14627.
Other postprocessing:
is.closed()
,
is.generator()
,
is.redundant()
,
is.significant()
,
is.superset()
Other associations functions:
abbreviate()
,
associations-class
,
c()
,
duplicated()
,
extract
,
inspect()
,
is.closed()
,
is.generator()
,
is.redundant()
,
is.significant()
,
is.superset()
,
itemsets-class
,
match()
,
rules-class
,
sample()
,
sets
,
size()
,
sort()
,
unique()
Provides the generic function is.redundant()
and the method to find
redundant rules based on any interest measure.
is.redundant(x, ...) ## S4 method for signature 'rules' is.redundant( x, measure = "confidence", confint = FALSE, level = 0.95, smoothCounts = 1, ... )
is.redundant(x, ...) ## S4 method for signature 'rules' is.redundant( x, measure = "confidence", confint = FALSE, level = 0.95, smoothCounts = 1, ... )
x |
a set of rules. |
... |
additional arguments are passed on to
|
measure |
measure used to check for redundancy. |
confint |
should confidence intervals be used to the redundancy check? |
level |
confidence level for the confidence interval. Only used when
|
smoothCounts |
adds a "pseudo count" to each count in the used contingency table. This implements addaptive smoothing (Laplace smoothing) for counts and avoids zero counts. |
Simple improvement-based redundancy: (confint = FALSE
) A rule
can be defined as redundant if a more general rules with the same or a
higher confidence exists. That is, a more specific rule is redundant if it
is only equally or even less predictive than a more general rule. A rule is
more general if it has the same RHS but one or more items removed from the
LHS. Formally, a rule is redundant if
This is equivalent to a negative or zero improvement as defined by Bayardo et al. (2000).
The idea of improvement can be extended other measures besides confidence.
Any other measure available for function interestMeasure()
(e.g.,
lift or the odds ratio) can be specified in measure
.
Confidence interval-based redundancy: (confint = TRUE
) Li et
al (2014) propose to use the confidence interval (CI) of the odds ratio (OR)
of rules to define redundancy. A more specific rule is redundant if it does
not provide a significantly higher OR than any more general rule. Using
confidence intervals as error bounds, a more specific rule is
defined as redundant if its OR CI overlaps with the CI of any more general
rule. This type of redundancy detection removes more rules
than improvement since it takes differences in counts due to randomness in
the dataset into account.
The odds ratio and the CI are based on counts which can be zero and which
leads to numerical problems. In addition to the method described by Li et al
(2014), we use additive smoothing (Laplace smoothing) to alleviate this
problem. The default setting adds 1 to each count (see
confint()
). A different pseudocount (smoothing parameter) can be
defined using the additional parameter smoothCounts
. Smoothing can be
disabled using smoothCounts = 0
.
Warning: This approach of redundancy checking is flawed since rules with non-overlapping CIs are non-redundant (same result as for a 2-sample t-test), but overlapping CIs do not automatically mean that there is no significant difference between the two measures which leads to a higher type II error. At the same time, multiple comparisons are performed leading to an increased type I error. If we are more worried about missing important rules, then the type II error is more concerning.
Confidence interval-based redundancy checks can also be used for other
measures with a confidence interval like confidence (see
confint()
).
returns a logical vector indicating which rules are redundant.
Michael Hahsler and Christian Buchta
Bayardo, R. , R. Agrawal, and D. Gunopulos (2000). Constraint-based rule mining in large, dense databases. Data Mining and Knowledge Discovery, 4(2/3):217–240.
Li, J., Jixue Liu, Hannu Toivonen, Kenji Satou, Youqiang Sun, and Bingyu Sun (2014). Discovering statistically non-redundant subgroups. Knowledge-Based Systems. 67 (September, 2014), 315–327. doi:10.1016/j.knosys.2014.04.030
Other postprocessing:
is.closed()
,
is.generator()
,
is.maximal()
,
is.significant()
,
is.superset()
Other associations functions:
abbreviate()
,
associations-class
,
c()
,
duplicated()
,
extract
,
inspect()
,
is.closed()
,
is.generator()
,
is.maximal()
,
is.significant()
,
is.superset()
,
itemsets-class
,
match()
,
rules-class
,
sample()
,
sets
,
size()
,
sort()
,
unique()
Other interest measures:
confint()
,
coverage()
,
interestMeasure()
,
is.significant()
,
support()
data("Income") ## mine some rules with the consequent "language in home=english" rules <- apriori(Income, parameter = list(support = 0.5), appearance = list(rhs = "language in home=english")) ## for better comparison we add Bayado's improvement and sort by improvement quality(rules)$improvement <- interestMeasure(rules, measure = "improvement") rules <- sort(rules, by = "improvement") inspect(rules) is.redundant(rules) ## find non-redundant rules using improvement of confidence ## Note: a few rules have a very small improvement over the rule {} => {language in home=english} rules_non_redundant <- rules[!is.redundant(rules)] inspect(rules_non_redundant) ## use non-overlapping confidence intervals for the confidence measure instead ## Note: fewer rules have a significantly higher confidence inspect(rules[!is.redundant(rules, measure = "confidence", confint = TRUE, level = 0.95)]) ## find non-redundant rules using improvement of the odds ratio. quality(rules)$oddsRatio <- interestMeasure(rules, measure = "oddsRatio", smoothCounts = .5) inspect(rules[!is.redundant(rules, measure = "oddsRatio")]) ## use the confidence interval for the odds ratio. ## We see that no rule has a significantly better odds ratio than the most general rule. inspect(rules[!is.redundant(rules, measure = "oddsRatio", confint = TRUE, level = 0.95)]) ## use the confidence interval for lift inspect(rules[!is.redundant(rules, measure = "lift", confint = TRUE, level = 0.95)])
data("Income") ## mine some rules with the consequent "language in home=english" rules <- apriori(Income, parameter = list(support = 0.5), appearance = list(rhs = "language in home=english")) ## for better comparison we add Bayado's improvement and sort by improvement quality(rules)$improvement <- interestMeasure(rules, measure = "improvement") rules <- sort(rules, by = "improvement") inspect(rules) is.redundant(rules) ## find non-redundant rules using improvement of confidence ## Note: a few rules have a very small improvement over the rule {} => {language in home=english} rules_non_redundant <- rules[!is.redundant(rules)] inspect(rules_non_redundant) ## use non-overlapping confidence intervals for the confidence measure instead ## Note: fewer rules have a significantly higher confidence inspect(rules[!is.redundant(rules, measure = "confidence", confint = TRUE, level = 0.95)]) ## find non-redundant rules using improvement of the odds ratio. quality(rules)$oddsRatio <- interestMeasure(rules, measure = "oddsRatio", smoothCounts = .5) inspect(rules[!is.redundant(rules, measure = "oddsRatio")]) ## use the confidence interval for the odds ratio. ## We see that no rule has a significantly better odds ratio than the most general rule. inspect(rules[!is.redundant(rules, measure = "oddsRatio", confint = TRUE, level = 0.95)]) ## use the confidence interval for lift inspect(rules[!is.redundant(rules, measure = "lift", confint = TRUE, level = 0.95)])
Provides the generic functions is.significant()
and the method to
find significant rules.
is.significant(x, ...) ## S4 method for signature 'rules' is.significant( x, transactions = NULL, method = "fisher", alpha = 0.01, adjust = "none", reuse = TRUE, ... )
is.significant(x, ...) ## S4 method for signature 'rules' is.significant( x, transactions = NULL, method = "fisher", alpha = 0.01, adjust = "none", reuse = TRUE, ... )
x |
a set of rules. |
... |
further arguments are passed on to |
transactions |
optional set of transactions. Only needed if not sufficient
interest measures are available in |
method |
test to use. Options are |
alpha |
required significance level. |
adjust |
method to adjust for multiple comparisons. Some options are
|
reuse |
logical indicating if information in the quality slot should be reuse for calculating the measures. |
The implementation for association rules uses Fisher's exact test with correction for multiple comparisons to test the null hypothesis that the LHS and the RHS of the rule are independent. Significant rules have a p-value less then the specified significance level alpha (the null hypothesis of independence is rejected). See Hahsler and Hornik (2007) for details.
returns a logical vector indicating which rules are significant.
Michael Hahsler
Hahsler, Michael and Kurt Hornik (2007). New probabilistic interest measures for association rules. Intelligent Data Analysis, 11(5):437–455. doi:10.3233/IDA-2007-11502
Other interest measures:
confint()
,
coverage()
,
interestMeasure()
,
is.redundant()
,
support()
Other postprocessing:
is.closed()
,
is.generator()
,
is.maximal()
,
is.redundant()
,
is.superset()
Other associations functions:
abbreviate()
,
associations-class
,
c()
,
duplicated()
,
extract
,
inspect()
,
is.closed()
,
is.generator()
,
is.maximal()
,
is.redundant()
,
is.superset()
,
itemsets-class
,
match()
,
rules-class
,
sample()
,
sets
,
size()
,
sort()
,
unique()
data("Income") rules <- apriori(Income, support = 0.2) is.significant(rules) rules[is.significant(rules)] # Adjust P-values for multiple comparisons rules[is.significant(rules, adjust = "bonferroni")]
data("Income") rules <- apriori(Income, support = 0.2) is.significant(rules) rules[is.significant(rules)] # Adjust P-values for multiple comparisons rules[is.significant(rules, adjust = "bonferroni")]
Provides the generic functions is.subset()
and is.superset()
, and the methods
for finding super or subsets in associations and
itemMatrix objects.
is.superset(x, y = NULL, proper = FALSE, sparse = TRUE, ...) is.subset(x, y = NULL, proper = FALSE, sparse = TRUE, ...) ## S4 method for signature 'itemMatrix' is.superset(x, y = NULL, proper = FALSE, sparse = TRUE) ## S4 method for signature 'associations' is.superset(x, y = NULL, proper = FALSE, sparse = TRUE) ## S4 method for signature 'itemMatrix' is.subset(x, y = NULL, proper = FALSE, sparse = TRUE) ## S4 method for signature 'associations' is.subset(x, y = NULL, proper = FALSE, sparse = TRUE)
is.superset(x, y = NULL, proper = FALSE, sparse = TRUE, ...) is.subset(x, y = NULL, proper = FALSE, sparse = TRUE, ...) ## S4 method for signature 'itemMatrix' is.superset(x, y = NULL, proper = FALSE, sparse = TRUE) ## S4 method for signature 'associations' is.superset(x, y = NULL, proper = FALSE, sparse = TRUE) ## S4 method for signature 'itemMatrix' is.subset(x, y = NULL, proper = FALSE, sparse = TRUE) ## S4 method for signature 'associations' is.subset(x, y = NULL, proper = FALSE, sparse = TRUE)
x , y
|
associations or itemMatrix objects. If |
proper |
a logical indicating if all or just proper super or subsets. |
sparse |
a logical indicating if a sparse Matrix::ngCMatrix rather than a dense logical matrix should be returned. Sparse computation requires a significantly smaller amount of memory and is much faster for large sets. |
... |
currently unused. |
Determines for each element in x
which elements in y
are supersets
or subsets. Note that the method can be very slow and memory intensive if
x
and/or y
are very dense (contain many items).
For rules, the union of lhs and rhs is used a the set of items.
returns a logical matrix or a sparse Matrix::ngCMatrix
with length(x)
rows and length(y)
columns.
Each logical row vector represents which elements in y
are supersets
(subsets) of the corresponding element in x
. If either x
or
y
have length zero, NULL
is returned instead of a matrix.
Michael Hahsler and Ian Johnson
Other postprocessing:
is.closed()
,
is.generator()
,
is.maximal()
,
is.redundant()
,
is.significant()
Other associations functions:
abbreviate()
,
associations-class
,
c()
,
duplicated()
,
extract
,
inspect()
,
is.closed()
,
is.generator()
,
is.maximal()
,
is.redundant()
,
is.significant()
,
itemsets-class
,
match()
,
rules-class
,
sample()
,
sets
,
size()
,
sort()
,
unique()
Other itemMatrix and transactions functions:
abbreviate()
,
c()
,
crossTable()
,
duplicated()
,
extract
,
hierarchy
,
image()
,
inspect()
,
itemFrequency()
,
itemFrequencyPlot()
,
itemMatrix-class
,
match()
,
merge()
,
random.transactions()
,
sample()
,
sets
,
size()
,
supportingTransactions()
,
tidLists-class
,
transactions-class
,
unique()
data("Adult") set <- eclat(Adult, parameter = list(supp = 0.8)) ### find the supersets of each itemset in set is.superset(set, set) is.superset(set, set, sparse = FALSE)
data("Adult") set <- eclat(Adult, parameter = list(supp = 0.8)) ### find the supersets of each itemset in set is.superset(set, set) is.superset(set, set, sparse = FALSE)
The order in which items are stored in an itemMatrix is called the item coding. The following generic functions and methods are used to translate between the representation in the itemMatrix format (used in transactions, rules and itemsets), item labels and numeric item IDs (i.e., the column numbers in the itemMatrix representation).
decode(x, ...) ## S4 method for signature 'numeric' decode(x, itemLabels) ## S4 method for signature 'list' decode(x, itemLabels) encode(x, ...) ## S4 method for signature 'character' encode(x, itemLabels, itemMatrix = TRUE) ## S4 method for signature 'numeric' encode(x, itemLabels, itemMatrix = TRUE) ## S4 method for signature 'list' encode(x, itemLabels, itemMatrix = TRUE) recode(x, ...) ## S4 method for signature 'itemMatrix' recode(x, itemLabels = NULL, match = NULL) ## S4 method for signature 'itemsets' recode(x, itemLabels = NULL, match = NULL) ## S4 method for signature 'rules' recode(x, itemLabels = NULL, match = NULL) compatible(x, y) ## S4 method for signature 'itemMatrix' compatible(x, y) ## S4 method for signature 'associations' compatible(x, y)
decode(x, ...) ## S4 method for signature 'numeric' decode(x, itemLabels) ## S4 method for signature 'list' decode(x, itemLabels) encode(x, ...) ## S4 method for signature 'character' encode(x, itemLabels, itemMatrix = TRUE) ## S4 method for signature 'numeric' encode(x, itemLabels, itemMatrix = TRUE) ## S4 method for signature 'list' encode(x, itemLabels, itemMatrix = TRUE) recode(x, ...) ## S4 method for signature 'itemMatrix' recode(x, itemLabels = NULL, match = NULL) ## S4 method for signature 'itemsets' recode(x, itemLabels = NULL, match = NULL) ## S4 method for signature 'rules' recode(x, itemLabels = NULL, match = NULL) compatible(x, y) ## S4 method for signature 'itemMatrix' compatible(x, y) ## S4 method for signature 'associations' compatible(x, y)
x |
a vector or a list of vectors of character strings (for
|
... |
further arguments. |
itemLabels |
a vector of character strings used for coding where the position of an item label in the vector gives the item's column ID. Alternatively, a itemMatrix, transactions or associations object can be specified and the item labels or these objects are used. |
itemMatrix |
return an object of class itemMatrix otherwise an
object of the same class as |
match |
deprecated: used |
y |
an object of class itemMatrix, transactions or
associations to compare item coding to |
Item coding compatibility: When working with several datasets or different
subsets of the same dataset, combining or compare the found
itemsets or rules requires a compatible item coding.
That is, the sparse matrices representing the
items (the itemMatrix objects) have columns for the same items in exactly the
same order. The coercion to transactions with transactions()
or
as(x, "transactions")
will create the
item coding by adding items in the order they are encountered in the dataset. This
can lead to different item codings (different order, missing items) for even
only slightly different datasets or versions of a dataset.
Method compatible()
can be used to check if two sets have the same item coding.
Defining a common item coding:
When working with many sets, then first a common item
coding should be defined by creating a vector with all possible item labels and then
specify them as itemLabels
to create transactions with transactions()
.
Compatible itemMatrix objects can be created using encode()
.
Recoding and Decoding:
Two incompatible objects can be made compatible using recode()
. Recode
one object by specifying the other object in itemLabels
.
decode()
converts from the column IDs used in the itemMatrix
representation to item labels. decode()
is used by LIST()
.
recode()
always returns an object of the same class as x
.
For encode()
with itemMatrix = TRUE
an object of class
itemMatrix is returned. Otherwise the result is of the same type as
x
, e.g., a list or a vector.
Michael Hahsler
LIST()
, associations, itemMatrix
Other preprocessing:
discretize()
,
hierarchy
,
merge()
,
sample()
data("Adult") ## Example 1: Manual decoding ## Extract the item coding as a vector of item labels. iLabels <- itemLabels(Adult) head(iLabels) ## get undecoded list (itemIDs) list <- LIST(Adult[1:5], decode = FALSE) list ## decode itemIDs by replacing them with the appropriate item label decode(list, itemLabels = iLabels) ## Example 2: Manually create an itemMatrix using iLabels as the common item coding data <- list( c("income=small", "age=Young"), c("income=large", "age=Middle-aged") ) # Option a: encode to match the item coding in Adult iM <- encode(data, itemLabels = Adult) iM inspect(iM) compatible(iM, Adult) # Option b: coercion plus recode to make it compatible to Adult # (note: the coding has 115 item columns after recode) iM <- as(data, "itemMatrix") iM compatible(iM, Adult) iM <- recode(iM, itemLabels = Adult) iM compatible(iM, Adult) ## Example 3: use recode to make itemMatrices compatible ## select first 100 transactions and all education-related items sub <- Adult[1:100, itemInfo(Adult)$variables == "education"] itemLabels(sub) image(sub) ## After choosing only a subset of items (columns), the item coding is now ## no longer compatible with the Adult dataset compatible(sub, Adult) ## recode to match Adult again sub.recoded <- recode(sub, itemLabels = Adult) image(sub.recoded) ## Example 4: manually create 2 new transaction for the Adult data set ## Note: check itemLabels(Adult) to see the available labels for items twoTransactions <- as( encode(list( c("age=Young", "relationship=Unmarried"), c("age=Senior") ), itemLabels = Adult), "transactions") twoTransactions inspect(twoTransactions) ## the same using the transactions constructor function instead twoTransactions <- transactions( list( c("age=Young", "relationship=Unmarried"), c("age=Senior") ), itemLabels = Adult) twoTransactions inspect(twoTransactions) ## Example 5: Use a common item coding # Creation of transactions separately will produce different item codings trans1 <- transactions( list( c("age=Young", "relationship=Unmarried"), c("age=Senior") )) trans1 trans2 <- transactions( list( c("age=Middle-aged", "relationship=Married"), c("relationship=Unmarried", "age=Young") )) trans2 compatible(trans1, trans2) # produce common item coding (all item labels in the two sets) commonItemLabels <- union(itemLabels(trans1), itemLabels(trans2)) commonItemLabels trans1 <- recode(trans1, itemLabels = commonItemLabels) trans1 trans2 <- recode(trans2, itemLabels = commonItemLabels) trans2 compatible(trans1, trans2) ## Example 6: manually create a rule using the item coding in Adult ## and calculate interest measures aRule <- new("rules", lhs = encode(list(c("age=Young", "relationship=Unmarried")), itemLabels = Adult), rhs = encode(list(c("income=small")), itemLabels = Adult) ) ## shorter version using the rules constructor aRule <- rules( lhs = list(c("age=Young", "relationship=Unmarried")), rhs = list(c("income=small")), itemLabels = Adult ) quality(aRule) <- interestMeasure(aRule, measure = c("support", "confidence", "lift"), transactions = Adult) inspect(aRule)
data("Adult") ## Example 1: Manual decoding ## Extract the item coding as a vector of item labels. iLabels <- itemLabels(Adult) head(iLabels) ## get undecoded list (itemIDs) list <- LIST(Adult[1:5], decode = FALSE) list ## decode itemIDs by replacing them with the appropriate item label decode(list, itemLabels = iLabels) ## Example 2: Manually create an itemMatrix using iLabels as the common item coding data <- list( c("income=small", "age=Young"), c("income=large", "age=Middle-aged") ) # Option a: encode to match the item coding in Adult iM <- encode(data, itemLabels = Adult) iM inspect(iM) compatible(iM, Adult) # Option b: coercion plus recode to make it compatible to Adult # (note: the coding has 115 item columns after recode) iM <- as(data, "itemMatrix") iM compatible(iM, Adult) iM <- recode(iM, itemLabels = Adult) iM compatible(iM, Adult) ## Example 3: use recode to make itemMatrices compatible ## select first 100 transactions and all education-related items sub <- Adult[1:100, itemInfo(Adult)$variables == "education"] itemLabels(sub) image(sub) ## After choosing only a subset of items (columns), the item coding is now ## no longer compatible with the Adult dataset compatible(sub, Adult) ## recode to match Adult again sub.recoded <- recode(sub, itemLabels = Adult) image(sub.recoded) ## Example 4: manually create 2 new transaction for the Adult data set ## Note: check itemLabels(Adult) to see the available labels for items twoTransactions <- as( encode(list( c("age=Young", "relationship=Unmarried"), c("age=Senior") ), itemLabels = Adult), "transactions") twoTransactions inspect(twoTransactions) ## the same using the transactions constructor function instead twoTransactions <- transactions( list( c("age=Young", "relationship=Unmarried"), c("age=Senior") ), itemLabels = Adult) twoTransactions inspect(twoTransactions) ## Example 5: Use a common item coding # Creation of transactions separately will produce different item codings trans1 <- transactions( list( c("age=Young", "relationship=Unmarried"), c("age=Senior") )) trans1 trans2 <- transactions( list( c("age=Middle-aged", "relationship=Married"), c("relationship=Unmarried", "age=Young") )) trans2 compatible(trans1, trans2) # produce common item coding (all item labels in the two sets) commonItemLabels <- union(itemLabels(trans1), itemLabels(trans2)) commonItemLabels trans1 <- recode(trans1, itemLabels = commonItemLabels) trans1 trans2 <- recode(trans2, itemLabels = commonItemLabels) trans2 compatible(trans1, trans2) ## Example 6: manually create a rule using the item coding in Adult ## and calculate interest measures aRule <- new("rules", lhs = encode(list(c("age=Young", "relationship=Unmarried")), itemLabels = Adult), rhs = encode(list(c("income=small")), itemLabels = Adult) ) ## shorter version using the rules constructor aRule <- rules( lhs = list(c("age=Young", "relationship=Unmarried")), rhs = list(c("income=small")), itemLabels = Adult ) quality(aRule) <- interestMeasure(aRule, measure = c("support", "confidence", "lift"), transactions = Adult) inspect(aRule)
Provides the generic function itemFrequency()
and methods to get the
frequency/support for all single items in an objects based on
itemMatrix. For example, it is used to get the single
item support from an object of class transactions
without mining.
itemFrequency(x, ...) ## S4 method for signature 'itemMatrix' itemFrequency(x, type = c("relative", "absolute"), weighted = FALSE) ## S4 method for signature 'tidLists' itemFrequency(x, type = c("relative", "absolute"))
itemFrequency(x, ...) ## S4 method for signature 'itemMatrix' itemFrequency(x, type = c("relative", "absolute"), weighted = FALSE) ## S4 method for signature 'tidLists' itemFrequency(x, type = c("relative", "absolute"))
x |
an object of class itemMatrix or tidLists. |
... |
further arguments are passed on. |
type |
a character string specifying if |
weighted |
should support be weighted by transactions weights stored as
column |
itemFrequency
returns a named numeric vector. Each element
is the frequency/support of the corresponding item in object x
. The
items appear in the vector in the same order as in the binary matrix in
x
.
Michael Hahsler
Other itemMatrix and transactions functions:
abbreviate()
,
c()
,
crossTable()
,
duplicated()
,
extract
,
hierarchy
,
image()
,
inspect()
,
is.superset()
,
itemFrequencyPlot()
,
itemMatrix-class
,
match()
,
merge()
,
random.transactions()
,
sample()
,
sets
,
size()
,
supportingTransactions()
,
tidLists-class
,
transactions-class
,
unique()
data("Adult") itemFrequency(Adult, type = "relative")
data("Adult") itemFrequency(Adult, type = "relative")
Provides the generic function itemFrequencyPlot()
and the method to
create an item frequency bar plot for inspecting the item frequency
distribution for objects based on itemMatrix (e.g.,
transactions, or items in itemsets
and rules).
itemFrequencyPlot(x, ...) ## S4 method for signature 'itemMatrix' itemFrequencyPlot( x, type = c("relative", "absolute"), weighted = FALSE, support = NULL, topN = NULL, population = NULL, popCol = "black", popLwd = 1, lift = FALSE, horiz = FALSE, names = TRUE, cex.names = graphics::par("cex.axis"), xlab = NULL, ylab = NULL, mai = NULL, ... )
itemFrequencyPlot(x, ...) ## S4 method for signature 'itemMatrix' itemFrequencyPlot( x, type = c("relative", "absolute"), weighted = FALSE, support = NULL, topN = NULL, population = NULL, popCol = "black", popLwd = 1, lift = FALSE, horiz = FALSE, names = TRUE, cex.names = graphics::par("cex.axis"), xlab = NULL, ylab = NULL, mai = NULL, ... )
x |
the object to be plotted. |
... |
further arguments are passed on (see
|
type |
a character string indicating whether item frequencies should be displayed relative of absolute. |
weighted |
should support be weighted by transactions weights stored as
column |
support |
a numeric value. Only display items which have a support of
at least |
topN |
a integer value. Only plot the |
population |
object of same class as |
popCol |
plotting color for population. |
popLwd |
line width for population. |
lift |
a logical indicating whether to plot the lift ratio between
instead of frequencies. The lift ratio is gives how many times an item is
more frequent in |
horiz |
a logical. If |
names |
a logical indicating if the names (bar labels) should be displayed? |
cex.names |
a numeric value for the expansion factor for axis names (bar labels). |
xlab |
a character string with the label for the x axis (use an empty string to force no label). |
ylab |
a character string with the label for the y axis (see xlab). |
mai |
a numerical vector giving the plots margin sizes in inches (see ‘? par’). |
A numeric vector with the midpoints of the drawn bars; useful for adding to the graph.
Michael Hahsler
Other itemMatrix and transactions functions:
abbreviate()
,
c()
,
crossTable()
,
duplicated()
,
extract
,
hierarchy
,
image()
,
inspect()
,
is.superset()
,
itemFrequency()
,
itemMatrix-class
,
match()
,
merge()
,
random.transactions()
,
sample()
,
sets
,
size()
,
supportingTransactions()
,
tidLists-class
,
transactions-class
,
unique()
data(Adult) ## the following example compares the item frequencies ## of people with a large income (boxes) with the average in the data set Adult.largeIncome <- Adult[Adult %in% "income=large"] ## simple plot itemFrequencyPlot(Adult.largeIncome) ## plot with the averages of the population plotted as a line ## (for first 72 variables/items) itemFrequencyPlot(Adult.largeIncome[, 1:72], population = Adult[, 1:72]) ## plot lift ratio (frequency in x / frequency in population) ## for items with a support of 20% in the population itemFrequencyPlot(Adult.largeIncome, population = Adult, support = 0.2, lift = TRUE, horiz = TRUE)
data(Adult) ## the following example compares the item frequencies ## of people with a large income (boxes) with the average in the data set Adult.largeIncome <- Adult[Adult %in% "income=large"] ## simple plot itemFrequencyPlot(Adult.largeIncome) ## plot with the averages of the population plotted as a line ## (for first 72 variables/items) itemFrequencyPlot(Adult.largeIncome[, 1:72], population = Adult[, 1:72]) ## plot lift ratio (frequency in x / frequency in population) ## for items with a support of 20% in the population itemFrequencyPlot(Adult.largeIncome, population = Adult, support = 0.2, lift = TRUE, horiz = TRUE)
The itemMatrix
class is the basic building block for transactions,
and associations. The class contains a sparse
Matrix representation of a set of itemsets and the
corresponding item labels.
## S4 method for signature 'itemMatrix' summary(object, maxsum = 6, ...) ## S4 method for signature 'itemMatrix' dim(x) nitems(x, ...) ## S4 method for signature 'itemMatrix' nitems(x) ## S4 method for signature 'itemMatrix' length(x) toLongFormat(from, ...) ## S4 method for signature 'itemMatrix' toLongFormat(from, cols = c("ID", "item"), decode = TRUE) ## S4 method for signature 'itemMatrix' labels(object, itemSep = ",", setStart = "{", setEnd = "}") itemLabels(object, ...) itemLabels(object) <- value ## S4 method for signature 'itemMatrix' itemLabels(object) ## S4 replacement method for signature 'itemMatrix' itemLabels(object) <- value itemInfo(object) itemInfo(object) <- value ## S4 method for signature 'itemMatrix' itemInfo(object) ## S4 replacement method for signature 'itemMatrix' itemInfo(object) <- value itemsetInfo(object) itemsetInfo(object) <- value ## S4 method for signature 'itemMatrix' itemsetInfo(object) ## S4 replacement method for signature 'itemMatrix' itemsetInfo(object) <- value ## S4 method for signature 'itemMatrix' dimnames(x) ## S4 replacement method for signature 'itemMatrix,list' dimnames(x) <- value
## S4 method for signature 'itemMatrix' summary(object, maxsum = 6, ...) ## S4 method for signature 'itemMatrix' dim(x) nitems(x, ...) ## S4 method for signature 'itemMatrix' nitems(x) ## S4 method for signature 'itemMatrix' length(x) toLongFormat(from, ...) ## S4 method for signature 'itemMatrix' toLongFormat(from, cols = c("ID", "item"), decode = TRUE) ## S4 method for signature 'itemMatrix' labels(object, itemSep = ",", setStart = "{", setEnd = "}") itemLabels(object, ...) itemLabels(object) <- value ## S4 method for signature 'itemMatrix' itemLabels(object) ## S4 replacement method for signature 'itemMatrix' itemLabels(object) <- value itemInfo(object) itemInfo(object) <- value ## S4 method for signature 'itemMatrix' itemInfo(object) ## S4 replacement method for signature 'itemMatrix' itemInfo(object) <- value itemsetInfo(object) itemsetInfo(object) <- value ## S4 method for signature 'itemMatrix' itemsetInfo(object) ## S4 replacement method for signature 'itemMatrix' itemsetInfo(object) <- value ## S4 method for signature 'itemMatrix' dimnames(x) ## S4 replacement method for signature 'itemMatrix,list' dimnames(x) <- value
object , x , from
|
the object. |
maxsum |
integer, how many items should be shown for the summary? |
... |
further parameters |
cols |
columns for the long format. |
decode |
decode item IDs to item labels. |
itemSep |
item separator symbol. |
setStart |
set start symbol. |
setEnd |
set end symbol. |
value |
replacement value |
Representation
Sets of itemsets are represented as a compressed sparse binary matrix. Conceptually, columns represent items and rows are the sets/transactions. In the compressed form, each itemset is a vector of column indices (called item IDs) representing the items.
Warning: Ideally, we would store the matrix as a row-oriented sparse
matrix (ngRMatrix
), but the Matrix package provides better support for
column-oriented sparse classes (Matrix::ngCMatrix). The matrix is therefore internally stored
in transposed form.
Working with several itemMatrix
objects
If you work with several itemMatrix
objects at the same time (e.g.,
several transaction sets, lhs and rhs of a rule, etc.), then the encoding
(itemLabes and order of the items in the binary matrix) in the different
itemMatrices is important and needs to conform. See itemCoding
to learn how to encode and recode itemMatrix
objects.
summary(itemMatrix)
: show a summary.
dim(itemMatrix)
: returns the number of rows
(itemsets) and columns (items in the encoding).
nitems(itemMatrix)
: returns the number of items in the encoding.
length(itemMatrix)
: returns the number of itemsets (rows) in the matrix.
toLongFormat(itemMatrix)
: convert the sets to long format
(a data.frame with two columns, ID and item). Column names
can be specified as a character vector of length 2 called cols
.
labels(itemMatrix)
: returns labels for the itemsets.
The following arguments can be used to customize the representation
of the labels: itemSep
, setStart
and setEnd
.
itemLabels(itemMatrix)
: returns the item labels used for encoding as a character vector.
itemLabels(itemMatrix) <- value
: replaces the item labels used for encoding.
itemInfo(itemMatrix)
: returns the whole item/column information data.frame including labels.
itemInfo(itemMatrix) <- value
: replaces the item/column info by a data.frame.
itemsetInfo(itemMatrix)
: returns the item set/row information data.frame.
itemsetInfo(itemMatrix) <- value
: replaces the item set/row info by a data.frame.
dimnames(itemMatrix)
: returns a list with the dimname vectors.
dimnames(x = itemMatrix) <- value
: replace the dimnames.
data
a sparse matrix of class Matrix::ngCMatrix representing the itemsets. Warning: the matrix is stored in transposed form for efficiency reasons!.
itemInfo
a data.frame
itemsetInfo
a data.frame
Objects can be created by calls of the form
new("itemMatrix", ...)
. However, most of the time objects will be
created by coercion from a matrix, list or data.frame.
as("matrix", "itemMatrix")
as("itemMatrix", "matrix")
as("list", "itemMatrix")
as("itemMatrix", "list")
as("itemMatrix", "ngCMatrix")
as("ngCMatrix", "itemMatrix")
Warning: the ngCMatrix
representation is transposed!
Michael Hahsler
Other itemMatrix and transactions functions:
abbreviate()
,
c()
,
crossTable()
,
duplicated()
,
extract
,
hierarchy
,
image()
,
inspect()
,
is.superset()
,
itemFrequency()
,
itemFrequencyPlot()
,
match()
,
merge()
,
random.transactions()
,
sample()
,
sets
,
size()
,
supportingTransactions()
,
tidLists-class
,
transactions-class
,
unique()
set.seed(1234) ## Generate a logical matrix with 5000 random itemsets for 20 items m <- matrix(runif(5000 * 20) > 0.8, ncol = 20, dimnames = list(NULL, paste("item", c(1:20), sep = ""))) head(m) ## Coerce the logical matrix into an itemMatrix object imatrix <- as(m, "itemMatrix") imatrix ## An itemMatrix contains a set of itemsets (each row is an itemset). ## The length of the set is the number of rows. length(imatrix) ## The sparese matrix also has regular matrix dimensions. dim(imatrix) nrow(imatrix) ncol(imatrix) ## Subsetting: Get first 5 elements (rows) of the itemMatrix. This can be done in ## several ways. imatrix[1:5] ### get elements 1:5 imatrix[1:5, ] ### Matrix subsetting for rows 1:5 head(imatrix, n = 5) ### head() ## Get first 5 elements (rows) of the itemMatrix as list. as(imatrix[1:5], "list") ## Get first 5 elements (rows) of the itemMatrix as matrix. as(imatrix[1:5], "matrix") ## Get first 5 elements (rows) of the itemMatrix as sparse ngCMatrix. ## **Warning:** For efficiency reasons, the ngCMatrix is transposed! You ## can transpose it again to get the expected format. as(imatrix[1:5], "ngCMatrix") t(as(imatrix[1:5], "ngCMatrix")) ## Get labels for the first 5 itemsets (first default and then with ## custom formating) labels(imatrix[1:5]) labels(imatrix[1:5], itemSep = " + ", setStart = "", setEnd = "") ## Create itemsets manually from an itemMatrix. Itemsets contain items in the form of ## an itemMatrix and additional quality measures (not supplied in the example). is <- new("itemsets", items = imatrix) is inspect(head(is, n = 3)) ## Create rules manually. I use imatrix[4:6] for the lhs of the rules and ## imatrix[1:3] for the rhs. Rhs and lhs cannot share items so I use ## itemSetdiff here. I also assign missing values for the quality measures support ## and confidence. rules <- new("rules", lhs = itemSetdiff(imatrix[4:6], imatrix[1:3]), rhs = imatrix[1:3], quality = data.frame(support = c(NA, NA, NA), confidence = c(NA, NA, NA) )) rules inspect(rules) ## Manually create a itemMatrix with an item encoding that matches imatrix (20 items in order ## item1, item2, ..., item20) itemset_list <- list(c("item1","item2"), c("item3")) imatrix_new <- encode(itemset_list, itemLabels = imatrix) imatrix_new compatible(imatrix_new, imatrix)
set.seed(1234) ## Generate a logical matrix with 5000 random itemsets for 20 items m <- matrix(runif(5000 * 20) > 0.8, ncol = 20, dimnames = list(NULL, paste("item", c(1:20), sep = ""))) head(m) ## Coerce the logical matrix into an itemMatrix object imatrix <- as(m, "itemMatrix") imatrix ## An itemMatrix contains a set of itemsets (each row is an itemset). ## The length of the set is the number of rows. length(imatrix) ## The sparese matrix also has regular matrix dimensions. dim(imatrix) nrow(imatrix) ncol(imatrix) ## Subsetting: Get first 5 elements (rows) of the itemMatrix. This can be done in ## several ways. imatrix[1:5] ### get elements 1:5 imatrix[1:5, ] ### Matrix subsetting for rows 1:5 head(imatrix, n = 5) ### head() ## Get first 5 elements (rows) of the itemMatrix as list. as(imatrix[1:5], "list") ## Get first 5 elements (rows) of the itemMatrix as matrix. as(imatrix[1:5], "matrix") ## Get first 5 elements (rows) of the itemMatrix as sparse ngCMatrix. ## **Warning:** For efficiency reasons, the ngCMatrix is transposed! You ## can transpose it again to get the expected format. as(imatrix[1:5], "ngCMatrix") t(as(imatrix[1:5], "ngCMatrix")) ## Get labels for the first 5 itemsets (first default and then with ## custom formating) labels(imatrix[1:5]) labels(imatrix[1:5], itemSep = " + ", setStart = "", setEnd = "") ## Create itemsets manually from an itemMatrix. Itemsets contain items in the form of ## an itemMatrix and additional quality measures (not supplied in the example). is <- new("itemsets", items = imatrix) is inspect(head(is, n = 3)) ## Create rules manually. I use imatrix[4:6] for the lhs of the rules and ## imatrix[1:3] for the rhs. Rhs and lhs cannot share items so I use ## itemSetdiff here. I also assign missing values for the quality measures support ## and confidence. rules <- new("rules", lhs = itemSetdiff(imatrix[4:6], imatrix[1:3]), rhs = imatrix[1:3], quality = data.frame(support = c(NA, NA, NA), confidence = c(NA, NA, NA) )) rules inspect(rules) ## Manually create a itemMatrix with an item encoding that matches imatrix (20 items in order ## item1, item2, ..., item20) itemset_list <- list(c("item1","item2"), c("item3")) imatrix_new <- encode(itemset_list, itemLabels = imatrix) imatrix_new compatible(imatrix_new, imatrix)
The itemsets
class represents a set of itemsets and the associated
quality measures.
itemsets(items, itemLabels = NULL, quality = data.frame()) ## S4 method for signature 'itemsets' summary(object, ...) ## S4 method for signature 'itemsets' length(x) ## S4 method for signature 'itemsets' nitems(x) ## S4 method for signature 'itemsets' labels(object, ...) ## S4 method for signature 'itemsets' itemLabels(object) ## S4 replacement method for signature 'itemsets' itemLabels(object) <- value ## S4 method for signature 'itemsets' itemInfo(object) ## S4 method for signature 'itemsets' items(x) ## S4 replacement method for signature 'itemsets' items(x) <- value ## S4 method for signature 'itemsets' tidLists(x)
itemsets(items, itemLabels = NULL, quality = data.frame()) ## S4 method for signature 'itemsets' summary(object, ...) ## S4 method for signature 'itemsets' length(x) ## S4 method for signature 'itemsets' nitems(x) ## S4 method for signature 'itemsets' labels(object, ...) ## S4 method for signature 'itemsets' itemLabels(object) ## S4 replacement method for signature 'itemsets' itemLabels(object) <- value ## S4 method for signature 'itemsets' itemInfo(object) ## S4 method for signature 'itemsets' items(x) ## S4 replacement method for signature 'itemsets' items(x) <- value ## S4 method for signature 'itemsets' tidLists(x)
items |
an itemMatrix or an object that can be converted using |
itemLabels |
item labels used for |
quality |
a data.frame with quality information (one row per itemset). |
object , x
|
the object |
... |
further argments |
value |
replacement value |
Itemsets are usually created by calling an association rule mining algorithm
like apriori()
.
To create itemsets manually, the itemMatrix for the items of the itemsets
can be created using itemCoding. An example is in the Example
section below.
Mined itemsets sets contain several interest measures accessible
with the quality()
method. Additional measures can be
calculated via interestMeasure()
.
summary(itemsets)
: create a summary
length(itemsets)
: get the number of itemsets.
nitems(itemsets)
: get the number of items (columns) in the current encoding.
labels(itemsets)
: get the itemset labels.
itemLabels(itemsets)
: get the item labels.
itemLabels(itemsets) <- value
: replace the item labels.
itemInfo(itemsets)
: get item info data.frame.
items(itemsets)
: get items as an itemMatrix.
items(itemsets) <- value
: with a different itemMatrix.
tidLists(itemsets)
: get tidLists stored in the object (if any).
items
an itemMatrix object representing the itemsets.
tidLists
a tidLists or NULL
.
quality
a data.frame with quality information
info
a list with mining information.
Objects are the result of calling the
functions apriori()
(e.g., with target = "frequent itemsets"
in the parameter list) or eclat()
.
Objects can also
be created by calls of the form new("itemsets", ...)
or by using the constructor function
itemsets()
.
as("itemsets", "data.frame")
Michael Hahsler
Superclass: associations
Other associations functions:
abbreviate()
,
associations-class
,
c()
,
duplicated()
,
extract
,
inspect()
,
is.closed()
,
is.generator()
,
is.maximal()
,
is.redundant()
,
is.significant()
,
is.superset()
,
match()
,
rules-class
,
sample()
,
sets
,
size()
,
sort()
,
unique()
data("Adult") ## Mine frequent itemsets with Eclat. fsets <- eclat(Adult, parameter = list(supp = 0.5)) ## Display the 5 itemsets with the highest support. fsets.top5 <- sort(fsets)[1:5] inspect(fsets.top5) ## Get the itemsets as a list as(items(fsets.top5), "list") ## Get the itemsets as a binary matrix as(items(fsets.top5), "matrix") ## Get the itemsets as a sparse matrix, a ngCMatrix from package Matrix. ## Warning: for efficiency reasons, the ngCMatrix you get is transposed as(items(fsets.top5), "ngCMatrix") ## Manually create itemsets with the item coding in the Adult dataset ## and calculate some interest measures twoitemsets <- itemsets( items = list( c("age=Young", "relationship=Unmarried"), c("age=Old") ), itemLabels = Adult) quality(twoitemsets) <- data.frame(support = interestMeasure(twoitemsets, measure = c("support"), transactions = Adult)) inspect(twoitemsets)
data("Adult") ## Mine frequent itemsets with Eclat. fsets <- eclat(Adult, parameter = list(supp = 0.5)) ## Display the 5 itemsets with the highest support. fsets.top5 <- sort(fsets)[1:5] inspect(fsets.top5) ## Get the itemsets as a list as(items(fsets.top5), "list") ## Get the itemsets as a binary matrix as(items(fsets.top5), "matrix") ## Get the itemsets as a sparse matrix, a ngCMatrix from package Matrix. ## Warning: for efficiency reasons, the ngCMatrix you get is transposed as(items(fsets.top5), "ngCMatrix") ## Manually create itemsets with the item coding in the Adult dataset ## and calculate some interest measures twoitemsets <- itemsets( items = list( c("age=Young", "relationship=Unmarried"), c("age=Old") ), itemLabels = Adult) quality(twoitemsets) <- data.frame(support = interestMeasure(twoitemsets, measure = c("support"), transactions = Adult)) inspect(twoitemsets)
Provides the generic functions and the methods for itemwise set
operations on items in an itemMatrix. The regular set operations regard each
itemset in an itemMatrix
as an element. Itemwise operations regard each item
as an element and operate on the items of pairs of corresponding itemsets
(first itemset in x
with first itemset in y
, second with second, etc.).
itemUnion(x, y) itemSetdiff(x, y) itemIntersect(x, y) ## S4 method for signature 'itemMatrix,itemMatrix' itemUnion(x, y) ## S4 method for signature 'itemMatrix,itemMatrix' itemSetdiff(x, y) ## S4 method for signature 'itemMatrix,itemMatrix' itemIntersect(x, y)
itemUnion(x, y) itemSetdiff(x, y) itemIntersect(x, y) ## S4 method for signature 'itemMatrix,itemMatrix' itemUnion(x, y) ## S4 method for signature 'itemMatrix,itemMatrix' itemSetdiff(x, y) ## S4 method for signature 'itemMatrix,itemMatrix' itemIntersect(x, y)
x , y
|
two itemMatrix objects with the same number of rows (itemsets). |
An object of class itemMatrix is returned.
Michael Hahsler
data("Adult") fsets <- eclat(Adult, parameter = list(supp = 0.5)) inspect(fsets[1:4]) inspect(itemUnion(items(fsets[1:2]), items(fsets[3:4]))) inspect(itemSetdiff(items(fsets[1:2]), items(fsets[3:4]))) inspect(itemIntersect(items(fsets[1:2]), items(fsets[3:4])))
data("Adult") fsets <- eclat(Adult, parameter = list(supp = 0.5)) inspect(fsets[1:4]) inspect(itemUnion(items(fsets[1:2]), items(fsets[3:4]))) inspect(itemSetdiff(items(fsets[1:2]), items(fsets[3:4]))) inspect(itemIntersect(items(fsets[1:2]), items(fsets[3:4])))
Provides the generic function LIST()
and the methods to create a
list representation from objects of the classes itemMatrix,
transactions, and tidLists.
LIST(from, ...) ## S4 method for signature 'itemMatrix' LIST(from, decode = TRUE) ## S4 method for signature 'transactions' LIST(from, decode = TRUE) ## S4 method for signature 'tidLists' LIST(from, decode = TRUE)
LIST(from, ...) ## S4 method for signature 'itemMatrix' LIST(from, decode = TRUE) ## S4 method for signature 'transactions' LIST(from, decode = TRUE) ## S4 method for signature 'tidLists' LIST(from, decode = TRUE)
from |
the object to be converted into a list. |
... |
further arguments. |
decode |
a logical controlling whether the items/transactions are
decoded from the column numbers internally used by
itemMatrix to the names stored in the object
|
Using LIST()
with decode = TRUE
is equivalent to the standard
coercion as(x, "list")
. LIST
returns the object from
as a list of vectors. Each vector represents one row of the
itemMatrix (e.g., items in a transaction).
a list primitive.
Michael Hahsler
Other import/export:
DATAFRAME()
,
pmml
,
read
,
write()
data(Adult) ### default coercion (same as as(Adult[1:5], "list")) LIST(Adult[1:5]) ### coercion without item decoding LIST(Adult[1:5], decode = FALSE)
data(Adult) ### default coercion (same as as(Adult[1:5], "list")) LIST(Adult[1:5]) ### coercion without item decoding LIST(Adult[1:5], decode = FALSE)
Provides the generic function match()
and the methods for
associations, transactions and itemMatrix objects. match()
returns a vector
of the positions of (first) matches of its first argument in its second.
match(x, table, nomatch = NA_integer_, incomparables = NULL) ## S4 method for signature 'itemMatrix,itemMatrix' match(x, table, nomatch = NA_integer_, incomparables = NULL) ## S4 method for signature 'rules,rules' match(x, table, nomatch = NA_integer_, incomparables = NULL) ## S4 method for signature 'itemsets,itemsets' match(x, table, nomatch = NA_integer_, incomparables = NULL) ## S4 method for signature 'itemMatrix,itemMatrix' x %in% table ## S4 method for signature 'itemMatrix,character' x %in% table ## S4 method for signature 'associations,associations' x %in% table ## S4 method for signature 'itemMatrix,character' x %pin% table ## S4 method for signature 'itemMatrix,character' x %ain% table ## S4 method for signature 'itemMatrix,character' x %oin% table
match(x, table, nomatch = NA_integer_, incomparables = NULL) ## S4 method for signature 'itemMatrix,itemMatrix' match(x, table, nomatch = NA_integer_, incomparables = NULL) ## S4 method for signature 'rules,rules' match(x, table, nomatch = NA_integer_, incomparables = NULL) ## S4 method for signature 'itemsets,itemsets' match(x, table, nomatch = NA_integer_, incomparables = NULL) ## S4 method for signature 'itemMatrix,itemMatrix' x %in% table ## S4 method for signature 'itemMatrix,character' x %in% table ## S4 method for signature 'associations,associations' x %in% table ## S4 method for signature 'itemMatrix,character' x %pin% table ## S4 method for signature 'itemMatrix,character' x %ain% table ## S4 method for signature 'itemMatrix,character' x %oin% table
x |
an object of class itemMatrix, transactions or associations. |
table |
a set of associations or transactions to be matched against. |
nomatch |
the value to be returned in the case when no match is found. |
incomparables |
not implemented. |
%in%
is a more intuitive interface as a binary operator, which
returns a logical vector indicating if there is a match or not for the items
in the itemsets (left operand) with the items in the table (right operand).
arules defines additional binary operators for matching itemsets:
%pin%
uses partial matching on the table; %ain%
itemsets have to match/include all items in the table; %oin%
itemsets can only match/include the items in the table. The binary
matching operators or often used in subset()
.
match
: An integer vector of the same length as x
giving the position in table
of the first match if there is a match,
otherwise nomatch
.
%in%
, %pin%
, %ain%
, %oin%
: A logical vector,
indicating if a match was located for each element of x
.
Michael Hahsler
Other associations functions:
abbreviate()
,
associations-class
,
c()
,
duplicated()
,
extract
,
inspect()
,
is.closed()
,
is.generator()
,
is.maximal()
,
is.redundant()
,
is.significant()
,
is.superset()
,
itemsets-class
,
rules-class
,
sample()
,
sets
,
size()
,
sort()
,
unique()
Other itemMatrix and transactions functions:
abbreviate()
,
c()
,
crossTable()
,
duplicated()
,
extract
,
hierarchy
,
image()
,
inspect()
,
is.superset()
,
itemFrequency()
,
itemFrequencyPlot()
,
itemMatrix-class
,
merge()
,
random.transactions()
,
sample()
,
sets
,
size()
,
supportingTransactions()
,
tidLists-class
,
transactions-class
,
unique()
data("Adult") ## get unique transactions, count frequency of unique transactions ## and plot frequency of unique transactions vals <- unique(Adult) cnts <- tabulate(match(Adult, vals)) plot(sort(cnts, decreasing=TRUE)) ## find all transactions which are equal to transaction 10 in Adult which(Adult %in% Adult[10]) ## for transactions we can also match directly with itemLabels. ## Find in the first 10 transactions the ones which ## contain age=Middle-aged (see help page for class itemMatrix) Adult[1:10] %in% "age=Middle-aged" ## find all transactions which contain items that partially match "age=" (all here). Adult[1:10] %pin% "age=" ## find all transactions that only include the item "age=Middle-aged" (none here). Adult[1:10] %oin% "age=Middle-aged" ## find al transaction which contain both items "age=Middle-aged" and "sex=Male" Adult[1:10] %ain% c("age=Middle-aged", "sex=Male")
data("Adult") ## get unique transactions, count frequency of unique transactions ## and plot frequency of unique transactions vals <- unique(Adult) cnts <- tabulate(match(Adult, vals)) plot(sort(cnts, decreasing=TRUE)) ## find all transactions which are equal to transaction 10 in Adult which(Adult %in% Adult[10]) ## for transactions we can also match directly with itemLabels. ## Find in the first 10 transactions the ones which ## contain age=Middle-aged (see help page for class itemMatrix) Adult[1:10] %in% "age=Middle-aged" ## find all transactions which contain items that partially match "age=" (all here). Adult[1:10] %pin% "age=" ## find all transactions that only include the item "age=Middle-aged" (none here). Adult[1:10] %oin% "age=Middle-aged" ## find al transaction which contain both items "age=Middle-aged" and "sex=Male" Adult[1:10] %ain% c("age=Middle-aged", "sex=Male")
Provides the generic function merge()
and the methods for itemMatrix
and transactions to add new items to existing data.
merge(x, y, ...) ## S4 method for signature 'itemMatrix' merge(x, y, ...) ## S4 method for signature 'transactions' merge(x, y, ...)
merge(x, y, ...) ## S4 method for signature 'itemMatrix' merge(x, y, ...) ## S4 method for signature 'transactions' merge(x, y, ...)
x |
an object of class itemMatrix or transactions. |
y |
an object of the same class as |
... |
further arguments; unused. |
Returns a new object of the same class as x
with the items in y
added.
Michael Hahsler
Other preprocessing:
discretize()
,
hierarchy
,
itemCoding
,
sample()
Other itemMatrix and transactions functions:
abbreviate()
,
c()
,
crossTable()
,
duplicated()
,
extract
,
hierarchy
,
image()
,
inspect()
,
is.superset()
,
itemFrequency()
,
itemFrequencyPlot()
,
itemMatrix-class
,
match()
,
random.transactions()
,
sample()
,
sets
,
size()
,
supportingTransactions()
,
tidLists-class
,
transactions-class
,
unique()
data("Groceries") ## create a random item as a matrix randomItem <- sample(c(TRUE, FALSE), size = length(Groceries),replace = TRUE) randomItem <- as.matrix(randomItem) colnames(randomItem) <- "random item" head(randomItem, 3) ## add the random item to Groceries g2 <- merge(Groceries, randomItem) nitems(Groceries) nitems(g2) inspect(head(g2, 3))
data("Groceries") ## create a random item as a matrix randomItem <- sample(c(TRUE, FALSE), size = length(Groceries),replace = TRUE) randomItem <- as.matrix(randomItem) colnames(randomItem) <- "random item" head(randomItem, 3) ## add the random item to Groceries g2 <- merge(Groceries, randomItem) nitems(Groceries) nitems(g2) inspect(head(g2, 3))
The Mushroom
transactions data set includes descriptions of hypothetical samples
corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota
Family.
Object of class transactions with 8124 transactions and 114 items.
The transaction set contains information about 8124 mushrooms (transactions). 4208 (51.8%) are edible and 3916 (48.2%) are poisonous. The data contains 22 nominal features plus the class attribute (edible or not). These features were translated into 114 items.
Michael Hahsler
The data set was obtained from the UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/datasets/Mushroom.
Alfred A. Knopf (1981). Mushroom records drawn from The Audubon Society Field Guide to North American Mushrooms. G. H. Lincoff (Pres.), New York.
This function reads and writes PMML representations (version 4.1) of associations (itemsets and rules). Write delegates to package pmml.
write.PMML(x, file) read.PMML(file)
write.PMML(x, file) read.PMML(file)
x |
|
file |
name of the PMML file (for |
Michael Hahsler
PMML 4.4 - Association Rules. https://dmg.org/pmml/v4-4/AssociationRules.html
Other import/export:
DATAFRAME()
,
LIST()
,
read
,
write()
data("Groceries") rules <- apriori(Groceries, parameter = list(support = 0.001)) rules <- head(rules, by = "lift") rules ### save rules as PMML write.PMML(rules, file = "rules.xml") ### read rules back rules2 <- read.PMML("rules.xml") rules2 ### compare rules inspect(rules[1]) inspect(rules2[1]) ### clean up unlink("rules.xml")
data("Groceries") rules <- apriori(Groceries, parameter = list(support = 0.001)) rules <- head(rules, by = "lift") rules ### save rules as PMML write.PMML(rules, file = "rules.xml") ### read rules back rules2 <- read.PMML("rules.xml") rules2 ### compare rules inspect(rules[1]) inspect(rules2[1]) ### clean up unlink("rules.xml")
Provides the method predict()
for itemMatrix (e.g.,
transactions). Predicts the membership (nearest neighbor) of new data to
clusters represented by medoids or labeled examples.
predict(object, ...) ## S4 method for signature 'itemMatrix' predict(object, newdata, labels = NULL, blocksize = 200, ...)
predict(object, ...) ## S4 method for signature 'itemMatrix' predict(object, newdata, labels = NULL, blocksize = 200, ...)
object |
clustered examples as an itemMatrix with cluster label specified in |
... |
further arguments passed on to |
newdata |
an itemMatrix containing the objects to predict labels for. |
labels |
an integer vector containing the labels for the examples in
|
blocksize |
a numeric scalar indicating how much memory predict can use
for big |
An integer vector of the same length as newdata
containing
the predicted labels for each element.
Michael Hahsler
Other proximity classes and functions:
affinity()
,
dissimilarity()
,
proximity-classes
data("Adult") ## sample small <- sample(Adult, 500) large <- sample(Adult, 5000) ## cluster a small sample and extract the cluster lael vector d_jaccard <- dissimilarity(small) hc <- hclust(d_jaccard) l <- cutree(hc, k=4) ## predict labels for a larger sample labels <- predict(small, large, l) ## plot the profile of the 1. cluster itemFrequencyPlot(large[labels == 1, itemFrequency(large) > 0.1])
data("Adult") ## sample small <- sample(Adult, 500) large <- sample(Adult, 5000) ## cluster a small sample and extract the cluster lael vector d_jaccard <- dissimilarity(small) hc <- hclust(d_jaccard) l <- cutree(hc, k=4) ## predict labels for a larger sample labels <- predict(small, large, l) ## plot the profile of the 1. cluster itemFrequencyPlot(large[labels == 1, itemFrequency(large) > 0.1])
Simple classes to represent proximity matrices.
For compatibility with
clustering functions in R
, we represent dissimilarities as the
S3
class dist
. For cross-dissimilarities and similarities, we
provide the S4
classes ar_cross_dissimilarities
and
ar_similarities
.
dist
objects are the result of
calling the method dissimilarity()
with one argument or any
R
function returning a S3 dist
object.
ar_cross_dissimilarity
objects are the result of calling the method
dissimilarity()
with two arguments, by calls of the form
new("similarity", ...)
, or by coercion from matrix.
ar_similarity
objects are the result of calling the method
affinity()
, by calls of the form new("similarity", ...)
,
or by coercion from matrix.
Michael Hahsler
Other proximity classes and functions:
affinity()
,
dissimilarity()
,
predict()
Simulate random transactions using different methods.
random.transactions( nItems, nTrans, method = "independent", ..., verbose = FALSE ) random.patterns( nItems, nPats = 2000, method = NULL, lPats = 4, corr = 0.5, cmean = 0.5, cvar = 0.1, iWeight = NULL, verbose = FALSE )
random.transactions( nItems, nTrans, method = "independent", ..., verbose = FALSE ) random.patterns( nItems, nPats = 2000, method = NULL, lPats = 4, corr = 0.5, cmean = 0.5, cvar = 0.1, iWeight = NULL, verbose = FALSE )
nItems |
an integer. Number of items to simulate |
nTrans |
an integer. Number of transactions to simulate |
method |
name of the simulation method used (see Details Section). |
... |
further arguments used for the specific simulation method (see details). |
verbose |
report progress? |
nPats |
number of patterns (potential maximal frequent itemsets) used. |
lPats |
average length of patterns. |
corr |
correlation between consecutive patterns. |
cmean |
mean of the corruption level (normal distribution). |
cvar |
variance of the corruption level. |
iWeight |
item selection weights to build patterns. |
Currently two simulation methods are implemented:
"independent"
(Hahsler et al, 2006): All items
are treated as independent. The transaction size is determined by
rpois(lambda - 1) + 1
, where lambda
can be specified (defaults to 3).
Note that one subtracted from lambda and added to the size to avoid
empty transactions. The items in the transactions are randomly chosen using
the numeric probability vector iProb
of length nItems
(default: 0.01 for each item).
"agrawal"
(see Agrawal and Srikant, 1994): This
method creates transactions with correlated items using random.patters()
.
The simulation is a two-stage process. First, a set of nPats
patterns
(potential maximal frequent itemsets) is generated. The length of the
patterns is Poisson distributed with mean lPats
and consecutive
patterns share some items controlled by the correlation parameter
corr
. For later use, for each pattern a pattern weight is generated
by drawing from an exponential distribution with a mean of 1 and a
corruption level is chosen from a normal distribution with mean cmean
and variance cvar
.
The function returns the patterns as an itemsets
objects which can be
supplied to random.transactions()
as the argument patterns
. If
no argument patterns
is supplied, the default values given above are
used.
In the second step, the transactions are generated using the patterns. The
length the transactions follows a Poisson distribution with mean
lPats
. For each transaction, patterns are randomly chosen using the
pattern weights till the transaction length is reached. For each chosen
pattern, the associated corruption level is used to drop some items before
adding the pattern to the transaction.
Returns a ntrans x nitems
transactions object.
Michael Hahsler
Michael Hahsler, Kurt Hornik, and Thomas Reutterer (2006). Implications of probabilistic data modeling for mining association rules. In M. Spiliopoulou, R. Kruse, C. Borgelt, A. Nuernberger, and W. Gaul, editors, From Data and Information Analysis to Knowledge Engineering, Studies in Classification, Data Analysis, and Knowledge Organization, pages 598–605. Springer-Verlag.
Rakesh Agrawal and Ramakrishnan Srikant (1994). Fast algorithms for mining association rules in large databases. In Jorge B. Bocca, Matthias Jarke, and Carlo Zaniolo, editors, Proceedings of the 20th International Conference on Very Large Data Bases, VLDB, pages 487–499, Santiago, Chile.
Other itemMatrix and transactions functions:
abbreviate()
,
c()
,
crossTable()
,
duplicated()
,
extract
,
hierarchy
,
image()
,
inspect()
,
is.superset()
,
itemFrequency()
,
itemFrequencyPlot()
,
itemMatrix-class
,
match()
,
merge()
,
sample()
,
sets
,
size()
,
supportingTransactions()
,
tidLists-class
,
transactions-class
,
unique()
## generate random 1000 transactions for 200 items with ## a success probability decreasing from 0.2 to 0.0001 ## using the method described in Hahsler et al. (2006). trans <- random.transactions(nItems = 200, nTrans = 1000, lambda = 5, iProb = seq(0.2,0.0001, length=200)) ## size distribution summary(size(trans)) ## display random data set image(trans) ## use the method by Agrawal and Srikant (1994) to simulate transactions ## which contains correlated items. This should create data similar to ## T10I4D100K (we just create 100 transactions here to speed things up). patterns <- random.patterns(nItems = 1000) summary(patterns) trans2 <- random.transactions(nItems = 1000, nTrans = 100, method = "agrawal", patterns = patterns) image(trans2) ## plot data with items ordered by item frequency image(trans2[,order(itemFrequency(trans2), decreasing=TRUE)])
## generate random 1000 transactions for 200 items with ## a success probability decreasing from 0.2 to 0.0001 ## using the method described in Hahsler et al. (2006). trans <- random.transactions(nItems = 200, nTrans = 1000, lambda = 5, iProb = seq(0.2,0.0001, length=200)) ## size distribution summary(size(trans)) ## display random data set image(trans) ## use the method by Agrawal and Srikant (1994) to simulate transactions ## which contains correlated items. This should create data similar to ## T10I4D100K (we just create 100 transactions here to speed things up). patterns <- random.patterns(nItems = 1000) summary(patterns) trans2 <- random.transactions(nItems = 1000, nTrans = 100, method = "agrawal", patterns = patterns) image(trans2) ## plot data with items ordered by item frequency image(trans2[,order(itemFrequency(trans2), decreasing=TRUE)])
Reads transaction data from a file and creates a transactions object.
read.transactions( file, format = c("basket", "single"), header = FALSE, sep = "", cols = NULL, rm.duplicates = FALSE, quote = "\"'", skip = 0, encoding = "unknown" )
read.transactions( file, format = c("basket", "single"), header = FALSE, sep = "", cols = NULL, rm.duplicates = FALSE, quote = "\"'", skip = 0, encoding = "unknown" )
file |
the file name or a connection. |
format |
a character string indicating the format of the data set. One
of |
header |
a logical value indicating whether the file contains the names of the variables as its first line. |
sep |
a character string specifying how fields are separated in the
data file. The default ( |
cols |
For the single format, |
rm.duplicates |
a logical value specifying if duplicate items should be removed from the transactions. |
quote |
a list of characters used as quotes when reading. |
skip |
number of lines to skip in the file before start reading data. |
encoding |
character string indicating the encoding which is passed to
|
For basket format, each line in the transaction data file
represents a transaction where the items (item labels) are separated by the
characters specified by sep
. For single format, each line
corresponds to a single item, containing at least ids for the transaction
and the item.
Returns an object of class transactions.
Michael Hahsler and Kurt Hornik
Other import/export:
DATAFRAME()
,
LIST()
,
pmml
,
write()
## create a demo file using basket format for the example data <- paste( "# this is some test data", "item1, item2", "item1", "item2, item3", sep="\n") cat(data) write(data, file = "demo_basket.txt") ## read demo data (skip the comment in the first line) tr <- read.transactions("demo_basket.txt", format = "basket", sep = ",", skip = 1) inspect(tr) ## make always sure that the items were properly separated itemLabels(tr) ## create a demo file using single format for the example ## column 1 contains the transaction ID and column 2 contains one item data <- paste( "trans1 item1", "trans2 item1", "trans2 item2", sep ="\n") cat(data) write(data, file = "demo_single.txt") ## read demo data tr <- read.transactions("demo_single.txt", format = "single", cols = c(1,2)) inspect(tr) ## create a demo file using single format with column headers data <- paste( "item_id;trans_id", "item1;trans1", "item1;trans2", "item2;trans2", sep ="\n") cat(data) write(data, file = "demo_single.txt") ## read demo data tr <- read.transactions("demo_single.txt", format = "single", header = TRUE, sep = ";", cols = c("trans_id", "item_id")) inspect(tr) ## tidy up unlink("demo_basket.txt") unlink("demo_single.txt")
## create a demo file using basket format for the example data <- paste( "# this is some test data", "item1, item2", "item1", "item2, item3", sep="\n") cat(data) write(data, file = "demo_basket.txt") ## read demo data (skip the comment in the first line) tr <- read.transactions("demo_basket.txt", format = "basket", sep = ",", skip = 1) inspect(tr) ## make always sure that the items were properly separated itemLabels(tr) ## create a demo file using single format for the example ## column 1 contains the transaction ID and column 2 contains one item data <- paste( "trans1 item1", "trans2 item1", "trans2 item2", sep ="\n") cat(data) write(data, file = "demo_single.txt") ## read demo data tr <- read.transactions("demo_single.txt", format = "single", cols = c(1,2)) inspect(tr) ## create a demo file using single format with column headers data <- paste( "item_id;trans_id", "item1;trans1", "item1;trans2", "item2;trans2", sep ="\n") cat(data) write(data, file = "demo_single.txt") ## read demo data tr <- read.transactions("demo_single.txt", format = "single", header = TRUE, sep = ";", cols = c("trans_id", "item_id")) inspect(tr) ## tidy up unlink("demo_basket.txt") unlink("demo_single.txt")
Provides the generic function ruleInduction()
and the method to induce all association
rules which can be generated by the given set of itemsets from a transactions
dataset.
ruleInduction(x, ...) ## S4 method for signature 'itemsets' ruleInduction( x, transactions = NULL, confidence = 0.8, method = c("ptree", "apriori"), reduce = FALSE, verbose = FALSE, ... )
ruleInduction(x, ...) ## S4 method for signature 'itemsets' ruleInduction( x, transactions = NULL, confidence = 0.8, method = c("ptree", "apriori"), reduce = FALSE, verbose = FALSE, ... )
x |
the set of itemsets from which rules will be induced. |
... |
further arguments. |
transactions |
the transactions used to mine the itemsets. Can
be omitted for method |
confidence |
a numeric value between 0 and 1 giving the minimum confidence threshold for the rules. |
method |
|
reduce |
remove unused items to speed up the counting process? |
verbose |
report progress? |
All rules that can be created using the supplied itemsets and that surpass the
specified minimum confidence threshold are returned.
ruleInduction()
can be used to produce
closed association rules defined by Pei
et al. (2000) as rules X => Y
where both X
and Y
are
closed frequent itemsets. See the code example in the Example section.
Rule induction implements several induction methods. The default method is "ptree"
"ptree"
method without transactions:
No transactions are need to be specified if
x
contains a complete set of frequent or
itemsets. The itemsets' support counts are stored in a ptree and then retrieved to
create rules and calculate rules confidence. This is very fast,
but fails because of missing
support values if x
is not a complete set of frequent itemsets.
"ptree"
method with transactions:
If transactions are specified then all transactions are counted into a prefix
tree and
later retrieved to create rules from the itemsets and calculate confidence values.
This is slower, but necessary if x
is not a complete set of frequent itemsets.
To improve speed, unused items are removed from the transaction
data before creating the prefix tree (this behavior can be changed using the
argument reduce
). This might be slower for large transaction
data sets. However, this is highly recommended as the items are also
reordered to reduce the counting time.
"apriori"
method (always needs transactions):
All association rules are mined from the transactions data set using apriori()
with the
smallest support found in the itemsets. In a second step, all rules which cannot
be generated from one of the itemsets are removed. This procedure is very slow,
especially for itemsets with many elements or very low support.
An object of class rules.
Christian Buchta and Michael Hahsler
Michael Hahsler, Christian Buchta, and Kurt Hornik. Selective association rule generation. Computational Statistics, 23(2):303-315, April 2008.
Jian Pei, Jiawei Han, Runying Mao. CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets. ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD 2000).
Other mining algorithms:
APappearance-class
,
AScontrol-classes
,
ASparameter-classes
,
apriori()
,
eclat()
,
fim4r()
,
weclat()
data("Adult") ## find all closed frequent itemsets closed_is <- apriori(Adult, target = "closed frequent itemsets", support = 0.4) closed_is ## use rule induction to produce all closed association rules closed_rules <- ruleInduction(closed_is, transactions = Adult, verbose = TRUE) ## inspect the resulting closed rules summary(closed_rules) inspect(head(closed_rules, by = "lift")) ## get rules from frequent itemsets. Here, transactions does not need to be ## specified for rule induction. frequent_is <- eclat(Adult, support = 0.4) assoc_rules <- ruleInduction(frequent_is) assoc_rules inspect(head(assoc_rules)) ## for itemsets that are not a complete set of frequent itemsets, ## transactions need to be specified. some_is <- sample(frequent_is, 10) some_rules <- ruleInduction(some_is, transactions = Adult) some_rules
data("Adult") ## find all closed frequent itemsets closed_is <- apriori(Adult, target = "closed frequent itemsets", support = 0.4) closed_is ## use rule induction to produce all closed association rules closed_rules <- ruleInduction(closed_is, transactions = Adult, verbose = TRUE) ## inspect the resulting closed rules summary(closed_rules) inspect(head(closed_rules, by = "lift")) ## get rules from frequent itemsets. Here, transactions does not need to be ## specified for rule induction. frequent_is <- eclat(Adult, support = 0.4) assoc_rules <- ruleInduction(frequent_is) assoc_rules inspect(head(assoc_rules)) ## for itemsets that are not a complete set of frequent itemsets, ## transactions need to be specified. some_is <- sample(frequent_is, 10) some_rules <- ruleInduction(some_is, transactions = Adult) some_rules
Defines the rules
class to represent a set of association rules and methods to work
with rules
.
rules(rhs, lhs, itemLabels = NULL, quality = data.frame()) ## S4 method for signature 'rules' summary(object, ...) ## S4 method for signature 'rules' length(x) ## S4 method for signature 'rules' nitems(x) ## S4 method for signature 'rules' labels(object, ruleSep = " => ", ...) ## S4 method for signature 'rules' itemLabels(object) ## S4 replacement method for signature 'rules' itemLabels(object) <- value ## S4 method for signature 'rules' itemInfo(object) lhs(x) ## S4 method for signature 'rules' lhs(x) lhs(x) <- value ## S4 replacement method for signature 'rules' lhs(x) <- value rhs(x) rhs(x) <- value ## S4 replacement method for signature 'rules' rhs(x) <- value ## S4 method for signature 'rules' rhs(x) ## S4 method for signature 'rules' items(x) generatingItemsets(x) ## S4 method for signature 'rules' generatingItemsets(x)
rules(rhs, lhs, itemLabels = NULL, quality = data.frame()) ## S4 method for signature 'rules' summary(object, ...) ## S4 method for signature 'rules' length(x) ## S4 method for signature 'rules' nitems(x) ## S4 method for signature 'rules' labels(object, ruleSep = " => ", ...) ## S4 method for signature 'rules' itemLabels(object) ## S4 replacement method for signature 'rules' itemLabels(object) <- value ## S4 method for signature 'rules' itemInfo(object) lhs(x) ## S4 method for signature 'rules' lhs(x) lhs(x) <- value ## S4 replacement method for signature 'rules' lhs(x) <- value rhs(x) rhs(x) <- value ## S4 replacement method for signature 'rules' rhs(x) <- value ## S4 method for signature 'rules' rhs(x) ## S4 method for signature 'rules' items(x) generatingItemsets(x) ## S4 method for signature 'rules' generatingItemsets(x)
rhs , lhs
|
itemMatrix objects or objects that can be converted using |
itemLabels |
a vector of all
possible item labels (character) or a transactions object to copy the item
coding used for |
quality |
a data.frame with quality information (one row per rule). |
object , x
|
the object |
... |
further arguments |
ruleSep |
rule separation symbol |
value |
replacement value |
Mined rule sets typically contain several interest measures accessible with
the quality()
method. Additional measures can be calculated via
interestMeasure()
.
To create rules manually, the itemMatrix for the LHS and the RHS of the rules need to be compatible. See itemCoding for details.
summary(rules)
: create a summary
length(rules)
: returns the number of rules.
nitems(rules)
: returns the number of items used in the current encoding.
labels(rules)
: labels for the rules.
itemLabels(rules)
: returns item labels for the current encoding.
itemLabels(rules) <- value
: change the item labels in the current encoding.
itemInfo(rules)
: returns the item info data.frame.
lhs(rules)
: returns the LHS of the rules as an itemMatrix.
lhs(rules) <- value
: replaces the LHS of the rules with an itemMatrix.
rhs(rules) <- value
: replaces the RHS of the rules with an itemMatrix.
rhs(rules)
: returns the RHS of the rules as an itemMatrix.
items(rules)
: returns all items in a rule (LHS and RHS) an itemMatrix.
generatingItemsets(rules)
: returns a collection of the itemsets which generated the rules, one itemset for each rule. Note that the collection can be a multiset and contain duplicated elements. Use unique()
to remove duplicates and obtain a proper set. This method produces the same as the result as calling items()
, but wrapped into an itemsets object with support information.
lhs,rhs
itemMatrix representing the left-hand-side and right-hand-side of the rules.
quality
the quality data.frame
info
a list with mining information.
Objects are the result of calling the
function apriori()
. Objects can also be created by calls of the
form new("rules", ...)
or by using the constructor function rules()
.
as("rules", "data.frame")
Michael Hahsler
Superclass: associations
Other associations functions:
abbreviate()
,
associations-class
,
c()
,
duplicated()
,
extract
,
inspect()
,
is.closed()
,
is.generator()
,
is.maximal()
,
is.redundant()
,
is.significant()
,
is.superset()
,
itemsets-class
,
match()
,
sample()
,
sets
,
size()
,
sort()
,
unique()
data("Adult") ## Mine rules rules <- apriori(Adult, parameter = list(support = 0.3)) rules ## Select a subset of rules using partial matching on the items ## in the right-hand-side and a quality measure rules.sub <- subset(rules, subset = rhs %pin% "sex" & lift > 1.3) ## Display the top 3 support rules inspect(head(rules.sub, n = 3, by = "support")) ## Display the first 3 rules inspect(rules.sub[1:3]) ## Get labels for the first 3 rules labels(rules.sub[1:3]) labels(rules.sub[1:3], itemSep = " + ", setStart = "", setEnd="", ruleSep = " ---> ") ## Manually create rules using the item coding in Adult and calculate some interest measures twoRules <- rules( lhs = list( c("age=Young", "relationship=Unmarried"), c("age=Old") ), rhs = list( c("income=small"), c("income=large") ), itemLabels = Adult ) quality(twoRules) <- interestMeasure(twoRules, measure = c("support", "confidence", "lift"), transactions = Adult) inspect(twoRules)
data("Adult") ## Mine rules rules <- apriori(Adult, parameter = list(support = 0.3)) rules ## Select a subset of rules using partial matching on the items ## in the right-hand-side and a quality measure rules.sub <- subset(rules, subset = rhs %pin% "sex" & lift > 1.3) ## Display the top 3 support rules inspect(head(rules.sub, n = 3, by = "support")) ## Display the first 3 rules inspect(rules.sub[1:3]) ## Get labels for the first 3 rules labels(rules.sub[1:3]) labels(rules.sub[1:3], itemSep = " + ", setStart = "", setEnd="", ruleSep = " ---> ") ## Manually create rules using the item coding in Adult and calculate some interest measures twoRules <- rules( lhs = list( c("age=Young", "relationship=Unmarried"), c("age=Old") ), rhs = list( c("income=small"), c("income=large") ), itemLabels = Adult ) quality(twoRules) <- interestMeasure(twoRules, measure = c("support", "confidence", "lift"), transactions = Adult) inspect(twoRules)
Provides the generic function sample()
and methods to sample from transactions and
associations.
## S4 method for signature 'itemMatrix' sample(x, size, replace = FALSE, prob = NULL, ...) ## S4 method for signature 'associations' sample(x, size, replace = FALSE, prob = NULL, ...)
## S4 method for signature 'itemMatrix' sample(x, size, replace = FALSE, prob = NULL, ...) ## S4 method for signature 'associations' sample(x, size, replace = FALSE, prob = NULL, ...)
x |
object to be sampled from (a set of associations or transactions). |
size |
sample size. |
replace |
a logical. Sample with replacement? |
prob |
a numeric vector of probability weights. |
... |
further arguments. |
An object of the same class as x
.
Michael Hahsler
Other preprocessing:
discretize()
,
hierarchy
,
itemCoding
,
merge()
Other associations functions:
abbreviate()
,
associations-class
,
c()
,
duplicated()
,
extract
,
inspect()
,
is.closed()
,
is.generator()
,
is.maximal()
,
is.redundant()
,
is.significant()
,
is.superset()
,
itemsets-class
,
match()
,
rules-class
,
sets
,
size()
,
sort()
,
unique()
Other itemMatrix and transactions functions:
abbreviate()
,
c()
,
crossTable()
,
duplicated()
,
extract
,
hierarchy
,
image()
,
inspect()
,
is.superset()
,
itemFrequency()
,
itemFrequencyPlot()
,
itemMatrix-class
,
match()
,
merge()
,
random.transactions()
,
sets
,
size()
,
supportingTransactions()
,
tidLists-class
,
transactions-class
,
unique()
data("Adult") ## sample with replacement s <- sample(Adult, 500, replace = TRUE) s
data("Adult") ## sample with replacement s <- sample(Adult, 500, replace = TRUE) s
Provides the generic functions and the methods for the set operations
union()
, intersect()
, setequal()
, setdiff()
and
is.element()
on sets of associations (e.g., rules, itemsets) and
itemMatrix.
## S3 method for class 'itemMatrix' union(x, y, ...) ## S3 method for class 'associations' union(x, y, ...) ## S4 method for signature 'associations' union(x, y, ...) ## S4 method for signature 'itemMatrix' union(x, y, ...) ## S3 method for class 'itemMatrix' intersect(x, y, ...) ## S3 method for class 'associations' intersect(x, y, ...) ## S4 method for signature 'associations' intersect(x, y, ...) ## S4 method for signature 'itemMatrix' intersect(x, y, ...) ## S3 method for class 'itemMatrix' setequal(x, y, ...) ## S3 method for class 'associations' setequal(x, y, ...) ## S4 method for signature 'associations' setequal(x, y, ...) ## S4 method for signature 'itemMatrix' setequal(x, y, ...) ## S3 method for class 'itemMatrix' setdiff(x, y, ...) ## S3 method for class 'associations' setdiff(x, y, ...) ## S4 method for signature 'associations' setdiff(x, y, ...) ## S4 method for signature 'itemMatrix' setdiff(x, y, ...) ## S3 method for class 'itemMatrix' is.element(el, set, ...) ## S3 method for class 'associations' is.element(el, set, ...) ## S4 method for signature 'associations' is.element(el, set, ...) ## S4 method for signature 'itemMatrix' is.element(el, set, ...)
## S3 method for class 'itemMatrix' union(x, y, ...) ## S3 method for class 'associations' union(x, y, ...) ## S4 method for signature 'associations' union(x, y, ...) ## S4 method for signature 'itemMatrix' union(x, y, ...) ## S3 method for class 'itemMatrix' intersect(x, y, ...) ## S3 method for class 'associations' intersect(x, y, ...) ## S4 method for signature 'associations' intersect(x, y, ...) ## S4 method for signature 'itemMatrix' intersect(x, y, ...) ## S3 method for class 'itemMatrix' setequal(x, y, ...) ## S3 method for class 'associations' setequal(x, y, ...) ## S4 method for signature 'associations' setequal(x, y, ...) ## S4 method for signature 'itemMatrix' setequal(x, y, ...) ## S3 method for class 'itemMatrix' setdiff(x, y, ...) ## S3 method for class 'associations' setdiff(x, y, ...) ## S4 method for signature 'associations' setdiff(x, y, ...) ## S4 method for signature 'itemMatrix' setdiff(x, y, ...) ## S3 method for class 'itemMatrix' is.element(el, set, ...) ## S3 method for class 'associations' is.element(el, set, ...) ## S4 method for signature 'associations' is.element(el, set, ...) ## S4 method for signature 'itemMatrix' is.element(el, set, ...)
x , y , el , set
|
sets of associations or itemMatrix objects. |
... |
Other arguments are unused. |
Technical note: All S4 methods for set operations are defined for the class name
"ANY"
in the signature, so they should work for all S4 classes for
which the following methods are available: match()
, length()
and
unique()
.
union()
, intersect()
, setequal()
and setdiff()
return an object of the same class as x
and y
.
is.element()
returns a logic vector of length el
indicating for
each element if it is included in set
.
Michael Hahsler
Other associations functions:
abbreviate()
,
associations-class
,
c()
,
duplicated()
,
extract
,
inspect()
,
is.closed()
,
is.generator()
,
is.maximal()
,
is.redundant()
,
is.significant()
,
is.superset()
,
itemsets-class
,
match()
,
rules-class
,
sample()
,
size()
,
sort()
,
unique()
Other itemMatrix and transactions functions:
abbreviate()
,
c()
,
crossTable()
,
duplicated()
,
extract
,
hierarchy
,
image()
,
inspect()
,
is.superset()
,
itemFrequency()
,
itemFrequencyPlot()
,
itemMatrix-class
,
match()
,
merge()
,
random.transactions()
,
sample()
,
size()
,
supportingTransactions()
,
tidLists-class
,
transactions-class
,
unique()
data("Adult") ## mine some rules r <- apriori(Adult) ## take 2 subsets r1 <- r[1:10] r2 <- r[6:15] union(r1, r2) intersect(r1, r2) setequal(r1, r2)
data("Adult") ## mine some rules r <- apriori(Adult) ## take 2 subsets r1 <- r[1:10] r2 <- r[6:15] union(r1, r2) intersect(r1, r2) setequal(r1, r2)
Provides the generic function size()
and methods to get the size of
each itemset in an itemMatrix or associations. For
example, size()
can be used to get a vector with the number of items in each
transaction.
size(x, ...) ## S4 method for signature 'itemMatrix' size(x) ## S4 method for signature 'tidLists' size(x) ## S4 method for signature 'itemsets' size(x) ## S4 method for signature 'rules' size(x)
size(x, ...) ## S4 method for signature 'itemMatrix' size(x) ## S4 method for signature 'tidLists' size(x) ## S4 method for signature 'itemsets' size(x) ## S4 method for signature 'rules' size(x)
x |
an object. |
... |
further (unused) arguments. |
returns a numeric vector of length length(x)
.
Each element is the size of the corresponding element (row in the itemMatrix) in
object x
. For rules, size()
returns the sum of the number of
items in the LHS and the RHS.
Michael Hahsler
Other itemMatrix and transactions functions:
abbreviate()
,
c()
,
crossTable()
,
duplicated()
,
extract
,
hierarchy
,
image()
,
inspect()
,
is.superset()
,
itemFrequency()
,
itemFrequencyPlot()
,
itemMatrix-class
,
match()
,
merge()
,
random.transactions()
,
sample()
,
sets
,
supportingTransactions()
,
tidLists-class
,
transactions-class
,
unique()
Other associations functions:
abbreviate()
,
associations-class
,
c()
,
duplicated()
,
extract
,
inspect()
,
is.closed()
,
is.generator()
,
is.maximal()
,
is.redundant()
,
is.significant()
,
is.superset()
,
itemsets-class
,
match()
,
rules-class
,
sample()
,
sets
,
sort()
,
unique()
data("Adult") summary(size(Adult))
data("Adult") summary(size(Adult))
Provides the method sort
to sort elements in class
associations (e.g., itemsets or rules) according to the
value of measures stored in the association's slot quality
(e.g.,
support).
## S4 method for signature 'associations' sort(x, decreasing = TRUE, na.last = NA, by = "support", order = FALSE, ...)
## S4 method for signature 'associations' sort(x, decreasing = TRUE, na.last = NA, by = "support", order = FALSE, ...)
x |
an object to be sorted. |
decreasing |
a logical. Should the sort be increasing or decreasing? (default is decreasing) |
na.last |
na.last is not supported for associations. NAs are always put last. |
by |
a character string specifying the quality measure stored in
|
order |
should a order vector (a permutation like |
... |
Further arguments are ignored. |
sort
is relatively slow for large sets of associations since it has
to copy and rearrange a large data structure.
With order = TRUE
an integer vector with the
order is returned instead of the reordered associations.
If only the top n
associations are needed then head()
using
by
performs this faster than calling sort()
and then head()
since it does it without copying and rearranging all the data. tail()
works in the same way.
An object of the same class as x
or a permutation vector.
Michael Hahsler
Other associations functions:
abbreviate()
,
associations-class
,
c()
,
duplicated()
,
extract
,
inspect()
,
is.closed()
,
is.generator()
,
is.maximal()
,
is.redundant()
,
is.significant()
,
is.superset()
,
itemsets-class
,
match()
,
rules-class
,
sample()
,
sets
,
size()
,
unique()
data("Adult") ## Mine rules with Apriori rules <- apriori(Adult, parameter = list(supp = 0.6)) rules_by_lift <- sort(rules, by = "lift") inspect(head(rules)) inspect(head(rules_by_lift)) ## A faster/less memory consuming way to get the top 5 rules according to lift ## (see Details section) inspect(head(rules, n = 5, by = "lift"))
data("Adult") ## Mine rules with Apriori rules <- apriori(Adult, parameter = list(supp = 0.6)) rules_by_lift <- sort(rules, by = "lift") inspect(head(rules)) inspect(head(rules_by_lift)) ## A faster/less memory consuming way to get the top 5 rules according to lift ## (see Details section) inspect(head(rules, n = 5, by = "lift"))
Provides the generic function subset()
and methods to subset
associations or transactions (itemMatrix) which meet certain conditions
(e.g., contains certain items or satisfies a minimum lift).
subset(x, ...) ## S4 method for signature 'itemMatrix' subset(x, subset, ...) ## S4 method for signature 'itemsets' subset(x, subset, ...) ## S4 method for signature 'rules' subset(x, subset, ...)
subset(x, ...) ## S4 method for signature 'itemMatrix' subset(x, subset, ...) ## S4 method for signature 'itemsets' subset(x, subset, ...) ## S4 method for signature 'rules' subset(x, subset, ...)
x |
object to be subsetted. |
... |
further arguments to be passed to or from other methods. |
subset |
logical expression indicating elements to keep. |
subset()
finds the rows/itemsets/rules of x
that match the expression
given in subset
. Parts of x
like items, lhs, rhs and the columns in the quality data.frame (e.g., support and lift) can be directly referred to by their names
in subset
.
Important operators to select itemsets containing items specified by their labels are
%in%: select itemsets matching any given item
%ain%: select only itemsets matching all given item
%oin%: select only itemsets matching only the given item
%pin%: %in%
with partial matching
An object of the same class as x
containing only the elements
which satisfy the conditions.
Michael Hahsler
data("Adult") rules <- apriori(Adult) ## select all rules with item "marital-status=Never-married" in ## the right-hand-side and lift > 2 rules.sub <- subset(rules, subset = rhs %in% "marital-status=Never-married" & lift > 2) ## use partial matching for all items corresponding to the variable ## "marital-status" rules.sub <- subset(rules, subset = rhs %pin% "marital-status=") ## select only rules with items "age=Young" and "workclass=Private" in ## the left-hand-side rules.sub <- subset(rules, subset = lhs %ain% c("age=Young", "workclass=Private"))
data("Adult") rules <- apriori(Adult) ## select all rules with item "marital-status=Never-married" in ## the right-hand-side and lift > 2 rules.sub <- subset(rules, subset = rhs %in% "marital-status=Never-married" & lift > 2) ## use partial matching for all items corresponding to the variable ## "marital-status" rules.sub <- subset(rules, subset = rhs %pin% "marital-status=") ## select only rules with items "age=Young" and "workclass=Private" in ## the left-hand-side rules.sub <- subset(rules, subset = lhs %ain% c("age=Young", "workclass=Private"))
A small example database for weighted association rule mining provided as an object of class transactions.
Object of class transactions with 6 transactions and 8 items. Weights are stored as transaction information.
The data set contains the example database described in the paper by K. Sun
and F.Bai for illustration of the concepts of weighted association rule
mining. weight
stored as transaction information denotes the
transaction weights obtained using the HITS algorithm.
K. Sun and F. Bai (2008). Mining Weighted Association Rules without Preassigned Weights. IEEE Transactions on Knowledge and Data Engineering, 4 (30), 489–495.
Other weighted association mining functions:
hits()
,
weclat()
data(SunBai) summary(SunBai) inspect(SunBai) transactionInfo(SunBai)
data(SunBai) summary(SunBai) inspect(SunBai) transactionInfo(SunBai)
Provides the generic function support()
and the methods to count support for
given itemMatrix and associations in a given transactions
data.
support(x, transactions, ...) ## S4 method for signature 'itemMatrix' support( x, transactions, type = c("relative", "absolute"), method = c("ptree", "tidlists"), reduce = FALSE, weighted = FALSE, verbose = FALSE, ... ) ## S4 method for signature 'associations' support( x, transactions, type = c("relative", "absolute"), method = c("ptree", "tidlists"), reduce = FALSE, weighted = FALSE, verbose = FALSE, ... )
support(x, transactions, ...) ## S4 method for signature 'itemMatrix' support( x, transactions, type = c("relative", "absolute"), method = c("ptree", "tidlists"), reduce = FALSE, weighted = FALSE, verbose = FALSE, ... ) ## S4 method for signature 'associations' support( x, transactions, type = c("relative", "absolute"), method = c("ptree", "tidlists"), reduce = FALSE, weighted = FALSE, verbose = FALSE, ... )
x |
the set of itemsets for which support should be counted. |
transactions |
the transaction data set used for mining. |
... |
further arguments. |
type |
a character string specifying if |
method |
use |
reduce |
should unused items are removed before counting? |
weighted |
should support be weighted by transactions weights stored as
column |
verbose |
report progress? |
Normally, the support of frequent itemsets is very efficiently counted during
mining process using a set minimum support.
However, if only the support for specific itemsets (maybe itemsets with very low support)
is needed, or the support of a set of itemsets needs to be recalculated on
different transactions than they were mined on, then support()
can be used.
Several methods for support counting are available:
"ptree"
(default method): The counters for the itemsets
are organized in a prefix tree. The transactions are sequentially processed
and the corresponding counters in the prefix tree are incremented (see
Hahsler et al, 2008). This method is used by default since it is typically
significantly faster than transaction ID list intersection.
"tidlists"
: support is counted using
transaction ID list intersection which is used by several fast mining
algorithms (e.g., by Eclat). However, Support is determined for each itemset
individually which is slow for a large number of long itemsets in dense
data.
To speed up counting, reduce = TRUE
can be specified in control. Unused items
are removed from the transactions before counting.
A numeric vector of the same length as x
containing the
support values for the sets in x
.
Michael Hahsler and Christian Buchta
Michael Hahsler, Christian Buchta, and Kurt Hornik. Selective association rule generation. Computational Statistics, 23(2):303-315, April 2008.
Other interest measures:
confint()
,
coverage()
,
interestMeasure()
,
is.redundant()
,
is.significant()
data("Income") ## find and some frequent itemsets itemsets <- eclat(Income)[1:5] ## inspect the support returned by eclat inspect(itemsets) ## count support in the database support(items(itemsets), Income)
data("Income") ## find and some frequent itemsets itemsets <- eclat(Income)[1:5] ## inspect the support returned by eclat inspect(itemsets) ## count support in the database support(items(itemsets), Income)
Find for each itemset in an associations object which transactions support (i.e., contains all items in the itemset) it. The information is returned as a tidLists object.
supportingTransactions(x, transactions, ...) ## S4 method for signature 'associations' supportingTransactions(x, transactions)
supportingTransactions(x, transactions, ...) ## S4 method for signature 'associations' supportingTransactions(x, transactions)
x |
a set of associations (itemsets, rules, etc.) |
transactions |
an object of class transactions used to mine the
associations in |
... |
currently unused. |
An object of class tidLists containing one transaction ID
list per association in x
.
Michael Hahsler
Other itemMatrix and transactions functions:
abbreviate()
,
c()
,
crossTable()
,
duplicated()
,
extract
,
hierarchy
,
image()
,
inspect()
,
is.superset()
,
itemFrequency()
,
itemFrequencyPlot()
,
itemMatrix-class
,
match()
,
merge()
,
random.transactions()
,
sample()
,
sets
,
size()
,
tidLists-class
,
transactions-class
,
unique()
data <- list( c("a","b","c"), c("a","b"), c("a","b","d"), c("b","e"), c("b","c","e"), c("a","d","e"), c("a","c"), c("a","b","d"), c("c","e"), c("a","b","d","e") ) data <- as(data, "transactions") ## mine itemsets f <- eclat(data, parameter = list(support = .2, minlen = 3)) inspect(f) ## find supporting Transactions st <- supportingTransactions(f, data) st as(st, "list")
data <- list( c("a","b","c"), c("a","b"), c("a","b","d"), c("b","e"), c("b","c","e"), c("a","d","e"), c("a","c"), c("a","b","d"), c("c","e"), c("a","b","d","e") ) data <- as(data, "transactions") ## mine itemsets f <- eclat(data, parameter = list(support = .2, minlen = 3)) inspect(f) ## find supporting Transactions st <- supportingTransactions(f, data) st as(st, "list")
Class to represent transaction ID lists and associated methods.
tidLists(x) ## S4 method for signature 'tidLists' summary(object, maxsum = 6, ...) ## S4 method for signature 'tidLists' dim(x) ## S4 method for signature 'tidLists' dimnames(x) ## S4 replacement method for signature 'tidLists,list' dimnames(x) <- value ## S4 method for signature 'tidLists' length(x) ## S4 method for signature 'tidLists' t(x) ## S4 method for signature 'tidLists' transactionInfo(x) ## S4 replacement method for signature 'tidLists' transactionInfo(x) <- value ## S4 method for signature 'tidLists' itemInfo(object) ## S4 replacement method for signature 'tidLists' itemInfo(object) <- value ## S4 method for signature 'tidLists' itemLabels(object) ## S4 method for signature 'tidLists' labels(object)
tidLists(x) ## S4 method for signature 'tidLists' summary(object, maxsum = 6, ...) ## S4 method for signature 'tidLists' dim(x) ## S4 method for signature 'tidLists' dimnames(x) ## S4 replacement method for signature 'tidLists,list' dimnames(x) <- value ## S4 method for signature 'tidLists' length(x) ## S4 method for signature 'tidLists' t(x) ## S4 method for signature 'tidLists' transactionInfo(x) ## S4 replacement method for signature 'tidLists' transactionInfo(x) <- value ## S4 method for signature 'tidLists' itemInfo(object) ## S4 replacement method for signature 'tidLists' itemInfo(object) <- value ## S4 method for signature 'tidLists' itemLabels(object) ## S4 method for signature 'tidLists' labels(object)
x , object
|
the object |
maxsum |
maximum numbers of itemsets shown in the summary |
... |
further arguments |
value |
replacement value |
Transaction ID lists contains a set of lists. Each list is associated with an item/itemset and stores the IDs of the transactions which support the item/itemset.
tidLists
uses the class
Matrix::ngCMatrix to efficiently store the
transaction ID lists as a sparse matrix. Each column in the matrix
represents one transaction ID list.
tidLists
can be used for different purposes. For some operations
(e.g., support counting) it is efficient to coerce a
transactions database into tidLists
where each
list contains the transaction IDs for an item (and the support is given by
the length of the list).
The implementation of the Eclat mining algorithm (which uses transaction ID list intersection) can also produce transaction ID lists for the found itemsets as part of the returned itemsets object. These lists can then be used for further computation.
summary(tidLists)
: create a summary
dim(tidLists)
: get dimensions. The rows represent the itemsets and the
columns are the transactions.
dimnames(tidLists)
: get dimnames
dimnames(x = tidLists) <- value
: replace dimnames
length(tidLists)
: get the number of itemsets.
t(tidLists)
: this object is not transposable.
t()
results in an error.
transactionInfo(tidLists)
: get the transaction info data.frame
transactionInfo(tidLists) <- value
: replace the the transaction info data.frame
itemInfo(tidLists)
: get the item info data.frame
itemInfo(tidLists) <- value
: replace the item info data.frame
itemLabels(tidLists)
: get the item labels
labels(tidLists)
: convert the tid lists into a text representation.
data
an object of class Matrix::ngCMatrix.
itemInfo
a data.frame
transactionInfo
a data.frame
Objects are created
as part of the
itemsets mined by eclat()
with tidLists = TRUE
in the
ECparameter object.
by coercion from an object of class transactions.
by calls of the form new("tidLists", ...)
.
as("tidLists", "list")
as("list", "tidLists")
as("tidLists", "ngCMatrix")
as("tidLists", "transactions")
as("transactions", "tidLists")
as("tidLists", "itemMatrix")
as("itemMatrix", "tidLists")
Michael Hahsler
Other itemMatrix and transactions functions:
abbreviate()
,
c()
,
crossTable()
,
duplicated()
,
extract
,
hierarchy
,
image()
,
inspect()
,
is.superset()
,
itemFrequency()
,
itemFrequencyPlot()
,
itemMatrix-class
,
match()
,
merge()
,
random.transactions()
,
sample()
,
sets
,
size()
,
supportingTransactions()
,
transactions-class
,
unique()
## Create transaction data set. data <- list( c("a","b","c"), c("a","b"), c("a","b","d"), c("b","e"), c("b","c","e"), c("a","d","e"), c("a","c"), c("a","b","d"), c("c","e"), c("a","b","d","e") ) data <- as(data, "transactions") data ## convert transactions to transaction ID lists tl <- as(data, "tidLists") tl inspect(tl) dim(tl) dimnames(tl) ## inspect visually image(tl) ## mine itemsets with transaction ID lists f <- eclat(data, parameter = list(support = 0, tidLists = TRUE)) tl2 <- tidLists(f) inspect(tl2)
## Create transaction data set. data <- list( c("a","b","c"), c("a","b"), c("a","b","d"), c("b","e"), c("b","c","e"), c("a","d","e"), c("a","c"), c("a","b","d"), c("c","e"), c("a","b","d","e") ) data <- as(data, "transactions") data ## convert transactions to transaction ID lists tl <- as(data, "tidLists") tl inspect(tl) dim(tl) dimnames(tl) ## inspect visually image(tl) ## mine itemsets with transaction ID lists f <- eclat(data, parameter = list(support = 0, tidLists = TRUE)) tl2 <- tidLists(f) inspect(tl2)
The transactions
class is a subclass of itemMatrix and
represents transaction data used for mining associations.
transactions( x, itemLabels = NULL, transactionInfo = NULL, format = "wide", cols = NULL ) ## S4 method for signature 'transactions' summary(object) ## S4 method for signature 'transactions' toLongFormat(from, cols = c("TID", "item"), decode = TRUE) ## S4 method for signature 'transactions' items(x) transactionInfo(x) ## S4 method for signature 'transactions' transactionInfo(x) transactionInfo(x) <- value ## S4 replacement method for signature 'transactions' transactionInfo(x) <- value ## S4 method for signature 'transactions' dimnames(x) ## S4 replacement method for signature 'transactions,list' dimnames(x) <- value
transactions( x, itemLabels = NULL, transactionInfo = NULL, format = "wide", cols = NULL ) ## S4 method for signature 'transactions' summary(object) ## S4 method for signature 'transactions' toLongFormat(from, cols = c("TID", "item"), decode = TRUE) ## S4 method for signature 'transactions' items(x) transactionInfo(x) ## S4 method for signature 'transactions' transactionInfo(x) transactionInfo(x) <- value ## S4 replacement method for signature 'transactions' transactionInfo(x) <- value ## S4 method for signature 'transactions' dimnames(x) ## S4 replacement method for signature 'transactions,list' dimnames(x) <- value
x , object , from
|
the object |
itemLabels |
a vector with labels for the items |
transactionInfo |
a transaction information data.frame with one row per transaction. |
format |
|
cols |
a numeric or character vector of length two giving the index or names of the columns (fields) with the transaction and item ids in the long format. |
decode |
translate item IDs to item labels? |
value |
replacement value |
Transactions store the presence of items in each individual transaction
as binary matrix where rows represent the transactions and columns represent the items.
transactions
direct extends class itemMatrix
to store the sparse binary incidence matrix, item labels, and optionally transaction
IDs and user IDs. If you work with several transaction sets at the
same time, then the encoding (order of the items in the binary matrix) in
the different sets is important. See itemCoding to learn how
to encode and recode transaction sets.
Data Preparation
Data typically starts as a data.frame or a matrix and needs to be
prepared before it can be converted into transactions
(see coercion methods in
the Methods Section and the Example Section below for details on the needed
format).
Columns need to represent items which is different depending on the data type of the column:
Continuous variables: Continuous variables cannot directly be represented as
items and need to be
discretized first. An item resulting from discretization might be
age>18
and the column contains only TRUE
or FALSE
.
Alternatively, it can be a factor with levels age<=18
,
50=>age>18
and age>50
. These will be automatically converted
into 3 items, one for each level. Discretization is described in functions
discretize()
and discretizeDF()
.
Logical variables: A logical variable describing a person could be
tall
indicating if the person is tall using the values TRUE
and FALSE
. The fact that the person is tall would be encoded in the
transaction containing the item tall
while not tall persons would not
have this item. Therefore, for logical variables, the TRUE
value is
converted into an item with the name of the variable and for the
FALSE
values no item is created.
Factors: Columns with nominal values
(i.e., factor, ordered) are translated into a series of binary items (one for each level
constructed as variable name = level
). Items cannot represent order and this ordered factors
lose the order information. Note that nominal variables
need to be encoded as factors (and not characters or numbers). This can be
done with
data[,"a_nominal_var"] <- factor(data[,"a_nominal_var"])
.
Complete examples for how to prepare data can be found in the man pages for Income and Adult.
summary(transactions)
: produce a summary
toLongFormat(transactions)
: convert the transactions to long format
(a data.frame with two columns, tid and item). Column names can
be specified as a character vector of length 2 called cols
.
items(transactions)
: get the transactions as an itemMatrix
transactionInfo(transactions)
: get the transaction info data.frame
transactionInfo(transactions) <- value
: replace the transaction info data.frame
dimnames(transactions)
: get the dimnames
dimnames(x = transactions) <- value
: set the dimnames
Slots are inherited from itemMatrix.
Objects are created by:
coercion from objects of other classes. itemLabels
and transactionInfo
are
by default created from information in x
(e.g., from row and column names).
the constructor function transactions()
by calling new("transactions", ...)
.
See Examples Section for creating transactions from data.
as("transactions", "matrix")
as("matrix", "transactions")
as("list", "transactions")
as("transactions", "list")
as("data.frame", "transactions")
as("transactions", "data.frame")
as("ngCMatrix", "transactions")
Michael Hahsler
Superclass: itemMatrix
Other itemMatrix and transactions functions:
abbreviate()
,
c()
,
crossTable()
,
duplicated()
,
extract
,
hierarchy
,
image()
,
inspect()
,
is.superset()
,
itemFrequency()
,
itemFrequencyPlot()
,
itemMatrix-class
,
match()
,
merge()
,
random.transactions()
,
sample()
,
sets
,
size()
,
supportingTransactions()
,
tidLists-class
,
unique()
## Example 1: creating transactions form a list (each element is a transaction) a_list <- list( c("a","b","c"), c("a","b"), c("a","b","d"), c("c","e"), c("a","b","d","e") ) ## Set transaction names names(a_list) <- paste("Tr", c(1:5), sep = "") a_list ## Use the constructor to create transactions ## Note: S4 coercion does the same trans1 <- as(a_list, "transactions") trans1 <- transactions(a_list) trans1 ## Analyze the transactions summary(trans1) image(trans1) ## Example 2: creating transactions from a 0-1 matrix with 5 transactions (rows) and ## 5 items (columns) a_matrix <- matrix( c(1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1), ncol = 5) ## Set item names (columns) and transaction labels (rows) colnames(a_matrix) <- c("a", "b", "c", "d", "e") rownames(a_matrix) <- paste("Tr", c(1:5), sep = "") a_matrix ## Create transactions trans2 <- transactions(a_matrix) trans2 inspect(trans2) ## Example 3: creating transactions from data.frame (wide format) a_df <- data.frame( age = as.factor(c( 6, 8, NA, 9, 16)), grade = as.factor(c("A", "C", "F", NA, "C")), pass = c(TRUE, TRUE, FALSE, TRUE, TRUE)) ## Note: factors are translated differently than logicals and NAs are ignored a_df ## Create transactions trans3 <- transactions(a_df) inspect(trans3) ## Note that coercing the transactions back to a data.frame does not recreate the ## original data.frame, but represents the transactions as sets of items as(trans3, "data.frame") ## Example 4: creating transactions from a data.frame with ## transaction IDs and items (long format) a_df3 <- data.frame( TID = c( 1, 1, 2, 2, 2, 3 ), item = c("a", "b", "a", "b", "c", "b") ) a_df3 trans4 <- transactions(a_df3, format = "long", cols = c("TID", "item")) trans4 inspect(trans4) ## convert transactions back into long format. toLongFormat(trans4) ## Example 5: create transactions from a dataset with numeric variables ## using discretization. data(iris) irisDisc <- discretizeDF(iris) head(irisDisc) trans5 <- transactions(irisDisc) trans5 inspect(head(trans5)) ## Note, creating transactions without discretizing numeric variables will apply the ## default discretization and also create a warning. ## Example 6: create transactions manually (with the same item coding as in trans5) trans6 <- transactions( list( c("Sepal.Length=[4.3,5.4)", "Species=setosa"), c("Sepal.Length=[4.3,5.4)", "Species=setosa") ), itemLabels = trans5) trans6 inspect(trans6)
## Example 1: creating transactions form a list (each element is a transaction) a_list <- list( c("a","b","c"), c("a","b"), c("a","b","d"), c("c","e"), c("a","b","d","e") ) ## Set transaction names names(a_list) <- paste("Tr", c(1:5), sep = "") a_list ## Use the constructor to create transactions ## Note: S4 coercion does the same trans1 <- as(a_list, "transactions") trans1 <- transactions(a_list) trans1 ## Analyze the transactions summary(trans1) image(trans1) ## Example 2: creating transactions from a 0-1 matrix with 5 transactions (rows) and ## 5 items (columns) a_matrix <- matrix( c(1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1), ncol = 5) ## Set item names (columns) and transaction labels (rows) colnames(a_matrix) <- c("a", "b", "c", "d", "e") rownames(a_matrix) <- paste("Tr", c(1:5), sep = "") a_matrix ## Create transactions trans2 <- transactions(a_matrix) trans2 inspect(trans2) ## Example 3: creating transactions from data.frame (wide format) a_df <- data.frame( age = as.factor(c( 6, 8, NA, 9, 16)), grade = as.factor(c("A", "C", "F", NA, "C")), pass = c(TRUE, TRUE, FALSE, TRUE, TRUE)) ## Note: factors are translated differently than logicals and NAs are ignored a_df ## Create transactions trans3 <- transactions(a_df) inspect(trans3) ## Note that coercing the transactions back to a data.frame does not recreate the ## original data.frame, but represents the transactions as sets of items as(trans3, "data.frame") ## Example 4: creating transactions from a data.frame with ## transaction IDs and items (long format) a_df3 <- data.frame( TID = c( 1, 1, 2, 2, 2, 3 ), item = c("a", "b", "a", "b", "c", "b") ) a_df3 trans4 <- transactions(a_df3, format = "long", cols = c("TID", "item")) trans4 inspect(trans4) ## convert transactions back into long format. toLongFormat(trans4) ## Example 5: create transactions from a dataset with numeric variables ## using discretization. data(iris) irisDisc <- discretizeDF(iris) head(irisDisc) trans5 <- transactions(irisDisc) trans5 inspect(head(trans5)) ## Note, creating transactions without discretizing numeric variables will apply the ## default discretization and also create a warning. ## Example 6: create transactions manually (with the same item coding as in trans5) trans6 <- transactions( list( c("Sepal.Length=[4.3,5.4)", "Species=setosa"), c("Sepal.Length=[4.3,5.4)", "Species=setosa") ), itemLabels = trans5) trans6 inspect(trans6)
Provides the generic function unique()
and the methods for
itemMatrix transactions, and associations.
unique(x, incomparables = FALSE, ...) ## S4 method for signature 'itemMatrix' unique(x, incomparables = FALSE) ## S4 method for signature 'associations' unique(x, incomparables = FALSE, ...)
unique(x, incomparables = FALSE, ...) ## S4 method for signature 'itemMatrix' unique(x, incomparables = FALSE) ## S4 method for signature 'associations' unique(x, incomparables = FALSE, ...)
x |
an object of class itemMatrix or associations. |
incomparables |
currently unused. |
... |
further arguments (currently unused). |
unique()
uses duplicated()
to return an
object with the duplicate elements removed.
An object of the same class as x
with duplicated elements
removed.
Michael Hahsler
Other associations functions:
abbreviate()
,
associations-class
,
c()
,
duplicated()
,
extract
,
inspect()
,
is.closed()
,
is.generator()
,
is.maximal()
,
is.redundant()
,
is.significant()
,
is.superset()
,
itemsets-class
,
match()
,
rules-class
,
sample()
,
sets
,
size()
,
sort()
Other itemMatrix and transactions functions:
abbreviate()
,
c()
,
crossTable()
,
duplicated()
,
extract
,
hierarchy
,
image()
,
inspect()
,
is.superset()
,
itemFrequency()
,
itemFrequencyPlot()
,
itemMatrix-class
,
match()
,
merge()
,
random.transactions()
,
sample()
,
sets
,
size()
,
supportingTransactions()
,
tidLists-class
,
transactions-class
data("Adult") r1 <- apriori(Adult[1:1000], parameter = list(support = 0.5)) r2 <- apriori(Adult[1001:2000], parameter = list(support = 0.5)) ## Note that this produces a collection of rules from two sets r_comb <- c(r1, r2) r_comb <- unique(r_comb) r_comb
data("Adult") r1 <- apriori(Adult[1:1000], parameter = list(support = 0.5)) r2 <- apriori(Adult[1001:2000], parameter = list(support = 0.5)) ## Note that this produces a collection of rules from two sets r_comb <- c(r1, r2) r_comb <- unique(r_comb) r_comb
Find frequent itemsets with the Eclat algorithm. This implementation uses optimized transaction ID list joins and transaction weights to implement weighted association rule mining (WARM).
weclat(data, parameter = NULL, control = NULL)
weclat(data, parameter = NULL, control = NULL)
data |
an object that can be coerced into an object of class transactions. |
parameter |
an object of class ASparameter (default
values: |
control |
an object of class AScontrol (default values:
|
Transaction weights are stored in the transactions as a column called
weight
in transactionInfo.
The weighted support of an itemset is the sum of the weights of the
transactions that contain the itemset. An itemset is frequent if its
weighted support is equal or greater than the threshold specified by
support
(assuming that the weights sum to one).
Note that Eclat only mines (weighted) frequent itemsets. Weighted
association rules can be created using ruleInduction()
.
Returns an object of class itemsets. Note that
weighted support is returned in quality as column support
.
The C code can be interrupted by CTRL-C
. This is convenient but comes
at the price that the code cannot clean up its internal memory.
Christian Buchta
G.D. Ramkumar, S. Ranka, and S. Tsur (1998). Weighted Association Rules: Model and Algorithm, Proceedings of ACM SIGKDD.
Other mining algorithms:
APappearance-class
,
AScontrol-classes
,
ASparameter-classes
,
apriori()
,
eclat()
,
fim4r()
,
ruleInduction()
Other weighted association mining functions:
SunBai
,
hits()
## Example 1: SunBai data data(SunBai) SunBai ## weights are stored in transactionInfo transactionInfo(SunBai) ## mine weighted support itemsets using transaction support in SunBai s <- weclat(SunBai, parameter = list(support = 0.3), control = list(verbose = TRUE)) inspect(sort(s)) ## create rules using weighted support (satisfying a minimum ## weighted confidence of 90%). r <- ruleInduction(s, confidence = .9) inspect(r) ## Example 2: Find association rules in weighted data trans <- list( c("A", "B", "C", "D", "E"), c("C", "F", "G"), c("A", "B"), c("A"), c("C", "F", "G", "H"), c("A", "G", "H") ) weight <- c(5, 10, 6, 7, 5, 1) ## convert list to transactions trans <- transactions(trans) ## add weight information transactionInfo(trans) <- data.frame(weight = weight) inspect(trans) ## mine weighed support itemsets s <- weclat(trans, parameter = list(support = 0.3), control = list(verbose = TRUE)) inspect(sort(s)) ## create association rules r <- ruleInduction(s, confidence = .5) inspect(r)
## Example 1: SunBai data data(SunBai) SunBai ## weights are stored in transactionInfo transactionInfo(SunBai) ## mine weighted support itemsets using transaction support in SunBai s <- weclat(SunBai, parameter = list(support = 0.3), control = list(verbose = TRUE)) inspect(sort(s)) ## create rules using weighted support (satisfying a minimum ## weighted confidence of 90%). r <- ruleInduction(s, confidence = .9) inspect(r) ## Example 2: Find association rules in weighted data trans <- list( c("A", "B", "C", "D", "E"), c("C", "F", "G"), c("A", "B"), c("A"), c("C", "F", "G", "H"), c("A", "G", "H") ) weight <- c(5, 10, 6, 7, 5, 1) ## convert list to transactions trans <- transactions(trans) ## add weight information transactionInfo(trans) <- data.frame(weight = weight) inspect(trans) ## mine weighed support itemsets s <- weclat(trans, parameter = list(support = 0.3), control = list(verbose = TRUE)) inspect(sort(s)) ## create association rules r <- ruleInduction(s, confidence = .5) inspect(r)
Provides the generic function write()
and the methods to write
transactions or associations to a file.
write(x, file = "", ...) ## S4 method for signature 'transactions' write( x, file = "", format = c("basket", "single"), sep = " ", quote = TRUE, ... ) ## S4 method for signature 'associations' write(x, file = "", sep = " ", quote = TRUE, ...)
write(x, file = "", ...) ## S4 method for signature 'transactions' write( x, file = "", format = c("basket", "single"), sep = " ", quote = TRUE, ... ) ## S4 method for signature 'associations' write(x, file = "", sep = " ", quote = TRUE, ...)
x |
the transactions or associations (rules, itemsets, etc.) object. |
file |
either a character string naming a file or a connection open for writing. '""' indicates output to the console. |
... |
further arguments passed on to |
format |
format to write transactions. |
sep |
the field separator string. Values within each row of x are
separated by this string. Use |
quote |
a logical value. Quote fields? |
For associations (rules and itemsets) write()
first uses coercion to
data.frame to obtain a printable form of x
and then uses
utils::write.table()
to write the data to disk. This is just a method to
export the rules in human-readable form. These exported associations cannot be
read back in as rules. To save and load associations in compact form, use save()
and
load()
from the base package. Alternatively, association can be
written to disk in PMML (Predictive Model Markup Language) via
write.PMML()
. This requires package pmml.
Transactions can be saved in basket (one line per transaction) or in single (one line per item) format.
Michael Hahsler
Other import/export:
DATAFRAME()
,
LIST()
,
pmml
,
read
data("Epub") ## write the formated transactions to screen (basket format) write(head(Epub)) ## write the formated transactions to screen (single format) write(head(Epub), format="single") ## write the formated result to file in CSV format write(Epub, file = "data.csv", format = "single", sep = ",") ## write rules in CSV format rules <- apriori(Epub, parameter=list(support = 0.0005, conf = 0.8)) write(rules, file = "data.csv", sep = ",") unlink("data.csv") # tidy up
data("Epub") ## write the formated transactions to screen (basket format) write(head(Epub)) ## write the formated transactions to screen (single format) write(head(Epub), format="single") ## write the formated result to file in CSV format write(Epub, file = "data.csv", format = "single", sep = ",") ## write rules in CSV format rules <- apriori(Epub, parameter=list(support = 0.0005, conf = 0.8)) write(rules, file = "data.csv", sep = ",") unlink("data.csv") # tidy up