Uncertain Data Vldb2012

You might also like

Download as rtf, pdf, or txt
Download as rtf, pdf, or txt
You are on page 1of 12

Mining Frequent Itemsets over Uncertain Databases

Yongxin Tong † Lei Chen † Yurong Cheng ‡ Philip S. Yu §

† Hong Kong University of Science & Technology, Hong Kong, China


‡ Northeastern University, China

§ University of Illinois at Chicago, USA


{ yxtong, leichen } @cse.ust.hk, ‡ cyrneu@gmail.com, psyu@cs.uic.edu
§

mining frequent itemsets over uncertain databases has also


attracted much attention [4, 9, 10, 11, 17, 18, 22, 28, 30, 31,
In recent years, due to the wide applications of uncertain da-
33]. For example, with the popularization of wireless sen-
ta, mining frequent itemsets over uncertain databases has at-
sor networks, wireless sensor network systems collect huge
tracted much attention. In uncertain databases, the support
amount of data. However, due to the inherent uncertain-
of an itemset is a random variable instead of a fixed occur-
ty of sensors, the collected data are often inaccurate. For
rence counting of this itemset. Thus, unlike the correspond-
the probability-included uncertain data, how can we discov-
ing problem in deterministic databases where the frequent
er frequent patterns (itemsets) so that the users can under-
itemset has a unique definition, the frequent itemset under
stand the hidden rules in data? The inherent probability
uncertain environments has two di erent definitions so far.
property of data is ignored if we simply apply the tradition-
The first definition, referred as the expected support-based
al method of frequent itemset mining in deterministic data
frequent itemset, employs the expectation of the support
to uncertain data. Thus, it is necessary to design special-
of an itemset to measure whether this itemset is frequent.
ized algorithms for mining frequent itemsets over uncertain
The second definition, referred as the probabilistic frequent
databases.
itemset, uses the probability of the support of an itemset
Before finding frequent itemsets over uncertain databases,
to measure its frequency. Thus, existing work on mining
the definition of the frequent itemset is the most essential
frequent itemsets over uncertain databases is divided into
issue. In deterministic data, it is clear that an itemset is fre-
two di erent groups and no study is conducted to compre-
quent if and only if the support (frequency) of such itemset
hensively compare the two di erent definitions. In addition,
is not smaller than a specified minimum support, min sup
since no uniform experimental platform exists, current so-
[7, 8, 19, 32]. However, di erent from the deterministic case,
lutions for the same definition even generate inconsistent
the definition of a frequent itemset over uncertain data has
results. In this paper, we firstly aim to clarify the relation-
two di erent semantic explanations: expected support-based
ship between the two di erent definitions. Through exten-
frequent itemset [4, 18] and probabilistic frequent itemset [9].
sive experiments, we verify that the two definitions have a
Both of which consider the support of an itemset as a dis-
tight connection and can be unified together when the size of
crete random variable. However, the two definitions are
data is large enough. Secondly, we provide baseline imple-
di erent on using the random variable to define frequent
mentations of eight existing representative algorithms and
itemsets. In the definition of the expected support-based
test their performances with uniform measures fairly. Final-
frequent itemset, the expectation of the support of an item-
ly, according to the fair tests over many di erent benchmark
set is defined as the measurement, called as the expected
data sets, we clarify several existing inconsistent conclusions
support of this itemset. In this definition [4, 17, 18, 22], an
and discuss some new findings.
itemset is frequent if and only if the expected support of such
itemset is no less than a specified minimum expected sup-
port threshold, min esup. In the definition of probabilistic
Recently, with many new applications, such as sensor net- frequent itemset [9, 28, 31], the probability that an itemset
work monitoring [23, 24, 26], moving object search [13, 14, appears at least the minimum support (min sup) times is
15] and protein-protein interaction (PPI) network analysis defined as the measurement, called as the frequent proba-
[29], uncertain data mining has become a hot topic in data bility of an itemset, and an itemset is frequent if and only if
mining communities [3, 4, 5, 6, 20, 21]. Since the problem of the frequent probability of such itemset is larger than a given
frequent itemset mining is fundamental in data mining area, probabilistic threshold.
Permission
personal
not
bear
republish,
permission
their
August
Proceedings
Copyright
made
this
resultsat
27th
or
notice
or
to
2012
and/or
to
classroom
distributed
of
-post
The
make
31st
the
and
VLDB
on
a2012,
38th
VLDB
the
fee.
digital
servers
use
InternationalConference
full
Endowment
for
Articles
Istanbul,
is
Endowment,
or
profit
citation
or
granted
hard
tofrom
or
redistribute
Turkey.
copies
on
commercial
2150-8097/12/07...
without
this
the
Vol.
of
volume
first
5,
all
fee
toNo.
page.
or
lists,
on
advantage
provided
part
were
11
Very
To
requires
ofinvited
copy
Large
this
that $are
andotherwise,
work
prior
copies
Data
that
to present
for
specific
copies
Bases,
10.00.
to 1650
itemset
simply
as
uses
deterministic
The
two
ability
support
an the
definition
important
definitions
extension
includes
distribution.
expectation
of andata.
itemset.
of
the
statistic,
should
ofexpected
complete
the
The
to
Most
definition
measure
Although
be
definition
it cannot
prior
studied
support-based
probability
researches
the
of
the
show
of
respectively
uncertainty,
the
probabilistic
expectation
the
frequent
distribution
frequent
believe
complete
[9,
which
itemset
isfrequent
28,
that
itemset
known
of
prob-
31].
is
the
ina
However, we find that the two definitions have a rather ficiency, approximate probabilistic frequent itemset mining
close connection. Both definitions consider the support of an algorithms are also proposed [10, 31]. For comparing the re-
itemset as a random variable following the Poisson Binomial lationship between the two frequent itemset definitions, we
distribution [2], that is the expected support of an itemset use precision and recall as measures to evaluate the approxi-
equals to the expectation of the random variable. Conse- mation e ectiveness. Moreover, since the above inconsistent
quently, computing the frequent probability of an itemset is conclusions may be caused by the dependence on datasets,
equivalent to calculating the cumulative distribution func- in this work, we choose six di erent datasets, three dense
tion of this random variable. In addition, the existing math- ones and three sparse ones with di erent probability distri-
ematical theory shows that Poisson distribution and Normal butions (e.g. Normal distribution Vs. Zipf distribution or
distribution can approximate Poisson Binomial distribution High probability Vs. Low probability).
under high confidence [31, 10]. Based on the Lyapunov Cen- To sum up, we try to achieve the following goals:
tral Limit Theory [25], the Normal distribution converges to • Clarify the relationship of the existing two definitions
Poisson Binomial distribution with high probability. More- of frequent itemsets over uncertain databases. In fac-
over, the Poisson Binomial distribution has a sound prop- t, there is a mathematical correlation between them.
erty: the computation of the expectation and variance are Thus, the two definitions can be integrated together.
the same in terms of computational complexity. Therefore, Based on this relationship, instead of spending expen-
the frequent probability of an itemset can be directly com- sive computation cost to mine probabilistic frequent
puted as long as we know the expected value and variance itemsets, we can directly use the solutions for mining
of the support of such itemset when the number of trans- expected support-based itemsets as long as the size of
actions in the uncertain database is large enough [10] (due data is large enough.
to the requirement of the Lyapunov Central Limit Theo- • Verify the contradictory conclusions in the existing re-
ry). In other words, the second definition is identical to the search and summarize a series of fair results.
first definition if the first definition also considers the vari- • Provide uniform baseline implementations for all ex-
ance of the support at the same time. Moreover, another isting representative algorithms under two definition-
interesting result is that existing algorithms for mining ex- s. These implementations adopt common basic oper-
pected support-based frequent itemsets are applicable to the ations and o er a base for comparing with the future
problem of mining probabilistic frequent itemsets as long as work in this area. In addition, we also proposed a
they also calculate the variance of the support of each item- novel approximate probabilistic frequent itemset min-
set when they calculate each expected support. Thus, the ing algorithm, NDUH-Mine which is combined with t-
e ciency of mining probabilistic frequent itemsets can be wo existing classic algorithm: UH-Mine algorithm and
greatly improved due to the existence of many e cient ex- Normal distribution-based frequent itemset mining al-
pected support-based frequent itemset mining algorithms. gorithm.
In this paper, we verify the conclusion through extensive • Propose an objective and su cient experimental eval-
experimental comparisons. uation and test the performances of the existing rep-
Besides the overlooking of the hidden relationship between resentative algorithms over extensive benchmarks.
the two above definitions, existing research on the same def- The rest of the paper is organized as follows. In Sec-
inition also shows contradictory conclusions. For example, tion 2, we give some basic definitions about mining frequent
in the research of mining expected support-based frequent itemset over uncertain databases. Eight representative algo-
itemsets, [22] shows that UFP-growth algorithm always out- rithms are reviewed in Section 3. Section 4 presents all the
performs UApriori algorithm with respect to the running experimental comparisons and the performance evaluations.
time. However, [4] reports that UFP-growth algorithm is We conclude in Section 5.
always slower than UApriori algorithm. These inconsisten-
t conclusions make later researchers confused about which In this section, we give several basic definitions about min-
result is correct. ing frequent itemsets over uncertain databases.
The lacking of uniform baseline implementations is one Let = { }1 2be a set of distinct items. We name
of the factors causing the inconsistent conclusions. There- a non-empty subset, , of as an itemset. For brevity, we
fore, di erent experimental results originate from discrep- use = to denote itemset = { 1}.2
12
ancy among many implementation skills, blurring what are is a if it has items. Given an uncertain
the contributions of the algorithms. For instance, the imple- transaction database , each transaction is denoted as
mentation for UFP-growth algorithm uses the ” oat type” to a tuple where is the transaction identifier,
store each probability. While the implementation for UH- and = { ( 1 ) 2 ( 2) ( )}. contains m units.
1
Mine algorithm adopts the ”double type”. The di erence Each unit has an item and a probability, , denoting the
of their memory cost cannot re ect the e ectiveness of the possibility of item appearing in the tuple. The number
two algorithms
mentations
details
objective
portant
uncertain
amount
bility
is guaranteed.
Except
are
and
factors
ofbasic
and
data objectively.
uniform
data,
report
canscientific
In
mining
in
measures
eliminate
the
addition,
the
true
baseline
running
fair toThus,
algorithms
contributions
measures
when uniform
experimental
interferences
implementations,
time,
trade
theis
need
memory
also
ofthe
each
to baseline
ocorrectness
comparison.
from
one
accuracy
process
cost imple-
algorithm.
implementation
ofthe
of
the
and
algorithms
selection
afor
most
large
scala-
Because
ef-im-of 1651
based of transactions
Definition
denoted
are
nsaction
an
frequent
itemset
defined
1. as
(Expected
database
itemset
,(as
the containing
). follows.
Given
expected
and
which
Support) in Given
is (a )random
,probabilistic
theincludes
support
expected of variable,
transactions,
=frequent
an
support-
is:uncertain
itemsets
and
(tra-
)
=1
these algorithms need to spend at least O( ) compu-
Table 1: An Uncertain Database
tation cost for each itemset. Moreover, in order to avoid
TID Transactions
redundant processing, the Cherno bound-based pruning is
T1 A (0.8) B (0.2) C (0.9) D (0.7) F(0.8)
a way to reduce the running time of this group of algorithm-
T2 A (0.8) B (0.7) C (0.9) E (0.5)
s. The third group is the approximate probabilistic frequent
T3 A (0.5) C (0.8) E (0.8) F (0.3)
algorithms. Due to the sound properties of the Poisson Bi-
T4 B (0.5) D (0.5) F (0.7)
nomial distribution, this group of algorithms can obtain the
approximate frequent probability with high quality by on-
ly acquiring the first moment ( ) and the second
Table 2: The Probability Distribution of sup( )
moment ( ). Therefore, the third kind of algorithms
sup(A) 0 1 2 3
have the O( ) computation cost and return the complete
Probability 0.1 0.18 0.4 0.32
probability information when uncertain databases are large
enough. To sum up, the third kind of algorithms actually
build a bridge between two di erent definitions of frequent
Definition 2. (Expected-Support-based Frequent Itemset)
itemsets over uncertain databases.
Given an uncertain transaction database which in-
cludes transactions, and a minimum expected support
ratio, , an itemset is an expected support-based
frequent itemset if and only if ( ) ×
In this subsection, we summarize three the most represen-
Example 1. (Expected Support-based Frequent Itemset)
tative expected support-based frequent itemset mining algo-
Given an uncertain database in Table 1 and the minimum ex- rithms: [17, 18], [22],
pected support, =0.5, there are only two expected
[4]. The first algorithm is based on the generate-and-test
support-based frequent itemsets: (2 1) and (2 6) where
framework employing the breath-first search strategy. The
the number in each bracket is the expected support of the
other two algorithms are based on the divide-and-conquer
corresponding itemset.
framework which uses the depth-first search strategy. Al-
though Apriori algorithm is slower than the other two al-
Definition 3. (Frequent Probability) Given an uncertain
gorithms in deterministic databases, UApriori which is the
transaction database which includes transactions,
uncertain version of Apriori, actually performs rather well
a minimum support ratio , and an itemset , ’s
among the three algorithms and is usually the fastest one in
frequent probability, denoted as ( ), is shown as follows:
dense uncertain datasets based on our experimental result-
()={() ×} s in Section 4. We further explain three algorithms in the
following subsections and Section 4.
Definition 4. (Probabilistic Frequent Itemset) Given an
uncertain transaction database which includes trans-
actions, a minimum support ratio , and a probabilis-
tic frequent threshold , an itemset is a probabilistic The first expected support-based frequent itemset mining
frequent itemset if ’s frequent probability is larger than algorithm was proposed by Chui et al. in 2007 [18]. This
the probabilistic frequent threshold, namely, algorithm extends the well-known Apriori algorithm [17, 18]
to the uncertain environment and uses the generate-and-
()={() ×}
test framework to find all expected support-based frequent
itemsets. We generally introduced UApriori algorithm as
Example 2. (Probabilistic Frequent Itemset) Given an un-
follows. The algorithm first finds all the expected support-
certain database in Table 2, =0.5, and = 0.7,
based frequent items firstly. Then, it repeatedly joins all
the probability distribution of the support of is shown
expected support-based frequent -itemsets to produce +1-
in Table 2. So, the frequent probability of is: ( ) =
itemset candidates and test +1-itemset candidates to obtain
{ ( ) 4 × 0 5} = { ( ) 2} = { ( ) =
expected support-based frequent + 1-itemsets. Finally, it
2} + { ( ) = 3} = 0 4 + 0 32 0 7 = . Thus, { }
ends when no expected support-based frequent +1-itemsets
is a probabilistic frequent itemset.
are generated.
Fortunately, the well-known downward closure property
[8] still works in uncertain databases. Thus, the traditional
Apriori pruning can be used when we check whether an item-
We categorize the eight representative algorithms into three set is an expected support-based frequent itemset. In other
groups. The first group is the expected support-based fre- words, all supersets of this itemset must not be expected
quent
support-based
algorithms
its
support
transactions.
frequent
bilistic
bility
exact
frequency.
for
frequent
algorithms.
frequent
ofalgorithms.
each
an
only
The
itemset
The
itemset.
frequent
probability
itemsets
consider
second
These
complexity
These
isDue
itemsets.
O(algorithms
and
group
the
instead
),algorithms
toexpected
where
report
of
complexity
isFor
computing
the
ofaim
exact
isthe
each
exact
the
support
discover
to
simple
frequent
number
of
itemset,
find
probabilistic
the
computing
toall
expectation,
all
expected
measure
of
expected
these
proba-the can1652
support-based
mental
minimum
improving
pend
pruning
early
the
be used.
upper
on
pruning.
pruning
as the
method
possible.
expected
However,
the
bound
structure
frequent
emethods
ciency.
inof
Once
support,
UApriori
the
the
of
itemsets.
These
[17,
decremental
the
datasets,
expected
the
upper
18]
ismethods
traditional
still
In
were
thus,
support
bound
addition,
the
pruning
proposed
the
mainly
traditional
isApriori
of
most
lower
several
an
methods
aim
for
itemset
important
than
pruning
Apriori
further
to
decre-
find
de-
the
as
UFP-growth algorithm [22] was extended from the FP-
growth algorithm [19] which is one of the most well-known
pattern mining algorithms in deterministic databases. Sim-
ilar to the traditional FP-growth algorithm, UFP-growth
algorithm also firstly builds an index tree, called UFP-tree
to store all information of the uncertain database. Then,
based on the UFP-tree, the algorithm recursively builds con- Figure 2: UH-Struct Generated from Table 1
ditional subtrees and finds expected support-based frequent
itemsets. The UFP-tree for the UDB in Table 1 is shown in
Figure 1 when =0.25.

Figure 3: UH-Struct of Head Table of A

In this data structure, each item is assigned with its label,


Figure 1: UFP-Tree its appearing probability and a pointer. The UH-Struct of
Table 1 is shown in Figure 2. After building the global UH-
In the process of building the UFP-tree, it first finds all ex- Strut, the algorithm uses the depth-first strategy to build
pected support-based frequent items and orders these items the head table in Figure 3 where A is the prefix. Then, the
by their expected supports. For the uncertain database in algorithm recursively builds the head tables where di erent
Figure 1, the ordered item list is { :2.6, :2.1, :1.8, :1.4, itemsets are prefix and generates all expected support-based
:1.3, :1.2} where the real number following the colon is frequent itemsets.
the expected support for each item. Based on the list, the In frequent itemset mining over deterministic databases,
algorithm sorts each transaction and inserts the transaction H-Mine algorithm fails to compress the data structure and
into the UFP-tree. Each node includes three values in UFP- use the dynamic frequency order sub-structure, such as con-
tree. The first value is the label of the item; the second value ditional FP-trees, which only builds the head tables of di er-
is the appearance probability of this item; and the third val- ent itemsets recursively. Therefore, for the dense databases,
ue is the numbers that this node is shared from root to it. the FP-growth is superior because a larger number of items
Di erent from the traditional FP-tree, the compression of are stored in fewer shared paths. However, for the sparse
UFP-tree is substantially reduced because it is hard to take databases, H-Mine is faster when building head tables of all
the advantage of the shared prefix path in the FP-tree under levels is faster than building all conditional subtrees. Thus,
uncertain databases. In the UFP-tree, items may share one it is quite likely that H-Mine is better than FP-growth in
node only when their labels and appearance probabilities are sparse databases. As discussed in Section 3.1.2, uncertain
both same. Otherwise, items must be presented in two di er- databases are quite sparse databases, so UH-Mine extended
ent nodes. In fact, the probabilities in an uncertain database from H-Mine always has the good performance.
make the corresponding deterministic database become s- Comparison of Three Algorithm Frameworks: More-
parse due to fewer shared nodes and paths. Thus, uncertain over, The search strategies and data structures of three above
databases are often considered as the sparse databases. Giv- algorithms are shown in Table 3, respectively.
en an UDB, we have to build many conditional subtrees in
the corresponding UFP-tree, which leads much redundan-
t computation. That is also the reason why UFP-growth In this subsection, we summarize two existing representa-
cannot achieve the similar performance as FP-growth does. tive probabilistic frequent itemset mining algorithms:
(Dynamic Programming-based Apriori algorithm) and
(Divide-and-Conquer-based Apriori Algorithm). The exact
UH-Mine [4] is also based on the divide-and-conquer frame- probabilistic frequent itemset mining algorithms first cal-
work and the depth-first search strategy. The algorithm was culate or estimate the frequent probability of each itemset.
extended from the H-Mine algorithm [27] which is classical Then, only for itemsets whose frequent probabilities are larg-
algorithm
ticular,
scans
based
ble
s.
label
pointer
inserts
For
UH-Mine
which
of
the
each
frequent
all
H-Mine
domain.
this
uncertain
transactions
in
contains
item,
item,
deterministic
algorithm
items.
is
After
the
the
quite
all
database
head
expected
Then,
building
expected
into
can
suitable
table
frequent
the
be
the
and
outlined
data
the
support
stores
algorithm
support-based
for
finds
head
itemset
structure,UH-Struct.
sparse
three
all
of
as
table,
expected
such
follows.
builds
databases.
mining.
elements:
the
frequent
item,
aalgorithm
head
Firstly,
support-
Inand
the
par-
item-
ta-
a it 1653
erthe
with
than
frequent
their
the given
exact
Table
probability
frequent
probability
UU
U
FP3:
Me
HAp Expected
gr-Mi
thri
owth
od nseis
ori Br
Sprobabilities.
DD more
eeptthreshold
eept
arc
adt
h-hSupport-based
h-
h-fircomplicated
fir
St fir
srat
ststSea
eg
Sea yare
t Sea Because
rrDat
ch
ch returned
r ch
UF
Ua No
St
H- Algorithms
than
P-ru
ne
St
t re
crucomputing
tu calculat-
together
ectre
ing the expected support, a quick estimation about whether
Table 4: Comparison of Complexity about Deter-
an itemset is a probabilistic frequent itemset can improve
mining the Frequent Probability of An Itemset
the e ciency of the algorithms. Therefore, a probability tail
Metho ds Complexity Accuracy
inequality-based pruning technique, Cherno bound-based 2 × min sup) Exact
DP O(N
pruning technique, becomes a key tool to improve the e - DC O(N logN ) Exact
ciency of probabilistic frequent itemset mining algorithms. Cherno O(N) False Positive

Under the definition of the probabilistic frequent itemset,


probability for an itemset. However, the computation of the
it is critical to compute the frequent probability of an item-
set e ciently. [9] is the first work proposing the concept of frequent probability is redundant if an itemset is not a prob-
frequent probability of an itemset and designing a dynam- abilistic frequent itemset. Thus, for e ciency improvement,
ic programming-based algorithm to compute the frequent it is a key problem to address how to filter out unpromising
probability. For the sake of the following discussion, we de- probabilistic infrequent itemsets as early as possible. Be-
fine that 0
( ) denotes the probability that itemset cause the support of an itemset follows Poisson Binomial
appears at least times among the first transactions in the
distribution, Cherno bound [16] is a well-known tight up-
given uncertain database. ( ) is the probability of
itemset appears in the - transaction . is the num- per bound of the frequent probability. The Cherno bound-
ber of transactions in the uncertain database. Therefore, based pruning is shown as follows.
the recursive relationship is defined as follows: ()= Lemma 1. (Cherno Bound-based Pruning [28]) Given
Pr (X ) × P r(X T ) + Pr (X ) × (1 P r(X T ))
an uncertain transaction database , an itemset , a
1
minimum support threshold , a probabilistic frequen-
t threshold , the expected support of , = ( ), an
()=10
Boundary Case: 0
itemset X is a probabilistic infrequent itemset if,
()=0

Thus, the frequent probability equals ( ). Ac- 2 2 1


cording to the dynamic programming method, algorith- 2µ
4 02 1
m uses the Apriori framework to find all probabilistic fre-
quent itemsets. where = ( 1) and is the number of
Based on the definition of the probabilistic frequent item- transactions in .
set, the support of an itemset follows the Poisson Binomial The Cherno bound can be computed easily as long as
distribution, from which we can deduce that the frequent the expected support is given. The time complexity of com-
probability actually equals that one subtracts the probabili- puting the Cherno bound is ( ) where is the number
ty computed from the corresponding cumulative distribution of transactions.
function (CDF) of the support. Moreover, di erent from Time Complexity and Accuracy Analysis: The time
UApriori, algorithm computes the frequent probability complexity and the accuracy of di erent methods calcu-
instead of the expected support for each itemset. The time lating or estimating the frequent probability of an item-
complexity of the dynamic programming computation for set are shown in Table 4. We can find that, it is possible
each itemset is ( 2
× ). that algorithm is faster than DC algorithm if ( ×2
) ( ). The Cherno bound-based pruning
spends ( ) to test whether an itemset is not a probabilis-
Besides the dynamic programming-based algorithm, an- tic frequent itemset and hence it is the fastest. In addition,
other divide-and-conquer-based algorithm was proposed to with respect to the accuracy, itemsets must be probabilistic
compute the frequent probability [28]. Unlike algorith- frequent itemsets if they can pass the test of and .
m, DC divides an uncertain database, , into two sub However, for Cherno bound-based pruning, there may ex-
database: 1 and 2. Then, in two sub-databases, ist a few false positive results because the Cherno bound is
the algorithm recursively calls itself to divide the database only an upper bound of the frequent probability.
until only one transaction left. The algorithm stops to record
the probability distribution of the support of the itemset in
that transaction. Finally, through the conquering part, the
complete probability distribution of the itemset support is
In this subsection, we focus on three approximate proba-
obtained when the algorithm terminates. bilistic frequent algorithms. Because the support of an item-
If only involves the above divide-and-conquer process, set is considered as a random variable following Poisson Bi-
its time complexity of calculating the frequent probability of nomial distribution under both of definitions, the random
an itemset is ( 2
) where N is the number of transaction
variable, i.e., the support of an itemset, can be approxi-
in the
method
complexity
cases,
to
conquer-based
Bothalgorithm
uncertain
experimental
algorithm
to
thespeed
of
dynamic
DC
methods
database.
can
outperforms
upalgorithm
comparisons
the
use
programming-based
eaim
the
ciency.
However,
Fast
to
isalgorithm
calculate
( ).Fourier
Thus,
reported
Ininmost
the
the
according
the
Transform
and
in
conquering
practical
final
exact
Section
divide-and-
time
frequent
(FFT)
4.part, 1654
expected
turn
Moreover,
probabilistic
these
mated
tion
their
tion
frequent
erandom
probabilities
by
ectively
sets
and
support-based
the
for
with
Normal
probabilities
frequent
Poisson
random
variables
when
highifdistribution,
uncertain
confidence.
algorithms
the
variables
distribution
algorithms
areexpectations
ofknown.
all databases
probabilistic
following
have
we
and
Therefore,
can
the
also
andeare
Poisson
same
Normal
ciently
the
guarantee
frequent
large
approximate
variances
e ciency
calculate
distribu-
enough.
to
item-
re-
of
Table 5: Comparison of Approximate Probabilistic
In [31], the authors proposed the Poisson distribution- Algorithms
based approximate probabilistic frequent itemset mining al- M eth o ds F ra me wo rk A p pro xim a tio n Me tho ds
gorithm, called PDUApriori. Since we know that the sup- PDUA p rio ri UAp r i or i Po i ss on App r ox i ma ti o n
NDUA pri or i UAp r i or i No rm al A pp r ox i ma ti o n
port of an itemset follows Poisson Binomial distribution that NDUH- Mi ne UH-Mi n e No rm al A pp r ox i ma ti o n
can be approximated by the Poisson distribution [12], the
frequent probability of an itemset can be rewritten by the
cumulative distribution function (CDF) of the Poisson dis- Normal distribution approximation, we propose a novel al-
tribution as follows. gorithm, which integrates the framework
×
of UH-Mine and the Normal distribution approximation in
() 1 order to achieve a win-win partnership in sparse uncertain
!
=0
databases. Compared to UH-Mine, we calculate the vari-
where equals the expected support in the above formula ance of each itemset when UH-Mine obtains the expected
since the parameter in the Poisson distribution is the ex- support of each itemset. In Section 4, we can observe that
pectation. PDUApriori algorithm is implemented as follows. NDUH-Mine has a better performance than NDUApriori on
Firstly, based on the given probabilistic frequent thresh- large sparse uncertain data, which confirms our goal.
old , the algorithm computes the corresponding expected Therefore, the Normal distribution-based approximation
support . Then, the algorithm treats as the minimum algorithms build a bridge between the expected support-
expected support and runs the UApriori algorithm to find based frequent itemsets and the probabilistic frequent item-
all the expected support-based frequent itemsets as all the sets. In particular, existing e cient expected support-based
probabilistic frequent itemsets. mining algorithms can directly be reused in the problem
PDUApriori utilizes a sound property of the Poisson dis- of mining probabilistic frequent itemsets and keep their in-
tribution, namely the fact that parameter is the expec- trinsic properties. In other words, under the definition of
tation and variance of the random variable following the mining probabilistic frequent itemsets, NDUApriori is the
Poisson distribution. Because the cumulative distribution fastest algorithm in large enough dense uncertain database,
function (CDF) of Poisson distribution is monotonic with while NDUH-Mine requires reasonable memory space and
respect to , PDUApriori computes the corresponding of scales well to very large sparse uncertain databases.
the given and calls UApriori to find the results. How- Comparison of Algorithm Framework and Approx-
ever, this algorithm only approximately determines whether imation Methods: Di erent from the exact probabilis-
an itemset is probabilistic frequent itemset, and it cannot tic frequent itemset mining algorithms, the computational
return the frequent probability values. complexities of computing the frequent probability of each
itemset for di erent approximate probabilistic frequent algo-
rithms are the same, ( ) where is the number of trans-
The Normal distribution-based approximate probabilistic actions of the given uncertain database. Thus, we mainly
frequent itemset mining algorithm, NDUApriori, was pro- compare the di erent algorithm frameworks and approxi-
posed in [10]. According to the Lyapunov Central Lim- mation approaches for the three approximate probabilistic
it Theory, Poisson Binomial distribution converges to the frequent itemset mining algorithms in Table 5.
Normal Distribution with high probability [25]. Thus, the
frequent probability of an itemset can be rewritten by the
standard normal distribution formula in the following for-
mula.

() (× 05 () ) In this subsection, we introduce the experimental environ-


() ment, the implementations, and the evaluation methods.
Firstly, in order to conduct a fair comparison, we build
where ( ) is the cumulative distribution function of stan- a common implementation framework which provides com-
dard Normal distribution, and Var( ) is the variance of the mon data structures and subroutines for implementing all
support of . NDUApriori algorithm employs the Aprior- the algorithms. All the experiments are performed on an
i framework and uses the cumulative distribution function Intel(R) Core(TM) i7 3.40GHz PC with 4GB main memory,
of standard Normal distribution to calculate the frequent running on Microsoft Windows 7. Moreover, all algorithm-
probability. s were implemented and compiled using Microsoft’s Visual
Di erent from PDUApriori, NDUApriori algorithm can C++ 2010.
return frequent probabilities for all probabilistic frequent For each algorithm, we use the existing robust implemen-
itemsets. However, it is impractical to apply NDUApriori tation for our comparison. According to the discussion in
on According
the very
clude
ed
Moreover,
gorithm large
probability.
support-based
Apriori
that
can
the
the sparse
framework.
Due
acquire
UH-mine
to
Normal
the uncertain
toalgorithms
the
discussion
the databases
distribution-based
merits
usually
high in
quality
ofin
sparse since
outperforms
both
Section it we
approximate
the
uncertain
UH-mine employs
approximation
3.3.1,
otherdatabases.
frequent
expect-
and
can con-
al- 1655
ployed
Section
quent
UCFP-tree
rithms.
in
expected
timization
up
[17,
isthe
algorithms;
used
decremental
3,
18]
For
mining
we
support-based
to
implementation
expected
between
separate
test
implement
process.
approximation
UFP-growth
pruning
support
UFP-growth
comparisons
algorithms;
The
theand
[4]
UApriori
algorithms,
implementation
algorithm.
since
hashing
probabilistic
algorithm
into
there
exact
algorithm
the
techniques
we
We
isprobabilistic
three
and
no
use
frequent
do
based
obvious
the
UCFP-tree
categories:
not
which
toversion
on
use
speed
algo-
[22]
em-
fre-
op-
the
algorithm in terms of the running time and the memory
Table 6: Characteristics of Datasets
cost. The UH-Mine algorithm is implemented based on D at as et # of Tr a ns . # o f I te ms Av e. Le n. D en si t y
the version in [4]. For four exact probabilistic frequent al- Con n ect 67 55 7 1 29 43 0. 3 3
gorithms, DPNB (Dynamic Programming-based Algorithm Acci d en t 34 01 83 4 68 3 3. 8 0. 0 72
Ko sa r ak 99 00 02 41 27 0 8. 1 0 . 00 01 9
with No Bound) algorithm that does not include the Cher- G azel l e 59 60 1 4 98 2. 5 0. 0 05
no bound-based pruning technique is implemented based T25 I1 5D 32 0k 32 0, 0 00 9 94 25 0. 0 25

on the version in [9]. Correspondingly, DCNB (Divide-and-


Conquer-based Algorithm No Bound) algorithm is modified
based on the version in [28]. However, what is di erent is Table 7: Default Parameters of Datasets
in our algorithm, each item has their own probability, while Dataset Mean Var. min sup pft
in [28], all items in a transaction share the same appearance Connect 0.95 0.05 0.5 0.9
Accident 0.5 0.5 0.5 0.9
probability. Moreover, DPB (Dynamic Programming-based
Kosarak 0.5 0.5 0.0005 0.9
Algorithm with Bound) algorithm [9] and DCB (Divide-and-
Gazelle 0.95 0.05 0.025 0.9
Conquer-based Algorithm with Bound) algorithm [28] repre- T25I15D320k 0.9 0.1 0.1 0.9
sent the corresponding algorithms of DPNB and DCNB, but
include the Cherno bound-based pruning, respectively. For
three approximation mining algorithms, the implementation
we report the running time and the memory cost in two
of PDUApriori is based on [31] and integrates all optimized
dense datasets and two sparse datasets. Secondly, we present
pruning techniques. PDUApriori [10] and PDUH-Mine are
the scalability of the three algorithms. Finally, we study the
implemented based on the frameworks of UApriori and UH-
in uence of the skew in the Zipf distribution.
Mine, respectively. Hashing function is used in the two al-
Running Time. Figures 4(a) - 4(d) show the running
gorithms to compute the cumulative distribution function of
time of expected support-based algorithms w.r.t
Standard Normal Distribution e ciently.
in Connect, Accident, Kosarak, and Gazelle datasets. When
Based on the experimental comparisons of existing re-
decreases, we observe that the running time of all
searches, we choose five classical deterministic benchmarks
the algorithms goes up. Moreover, UFP-growth is always
from FIMI repository [1], and assign a probability gener-
the slowest in the above results. UApriori is faster than UH-
ated from Gaussian distribution to each item. Assigning
Mine in Figures 4(a) and 4(b), on the other hand, UH-Mine
probability to deterministic database to generate meaning-
is faster than UApriori in Figures 4(c) and 4(d).
ful uncertain test data is widely accepted by the curren-
It is reasonable because UApriori outperforms other algo-
t community [4, 9, 10, 11, 17, 18, 22, 28, 30, 31]. Five
rithms under the conditions that uncertain dataset is dense
datasets includes two dense datasets, Connect and Acciden-
and is high enough. These conditions cause the
t, two sparse datasets, Kosarak and Gazelle, and a very large
search space of mining algorithm relatively small. Under
synthetic dataset T25I15D320k which was used for testing
this case, the breath-first-search-based algorithms are faster
the scalability of uncertain frequent itemset mining algo-
than the depth-first-search-based algorithms. Thus, UApri-
rithms [4]. The characteristics of above datasets are shown
ori outperforms the other two depth-first-search-based algo-
in Table 6. In addition, to verify the in uence of uncer-
rithms in Figure 4(a) and 4(b). Otherwise, the depth-first-
tainty, we also test another probability distribution, Zipf
search-based algorithm, UH-Mine, is better, which is proved
distribution, instead of Gaussian distribution. Among the
by Figure 4(c) and 4(d). However, even UFP-growth using
cases that datasets following the Gaussian distribution, we
depth-first-search strategy, it does not perform well because
further categorize four scenarios. The first scenario is that
UFP-growth spends too much time on recursively construct-
a dense dataset with high mean and low variance, namely
ing many redundant conditional subtrees with limited shared
Connect with the mean (0.95) and the variance (0.05). The
paths. In addition, another interesting observation is that
second scenario is that a dense dataset with low mean and
slopes of curves in Figure 4(a) are larger than those in 4(c)
high variance, namely Accident with the mean (0.5) and the
even if the size of Connect is smaller than that of Kosarak.
variance (0.5). The third scenario is that a sparse dataset
This result makes sense because the slope of the curve de-
with high mean and low variance, namely Gazelle with the
pends on the density of a dataset.
mean (0.95) and the variance (0.05). The fourth scenario
Memory Cost. According to Figures 4(e) - 4(f ), UFP-
is that a sparse dataset with high mean and low variance,
growth spends memory the most among all the three algo-
namely Kosarak with the mean (0.5) and the variance (0.5).
rithms. Similar to the conclusion given in the above analysis
Moreover, in the case that dataset following the Zipf distri-
of Running Time, UApriori is superior to UH-Mine if and
bution, the only one scenario that a dense dataset following
only if the uncertain dataset is dense and is high
the Zipf distribution varying the skew from 0.8 to 2 is tested
enough, otherwise, UH-Mine is the winner.
because the sparse datasets followed Zipf distribution only
UApriori uses less memory when is high and
have
getIn
each athis
perform
addition,
any very
algorithms,
The
datasets small
probability
meaningful
10we
section,
runs size
UApriori,
do
arenot
per
shown
we ofinfrequent
parameters
results
report
experiment
compare
UFP-growth,
the
from
Table 7.itemsets,
running
and
three
and
these
their
For
and
expected
report thus
datasets.
time
all
default
UH-Mine.
the
the
over we did
weofnot
support-based
tests,
averages.
values
1 hour.
Firstly,
In 1656
itemsets
UH-Mine,
tables
UApriori
dataset
only
dant for
infrequent
spends
generated
ever,
of
Struct.
is the
has
di
dense
UApriori
erent
as
main
to
the
However,
decreasing
require
because
candidates.
which
limited
prefixes.
memory
changes
much
results
there
with
memory
and
Therefore,
cost
sharply
Thus,
more
the
are
in
dataset
issmall
increase
cost
used
the
memory
with
the
becoming
on
memory
search
to
number
decreasing
memory
building
initialize
of to
, UH-Struct
space.
store
usage
ofsparse,
the
usage
frequent
its
redun-
.How-
trend
For
head
UH-
of
3,6 00 3 ,60 0 36 00
3 ,60 0
U A pri ori U A pri ori U A pr ior i
1 ,00 0
U H M ine U A pr ior i U H M ine
U H Mi ne
U H M ine
U FP gr ow th U FP gr ow th U F P gr ow t h
U FP gr ow th
1 00 10 0 10 0 1 00

10 10

0 .9 0 .8 0.7 0 .6 0 .5 0.4 0.5 0 .4 0 .3 0. 2 0 .1 0.1 0. 05 0.0 1 0. 00 5 0 .0 02 5 0. 00 1 0. 1 0 .01 0 .0 01 1 .0E 4


m in_ esu p m in _e sup m in _e sup min _e su p

(a) C on n ect : min es u p vs . t ime


(b) A ccid en t: m in es up vs . t ime
(c) Ko sa r ak : min es u p vs . t ime
(d)
Gazelle: m in es u p vs . t ime

103 30 0 50 0 8 00
U A pr ior i U Apr ior i
U Apr ior i U Apr ior i 4 00
U H M ine 40 0
U H Mine U H Mine
U H Mi ne 25 0
U FP gr ow th U FP gr owt h
U FP gr ow th 30 0 U FP gr owt h
1 00
20 0
2 20 0
10

15 0
10 0
10
10 0 0
0 .9 0 .8 0.7 0 .6 0 .5 0.4 0.5 0 .4 0 .3 0. 2 0.1 0 .1 0.0 5 0 .0 1 0.0 05 0. 00 25 0 .0 01 0. 1 0 .01 0 .0 01 1 .0E 4
m in_ esu p m in _e sup m in_ es up min _e su p

(e) Co n nect : m in es up v s . mem or y


(f ) A ccid en t: m in es up vs . mem or y
(g) Ko s ar ak : min es u p vs . m emo ry
(h)
G azelle: m in es up v s . mem or y

55 0 9 00
U Apr ior i 35 0 12 20

U H Mi ne U Apr i ori
3 00
10 0 U FP gr ow th U H Mi ne U A pri ori
10 0
U FP gr ow th 5 00
1 00 U H M ine
U A pr ior i U FP gr ow th
10
U H M ine
U FP gr ow th
10
10
1 00
20 40 80 10 0 1 60 32 0 20 4 0 8 0 10 0 16 0 3 20 0 .8 1.2 1. 6 2 0 .8 1. 2 1 .6 2
Num b er o f T ra ns act ion s ( K) N um be r of Tr an sa ctio ns (K) Sk ew Ske w

(i) S cala bilit y vs . t ime


(j) Sca lab ility v s . mem or y
(k) Zip f: s kew v s. t im e
(l)
Zip f: s kew v s . mem or y

Figure 4: Performance of Expected Support-based Frequent Algorithms

UH-Mine increases smoothly. Moreover, similar to the dis- of dense datasets and higher , UApriori spends the
cussion of Running Time, UFP-growth is the most memory least time and memory. Otherwise, UH-Mine is the winner.
consuming one among three algorithms. Moreover, UFP-growth is often the slowest algorithm and
Scalability. We further analyze the scalability of three spends the largest memory cost since UFP-growth has only
expected support-based algorithms. In Figure 4(i), varying limited shared paths so that it has to spend too much time
the number of transactions in the dataset from 20k to 320k, and memory on redundant recursive computation.
we observe that the running time is linear. With the in- Finally, the in uence of the Zipf distribution is similar to
crease of the size of dataset, the time of UApriori is close to that of a very sparse dataset. Under the Zipf distribution,
that of UH-mine. This is reasonable because all the items UH-Mine algorithm usually performs very well.
in T25I15D30k have similar distributions. Therefore, with
the increase of transactions, the running time of algorithms
increase linearly. Figure 4(j) reports the memory usages of In this section, we compare four probabilistic frequent al-
three algorithms which demonstrate the linearity in terms gorithms: DPNB, DCNB, DPB and DCB. Firstly, we show
of the number of transactions. Moreover, we can find that the running time and the memory cost in terms of changing
the memory usage increase of UApriori is more steady than . Then, we present the in uence of on the run-
that of two other algorithms. This is because UApriori need- ning time and the memory cost. Moreover, the scalability
s not to build a special data structure to store the uncer- of the three algorithms is studied. Finally, we report the
tain database. However, the two other algorithms have to in uence of the skew in the Zipf distribution as well.
spend the extra memory cost for storing their data struc- E ect of min sup. Figures 5(a) and 5(c) show the run-
tures. Therefore, the curve of UApriori is more steady. ning time of four competitive algorithms w.r.t.
E ect of the Zipf distribution. For verifying the in- in Accident and Kosarak datasets, respectively. With the
uence of uncertainty under di erent distributions, Figures Cherno -bound-based pruning, we can see that DCB is al-
4(k) and 4(l) show the running time and the memory cost ways faster than DPB. However, without the Cherno -bound-
of three algorithms in terms of the skew parameter of Zipf based pruning, we can find that DCNB is always faster
distribution.
memory
ter.
assigned
parameter,
ically,
that
ed
among
Conclusions.
support-based
Due
UH-Mine
when
current
to
cost
thethe
which
the
We
zero
decrease
outperforms
property
proposed
skew
To
can
frequent
probability
results
sum
observe
parameter
with
ofup,
mining
inZipf
itemset,
fewer
UApriori
the
under
that
with
distribution,
increase
algorithms.
increases,
the
frequent
the
there
running
gradually.
increase
definition
of
is itemsets.
no
the
wemore
Intime
clear
can
skew
of
the
ofthe
items
and
observe
condition
expect-
winner
Specif-
parame-
skew
the
are 1657
based
algorithms,
than
ty
is
more
infrequent
faster
of
DPB
DPNB.
pruning
computing
divide-and-conquer-based
egorithms,
is
cient
than
faster
we
itemsets
This
quickly.
DCNB,
than
can
the
than
(isthat
find
reasonable
frequent
can
Moreover,
DPNB.
this
of
that
be
dynamic
is
×
2 ).
DCB
filtered
because
probability
These
Comparing
because
we
algorithms
isprogramming-based
by
faster
also
results
there
the
the
of
observe
the
than
are
Cherno
time
show
each
issame
only
(DCNB
),complexi-
itemset
that
which
abound-
type
smal-
DPB
most
and
al-
of
in
is
18 00 23 0 1,8 00 25 0
10 00 DPNB
DPNB
DPB
20 0 D PN B DPB
DCNB 20 0
1 00 1 00 DCNB
D PB DCB
DCNB DCB
D PN B
D PB DCB 15 0
10 10
D CN B
D CB
14 0 10 0
0. 9 0. 8 0. 7 0. 6 0. 5 0 .4 0 .9 0.8 0.7 0.6 0.5 0.4 0 .9 0 .8 0. 7 0. 6 0.5 0.4 0 .3 0 .2 0 .1 0. 9 0.8 0.7 0.
0 .3
.612
.5
.4
m in _su p min _s up m in _su p m in_ sup

(a) A ccid ent : min su p v s. tim e


(b) Ac ciden t : min s u p vs . m emo ry
(c) Kos ar ak : m in su p v s . tim e
(d)
Kos ar ak : min su p v s . mem or y

1 ,9 50 22 0 8 50 25 0

1 ,0 00

DPNB DPNB
DPNB 20 0
19 0 DPB DPB DPB
1 00 DCNB
1 00 DCNB DCNB
D PN B DCB DCB DCB
D PB 17 0
D CN B
15 0
D CB
10
10
15 0
0. 9 0. 8 0.7 0.6 0 .5 0 .4 0 .3 0 .2 0 .1 0 .9 0 .8 0 .7 0 .6 0 .5 0. 4 0. 3 0.2 0.1 0 .9 0 .8 0. 7 0. 6 0.5 0.4 0 .3 0 .2 0 .1 0. 9 0.8 0.7 0.
0 .3
.612
.5
.4
p ft pft p ft pf t

(e) A ccid en t: p ft v s . tim e


(f ) A ccid en t: p ft v s . mem or y (g) Kos ar ak : p ft vs . t ime
(h)
Ko s ar ak : pf t vs . mem or y

70 0 1 70 20 0 2 20
DPNB
D PN B
1 00 DPB
D PB 15 0 DP N B 2 00 D P NB
10 0 DCNB
DCNB DP B DPB
50 DCB
DCB DC N B DCNB
10 0
DC B DCB
1 70
10
50

10 0 1 40
20 40 80 10 0 1 60 32 0 20 4 0 8 0 10 0 16 0 3 20 0.8 1 .2 1 .6 2 0 .8 1. 2 1.6 2
Num b er o f T ra ns act ion s ( K) N um be r of Tr an sa ctio ns (K) Skew Sk ew

(i) S cala bilit y vs . t ime


(j) Sca lab ility v s . mem or y
(k) Zip f: s kew v s. t im e
(l)
Zip f: s kew v s . mem or y

Figure 5: Performance of Exact Probabilistic Frequent Algorithms

l number of frequent itemsets that need to compute their algorithm. In Figures 5(i), we can find that the trends of
frequent probabilities when is high, most of the in- running time of all algorithms are linear with the increase
frequent itemsets are already pruned by the Cherno bound. of the number of transactions. In particular, the trends of
In addition, according to Figures 5(b) and 5(d), this is both DC and DCNB are more smooth than those of DP and
very clear that DPB and DPNB require less memory than DPNB because the time complexities of computing frequent
DCB and DCNB. It is reasonable because both DCB and D- probability for DC and DCNB are both ( ) and bet-
CNB trade o the memory for the e ciency based on their ter than the time complexities of DP and DPNB. In Figures
the divide-and-conquer strategy. In addition, we can observe 5(j), we can observe that the memory cost of four algorithms
that the memory usage trend of DCNB changes sharply with linearly varies w.r.t. the number of transactions.
decreasing because there are a few frequent itemset- E ect of the Zipf distribution. Figures 5(k) and 5(l)
s when is high and most of infrequent itemsets are show the running time and the memory cost of four exact
filtered out by the Cherno bound-based pruning. In partic- probabilistic frequent mining algorithms in terms of the skew
ular, we can find that similar observations w.r.t are parameter of Zipf distribution. We can observe that the run-
shown in both the dense and the sparse datasets, which in- ning time and the memory cost decrease with the increase of
dicate that the density of the databases is not the key factor the skew parameter. We can find that, through varying the
a ecting the running time and the memory usage of exact skew parameter, the changing trends of the runing time and
probabilistic frequent algorithms. the memory cost are quite stable. Therefore, the skew pa-
E ect of pft. Figures 5(e) and 5(g) report the running rameter of Zipf distribution does not have significant impact
time w.r.t. . We can find that DCB is still the fastest to the running time and the memory cost.
algorithm and DPNB is the slowest one. Di erent from the Conclusions. First of all, among exact probabilistic fre-
results w.r.t. , DCNB is always faster than DPB quent itemsets mining algorithms, DCB algorithm is the
when varies. Additionally, Figures 5(f) and 5(h) show fastest algorithm in most cases. However, compared to DP-
the memory cost w.r.t. . The memory usages of both B, it has to spend more memory for the divide-and-conquer
DPB and DPNB are always significantly smaller than those processing.
of both
varying
significant
the
frequent
also
4.2,
ability
Scalability.
memory
four
we
further
of
DCB
still
,mining
probabilities
the
four
impact
explained
use
cost
changing
and
exact
Similar
algorithms.
the
are
DCNB.
toT25I15D320k
probabilistic
quite
the
of
in
to
trends
frequent
running
the
Furthermore,
stable.
This
next
scalability
of the
is
time
frequent
subsection.
Thus,
itemsets
dataset
reasonable
runing
and
we
analysis
does
to
itemset
are
the
time
find
test
not
one
memory
because
that,
and
in
have
the
mining
and
Section
scal-
byitof
most
is 1658
set
important
bility
that
In
mining
addition,
for
Cherno
computing
However,
and
time
each
tool
algorithms.
(ifto
the
an
bound-based
it speed
can
itemset,
Cherno
DC
Cherno
×2 )filter
and
Based
toupcalculate
respectively.
exact
DP
out
bound
bound-based
pruning
on
algorithms
some
probabilistic
computational
ofthe
infrequent
each
can
exact
Therefore,
pruning
reduce
itemset
have
frequent
frequent
to
itemsets.
analysis,
the
isis
spend
itthe
only
running
isproba-
item-
clear
most
(the
()).
3 ,60 0 4 00 1 70
2 40 0
D CB 3 80 D CB 1 50
1 ,00 0 PD U Ap r io ri PD U A pr ior i
3 40 DCB
N DU Ap r io r i 1 80 0
3 00 PD U A pr ior i N DU A pr ior i
N DU H Min e
10 0 N D UA pr ior i N DU H M ine 1 00
2 60
N D UH Mine 1 20 0
2 20 DCB
10 1 80 50 P D UA pr ior i
60 0 N D U Apr ior i
1 40
N D U H Mi ne
1 00
0.5 0 .4 0.3 0 .2 0.1 0. 01 0 .5 0 .4 0 .3 0. 2 0. 1 0.0 1 0 0
m in_ su p 0 .0 1 0.0 05 0.0 02 5 0. 00 15 0.0 01 0.0 1 000.00
.0
.0005
25
15
1
min _s up m in_ su p
m in_ su p

(a) A ccid ent : min su p v s. tim e


(b) Ac ciden t : min s u p vs . m emo ry
(c) Kos ar ak : m in su p v s . tim e
(d)
Kos ar ak : min su p v s . mem or y

2 50 21 0 1 80
2 40
DCB DCB
1 50
2 00 DCB P D U Apr ior i 2 10
18 0 PDUAp rio ri
P D U Apr ior i N D U A pri ori 1 20 NDUAp rio ri
1 50 N D U A pri ori 1 80 NDUH M ine
N D U H Mi ne
N D U H M ine 15 0 90
D CB
1 00 1 50
PD UAp rio ri
60
N DUAp rio ri
50 12 0 1 20
30 N DUH M ine

0 90
0. 9 0. 8 0.7 0.6 0 .5 0 .4 0 .3 0 .2 0 .1 90 0 0. 9 0. 8 0.7 00.6
0.4
0.5
.3
.2
.1
0.9 0. 8 0 .7 0 .6 0.5 0.4 0. 3 0 .2 0 .1 0 .9 0. 8 0. 7 0.6 0 .5 0 .4 0 .3 0 .2 0. 1
p ft pf t p ft
p ft

(e) A ccid en t: p ft v s . tim e


(f ) A ccid en t: p ft v s . mem or y (g) Kos ar ak : p ft vs . t ime
(h)
Ko s ar ak : pf t vs . mem or y

25 0 1 25
65 0 1 20
60 0 P DU A pr ior i PD U A pri ori
20 0 PDU Ap rio r i 1 20
N D U Apr ior i N DU A pr ior i
50 0 90 ND UAp ri or i
N D U H Mine N DU H M ine
40 0 15 0 1 15
ND UH Min e PD U A pr ior i
60
30 0 N D U Apr ior i
10 0 1 10
10 0 N D U H Mine
30
50 1 05
10 0

0 0 0 1 00
0 2 0 4 0 8 01 00 1 60 32 0 0 20 40 80 10 0 16 0 3 20 0 .8 1.2 1. 6 2 0 .8 1 .2 1 .6 2
Num b er of T ra ns act ion s ( K) N um be r of Tr an sa ctio ns (K) Sk ew Ske w

(i) S cala bilit y vs . t ime


(j) Sca lab ility v s . mem or y
(k) Zip f: s kew v s. t im e
(l)
Zip f: s kew v s . mem or y

Figure 6: Performance of Approximation Probabilistic Frequent Algorithms

Table 8: Accuracy in Accident


Mi n Su p P DU A pr i ori N DU A pr i ori N DU H -Mi ne
PRPRPR
In this section, we mainly compare three approximation 0. 2 0. 9 1 1 0. 95 1 0. 95 1
probabilistic frequent algorithms, PDUApriori, NDUApri- 0. 3 1 1 1 1 1 1
ori, and NDUH-Mine, and an exact probabilistic frequent 0. 4 1 1 1 1 1 1
0. 5 1 1 1 1 1 1
algorithm, DCB. Firstly, we report the running time and 0. 6 1 1 1 1 1 1
the memory cost in terms of . Then, we present the
running time and the memory cost when is changed. In
addition, we test the precsion and the recall to evaluate the
Table 9: Accuracy in Kosarak
approximation quality. Finally, we report the scalability as
Mi n S u p PDU A p ri or i N DU A p ri or i N DU H -Mi n e
well. PRPRPR
0. 002 5 0. 95 1 0 .9 5 1 0 .9 5 1
E ect of min sup. First of all, we test the running 0 . 005 0. 96 1 0 .9 6 1 0 .9 6 1
time and memory cost of four algorithms w.r.t. the min- 0. 01 0. 98 1 0 .9 8 1 0 .9 8 1
0. 05 1 1 1 1 1 1
imum support, shown in Figures 6(a) - 6(d). In 0. 1 1 1 1 1 1 1
Figure 6(a), both PDUApriori and NDUApriori are faster
than the other two. In Figure 6(c), NDUH-Mine is the
fastest. Moreover,DCB is the slowest algorithm among the
m. The results also confirm that the density of databases
four algorithms since it o ers exact answers. This is reason-
is the most important factor for the approximate algorithm
able because PDUApriori and NDUApriori are based on the
e ciency again. However, Figure 6(f) shows that the mem-
UApriori framework which performs best under the condi-
ory cost of all four algorithms is steady. Similar results are
tions that uncertain dataset is dense and is enough
also shown in Figure 6(h). Hence, varying almost does
high. Otherwise, NDUH-Mine is the best.
not in uence the memory cost of algorithms.
In addition, in Figures 6(b)and 6(d), it is very clear that P-
Precision and Recall. Besides o ering e cient running
DUApriori, NDUApriori, and NDUH-Mine require less mem-
oryEthan
divide-and-conquer
Figure
memory
Mine
varying
UApriori
However,
ect
spends
6(b),
.of
DCB.
because
We
are
inpft.
both
Figure
less
can
still
This
Figure
see
the
PDUApriori
memory
the6(g),
strategy
isdataset
that
fastest
6(e)
reasonable
NDUH-Mine
both
since
reports
algorithms
to
is and
of
dense.
obtain
this
PDUApriori
because
the
NDUApriori
dataset
In
the
time
is
inFigure
the
exact
Accident
DCB
inisfastest
terms
and
sparse.
results.
6(d),
uses
require
ND-
dataset.
of
algorith-
the
NDUH-
In
less to1659
time
tic
is exactandprobabilistic
result
afrequent
measure
quent
more
quent ealgorithm,
ective
which
test
generated memory
important
mining
itemset
theequals
accuracy
precision
algorithm.
from
mining
target
and cost,
frequent
the
ER
of
and
| for the
|algorithms.approximation
approximation
| is
| Please
the
algorithm.
the
the
and
recall
approximation
result
note
theWe
w.r.t
recall
Moreover,
generated
that
use accuracy
probabilistic
varying
AR
which
theprobabilis-
means
precision
from
we
be-
equals
only
fre-
the
|||
Table 10: Summary of Eight Representative Frequent Itemset Algorithms over Uncertain Databases
E xp ect ed S u p p o rt -b a se d Al g Ex a ct Pr o b. Fr eq . Al g A p pr ox . P r ob . Fr eq. A l g
UAp r i or i UH-M i n e UF P- gr owt h D P DC P DUA pr i o ri N DUAp r i or i NDU H-M i ne
Ti m e( D) ( min es up h i g h ) ( min esup l ow) ( min sup h i gh ) (m in sup h ig h ) ( min sup l ow)
Ti m e( S)
Mem or y ( D) ( min es up h i g h ) ( min esup l ow) ( min sup h i gh ) (m in sup h ig h ) ( min sup l ow)
M em or y (S )
Accu ra cy E x act Ex a ct E xa ct Ex a ct E x act A pp r ox . Ap p ro x. ( B et ter ) Ap p ro x. ( B et ter )

cause the in uence of is far less than the . Table • As observed in Table 10, under the definition of expect-
8 and Table 9 are shown the precisions and the recalls of two ed support-based frequent itemset, UApriori is usually
approximation probabilistic frequent algorithms in Accident the fastest algorithm with lower memory cost when the
and Kosarak, respectively. We can find that the precision database is dense and is high. On the con-
and the recall are almost 1 in Accident dataset which mean- trary, when the database is sparse or is low,
s there is almost no false positive and false negative. In UH-Mine often outperforms other algorithms in the
Kosarak, we also observe that there are a few false posi- running time and only spends limited memory cost.
tives with decreasing of . In addition, the Normal However, UFP-growth is almost the slowest algorithm
distribution-based approximation algorithms can get better with high memory cost.
approximation e ect than the Poisson distribution-based ap- • From Table 10, among exact probabilistic frequent item-
proximation algorithms. This is because the expectation and sets mining algorithms, DC algorithm is the fastest
the variance in the Poisson distribution is the same, which algorithm in most cases. However, it trades o the
is , but, in fact, the expected support and the variance of memory cost for the e ciency because it has to store
an itemset are usually unequal. recursive results for the processing of the divide-and-
Scalability. We further analyze the scalability of three conquer. In addition, when the condition is satisfied,
approximate probabilistic frequent mining algorithms. In DP algorithm is faster than DC algorithm.
Figure 6(i), varying the number of transactions in the dataset • Again from Table 10, both PDUApriori and NDU-
from 20k to 320k, we find that the running time is linear. Apriori is the winner in the running time and the
Figure 6(j) reports the memory cost of three algorithms memory cost when the database is dense and
which show the linearity in terms of the number of trans-
is high, otherwise, NDUH-Mine is the winner. The
actions. Therefore, NDUH-Mine performs best. main di erence between PDUApriori and NDUAprior-
E ect of the Zipf distribution. Figures 6(k) and 6(l)
i is that NDUApriori has better approximation when
show the running time and the memory cost of three ap- the database is large enough.
proximate algorithms in terms of the skew parameter of Zipf
Other than the result described in Table 10, we also find:
distribution. We can observe that the running time and the
memory cost decrease with the increase of the skew param- • Approximation probabilistic frequent itemset mining
eter. In particular, when the skew parameter increases, we algorithms usually get a high-quality approximation ef-
can observe that PDUApriori outperforms NDUApriori and fect in most cases. To our surprise, the frequent proba-
NDUH-Mine gradually. bilities of most probabilistic frequent itemsets are often
Conclusions. First of all, approximation probabilistic 1 when the uncertain databases are large enough such
frequent itemset mining algorithms can get high-quality ap- as the number of transaction is more than 10,000. It is
proximation when the uncertain database is large enough a reasonable result. On the one hand, Lyapunov Cen-
due to the requirement of CLT. In our experiments, the tral Limit Theory guarantees the high-quality approx-
datasets usually include more than 50,000 transactions. These imation. On the other hand, according to the cumu-
approximation algorithms almost have no false positive or lative distribution function (CDF) of the Poisson dis-
false negative. These results are reasonable because the Lya- tribution, we know that the frequent probability of an
punov CLT guarantees the approximation quality. itemset can be approximated as 1 ×
i

= 0!
In addition, in terms of the e ciency, approximation prob- where is the expected support of this itemset. When
abilistic frequent itemset mining algorithms are much better an uncertain database is large enough, the expected
any existing exact probabilistic frequent itemset mining al- support of this itemset is usually large if it is a proba-
gorithms. Moreover, Normal distribution-based algorithms bilistic frequent itemset. Thus, as a consequence, the
usually are faster than the Poisson distribution-based algo- frequent probability of this itemset equals 1.
rithm.
• Approximation probabilistic frequent itemset mining
Finally, similar to the case of expected support-based fre-
algorithms usually far outperform any existing exact
quent algorithms, NDUApriori is always the fastest algorith-
m Table
the
in
over,
set,
data
arein
We
similar.
best
and
set.
dense
‘time(D)’
summarize
‘time(S)’
algorithm
10
Theuncertain
where
meanings
means
means
experimental
‘in’ sparse
means
databases,
that
ofthat
‘memory(D)’
the
uncertain
the
the
time
winner
results
while
time
cost
cost
databases.
NDUH-Mine
under
in
and
inthat
in
the
‘memory(S)’
the
dicase.
dense
erent
sparse
More-
usually
data
cases probabilistic
1660
the
eder
cause
itemset
ciency
itemset
algorithm
• result
Cherno efrequent
theitdefinition
of
can
under
itemsets
if exact
we
bound
filter
be
ciency
compute
the itemset
obtained
probabilistic
as
of
out
is
definition
well.
and
expected
an mining
theimportant
the
the
by
infrequent
variance
the
memory
of algorithms
frequent
support-based
existing
probabilistic
tool
itemsets
ofcost.
to
algorithms
the in un-
solutions
improve the
Therefore,
support
frequent
quickly.
be-
the
of
[14] L. Chen, M. Ozsu,
T. ¨ and V. Oria. Robust and fast
In this paper, we conduct a comprehensive experimen- similarity search for moving object tra jectories. In
tal study of all the frequent itemset mining algorithms over SIGMOD, pages 491–502, 2005.
uncertain databases. Since there are two definitions of fre- [15] R. Cheng, D. V. Kalashnikov, and S. Prabhakar.
quent itemsets over uncertain data, most existing researches Querying imprecise data in moving object
are categorized into two directions. However, through our environments. IEEE Trans. Knowl. Data Eng.,
exploration, we firstly clarify that there is a close relation- 16(9):1112–1127, 2004.
ship between two di erent definitions of frequent itemsets [16] H. Cherno . A measure of asymptotic e ciency for
over uncertain data. Therefore, we need not use the current tests of a hypothesis based on the sum of observations.
solution for the second definition and replace them with ef- Ann. Math. Statist., 23(4):493–507, 1952.
ficient existing solution of first definition. Secondly, we pro- [17] C. K. Chui and B. Kao. A decremental approach for
vide baseline implementations of eight existing representa- mining frequent itemsets from uncertain data. In
tive algorithms and test their performances under a uniform PAKDD, pages 64–75, 2008.
measurement fairly. Finally, based on extensive experiments [18] C. K. Chui, B. Kao, and E. Hung. Mining frequent
over many di erent benchmarks, we verify several existing itemsets from uncertain data. In PAKDD, pages
inconsistent conclusions and find some new rules in this area. 47–58, 2007.
[19] J. Han, J. Pei, and Y. Yin. Mining frequent patterns
without candidate generation. In SIGMOD, pages
This work is supported in part by the Hong Kong RGC 1–12, 2000.
GRF Project No.611411, National Grand Fundamental Re- [20] B. Jiang and J. Pei. Outlier detection on uncertain
search 973 Program of China under Grant 2012-CB316200 data: Objects, instances, and inferences. In ICDE,
and 2011-CB302200-G, HP IRP Project 2011, Microsoft Re- pages 422–433, 2011.
search Asia Grant, MRA11EG05, the National Natural Sci- [21] B. Kao, S. D. Lee, F. K. F. Lee, D. W.-L. Cheung,
ence Foundation of China (Grant No.61025007, 60933001, and W.-S. Ho. Clustering uncertain data using voronoi
61100024), US NSF grants DBI-0960443, CNS-1115234, and diagrams and r-tree index. IEEE Trans. Knowl. Data
IIS-0914934, and Google Mobile 2014 Program. Eng., 22(9), 2010.
[22] C. K.-S. Leung, M. A. F. Mateo, and D. A. Brajczuk.
A tree-based approach for frequent pattern mining
from uncertain data. In PAKDD, pages 653–661, 2008.
[1] Frequent itemset mining implementations repository.
[23] M. Li and Y. Liu. Underground coal mine monitoring
http://fimi.us.ac.be.
with wireless sensor networks. TOSN, 5(2):10, 2009.
[2] Wikipedia of poisson binomial distribution.
[24] Y. Liu, K. Liu, and M. Li. Passive diagnosis for
[3] C. C. Aggarwal. Managing and Mining Uncertain
wireless sensor networks. IEEE/ACM Trans. Netw.,
Data. Kluwer Press, 2009.
18(4):1132–1144, 2010.
[4] C. C. Aggarwal, Y. Li, J. Wang, and J. Wang.
[25] M. Mitzenmacher and E. Upfal. Probability and
Frequent pattern mining with uncertain data. In
Computing: Randomized algorithm and probabilistic
KDD, pages 29–38, 2009. analysis. Cambridge University Press, 2005.
[5] C. C. Aggarwal and P. S. Yu. Outlier detection with
[26] L. Mo, Y. He, Y. Liu, J. Zhao, S. Tang, X.-Y. Li, and
uncertain data. In SDM, pages 483–493, 2008.
G. Dai. Canopy closure estimates with greenorbs:
[6] C. C. Aggarwal and P. S. Yu. A survey of uncertain
sustainable sensing in the forest. In SenSys, pages
data algorithms and applications. IEEE Trans. Knowl. 99–112, 2009.
Data Eng., 21(5):609–623, 2009.
[27] J. Pei, J. Han, H. Lu, S. Nishio, S. Tang, and D. Yang.
[7] R. Agrawal, T. Imielinski, and A. N. Swami. Mining H-mine: Hyper-structure mining of frequent patterns
association rules between sets of items in large
in large databases. In ICDM, pages 441–448, 2001.
databases. In SIGMOD, pages 207–216, 1993.
[28] L. Sun, R. Cheng, D. W. Cheung, and J. Cheng.
[8] R. Agrawal and R. Srikant. Fast algorithms for mining Mining uncertain data with probabilistic guarantees.
association rules in large databases. In VLDB, pages
In KDD, pages 273–282, 2010.
487–499, 1994.
[29] S. Suthram, T. Shlomi, E. Ruppin, R. Sharan, and
[9] T. Bernecker, H.-P. Kriegel, M. Renz, F. Verhein, and T. Ideker. A direct comparison of protein interaction
A. Z¨ u e. Probabilistic frequent itemset mining in confidence assignment schemes. BMC Bioinformatics,
uncertain databases. In KDD, pages 119–128, 2009.
7:360, 2006.
[10] T. Calders, C. Garboni, and B. Goethals. [30] Y. Tong, L. Chen, and B. Ding. Discovering
L.Approximation
[11] T.
[12]
[13] inCalders,
pattern
PAKDD,
binomial
10(4):1181–1197,
and
L.uncertain
Chen
Cam.
editand
mining
distance. of frequentness
distribution.
pages
C.
AnR.
data.
Garboni,
approximation
T.of
480–487,
Ng.
In
uncertain
1960.
ICDM,
VLDB,
On
Pacific
andthe
2010.
B.dataprobability
pages
marriage
pages
theorem
Goethals.
Journal
with
749–754,
792–803,
for of2010.
sampling.
of itemsets
ofEMathematics,
lp-norms
the
cient
poisson
2004.
In 1661
[32]
[31]
[33]
model-based
M.
probabilistic
IEEE
L.
Accelerating
Q.J.Wang,
probabilistic
Zhang,
Zaki.
Trans.
threshold-based
R.
Scalable
approach.
F.data.
Knowl.
Cheng,
Li,
probabilistic
data.
and
Inalgorithms
SIGMOD,
Data
S.
In
K.
InD.
CIKM,
frequent
ICDE,
Yi.
Eng.,
Lee,
frequent
Finding
for
and
pages
12(3):372–390,
closed
association
D.
itemset
frequent
429–438,
819–832,
270–281,
W.-L.
itemsets
mining:
Cheung.
items
mining.
2010.
2008.
2012.
2000.
over
ina

You might also like