Rough Sets Association Analysis

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Approximate Boolean Reasoning 459

Let us note that any i-th degree surface in IR


k
can be dened as follows:
S =
_
(x
1
, . . . , x
k
) IR
k
: P (x
1
, . . . , x
k
) = 0
_
,
where P (x
1
, . . . , x
k
) is an arbitrary i
th
degree polynomial over k variables.
Any i
th
degree polynomial is a linear combination of monomials, each of degree
not greater than i. By (i, k) we denote the number of k-variable monomials of
degrees i. Then, instead of searching for i
th
degree surfaces in k-dimensional
ane real space IR
k
, one can search for hyperplanes in space IR
(i,k)
.
It is easy to see that the number of j
th
degree monomials built from k variables
is equal to
_
j +k 1
k 1
_
. Then we have
(i, k) =
i

j=1
_
j +k 1
k 1
_
= O
_
k
i
_
. (58)
As we can see, applying the above surfaces we have better chance to discern
objects from dierent decision classes with smaller number of cuts. This is be-
cause higher degree surfaces are more exible than normal cuts. This fact can be
shown by applying the VC (Vapnik-Chervonenkis) dimension for corresponding
set of functions [154].
To search for an optimal set of i
th
degree surfaces discerning objects from
dierent decision classes of a given decision table S = (U, A d) one can con-
struct a new decision table S
i
=
_
U, A
i
d
_
where A
i
is a set of all monomials
of degree i built on attributes from A. Any hyperplane found for the decision
table S
i
is a surface in the original decision table S. The cardinality of A
i
is
estimated by the formula (58).
Hence, for the better solution, we must pay with the increase of space and
time complexity.
9 Rough Sets and Association Analysis
In this section, we consider a well-known and famous nowadays data mining
technique, called association rules [3], to discover useful patterns in transactional
databases. The problem is to extract all associations and correlations among
data items where the presence of one set of items in a transaction implies (with
a certain degree of condence) the presence of other items. Besides market basket
data, association analysis is also applicable to other application domains such as
customer relationship management (CRM), bioinformatics, medical diagnosis,
Web mining, and scientic data analysis.
We will point out also the contribution of rough sets and approximate Boolean
reasoning approach in association analysis, as well as the correspondence between
the problem of searching for approximate reduct and the problem of generating
association rules from frequent item sets.
GROUP 15
460 H.S. Nguyen
9.1 Approximate Reducts
Let S = (U, Adec) be a given decision table, where U = u
1
, u
2
, . . . , u
n
and
A = a
1
, . . . , a
k
. Discernibility matrix of S was dened as the (n n) matrix
M(S) = [M
i,j
]
n
i,j=1
where
M
i,j
=
_
a
m
A : a
m
(x
i
) ,= a
m
(x
j
) if dec(x
i
) ,= dec(x
j
)
otherwise.
(59)
Let us recall that a set B A of attributes is consistent with dec (or
dec-consistent) if B has non-empty intersection with each non-empty set M
i,j
,
i.e.,
B is consistent with dec i
i,j
(C
i,j
= ) (B C
i,j
,= ).
Minimal (with respect to inclusion) dec-consistent sets of attributes are called
decision reducts.
In some applications (see [138], [120]), instead of reducts we prefer to use their
approximations called -reducts, where [0, 1] is a real parameter. A set of
attributes is called -reduct if it is minimal (with respect to inclusion) among
the sets of attributes B such that
disc(B)
conflict(S)
=
[M
i,j
: B M
i,j
,= [
[C
i,j
: C
i,j
,= [
.
If = 1, the notions of an -reduct and a (normal) reduct coincide. One can
show that for a given , problems of searching for shortest -reducts and for all
-reducts are also NP-hard [96].
9.2 From Templates to Optimal Association Rules
Let S = (U, A) be an information table. By descriptors (or simple descriptors)
we mean the terms of the form (a = v), where a A is an attribute and v V
a
is a value in the domain of a (see [98]). By template we mean the conjunction of
descriptors:
T = D
1
D
2
... D
m
,
where D
1
, ...D
m
are either simple or generalized descriptors. We denote by
length(T) the number of descriptors being in T.
For the given template with length m:
T = (a
i
1
= v
1
) ... (a
i
m
= v
m
)
the object u U is said to satisfy the template T if and only if
j
a
i
j
(u) = v
j
. In
this way the template Tdescribes the set of objects having the common property:
values of attributes a
i
1
, ..., a
i
m
are equal to v
1
, ..., v
m
, respectively. In this sense
one can use templates to describe the regularity in data, i.e., patterns - in data
mining or granules - in soft computing.
Approximate Boolean Reasoning 461
Templates, except for length, are also characterized by their support. The
support of a template T is dened by
support(T) = [u U : u satises T[.
From descriptive point of view, we prefer long templates with large support.
The templates that are supported by a predened number (say min support)
of objects are called the frequent templates. This notion corresponds exactly
to the notion of frequent itemsets for transaction databases [1]. Many ecient
algorithms for frequent itemset generation has been proposed in [1], [3], [2],
[161] [44]. The problem of frequent template generation using rough set method
has been also investigated in [98], [105]. In Sect. 5.4 we considered a special
kind of templates called decision templates or decision rules. Almost all objects
satisfying a decision template should belong to one decision class.
Let us assume that the template T, which is supported by at least s objects,
has been found (using one of existing algorithms for frequent templates). We
assume that T consists of m descriptors i.e.
T = D
1
D
2
D
m
where D
i
(for i = 1, . . . , m) is a descriptor of the form (a
i
= v
i
) for some a
i
A
and v
i
V
a
i
. We denote the set of all descriptors occurring in the template T
by DESC(T), i.e.,
DESC(T) = D
1
, D
2
, . . . , D
m
.
Any set of descriptors P DESC(T) denes an association rule
1
P
=
def


D
i
P
D
i
=

D
j
/ P
D
j

.
The condence factor of the association rule 1
P
can be redened as
confidence (1
P
) =
def
support(T)
support(
_
D
i
P
D
i
)
,
i.e., the ratio of the number of objects satisfying T to the number of objects
satisfying all descriptors from P. The length of the association rule 1
P
is the
number of descriptors from P.
In practice, we would like to nd as many association rules with satisfactory
condence as possible (i.e., confidence (1
P
) c for a given c (0; 1)). The
following property holds for the condence of association rules:
P
1
P
2
= confidence (1
P
1
) confidence (1
P
2
) . (60)
This property says that if the association rule 1
P
generated from the descriptor
set P has satisfactory condence then the association rule generated from any
superset of P also has satisfactory condence.
For a given condence threshold c (0; 1] and a given set of descriptors
P DESC(T), the association rule 1
P
is called c-representative if
462 H.S. Nguyen
1. confidence (1
P
) c;
2. for any proper subset P

P we have confidence (1
P
) < c.
From Eqn. (60) one can see that instead of searching for all association rules,
it is enough to nd all c-representative rules. Moreover, every c-representative
association rule covers a family of association rules. The shorter the association
rule 1 is, the bigger is the set of association rules covered by 1. First of all, we
show the following theorem:
Theorem 24. For a xed real number c (0; 1] and a template T, the optimal
cassociation rules problem i.e., searching for the shortest c-representative
association rule from T in a given table A is NP-hard.
Proof: Obviously, the Optimal cAssociation Rules Problem belongs to NP. We
show that the Minimal Vertex Covering Problem (which is NP-hard, see e.g.
[35]) can be transformed to the Optimal c-Association Rules Problem.
Let the graph G = (V, E) be an instance of the Minimal Vertex Cover Prob-
lem, where V = v
1
, v
2
, . . . v
n
and E = e
1
, e
2
, . . . e
m
. We assume that every
edge e
i
is represented by two-element set of vertices, i.e., e
i
= v
i
1
, v
i
2
. We con-
struct the corresponding information table (or transaction table) A(G) = (U, A)
for the Optimal c-Association Rules Problem as follows:
1. The set U consists of m objects corresponding to m edges of the graph G
and k + 1 objects added for some technical purpose, i.e.,
U = x
1
, x
2
, . . . , x
k
x

u
e
1
, u
e
2
, . . . , u
e
m
,
where k =
_
c
1c
_
is a constant derived from c.
2. The set A consists of n attributes corresponding to n vertices of the graph
G and an attribute a

added for some technical purpose, i.e.,


A = a
v
1
, a
v
2
, . . . , a
v
n
a

.
The value of attribute a A over the object u U is dened as follows:
(a) if u x
1
, x
2
, . . . , x
k
then
a(x
i
) = 1 for any a A.
(b) if u = x

then for any j 1, . . . , n:


a
v
j
(x

) = 1 and a

(x

) = 0.
(c) if u u
e
1
, u
e
2
, . . . , u
e
m
then for any j 1, . . . , n:
a
v
j
(u
e
i
) =
_
0 if v
j
e
i
1 otherwise
and a

(u
e
i
) = 1.
Approximate Boolean Reasoning 463
Example
Let us consider the Optimal c-Association Rules Problem for c = 0.8. We il-
lustrate the proof of Theorem 24 by the graph G = (V, E) with ve ver-
tices V = {v
1
, v
2
, v
3
, v
4
, v
5
} and six edges E = {e
1
, e
2
, e
3
, e
4
, e
5
, e
6
}. First
we compute k =
_
c
1c
_
= 4. Hence, the information table A(G) consists
of six attributes {a
v
1
, a
v
2
, a
v
3
, a
v
4
, a
v
5
, a

} and (4 + 1) + 6 = 11 objects
{x
1
, x
2
, x
3
, x
4
, x

, u
e
1
, u
e
2
, u
e
3
, u
e
4
, u
e
5
, u
e
6
}. The information table A(G) con-
structed from the graph G is presented in the gure below.
v
2
v
1
v
v
v
3
4
5
e
e
e
e
e
e
1
2
3
6
5
4
=
A(G) a
v
1
a
v
2
a
v
3
a
v
4
a
v
5
a

x
1
1 1 1 1 1 1
x
2
1 1 1 1 1 1
x
3
1 1 1 1 1 1
x
4
1 1 1 1 1 1
x

1 1 1 1 1 0
u
e
1
0 0 1 1 1 1
u
e
2
0 1 1 0 1 1
u
e
3
1 0 1 1 0 1
u
e
4
1 0 1 0 1 1
u
e
5
0 1 0 1 1 1
u
e
6
1 1 0 1 0 1
Fig. 34. The construction of the information table A(G) from the graph G = (V, E)
with ve vertices and six edges for c = 0.8
The illustration of our construction is presented in Fig. 34.
We will show that any set of vertices W V is a minimal covering set for the
graph G if and only if the set of descriptors
P
W
= (a
v
j
= 1) : for v
j
W
dened by W encodes the shortest c-representative association rule for A(G)
from the template
T = (a
v
1
= 1) (a
v
n
= 1) (a

= 1).
The rst implication () is obvious. We show that implication () also holds.
The only objects satisfying T are x
1
, . . . , x
k
hence we have support(T) = k.
Let P Q be an optimal c-condence association rule derived from T. Then
we have
support(T)
support(P)
c, hence
support(P)
1
c
support(T) =
1
c
k =
1
c

_
c
1 c
_

1
1 c
=
c
1 c
+ 1.
Because support(P) is an integer number, we have
support(P)
_
c
1 c
+ 1
_
=
_
c
1 c
_
+ 1 = k + 1.
464 H.S. Nguyen
Thus, there is at most one object from the set x

u
e
1
, u
e
2
, . . . , u
e
m
satisfying
the template P. We consider two cases:
1. The object x

satises P: then the template P cannot contain the descriptor


(a

= 1), i.e.,
P = (a
v
i
1
= 1) (a
v
i
t
= 1)
and there is no object from u
e
1
, u
e
2
, . . . , u
e
m
which satises P, i.e., for any
edge e
j
E there exists a vertex v
i
v
i
1
, . . . , v
i
t
such that a
v
i
(u
e
j
) = 0
(which means that v
i
e
j
). Hence, the set of vertices W = v
i
1
, . . . , v
i
t
V
is a solution of the Minimal Vertex Cover Problem.
2. An object u
e
j
satises P: then P consists of the descriptor (a

= 1); thus
P = (a
v
i
1
= 1) (a
v
i
t
= 1) (a

= 1).
Let us assume that e
j
= v
j
1
, v
j
2
. We consider two templates P
1
, P
2
ob-
tained from P by replacing the last descriptor by (a
v
j
1
= 1) and (a
v
j
2
= 1),
respectively, i.e.
P
1
= (a
v
i
1
= 1) (a
v
i
t
= 1) (a
v
j
1
= 1)
P
2
= (a
v
i
1
= 1) (a
v
i
t
= 1) (a
v
j
2
= 1).
One can prove that both templates are supported by exactly k objects:
x
1
, x
2
, . . . , x
t
and x

. Hence, similarly to the previous case, the two sets of


vertices W
1
= v
i
1
, . . . , v
i
t
, v
j
1
and W
2
= v
i
1
, . . . , v
i
t
, v
j
2
establish the
solutions of the Minimal Vertex Cover Problem.
We showed that any instance I of the Minimal Vertex Cover Problem can be
transformed to the corresponding instance I

of the Optimal cAssociation Rule


Problem in polynomial time and any solution of I can be obtained from solutions
of I

. Our reasoning shows that the Optimal cAssociation Rules Problem is NP-
hard.
Since the problem of searching for the shortest representative association rules
is NP-hard, the problem of searching for all association rules must be also as
least NP-hard because this is a more complex problem. Having all association
rules one can easily nd the shortest representative association rule. Hence, we
have the following:
Theorem 25. The problem of searching for all (representative) association
rules from a given template is at least NP-hard unless P = NP.
The NP-hardness of presented problems forces us to develop ecient approx-
imate algorithms solving them. In the next section we show that they can be
developed using rough set methods.
Approximate Boolean Reasoning 465
9.3 Searching for Optimal Association Rules by Rough Set Methods
To solve the presented problem, we show that the problem of searching for
optimal association rules from a given template is equivalent to the problem of
searching for local -reducts for a decision table, which is a well-known problem
in rough set theory. We propose the Boolean reasoning approach for association
rule generation.
Association rule problem (A, T)

New decision table A|
T
?
Association rules R
P
1
, . . . , R
P
t

-reducts P
1
, . . . , P
t
of A|
T
Fig. 35. The Boolean reasoning scheme for association rule generation
We construct a new decision table A[
T
= (U, A[
T
d) from the original infor-
mation table A and the template T as follows:
A[
T
= a
D
1
, a
D
2
, . . . , a
D
m
is a set of attributes corresponding to the de-
scriptors of the template T
a
D
i
(u) =
_
1 if the object u satises D
i
,
0 otherwise;
(61)
the decision attribute d determines whether a given object satises the tem-
plate T, i.e.,
d(u) =
_
1 if the object u satises T,
0 otherwise.
(62)
The following theorems describe the relationship between association rules
problem and reduct searching problem.
Theorem 26. For a given information table A = (U, A) and a template T, the
set of descriptors P is a reduct in A[
T
if and only if the rule

D
i
P
D
i

D
j
/ P
D
j
is 100%-representative association rule from T.
Proof: Any set of descriptors P is a reduct in the decision table A[
T
if and only
if every object u with decision 0 is discerned from objects with decision 1 by one
466 H.S. Nguyen
of the descriptors from P (i.e., there is at least one 0 in the information vector
inf
P
(u)). Thus u does not satisfy the template
_
D
i
P
D
i
. Hence
support
_

D
i
P
D
i
_
= support(T).
The last equality means that

D
i
P
D
i

D
j
/ P
D
j
is 100%-condence association rule for table A.

Analogously, one can show the following fact:


Theorem 27. For a given information table A = (U, A), a template T, a set of
descriptors P DESC(T), the rule

D
i
P
D
i

D
j
/ P
D
j
is a c-representative association rule obtained from T if and only if P is a -
reduct of A[
T
, where = 1
1
c
1
n
s
1
, n is the total number of objects from U and
s = support(T). In particular, the problem of searching for optimal association
rules can be solved using methods for -reduct nding.
Proof: Assume that support(
_
D
i
P
D
i
) = s + e, where s = support(T). Then
we have
confidence


D
i
P
D
i

D
j
/ P
D
j

=
s
s +e
c.
This condition is equivalent to
e
_
1
c
1
_
s.
Hence, one can evaluate the discernibility degree of P by
disc degree(P) =
e
n s

_
1
c
1
_
s
n s
=
1
c
1
n
s
1
= 1 .
Thus
= 1
1
c
1
n
s
1
.

Searching for minimal -reducts is a well-known problem in the rough set theory.
One can show that the problem of searching for shortest -reducts is NP-hard
[96] and the problem of searching for the all -reducts is at least NP-hard. How-
ever, there exist many approximate algorithms solving the following problems:
Approximate Boolean Reasoning 467
1. Searching for shortest reduct (see [143]);
2. Searching for a number of short reducts (see, e.g., [158]);
3. Searching for all reducts (see, e.g., [7]).
The algorithms for the rst two problems are quite ecient from computational
complexity point of view. Moreover, in practical applications, the reducts gen-
erated by them are quite closed to the optimal one.
In Sect. 9.3.1, we present some heuristics for these problems in terms of asso-
ciation rule generation.
9.3.1 Example
The following example illustrates the main idea of our method. Let us consider
the information table A (Table 18) with 18 objects and 9 attributes.
Assume that the template
T = (a
1
= 0) (a
3
= 2) (a
4
= 1) (a
6
= 0) (a
8
= 1)
has been extracted from the information table A. One can see that support(T) =
10 and length(T) = 5. The new decision table /[
T
is presented in Table 19.
The discernibility function for decision table A[
T
is as follows
f(D
1
, D
2
, D
3
, D
4
, D
5
) = (D
2
D
4
D
5
) (D
1
D
3
D
4
) (D
2
D
3
D
4
)
(D
1
D
2
D
3
D
4
) (D
1
D
3
D
5
)
(D
2
D
3
D
5
) (D
3
D
4
D
5
) (D
1
D
5
)
Table 18. The example of information table A and template T with support 10
A a
1
a
2
a
3
a
4
a
5
a
6
a
7
a
8
a
9
u
1
0 * 1 1 * 2 * 2 *
u
2
0 * 2 1 * 0 * 1 *
u
3
0 * 2 1 * 0 * 1 *
u
4
0 * 2 1 * 0 * 1 *
u
5
1 * 2 2 * 1 * 1 *
u
6
0 * 1 2 * 1 * 1 *
u
7
1 * 1 2 * 1 * 1 *
u
8
0 * 2 1 * 0 * 1 *
u
9
0 * 2 1 * 0 * 1 *
u
10
0 * 2 1 * 0 * 1 *
u
11
1 * 2 2 * 0 * 2 *
u
12
0 * 3 2 * 0 * 2 *
u
13
0 * 2 1 * 0 * 1 *
u
14
0 * 2 2 * 2 * 2 *
u
15
0 * 2 1 * 0 * 1 *
u
16
0 * 2 1 * 0 * 1 *
u
17
0 * 2 1 * 0 * 1 *
u
18
1 * 2 1 * 0 * 2 *
T 0 * 2 1 * 0 * 1 *
468 H.S. Nguyen
Table 19. The new decision table A|
T
constructed from A and template T
A|
T
D
1
D
2
D
3
D
4
D
5
d
a
1
= 0 a
3
= 2 a
4
= 1 a
6
= 0 a
8
= 1
u
1
1 0 1 0 0 0
u
2
1 1 1 1 1 1
u
3
1 1 1 1 1 1
u
4
1 1 1 1 1 1
u
5
0 1 0 0 1 0
u
6
1 0 0 0 1 0
u
7
0 0 0 0 1 0
u
8
1 1 1 1 1 1
u
9
1 1 1 1 1 1
u
10
1 1 1 1 1 1
u
11
0 1 0 1 0 0
u
12
1 0 0 1 0 0
u
13
1 1 1 1 1 1
u
14
1 1 0 0 0 0
u
15
1 1 1 1 1 1
u
16
1 1 1 1 1 1
u
17
1 1 1 1 1 1
u
18
0 1 1 1 0 0
After the condition presented in Table 20 is simplied, we obtain six reducts for
the decision table A[
T
.
f(D
1
, D
2
, D
3
, D
4
, D
5
) = (D
3
D
5
) (D
4
D
5
) (D
1
D
2
D
3
)
(D
1
D
2
D
4
) (D
1
D
2
D
5
) (D
1
D
3
D
4
)
Thus, we have found from the template T six association rules with (100%)-
condence (see Table 20).
For c = 90%, we would like to nd -reducts for the decision table A[
T
, where
= 1
1
c
1
n
s
1
= 0.86.
Hence, we would like to search for a set of descriptors that covers at least
(n s)()| = 8 0.86| = 7
elements of discernibility matrix M(A[
T
). One can see that the following sets of
descriptors:
D
1
, D
2
, D
1
, D
3
, D
1
, D
4
, D
1
, D
5
, D
2
, D
3
, D
2
, D
5
, D
3
, D
4

have non-empty intersection with exactly 7 members of the discernibility matrix


M(A[
T
). Table 20 presents all association rules achieved from those sets.
Approximate Boolean Reasoning 469
Table 20. The simplied version of the discernibility matrix M(A|
T
); representative
association rules with (100%)-condence and representative association rules with at
least (90%)-condence
M(A|
T
) u
2
, u
3
, u
4
, u
8
, u
9
,
u
10
, u
13
, u
15
, u
16
, u
17
u
1
D
2
D
4
D
5
u
5
D
1
D
3
D
4
u
6
D
2
D
3
D
4
u
7
D
1
D
2
D
3
D
4
u
11
D
1
D
3
D
5
u
12
D
2
D
3
D
5
u
14
D
3
D
4
D
5
u
18
D
1
D
5
=
100%-representative rules
D
3
D
5
D
1
D
2
D
4
D
4
D
5
D
1
D
2
D
3
D
1
D
2
D
3
D
4
D
5
D
1
D
2
D
4
D
3
D
5
D
1
D
2
D
5
D
3
D
4
D
1
D
3
D
4
D
2
D
5
90%-representative rules
D
1
D
2
D
3
D
4
D
5
D
1
D
3
D
3
D
4
D
5
D
1
D
4
D
2
D
3
D
5
D
1
D
5
D
2
D
3
D
4
D
2
D
3
D
1
D
4
D
5
D
2
D
5
D
1
D
3
D
4
D
3
D
4
D
1
D
2
D
5
In Fig. 36, we present the set of all 100%association rules (light gray region)
and 90%association rules (dark gray region). The corresponding representative
association rules are represented in bold frames.
9.3.2 The Approximate Algorithms
From the previous example it follows that the searching problem for the repre-
sentative association rules can be considered as a searching problem in the lattice
of attribute subsets (see Fig. 36). In general, there are two searching strategies:
bottomup and topdown. The topdown strategy starts with the whole descrip-
tor set and tries to go down through the lattice. In every step, we reduce the
most superuous subsets keeping the subsets which most probably can be re-
duced in the next step. Almost all existing methods realize this strategy (e.g.,
Apriori algorithm [2]). The advantage of these methods is as follows:
1. They generate all association rules during searching process.
2. It is easy to implement them for either parallel or concurrent computer.
But this process can take very long computation time because of NP-hardness
of the problem (see Theorem 25).
The rough set based method realizes the bottomup strategy. We start with
the empty set of descriptors. Here we describe the modied version of greedy
heuristics for the decision table A[
T
. In practice, we do not construct this addi-
tional decision table. The main problem is to compute the occurrence number
of descriptors in the discernibility matrix M(A[
T
). For any descriptor D, this
470 H.S. Nguyen
Algorithm 8. Searching for shortest representative association rule
Input: Information table A, template T, minimal condence c.
Output: Short c-representative association rule
begin 1
Set P := ; U
P
:= U ; 2
min support := |U|
1
c
support(T); 3
Select the descriptor D from DESC(T) \ P which is satised by the smallest 4
number of objects from U
P
;
Set P := P {D}; 5
U
P
:= satisfy(P); 6
// i.e., set of objects satisfying all descriptors from P
if |U
P
| > min support then 7
GOTO Step 4; 8
else 9
STOP; 10
end 11
end 12
D
1
D
2
D
3
D
4
D
5
D
1
D
2
D
3
D
4
D
1
D
2
D
3
D
5
D
1
D
2
D
4
D
5
D
1
D
3
D
4
D
5
D
2
D
3
D
5
D
4
D
1
D
2
D
3
D
4
D
5
D
1
D
2
D
3
D
1
D
2
D
4
D
1
D
2
D
3
D
4
D
1
D
2
D
3
D
5
D
1
D
2
D
3
D
5
D
3
D
5
D
4
D
1
D
4
D
2
D
3
D
5
D
4
D
5
D
5
D
4
D
1
D
2
D
1
D
3
D
1
D
4
D
1
D
5
D
2
D
3
D
2
D
4
D
2
D
5
D
3
D
4
D
3
D
5
D
4
D
5
association rules with
confidence = 100%
association rules with
confidence < 90%
association rules with
confidence > 90%
Fig. 36. The illustration of 100% and 90% representative association rules
Approximate Boolean Reasoning 471
number is equal to the number of 0

occurring in the column a


D
represented
by this descriptor and it can be computed using simple SQL queries of the form
SELECT COUNT ... WHERE ...
We present two algorithms: the rst (Algorithm 8) nds almost the shortest
c-representative association rule. The presented algorithm does not guarantee
that the descriptor set P is c-representative. But one can achieve it by removing
from P (which is in general small) all unnecessary descriptors.
The second algorithm (Algorithm 9) nds k short c-representative association
rules where k and c are parameters given by the user. This algorithm makes use
of the beam search strategy which evolves k most promising nodes at each depth
of the searching tree.
Algorithm 9. Searching for k short representative association rules
Input: Information table A, template T, minimal condence c, number of
representative rules k N.
Output: k short c-representative association rules R
P
1
, . . . , R
P
k
.
begin 1
for i := 1 to k do 2
Set P
i
:= ; 3
U
P
i
:= U; 4
end 5
Set min support := |U|
1
c
support(T); 6
Result set := ; 7
Working set := {P
1
,. . . , P
k
}; 8
Candidate set := ; 9
for (each P
i
Working set) do 10
Select k descriptors D
i
1
, . . . , D
i
k
from DESC(T) \ P
i
which is satisedby 11
the smallest number of objects from U
P
i
;
Insert P
i
{D
i
1
}, . . . , P
i
{D
i
k
} to the Candidate set; 12
end 13
Select k descriptor sets P

1
,. . . , P

k
from the Candidate set (if exist) which 14
are satised by smallest number of objects from U;
Set Working set := {P

1
,. . . , P

k
}; 15
for (P
i
Working set) do 16
Set U
P
i
:= satisfy(P
i
); 17
if |U
P
i
| < min support then 18
Move P
i
from Working set to the Result set; 19
end 20
if (|Result set| > k or Working set is empty) then 21
STOP; 22
else 23
GOTO Step 9; 24
end 25
end 26
end 27
472 H.S. Nguyen
P
1
P
2
P
k ...
D
1
D
k
1 1
...
D
1
D
k
2 2
...
D
1
D
k
k k
...
P
1
U
{D }
1
1
P
1
U
{D }
k
1
P
2
U
{D }
1
2
P
2
U
{D }
k
2
P
k
U
{D }
1
k
P
k
U
{D }
k
k
... ... ... ...
P'
1
P'
2
P'
k
The candidate set
Old working set
New working set
Fig. 37. The illustration of the k short representative association rules algorithm
10 Rough Set and Boolean Reasoning Approach to
Mining Large Data Sets
Mining large data sets is one of the biggest challenges in KDD. In many practical
applications, there is a need of data mining algorithms running on terminals of
a clientserver database system where the only access to database (located in
the server) is enabled by SQL queries.
Unfortunately, the proposed so far data mining methods based on rough sets
and Boolean reasoning approach are characterized by high computational com-
plexity and their straightforward implementations are not applicable for large
data sets. The critical factor for time complexity of algorithms solving the dis-
cussed problem is the number of simple SQL queries like
SELECT COUNT FROM aTable WHERE aCondition
In this section, we present some ecient modications of these methods to solve
out this problem. We consider the following issues:
Searching for short reducts from large data sets;
Induction of rule based rough classier from large data sets;
Searching for best partitions dened by cuts on continuous attributes;
Soft cuts: a new paradigm for discretization problem.
10.1 Searching for Reducts
The application of ABR approach to reduct problem was described in Sect. 5.
We have shown (see Algorithm 2 on page 389) that the greedy heuristic for
minimal reduct problem uses only two functions:

You might also like