A New Rough Sets Model Based On Database Systems: Xiaohua Hu T. Y. Lin

Fundamenta Informaticae XX (2004) 118
IOS Press
A New Rough Sets Model Based on Database Systems

Xiaohua Hu C
College of Information Science
Drexel University
Philadelphia, PA 19104, USA
thu@cis.drexel.edu
T. Y. Lin
Department of Computer Science
San Jose State University
San Jose, CA 94403, USA
tylin@cs.sjsu.edu
Jianchao Han
Department of Computer Science
California State University Dominguez Hills
Carson, CA 90747, USA
jhan@csudh.edu
Abstract. Rough sets theory was proposed by Pawlak in the 1980s and has been applied successfully in a lot of domains. One of the major limitations of the traditional rough sets model in the
real applications is the inefficiency in the computation of core and reduct, because all the intensive
computational operations are performed in flat files. In order to improve the efficiency of computing
core attributes and reducts, many novel approaches have been developed, some of which attempt
to integrate database technologies. In this paper, we propose a new rough sets model and redefine
the core attributes and reducts based on relational algebra to take advantages of the very efficient
set-oriented database operations. With this new model and our new definitions, we present two new
algorithms to calculate core attributes and reducts for feature selections. Since relational algebra
operations have been efficiently implemented in most widely-used database systems, the algorithms
presented in this paper can be extensively applied to these database systems and adapted to a wide
range of real-life applications with very large data sets. Compared with the traditional rough set
models, our model is very efficient and scalable.
Keywords: Rough set, database systems, relational algebra, reduct, feature selection
C
Corresponding author
X. Hu, T. Lin, J. Han / A New Rough Set Model
1. Introduction
Rough sets theory was first introduced by Pawlak in the 1980s [14] and it has been applied in many
applications such as machine learning, knowledge discovery, and expert systems [9, 10, 15] since then.
Rough set is especially useful for domains where data collected are imprecise and/or incomplete about
the domain objects. It provides powerful tools for data analysis and data mining from imprecise and
ambiguous data. Many rough sets models have been developed in the rough set community in the last
decades [5, 14, 16, 17, 19, 20, 21], including Ziarkos VPRS [21] and Hus GRS [5], to name a few.
Some of them have been applied in the industry data mining projects such as stock market prediction,
patient symptom diagnosis, telecommunication churner prediction, and financial bank customer attrition
analysis to solve challenging business problems [9, 10]. These rough set models focus on the extension of
the original model proposed by Pawlak [14, 15] and attempt to deal with its limitations such as handling
statistical distribution or noisy data.
Our experience of applying the VPRS and GRS models in large data sets in data mining applications
has shown that one of the strong drawbacks of the traditional rough set model is the inefficiency of rough
set methods and algorithms of computing the core attributes and reducts and identifying the dispensable
attributes, which limits the suitability of the traditional rough set model in data mining applications.
Further investigation of the problem reveals that most existing rough set models do not integrate with
the relational database systems and a lot of computational intensive operations are performed in flat files
rather than utilizing the high performance database set operations. Moreover, not much attention and
attempt have been paid to design new rough sets model by effectively combining database technologies
to generate the core attributes and reducts so as to make their computations efficient and scalable in the
large data set.
To overcome the problem, some approaches to improving the efficiency of finding core attributes
and reducts have been developed [1], including the algorithms presented in [13], which largely improve
the generation of discernibility relation by sorting the objects. Some authors have proposed approaches
to reduce data size using relational database system techniques [8] and developed rough set-based data
mining systems that integrates RDBMS capabilities such as RSDM (Rough Set Data Miner) [3]. The
algorithms presented in [3] embed SQL queries to take advantage of database technologies.
In this paper, we attempt to redefine some concepts of rough set theory such as core attributes and
reducts by using relational algebra so that the computation of core attributes and reducts can be per

. The
formed with very efficient set-oriented database operations, such as
arguments behind our model based on database technologies come from the following two points:
Most existing RDBMS systems have implemented pre-processing and efficient organization of
the original data, such as sorting and indexing, where sorting is the main idea of the algorithms
proposed in [13], and
The efficient implementation of SQL queries in most RDBMS systems reduces the cost of accessing disks and is scalable to huge data sets [8, 3].
The rest of the paper is organized as follows: We give an overview of the rough set theory based on
the model proposed by Pawlak [14, 15] with some examples in Section 2. In Section 3, we redefine the
main concepts and methods of rough set theory based on the database systems set-oriented operations,
and illustrate these concepts with examples. With the new definition, we propose an efficient algorithm
to compute the core attributes and show the correctness of the algorithm as well as the equivalence of
our definition to the corresponding one defined in the traditional rough set theory in Section 4. Another
efficient algorithm for feature selection based on reduct generation is presented in Section 5. Finally, we
conclude with some discussions and our future works in Section 6.
2. Overview of Rough Sets Theory

In rough sets theory, the data is collected in a table, called decision table (or table in the database term).
Rows of the decision table correspond to objects, and columns correspond to attributes. In the data
set, we assume we are given a set of examples with a class label to indicate the class to which each
example belongs. We call the class label the decision attribute, the rest of the attributes the condition
attributes. We also assume that our data set is stored in a relational table with the form Table(conditionattributes, decision-attributes).
is used to denote the condition attributes,
for decision attributes,
where
, and denotes the -th tuple of the data table. Rough sets theory defines three re

gions based on the equivalent classes induced by the attribute values:

and
.
contains all the objects which are classified

surely based on the data collected, and
contains all the objects which can be

is the difference between the upper approximation and the lower
classified probably, while the
approximation.
We give the formal definitions as below.

"!

& &

$ # %
$#(' &*)+, %
-' &*)
&
' )*.0/ 21 434343. 25 !
687:9;E .' 7<)=::.F 7>G/ F 71 434343. 7@? !
Suppose
is a database table. We define two tuples and are in the same equivalent
class induced by attributes ( is a subset of or ) if
. (The tuples in the same equivalent
class have the same attribute values for all the attributes in ). Let
denote the
equivalent classes induced by , and
denotes the equivalent classes

induced by (
are also called
).
7 A
BC=D7@#
H
IKJELNM4OP QSR T(UGV of W

under 7X9Y
7Z#
H
.
IKJELNM4O P QSR T(UGV \[.7@# ] 7@# 9^W
(_`a$b434343E(cd!-3
For any object e#gfIKJELNM4O P QSR T(UGV , h# can be classified certainly to H
.
IKJELNM4O-P QSR TiP U*RG\[.IKJELNM4O-P QSR T(UGV ] W
fj' )k2a$b434343.$lm!-3
Definition 2.2. For a given set H
, the upper approximation nKo-opM4O P QSR T(UGV of W
under 7q9Y
union of those equivalent classes 7Z# , each of which has a non-empty intersection with 2
.
nKo-opM4O P QSR T(UGV \[.7@# ] 7@# rW
;s t(_ua$b434343E(cd!-3
For any object e#gfvnKo-opM4O P QSR T(UGV , h# can be classified probably to H
.
nKo-opM4O-P QSR TiP U*Rm\[nKo-opM4O-P QSR T(UGV ] W
fj' )k2a$b434343.$lm!-3
Definition 2.1. For a given set
, the lower approximation
union of all those equivalent classes , each of which is contained by
is the
is the
Table 1. 8 cars with attributes Door, Size, Cylinder, Mileage
Tuple id
Door
$/
e1
Size
Cylinder
Mileage
compact
high
sub
low
compact
high
compact
low
compact
low

compact
high

sub
low

sub
low
Definition 2.3. The boundary area between the upper approximation and the lower approximation is
defined as
J mC=DBO P QSR TiP U*RS nKo-opM4O-P QSR TiP U*R IKJELNM4O-P QSR TiP U*R%3
Example 2.1. Suppose we have a collection of 8 cars ( /( e1E434343.( ) with information about the rJJEO ,
&g_ M , _C=DM4O and _ %MB M , shown in Table 1. rJJEO , &g_ M and _C=DM4O are the condition attributes
and _ %MB M is the decision attribute. The attribute Tupel id is just for explanation purpose and can be
ignored.
With q
_ %MB M! and qq.rJJEOEi&g_ Mi _C=DM4O! , we can calculate the elementary sets,

lower and upper approximations as follows:
' )
=
_ %MB MZ _ _ %MB MZ JEL !
_ %MB MZ _
=
$/( E( E!
_ %MB MZ JEL
=
e1( ( ( E( .!
' )
=
$/4!- e1( !- ( E( E!- !- E!!
IKJELNM4O-P mR T # 4# = $/4!
IKJELNM4O-P mR T #
=
e1( ( E( E!
=
$/( e1E( ( E( .!
IKJELNM4O-P mR TiP U*R
nKo-opM4O P mR T # 4# = $/( E( ( E!
nKo-opM4O-P mR T #
=
e1( E( ( ( E( E( E!
nKo-opM4O-P mR TiP U*R
=
$/( e1E( ( ( E( ( E( .!
J mC=DBO P mR TiP U*R
=
( E( E!
Only 5 of 8 cars i/( e1E( ( E( belong to the lower approimation of based on , IKJELNM4O P mR TiP U*R ,
while 3 of 8 cars fall in the boundary area. This fact indicates that the information on rJJEOEi&g_ M and
_C=DM4O collected so far is not consistent, for it is only good enough to make a classification model
for the above five cars, but not enough to classify other three. In order to classify E( and , more

!"#$!%'& (#$&

!"#$!%' )*

!"#$!%'& (#$&

!"#$!%' )*

Table 2. 8 cars with attributes
Tuple id
Weight
$/
e1

"!
#$%&'
Door
Size
Cylinder
Mileage
low
compact
high
sub
low
medium
compact
high
high
compact
low
high
compact
low
low

low
compact
high

high
sub
low

low
sub
low
information is needed. Actually, one can easily verify that and are a pair of contradictory tuples, for
they have the same condition attributes values but different decision attribute values.
(:M4_
(:M4_
Example 2.2. To classify the tuples in the boundary area, we add a new condition attribute

of

for each car. The result is illustrated in Table 2.
cars and collect information on
With the condition attributes set

, we can calculate the lower,
upper approximations and boundary area as below:
\ )(:M4_ i rJJEOEi&g_ Mi _C=DM4O!

' )
=
$/4!- e1!- E!- !- E!- !- .!- !!
IKJELNM4O +* ' ) ,
=
$/( e1E( ( ( E( ( E( .!
nKo-opM4O -* ' ) ,
=
$/( e1E( ( ( E( ( E( .!

*
J mC=DBO ' ) , = t
From above, one can see, once the information on (:M4_ is added, a classification model for all 8

cars can be built.
One of the nice features of rough sets theory is that rough sets can tell whether the data is complete or
not based on the data itself. If the data is incomplete, it suggests more information about the objects need
to be collected in order to build a good classification model. On the other hand, if the data is complete,
rough sets theory can also determine whether there are more than enough or redundant information
in the data and find the minimum data needed for classification model. This property of rough sets
theory is very important for applications where domain knowledge is very limited or data collection is
very expensive/laborious because it makes sure the data collected is just good enough to build a good
classification model without sacrificing the accuracy of the classification model or wasting time and effort
to gather extra information about the objects. Furthermore, rough sets theory classifies all the attributes
into three categories: core attributes, reduct attributes and dispensable attributes. Core attributes have
the essential information to make correct classification for the data set and should be retained in the data
set; dispensable attributes are the redundant ones in the data set and should be eliminated; and reduct
attributes are in the middle between. Depending on the combination of the attributes, in some cases, a
reduct attribute is not necessary, while in other situations it is essential.
Definition 2.4. An Attribute
u
Zfd is a dispensable attribute in with respect to
IKJELNM4O P mR TiP U*R IKJELNM4O P V(R TiP U*R 3

if
One can verify that, in Table 2,
' (:M4_ ii&g_ Mi _C=DM4OE)

=
$/.( .!- e1( .!- !- !- E!- E!!-
IKJELNM4O P k# #
# R TiP # R = $/( e1( E( ( ( E( E( E!-3
Thus, IKJELNM4O P U k# #
# R TiP #
R ;IKJELNM4O P k# #
# R TiP # R , so rJJEO
a dispensable attribute in with respect to _ %MB M .
Definition 2.5. An Attribute u
Zfd is a core attribute in with respect to if
IKJELNM4O P mR TiP U*R s IKJELNM4O P V(R TiP U*R 3

! (#$&
!
))
! (#$&

!
!
!"#$!

!
!"#$!

!
! (#$&

!
!"#$!
is

Similarly, it can be shown that, in Table 2,
' rJJEOEi&g_ Mi _C=DM4OE)

=
$/4!- e1( .!- ( ( .!- !- !!-
IKJELNM4O P U #
# R TiP # R = $/( e1( ( E( E!-3
Thus, IKJELNM4O P U k# #
# R TiP #
R s IKJELNM4O P U #
# R TiP # R , so (:M4_ is
a core attribute in with respect to _ %MB M .

Definition 2.6. A subset
of the attributes set is a reduct of with respect to , if and only if the

))
!
))
! (#$&

!
!
!"#$!
!

!"#$!

!

!
))

!"#$!

following conditions are satisfied:
IKJELNM4O P R TiP U*R IKJELNM4O P mR TiP U*R

2. 6
IKJELNM4O-P R TiP U*Rus IKJELNM4O-P mR TiP U*R 3

1.
From Definition 2.6, one can see that a reduct is a minimum subset of the entire condition attributes
set that has the same classification capability as the original condition attributes set. One can easily show
that, in Table 2,
,
is a reduct, because
(:M4_ &g_ M
' (:M4_ ii&g_ M )

IKJELNM4O P k# # R TiP U*R
' (:M4_ k)
' &g_ M )
IKJELNM4O-P k# R TiP U*R
IKJELNM4O P # R TiP U*R

! (#$&
!

=
=
=
=
=
=
$/( E!- e1( .!- !- -( .!- E!!

$/( e1( E( ( ( E( E( E!
$/( e1( ( .!- !- -( E( .!!
$/( ( ( E( E!- e1( .( E!!
( ( E( .!
e1( E( .!

IKJELNM4O P mR TiP U*R

IKJELNM4O-P mR TiP U*R

s IKJELNM4O P mR TiP U*R
For a data set, there may exist more than one reduct. Actually, it can be easily verified that (:M4_ ,
_C=DM4O is another reduct of with respect to in Table 2, and (:M4_ , &g_ M and (:M4_ , _C=DM4O
are the only two reducts of with respect to . Unfortunately, finding all reducts and the minimum
! (#$&
!

s

reduct of the attributes set in a given data table is NP-hard [18].
Definition 2.7. An attribute
u
Zfd
is a reduct attribute if
is part of a reduct.

&g_ M
_C=DM4O
Every reduct must contain all the core attributes [14]. (We will explain the algorithm to find a reduct
in the later part of the paper). So ,
are reduct attributes. According to rough set theory,
based on the data in Table 2, in order to make good classification model for the attribute , we
need

information of the cars plus either or
but not necessary both at the same
time.
(:M4_
&g_ M
_C=DM4O
_ %MB M
3. A New Rough Sets Model Based on Database Systems

There exist some limitations of rough sets theory which restricts its suitability in practice [5, 8, 16, 17,
19, 20, 21], two of which are described as follows:
(1) Rough sets theory uses the strict set inclusion definition to define the lower approximation, which
does not consider the statistical distribution/noise of the data in the equivalence class. This drawback of the original rough set model has limited its applications in domains where data tends
to be noisy or dirty. For example, suppose

and

, based on the strict set inclusion definition,
, but
of
the elements in
are contained in
, and the only exception is , which may be an outlier
or a noise in the data set. Thus, it is reasonable to add
into the lower approximation of
,
though the traditional rough set model does not do so. Some new models have been proposed to
overcome this problem such as Ziarkos Varied Precise Rough Set Model (VPRS) [21] and our
previous research work on Generalized Rough Set Model (GRS Model) [5]. A detailed discussion of these new models is beyond the scope of our paper, for interested readers, please see the
references [5, 21].
7@# $/( e1E( 434343.( ( e1 i/4!

7#
A
>
$/( e1( E434343.( ( $/ ( $/ i/434343.( e1 E!
7W# s W
1 i/
7W#
H
(2) Another drawback of rough sets theory is the inefficiency in computation, which limits its suitability for large data sets in real-world applications. In order to find the reducts, core and dispensable
attributes, the rough sets model needs to construct all the equivalent classes based on the attribute
values of the condition and decision attributes. This process is very time-consuming, and thus the
model is very inefficient and infeasible, and doesnt scale for large data set, which is very common
in data mining applications [5, 8]. Some new algorithms to overcoming this inefficiency have been
developed [1, 13].
Our research investigation of the inefficiency problem of rough sets model finds out that most rough
set models do not integrate with the relational database systems and a lot of basic operations of these
computations are performed in flat files rather than utilizing the high performance database set operations. In considering of this and influenced by [8, 3], we borrow the main ideas of rough sets theory and
redefine them using the database theory to utilize the very efficient set-oriented database system operations. Almost all the operations in rough sets computation used in our method can be performed using

the database system operations such as
, etc. In this section, we will give our new
definitions of core attributes, dispensable attributes and reducts based on database operations. Two new
algorithms for finding core attributes and feature selection based on our new model will be presented in
the following sections.
As pointed out in [14], all the core attributes are indispensable part of every reduct. So it is very
important to have a very efficient way to find all the core attributes in order to get the reduct, the minimum
subset of the entire condition attributes set. In the traditional rough set model, a popular method to get

the core attributes is to construct a decision matrix first, then search all the entries in the decision matrix
to find all those entries with only one attribute. If the entry in the decision matrix contains only one
attribute, that attribute is a core attribute [2].
For example, the decision matrix generated from Table 2 is shown in Table 3. This decision matrix
contains all the condition attributes, which are not identical between two equivalent classes induced by
the decision attribute:

and
. In the
decision matrix, only

appears in the entries that have a single attribute. Thus, according to the
decision matrix in Table 3, the only core attribute in the condition attributes set for the data in Table 2 is
.
_ %MB MZ JEL 1E( ( ( .( E!
_ %MB MZ _ $/( E( E!
(:M4_
(:M4_
Table 3. Decision Matrix for Table 2
e1

$/ i&gi
(vi
( (vi&gi &gi
(vi&gi (vi (
(vi&gi
(vi&gi
&gi
(
(
(vi&gi
i&gi
_C=DM4O -rJJEO &&g_ M ( (:M4_

This method is very inefficient and it is not realistic to construct a decision matrix for millions of
tuples in the data table, which is a typical situation for data mining applications. Some authors have
presented approaches to finding core attributes and reducts without constructing the decision matrix.
For example, an approach developed in [13] first constructs the discernibility relation by sorting the
data tuples (objects) in the data table, then uses the discernibility relation to build the lower and upper
approximations, and finally applies the approximations to find a semi-minimal reduct. The algorithms
presented in [13] runs in time of at least
, where is the number of attributes and is the
number of tuples, because sorting tuples takes
time. In this section, however, we attempt
to propose a new approach to find core attributes and reducts without calculating the lower and upper
approximations, and our algorithm of finding core attributes takes only
time. For this purpose,
we will redefine the core attributes and reducts.

Following the relational algebra, in this paper, we use
to denote the
, and

for
.
* c C
KC ,
* c C
KC , c
*c C ,

u
Zfd is a core attribute if it satisfies the following condition:
ZBOD * *
j ,,\s ZBOD * *
,,$3
u
to denote `
! .
Definition 3.1. An attribute

For simplicity, we use

For example, in Table 2, it can be shown that
ZBOD * * rJJEOEi&g_ Mi _C=DM4OE _ %MB M,,

ZBOD * * rJJEOEi&g_ Mi _C=DM4O ,, 3

Therefore, (:M4_ is a core attribute in with respect to _ %MB M .
and

& ZI

;
We can check whether attribute

is a core attribute by using some
operations. We only need
to take two projections of the table: one on the attribute set

, and the other on
.
If the cardinality of the two projection tables is the same, then it means that no information is lost in
removing attributes , otherwise, it indicates that
is a core attribute. Put it in a more formal way,
using database terms, the cardinality of two projections being compared will be different if and only if
there exist at least two tuples and such that
;`
e5
6Hfd
( ' .)+, e5' .)( '
)u,s e5 '
i)-C=DZ ' )K,s e5' )3
In this case, a projection on
u
will have at least one fewer row than the projection on v
,
because and e5 being identical in
\`
are being combined in this projection. However, in the
projection "`
" , ( e5 are still distinguishable. So eliminating the attribute
will lose the ability
to distinguish tuple and e5 . Intuitively this means that some classification information is lost after
is eliminated.
For example, in Table 2, and have the same values on all the condition attributes except (:M4_ ;
the two tuples belong to different classes because they are different on the value on (:M4_ . Thus,
(:M4_ is the only attribute to distinguish between and . If (:M4_ is eliminated, then and
are indistinguishable. So (:M4_ is a core attribute for the table.
Definition 3.2. An attribute u
Zfd is a dispensable attribute in with respect to , if the classification
result of each tuple is not affected without using
, that is,
ZBOD * *
,,u\ZBOD * * '
,,$3

$

This definition characterizes that an attribute is dispensable if each tuple in the data table can be
classified in the same way no matter whether the attribute is present or not. We can check whether
is dispensable by using some
operations. We only need to take two projections of
attribute
the table: one on the attribute set

, and the other on

. If the cardinality of the
,
two projection tables is the same, then it means that no information is lost in removing attributes
otherwise, it indicates that
is relevant and should be reinstated. For example, in Table 2, one can
verify that
& ZI
K

ZBOD * * (:M4_ ii&g_ Mi _C=DM4OE _ %MB M,,@

and
ZBOD * * (:M4_ ii&g_ Mi _C=DM4O ,K3
Thus, rJJEO is a dispensable attribute in with respect to _ %MB M .

In order to find reducts of the condition attributes set with respect to the decision attribute set, we
define the degree of dependency of the condition attributes on the decision attributes.

"n
"n

"n 9X
Definition 3.3. Let

be a subset of the condition attributes set,

dependency between
and the decision attribute set
in the decision table

, is defined as
* "n ,
*

"n ,` ZBOD ZB* O* D * " n j, , , 3
* , , denoted
. The degree of
10
*
"n ,

The value
is the proportion of those tuples in the decision table that can be classified.
This value characterizes the ability to predict the class
and the complement
from tuples in the
decision table.
With the dependency measure between the condition attributes and the decision attributes, we can
redefine the reduct of the condition attributes set with respect to the decision attributes set in a data table
as follows.
"n * 9: ,

Definition 3.4. The subset of attributes

is a reduct of the condition attributes set with
respect to the decision attributes set , if it is a minimum subset of attributes that has the same classi
fication power as the entire collection of condition attributes. Formally,
satisfies the following
conditions:
*
"n
"n ,` * ,$
"n ,@ s * ,$h6 "n 3

Definition 3.5. The merit value of an attribute
in is defined as
M4O_ *
i ,Ka ZBZBOD * O D ** *
, , ,, 3
M4O_ *
i , reflects the degree of contribution made by the attribute
to the dependency only
between and . For example, in Table 2,
ZBOD * * rJJEOEi&g_ Mi _C=DM4OE _ %MB M ,,
ZBOD * * rJJEOE (:M4_ ii&g_ Mi _C=DM4OE _ %MB M , ,<
M4O_ * (:M4_ i rJJEOE (:M4_ ii&g_ Mi _C=DM4OG _ %MB M , a 3 b3
and
*

4.

A New Algorithm for Finding Core Attributes
In this section, we will propose a new algorithm based on the the concepts defined in previous section
and database operations to find all the core attributes of a decision table. The data table, however,
may contain inconsistent tuples. Two tuples are said to be inconsistent if they have the same values
on the condition attributes, but are labeled as different classes (having different values on the decision
attributes). Inconsistent tuples can not be classified. Thus, the inconsistent tuples should be eliminated
from the data table before the classification process proceeds. Here we assume the inconsistent tuples are
noisy data, otherwise more attributes and values for tuples should be collected further to ensure the data
table is consistent. Our new algorithm is based on the database systems operations without calculation
of the lower and upper approximations. Compared with the original rough set approach, our algorithm
is efficient and scalable.
For simplicity, we first assume that the data table is inconsistency-free, and then we will discuss how
efficiently to detect and eliminate the inconsistent tuples from the data table based on our new rough set
model.
11
The new algorithm for finding all core attributes based on the relational database system is described
in Algorithm 1.
Algorithm 1 consists of two steps. The first step is simply to initialize the core attributes set to be
empty. The second step checks all condition attributes one by one to see if they are core attributes. If
yes, they are added to the set of the core attributes. The following theorem ensures that the return
of Algorithm 1 contains all core attributes and only those attributes. Before we present and prove the
theorem, lets first present two facts.
ZJEO-M
is consistent if and only if

;IKJELNM4O-P mR TiP U*RG nKo-opM4O-P mR TiP U*R -C=D J mC=DBO P mR TiP U*RS;t3
Fact 4.1. The data table
One can find out that Table 1 is inconsistent, while Table 2 is consistent.
7 9
' 7<) .7H/ 71 434343 7@? ! and ' ) / 1 434343. ! are
7 and , respectively, then 6 # f ' )(_Aqa$b434343E(C , and
7
f ' 7<)k2a$b434343E(c
# 7
;t or # 9^7
. ' ) is said a refinement of ' 7<) .

Let ' )2 </ i 1 434343.iK? ! and ' ;7<)2 / i 1 434343.i 5 ! be the set of equivalent classes
induced by and
7 , respectively, and ' ) ."/ 21 434343. ! be the set of equivalent classes
induced by . We have the following two propositions.
Proposition 4.1. 67:fd , if IKJELNM4OP mR TiP U*R`
s IKJELNM4O-P mQSR TiP U*R , then
ZBOD * * 7 ,,ZBOD * * 7 ,,$3

. Assume
Fact 4.2. Let
the set of equivalent classes induced by

, either

a :C , we have IKJELNM4O P mQSR T(UG V ] 9 W

a

lm5! 6 ? fIKJELNM4O P mQSR T(UGV
a l such that fY and 9 W
. Because
; # */ # # */ K# , so
a vo cv( fd . Hence, we have fd " ;s t3
Proof:
According to Definition 2.1, for given

. Thus,

%
* ,
Algorithm 1: Core Attributes Algorithm

Input: a decision table
Output:
the core attribute of table
Method:
1. Set
2. For each attribute
If

Then

ZJEO-M

ZJEO-M\
7 fd;
ZBOD * * 7 ,,\s ZBOD * *
ZJEO-M\ZJEO-M r7

7 ,,
12
9 9 W
because 7 9 . Hence f
8K# ] K# 9W
a _ cd!" IKJELNM4O-P mR T(UGV . Therefore, IKJELNM4OP mQSR T(UGV 9 IKJELNM4O-P mR T(UGV and thus
IKJELNM4O-P mQSR TiP U*R`9 IKJELNM4O-P mR TiP U* R . Because IKJELNM4OP mQSR TiP U*R s IKJELNM4O-P mR TiP U*R from the given condition,
we must have IKJELNM4OP mQSR TiP U*R
IKJELNM4O-P mR TiP U*R 3

So it can be inferred that
Z such that fIKJELNM4O P mR TiP U*R and fIKJELNM4O P mQSR TiP U*R .
Thus,
@W
a j C , such that fIKJELNM4O P mR T(UGV , which means,

a vo cv( fd 9^W
.

And 6a _ C`( fIKJELNM4O-P mQSR T(
U , that is, f 8 ] 9^># a ,lm! .
On the other hand, Afd ] a$b434343E$lm! . Hence
a lG( Hfj , but 60a ;_
lGi 9s ># . It is known >f W
. Thus, we have
> >fv@( 2f "W
s t -C=DA 9:s W
which
, but f W
. Because 8.># ] _@ a$b434343E(C ! , so
a :C
means,
> Zf such that Zf
such that f"
8 s . Thus, f0 r 8 s 3
and f ; but ' ,7 ;) s
Therefore, we obtain ' 7<)WX 4' 7<) for f
4' 7 ) for fW
and f s .
From above, one can see that and are projected to be same by * \7 , but different by
* ;7 : , . Thus, we conclude that * ;7 : , has at least one more distinct tuple than
* 7 , , and the proposition holds.

Proposition 4.2. Assume is consistent. 687 fd , if
ZBOD * * 7 ,,ZBOD * * 7 j ,,$

By Fact 4.2, it can be easily seen that

IKJELNM4O-P mR TiP U*R`s IKJELNM4O-P mQSR TiP U*R 3
then
f
:

* A7 ,
* A7 > ,
4' A7<)+ ' A7<) -C=DN 4'
4' ) s ' )3
a
l
i rf
7 ) s ' 7 )3
a :_>:s :C
Zf 2#
f W
6 a ,o :C`i 9s
i f
6 a o `C ( i f IKJELNM4OP mQSR U
i f IKJELNM4OP mQSR TiP U*R 3

i f IKJELNM4O P mR TiP U*R
KI JELNM4O P mR TiP U*R s
IKJELNM4O P mQSR TiP U*R
Proof:
According to the given condition, it can be easily inferred that
and
such that and are
projected to be same by

but distinct by

, that is,

,
Hence, we have

Therefore,
such that

and
such that
and
. So
(otherwise
).

By Definition 2.1,

. Thus,

On the other
hand, because
is consistent, by Fact 4.1,
. Therefore,

.

By Proposition 4.1 and Proposition 4.2, we immediately have:
Theorem 4.1. If the data table
1.
is a core attribute in
is consistent, then
with respect to
ZBOD * *
687 fd
if and only if
7 ,,\s ZBOD * * 7 ,,

2. 7 is a dispensable attribute in with respect to if and only if
ZBOD * * 7 ,`\ZBOD * * 7 , ,$3

13
Theorem 4.1 reveals that the new definitions of the core and dispensable attributes in our new rough
set model are equivalent to the corresponding definitions defined in the traditional rough set model [14,
15].
Algorithm 1 can be implemented using
statements as follows.
& ZI
ZBOD * * ,, :
SELECT DISTINCT COUNT(*) FROM T
ZBOD * * ,, :

can be
i 7

, or
7

SELECT DISTINCT COUNT(*) FROM (SELECT X FROM T)
Theorem 4.2. Algorithm 1 can be implemented in

and is the number of tuples (rows).
* c C , time, where c
c
*c C ,
is the number of attributes
Proof:
The for loop is executed times, and inside each loop, finding the cardinality takes
fore, the total running time is
.
* C , time. There

From above theorem, it is clear that the time complexity of our algorithm based on SQL queries to
finding core attributes is linear with respect to the number of objects and attributes, which is better than
, the time complexity of the algorithm presented in [13]. We do not, however, include the
time complexity caused by the SQL queries, which forms one of our future works by experiments with
large data sets.
Furthermore, most real applications contain noise data, and some tuples in the data table may be
inconsistent. Our model based on the relational database operations also provides an efficient means of
detecting and eliminating the inconsistency of the data table.
By Fact 4.1, to detect the inconsistency of the data table, we only need to test whether
. If they are not equal, there exist inconsistent tuples in the data table. The inconsistent
tuples can be eliminated by the means of computing the following relational operation:
* c C
uC ,
ZBOD * * ,,u
ZBOD * * ,,
+U *

U U * * ,,,$
%
%

where the operators are explained as follows [4]:

Projection: Assume has columns , and

of but only contains columns in .
Rename:

JEC=D-_ h_kJEC
-Join:

* , is the same data table as

(# #

* ,

, but named as
is a data table that contains all tuples

.
)
)

is a new data table that consists of those tuples of
, where is the Cartesian product of and .
This operation can be implemented using the
&ZI
SELECT * FROM T U
WHERE EXISTS (SELECT * FROM T V
statement as follows:
satisfying the
14
WHERE (U.C=V.C)
(U.D

V.D))
If the database system implements the -Join by using Hash table (e.g. in Oracle 9i, you can set
the system variable HASH JOIN ENABLED to true), then detecting and eliminating inconsistent tuples
from the data table only take a linear time. Thus, we have,
*C ,
*C ,
Theorem 4.3. Based on our rough set model, the inconsistency of a data table can be detected in
time, and the inconsistent tuples can be eliminated in
time, where is the number of tuples in the
data table.
5.
Rough Set Based Feature Selection

Two kinds of attributes are generally perceived as being unnecessary: attributes that are irrelevant to the
target concept (like the tuple
, customer
), and attributes that are redundant given other attributes.
in Table 2. Either or
but not
For example, consider the attributes and
necessary both is needed at the same time for the classification purpose if the attribute
of the cars
appears. In actual applications, these two kinds of unnecessary attributes can exist at the same time but
the latter redundant attributes are more difficult to eliminate because of the interactions between them.
In order to reduce both kinds of unnecessary attributes to a minimum, we use feature selection. Feature
selection is a process we employ to choose a subset of attributes from the original attributes. Feature
selection has been studied intensively in the past decades [6, 7, 11, 12]. The purpose of feature selection
is to identify the significant features, eliminate the irrelevant or dispensable features to the learning task,
and build a good learning model. The benefits of feature selection are twofold: it considerably decreases
the running time of the induction algorithm, and increases the accuracy of the resulting model.
All feature selection algorithms fall into two categories: (1) the filter approach and (2) the wrapper
approach. In the filter approach, the feature selection is performed as a preprocessing step to induction.
Some of the well-known filter feature selection algorithms are RELIEF [7] and PRESET [12]. The filter
approach is ineffective in dealing with the feature redundancy. In the wrapper approach [6], the feature
selection is wrapped around an induction algorithm, so that the bias of the operators that define the
search and that of the induction algorithm interact mutually. Though the wrapper approach suffers less
from feature interaction, nonetheless, its running time would make the wrapper approach infeasible in
practice, especially if there are many features, because the wrapper approach keeps running the induction
algorithm on different subsets from the entire attributes set until a desirable subset is identified. We
intend to keep the algorithm bias as small as possible and would like to find a subset of attributes that can
generate good results by applying a suite of data mining algorithms. We focus on an induction algorithmindependent feature selection. Our goal is to construct a reasonably fast algorithm that can find a relevant
subset of attributes and eliminate the two kinds of unnecessary attributes effectively.
A decision table may have more than one reduct. Anyone of them can be used to replace the original
table. Finding all the reducts from a decision table is NP-Hard [9]. Fortunately, in many real applications,
it is usually not necessary to find all of them. One is sufficient. A natural question is which reduct is the
best if there exist more than one reduct. The selection depends on the optimality criterion associated with
the attributes. If it is possible to assign a cost function to attributes, then the selection can be naturally
based on the combined minimum cost criteria. In the absence of an attribute cost function, the only
source of information to select the reduct is the contents of the data table [12]. For simplicity, we adopt
&g_ M
_C=DM4O
&g_ M
_C=DM4O
(:M4_
* ,
15
Algorithm 2: Compute a minimal attribute subset (reduct)

Input: A decision table

Output: A set of minimum attribute subset (
)
Method:
1. Run Algorithm 1 to get the core attributes of the table

2.

3.

4. Compute the merit values for all attributes of

5. Sort attributes in
based on merit values in decreasing order
6. Choose an attribute
with the largest merit values (if there are several attributes
with the same merit value, choose the attribute which has the least number of

combinations with those attributes in
)

7.

8. If
, then terminate, otherwise go back to Step 4
"n\\
7 \
"n
7
"n
"n

!
"n\
"n
!- 7 7
* " n ,ua
the criteria that the best reduct is the one with the minimal number of attributes and that if there are two or
more reducts with the same number of attributes, then the reduct with the least number of combinations
of values of its attributes is selected.
With these considerations in mind, we propose a rough set based filter feature selection algorithm in
this section, which is illustrated in Algorithm 2.
Because all core attributes must be contained in all reducts, Algorithm 2 first calls Algorithm 1 to
find all core attributes and initializes the reduct with the complement of the core attributes set against the
condition attributes set. Then the algorithm ranks the attributes based on the attributes merit and adopts
the backward elimination approach to remove the redundant attributes. When two or more attributes have
the same merit values, the attribute with the least number of possible values is removed. This process is
repeated until a reduct is generated.
There are many algorithms developed to find reducts, but most of these algorithms suffer from the
performance problem because they are not integrated into the relational database systems and all the
related computation operations are performed on a flat file [9, 10]. In our algorithms presented above, all
the calculations such as
,
, attribute dependency, and merit values are utilizing the database
set operations. With this algorithm, we can get a reduct either

or

from the data in Table 2. For each reduct, we can derive a reduct table from the original table. For
example, the reduct table based on the reduct

, shown in Table 4, is generated by projecting
out the attributes

and from Table 2, which can still make a correct classification model.

is a minimum subset and cannot be reduced further without sacrificing the accuracy of the
classification model. If we create another table from Table 4 by removing Size, shown in Table 5, this
table cannot correctly distinguish between tuples
and as well as tuples
and , because these
tuples have the same

values but belong to different classes which are distinguishable in the
reduct table Table 4.
Summarily, our algorithm has many advantages over existing methods:
ZJEO-M O-MD i
(:M4_ +i&g_ M
(:M4_
(:M4_
&g_ M
(:M4_ ii&g_ M
(:M4_ ii&g_ M
/
(1
(:M4_ ii _C=DM4O
(1) it is effective and efficient in eliminating irrelevant and redundant features with strong interaction
by searching relevant and important features;
(2) it is feasible for applications with hundreds or thousands of features by using database systems
16
Table 4. Reduct Table for
-o %M _kD
$/4(
e1E(

(

(:M4_
EJ L
JEL
cMD
_
_

Table 5. Reduced Table for
(:M4_
EJ L
JEL
cMD
_

_
JEL
_
EJ L
_ %MB M

-o %M _kD
$/(
e1(

( E(

_
JEL
_
JEL
JEL
_ %MB M

&g_ M
JEc>oGB i

JEc>oGB i
JEc>oGB i

operations; and
(3) it is tolerant to inconsistency in the data table by considering the dependency between the condition
attributes and the decision attributes.
6. Conclusion
Rough sets theory has been applied successfully in many disciplines. One of the major limitations of
the traditional rough sets model in the real applications is the inefficiency in the computation of core
attributes and the generation of reducts. Most existing rough set models, do not integrate with database
systems and a lot of computational intensive operations such as discernibility relations computation,
core attributes search, reduct generation, and rule induction are performed on flat files, which limits their
applicability for large data sets in data mining applications.
In order to improve the efficiency of computing core attributes and reducts, many novel approaches
have been developed. In this paper, we proposed a new rough set model using relational algebra op
erations such as
and so on, by taking advantages of efficient data
organizations and SQL query algorithms developed and implemented in most relational database systems. We borrowed the main ideas of traditional rough sets theory and redefined them based on the
relational algebra and database systems to take their advantages of the very efficient set-oriented operations. Our model is based on the following ideas: effective objects pre-processing and organization such
as sorting and indexing [13] and efficient SQL queries implemented in most RDBMS systems [3, 8] help
improve the algorithm efficiency. With this new model, we presented two new algorithms to calculate
core attributes and reducts for feature section.
Our algorithm for finding core attributes is efficient and the outcome is proved to be correct based

17
on the definition of the traditional rough set model. We also discussed the detection and elimination of
inconsistent tuples from the data table. Our feature selection algorithm identifies a reduct efficiently and
reduces the data set significantly without losing essential information. Almost all the operations used in
generating core attributes and reducts in our method can be performed using the database systems setoriented operations. Since these relational algebra operations have been efficiently implemented in most
widely-used relational database systems like Oracle, DB2, Sybase, etc., the algorithms presented in this
paper can be extensively applied to relational database systems and adapted to a wide range of real-life
applications. Another merit of these algorithms is their scalability, because existing relational database
systems have demonstrated that their implementations of relational algebra operations are suitable to
process very large data sets. Thanks to these, our method is more efficient and scalable than the traditional
rough set based data mining approaches.
Our future work will be focusing on the experiments of this model with large data sets stored in
database systems, as well as applications of this model to feature selection and rule induction for knowledge discovery in very large data sets.
References
[1] Bazan, J., Nguyen, H., Nguyen, S., Synak, P., Wroblewski, J., Rough set algorithms in classification problems, Rough Set Methods and Applications: New Developments in Knowledge Discovery in Information
Systems, L. Polkowski, T. Y. Lin, and S. Tsumoto (eds), 49-88, Physica-Verlag, Heidelberg, Germany, 2000
[2] Cercone N., Ziarko W., Hu X., Rule Discovery from Databases: A Decision Matrix Approach, Proc. of
ISMIS, Zakopane, Poland, 653-662, 1996
[3] Fernandez-Baizan, A., Ruiz, E., Sanchez, J., Integrating RDMS and Data Mining Capabilities Using Rough
Sets, Proc. IMPU, Granada, Spain, 1996
[4] Garcia-Molina, H., Ullman, J. D., Widom, J., Database System Implementation, Prentice Hall, 2000.
[5] Hu, X., Cercone N., Han, J., Ziarko, W, GRS: A Generalized Rough Sets Model, in Data Mining, Data
Mining, Rough Sets and Granular Computing, T.Y. Lin, Y.Y.Yao and L. Zadeh (eds), Physica-Verlag, 447460, 2002
[6] John,G., Kohavi, R., Pfleger,K., Irrelevant Features and the Subset Selection Problem, Proc. ICML, 121-129,
1994
[7] Kira,K., Rendell,L.A. The feature Selection Problem: Traditional Methods and a New Algorithm, Proc.
AAAI, MIT Press, 129-134, 1992
[8] Kumar A., New Techniques for Data Reduction in Database Systems for Knowledge Discovery Applications,
Journal of Intelligent Information Systems, 10(1), 31-48, 1998
[9] Lin T.Y., Cercone, N. (eds), Rough Sets and Data Mining: Analysis of Imprecise Data, Kluwer Academic
Publisher, 1997
[10] Lin T.Y., Yao Y.Y. Zadeh L. (eds), Data Mining, Rough Sets and Granular Computing, Physica-Verlag, 2002
[11] Liu, H., Motoda., H. (eds), Feature Extraction Construction and Selection: A Data Mining Perspective.
Kluwer Academic Publisher, 1998
[12] Modrzejewski, M., Feature Selection Using Rough Sets Theory, Proc. ECML, 213-226, 1993
[13] Nguyen, H., Nguyen, S., Some efficient algorithms for rough set methods, Proc. IPMU Granada, Spain,
1451-1456, 1996
18
[14] Pawlak Z., Rough Sets, International Journal of Information and Computer Science, 11(5), 341-356, 1982
[15] Pawlak Z., Rough Sets: Theoretical Aspects of Reasoning about Data, Kluwer Academic Publishers, 1992
[16] Polkowski, L., Skowron, A., Rough mereology, Proc. ISMIS, Charlotte, NC, 85-94, 1994
[17] Polkowski, L., Skowron, A., Rough mereology: A new paradigm for approximate reasoning, J. of Approximate Reasoning, 15(4), 333-365, 1996
[18] Skowron, A., Rauszer C., The Discernibility Matrices and Functions in Information Systems, Intelligent
Decision Support - Handbook of Applications and Advances of the Rough Sets Theory, K. Slowinski (ed),
Kluwer, Dordrecht, 331-362, 1992
[19] Skowron, A., Stepaniuk, J., Tolerance approximation spaces, Fundamenta Informaticae 27(2-3), 245-253,
1996
[20] Stepaniuk, J., Generalized approximation spaces, Proc. of the 3rd International Workshop on Rough Sets
and Soft Computing, San Jose State University, San Jose, California, 156-163, 1994
[21] Ziarko, W., Variable Precision Rough Set Model, Journal of Computer and System Sciences, 46(1), 39-59,
1993

A New Rough Sets Model Based On Database Systems: Xiaohua Hu T. Y. Lin

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A New Rough Sets Model Based On Database Systems: Xiaohua Hu T. Y. Lin

Uploaded by

Copyright:

Available Formats

Fundamenta Informaticae XX (2004) 118

A New Rough Sets Model Based on Database Systems

X. Hu, T. Lin, J. Han / A New Rough Set Model

X. Hu, T. Lin, J. Han / A New Rough Set Model

2. Overview of Rough Sets Theory

IKJELNM4OP QSR T(UGV of W

X. Hu, T. Lin, J. Han / A New Rough Set Model

Table 1. 8 cars with attributes Door, Size, Cylinder, Mileage

lower and upper approximations as follows:

X. Hu, T. Lin, J. Han / A New Rough Set Model

Table 2. 8 cars with attributes

\ )(:M4_ i rJJEOEi&g_ Mi _ C=DM4O!

cars can be built.

X. Hu, T. Lin, J. Han / A New Rough Set Model

Definition 2.4. An Attribute

One can verify that, in Table 2,

' (:M4_ ii&g_ Mi _ C=DM4OE)

Similarly, it can be shown that, in Table 2,

' rJJEOEi&g_ Mi _ C=DM4OE)

following conditions are satisfied:

IKJELNM4O P  R TiP U*R IKJELNM4O P mR TiP U*R 

' (:M4_ ii&g_ M )

 $/( E!-  e1( .!-  !-  -( .!-  E!!

IKJELNM4O P mR TiP U*R

IKJELNM4O-P mR TiP U*R

reduct of the attributes set in a given data table is NP-hard [18].

Definition 2.7. An attribute

X. Hu, T. Lin, J. Han / A New Rough Set Model

3. A New Rough Sets Model Based on Database Systems

7@#  $/( e1E( 434343.( ( e1 i/4!

X. Hu, T. Lin, J. Han / A New Rough Set Model

_ %MB MZ JEL  1E( ( ( .( E!

_ %MB MZ _  $/( E( E!

Table 3. Decision Matrix for Table 2

Definition 3.1. An attribute

For simplicity, we use

For example, in Table 2, it can be shown that

ZBOD * * rJJEOEi&g_ Mi _ C=DM4OE _ %MB M, , 

ZBOD * * rJJEOEi&g_ Mi _ C=DM4O , , 3

X. Hu, T. Lin, J. Han / A New Rough Set Model

We can check whether attribute

ZBOD * * (:M4_ ii&g_ Mi _ C=DM4OE _ %MB M, ,@

Definition 3.3. Let

X. Hu, T. Lin, J. Han / A New Rough Set Model

Definition 3.4. The subset of attributes

"n  ,`  *  ,$

"n  ,@ s  *    ,$h6     "n 3

A New Algorithm for Finding Core Attributes

X. Hu, T. Lin, J. Han / A New Rough Set Model

 is consistent if and only if

Fact 4.1. The data table

a : C , we have IKJELNM4O P mQSR T(UG V    ]   9 W

Algorithm 1: Core Attributes Algorithm

X. Hu, T. Lin, J. Han / A New Rough Set Model

By Fact 4.2, it can be easily seen that

IKJELNM4O-P mR TiP U*R`s IKJELNM4O-P mQSR TiP U*R 3

7 , ,\s ZBOD * * 7 , ,

X. Hu, T. Lin, J. Han / A New Rough Set Model

SELECT DISTINCT COUNT(*) FROM T

SELECT DISTINCT COUNT(*) FROM (SELECT X FROM T)

Theorem 4.2. Algorithm 1 can be implemented in

is the number of attributes

IKJELNM4OP QSR T(UGV of W

\ )(:M4_ i rJJEOEi&g_ Mi _C=DM4O!

' (:M4_ ii&g_ Mi _C=DM4OE)

' rJJEOEi&g_ Mi _C=DM4OE)

IKJELNM4O P R TiP UR IKJELNM4O P mR TiP UR

' (:M4_ ii&g_ M )

$/( E!- e1( .!- !- -( .!- E!!

7@# $/( e1E( 434343.( ( e1 i/4!

_ %MB MZ JEL 1E( ( ( .( E!

_ %MB MZ _ $/( E( E!

ZBOD * * rJJEOEi&g_ Mi _C=DM4OE _ %MB M,,

ZBOD * * rJJEOEi&g_ Mi _C=DM4O ,, 3

ZBOD * * (:M4_ ii&g_ Mi _C=DM4OE _ %MB M,,@

"n ,` * ,$

"n ,@ s * ,$h6 "n 3

is consistent if and only if

a :C , we have IKJELNM4O P mQSR T(UG V ] 9 W

IKJELNM4O-P mR TiP UR`s IKJELNM4O-P mQSR TiP UR 3

7 ,,\s ZBOD * * 7 ,,

(:M4_ ii _C=DM4O