Professional Documents
Culture Documents
A New Rough Sets Model Based On Database Systems: Xiaohua Hu T. Y. Lin
A New Rough Sets Model Based On Database Systems: Xiaohua Hu T. Y. Lin
IOS Press
T. Y. Lin
Department of Computer Science
San Jose State University
San Jose, CA 94403, USA
tylin@cs.sjsu.edu
Jianchao Han
Department of Computer Science
California State University Dominguez Hills
Carson, CA 90747, USA
jhan@csudh.edu
Abstract. Rough sets theory was proposed by Pawlak in the 1980s and has been applied successfully in a lot of domains. One of the major limitations of the traditional rough sets model in the
real applications is the inefficiency in the computation of core and reduct, because all the intensive
computational operations are performed in flat files. In order to improve the efficiency of computing
core attributes and reducts, many novel approaches have been developed, some of which attempt
to integrate database technologies. In this paper, we propose a new rough sets model and redefine
the core attributes and reducts based on relational algebra to take advantages of the very efficient
set-oriented database operations. With this new model and our new definitions, we present two new
algorithms to calculate core attributes and reducts for feature selections. Since relational algebra
operations have been efficiently implemented in most widely-used database systems, the algorithms
presented in this paper can be extensively applied to these database systems and adapted to a wide
range of real-life applications with very large data sets. Compared with the traditional rough set
models, our model is very efficient and scalable.
Keywords: Rough set, database systems, relational algebra, reduct, feature selection
C
Corresponding author
1. Introduction
Rough sets theory was first introduced by Pawlak in the 1980s [14] and it has been applied in many
applications such as machine learning, knowledge discovery, and expert systems [9, 10, 15] since then.
Rough set is especially useful for domains where data collected are imprecise and/or incomplete about
the domain objects. It provides powerful tools for data analysis and data mining from imprecise and
ambiguous data. Many rough sets models have been developed in the rough set community in the last
decades [5, 14, 16, 17, 19, 20, 21], including Ziarkos VPRS [21] and Hus GRS [5], to name a few.
Some of them have been applied in the industry data mining projects such as stock market prediction,
patient symptom diagnosis, telecommunication churner prediction, and financial bank customer attrition
analysis to solve challenging business problems [9, 10]. These rough set models focus on the extension of
the original model proposed by Pawlak [14, 15] and attempt to deal with its limitations such as handling
statistical distribution or noisy data.
Our experience of applying the VPRS and GRS models in large data sets in data mining applications
has shown that one of the strong drawbacks of the traditional rough set model is the inefficiency of rough
set methods and algorithms of computing the core attributes and reducts and identifying the dispensable
attributes, which limits the suitability of the traditional rough set model in data mining applications.
Further investigation of the problem reveals that most existing rough set models do not integrate with
the relational database systems and a lot of computational intensive operations are performed in flat files
rather than utilizing the high performance database set operations. Moreover, not much attention and
attempt have been paid to design new rough sets model by effectively combining database technologies
to generate the core attributes and reducts so as to make their computations efficient and scalable in the
large data set.
To overcome the problem, some approaches to improving the efficiency of finding core attributes
and reducts have been developed [1], including the algorithms presented in [13], which largely improve
the generation of discernibility relation by sorting the objects. Some authors have proposed approaches
to reduce data size using relational database system techniques [8] and developed rough set-based data
mining systems that integrates RDBMS capabilities such as RSDM (Rough Set Data Miner) [3]. The
algorithms presented in [3] embed SQL queries to take advantage of database technologies.
In this paper, we attempt to redefine some concepts of rough set theory such as core attributes and
reducts by using relational algebra so that the computation of core attributes and reducts can be per
. The
formed with very efficient set-oriented database operations, such as
arguments behind our model based on database technologies come from the following two points:
Most existing RDBMS systems have implemented pre-processing and efficient organization of
the original data, such as sorting and indexing, where sorting is the main idea of the algorithms
proposed in [13], and
The efficient implementation of SQL queries in most RDBMS systems reduces the cost of accessing disks and is scalable to huge data sets [8, 3].
The rest of the paper is organized as follows: We give an overview of the rough set theory based on
the model proposed by Pawlak [14, 15] with some examples in Section 2. In Section 3, we redefine the
main concepts and methods of rough set theory based on the database systems set-oriented operations,
and illustrate these concepts with examples. With the new definition, we propose an efficient algorithm
to compute the core attributes and show the correctness of the algorithm as well as the equivalence of
our definition to the corresponding one defined in the traditional rough set theory in Section 4. Another
efficient algorithm for feature selection based on reduct generation is presented in Section 5. Finally, we
conclude with some discussions and our future works in Section 6.
"!
& &
$ # %
$#(' &*)+, %
-' &*)
&
' )*.0/ 21 434343. 25 !
687:9;E .' 7<)=::.F 7>G/ F 71 434343. 7@? !
Suppose
is a database table. We define two tuples and are in the same equivalent
class induced by attributes ( is a subset of or ) if
. (The tuples in the same equivalent
class have the same attribute values for all the attributes in ). Let
denote the
equivalent classes induced by , and
denotes the equivalent classes
induced by (
are also called
).
7 A
BC=D7@#
H
is the
is the
Tuple id
Door
$/
e1
Size
Cylinder
Mileage
compact
high
sub
low
compact
high
compact
low
compact
low
compact
high
sub
low
sub
low
Definition 2.3. The boundary area between the upper approximation and the lower approximation is
defined as
J mC=DBO P QSR TiP U*RS nKo-opM4O-P QSR TiP U*R IKJELNM4O-P QSR TiP U*R%3
Example 2.1. Suppose we have a collection of 8 cars ( /( e1E434343.( ) with information about the rJJEO ,
&g_ M , _C=DM4O and _ %MB M , shown in Table 1. rJJEO , &g_ M and _C=DM4O are the condition attributes
and _ %MB M is the decision attribute. The attribute Tupel id is just for explanation purpose and can be
ignored.
With q
_ %MB M! and qq.rJJEOEi&g_ Mi _C=DM4O! , we can calculate the elementary sets,
' )
=
_ %MB MZ _ _ %MB MZ JEL !
_ %MB MZ _
=
$/( E( E!
_ %MB MZ JEL
=
e1( ( ( E( .!
' )
=
$/4!- e1( !- ( E( E!- !- E!!
IKJELNM4O-P mR T # 4# = $/4!
IKJELNM4O-P mR T #
=
e1( ( E( E!
=
$/( e1E( ( E( .!
IKJELNM4O-P mR TiP U*R
nKo-opM4O P mR T # 4# = $/( E( ( E!
nKo-opM4O-P mR T #
=
e1( E( ( ( E( E( E!
nKo-opM4O-P mR TiP U*R
=
$/( e1E( ( ( E( ( E( .!
J mC=DBO P mR TiP U*R
=
( E( E!
Only 5 of 8 cars i/( e1E( ( E( belong to the lower approimation of based on , IKJELNM4O P mR TiP U*R ,
while 3 of 8 cars fall in the boundary area. This fact indicates that the information on rJJEOEi&g_ M and
_C=DM4O collected so far is not consistent, for it is only good enough to make a classification model
for the above five cars, but not enough to classify other three. In order to classify E( and , more
!"#$!%'& (#$&
!"#$!%' )*
!"#$!%'& (#$&
!"#$!%' )*
Tuple id
Weight
$/
e1
"!
#$%&'
Door
Size
Cylinder
Mileage
low
compact
high
sub
low
medium
compact
high
high
compact
low
high
compact
low
low
low
compact
high
high
sub
low
low
sub
low
information is needed. Actually, one can easily verify that and are a pair of contradictory tuples, for
they have the same condition attributes values but different decision attribute values.
(:M4_
(:M4_
Example 2.2. To classify the tuples in the boundary area, we add a new condition attribute
of
for each car. The result is illustrated in Table 2.
cars and collect information on
With the condition attributes set
, we can calculate the lower,
upper approximations and boundary area as below:
One of the nice features of rough sets theory is that rough sets can tell whether the data is complete or
not based on the data itself. If the data is incomplete, it suggests more information about the objects need
to be collected in order to build a good classification model. On the other hand, if the data is complete,
rough sets theory can also determine whether there are more than enough or redundant information
in the data and find the minimum data needed for classification model. This property of rough sets
theory is very important for applications where domain knowledge is very limited or data collection is
very expensive/laborious because it makes sure the data collected is just good enough to build a good
classification model without sacrificing the accuracy of the classification model or wasting time and effort
to gather extra information about the objects. Furthermore, rough sets theory classifies all the attributes
into three categories: core attributes, reduct attributes and dispensable attributes. Core attributes have
the essential information to make correct classification for the data set and should be retained in the data
set; dispensable attributes are the redundant ones in the data set and should be eliminated; and reduct
attributes are in the middle between. Depending on the combination of the attributes, in some cases, a
reduct attribute is not necessary, while in other situations it is essential.
u
Zfd is a dispensable attribute in with respect to
IKJELNM4O P mR TiP U*R IKJELNM4O P V(R TiP U*R 3
if
! (#$&
!
))
! (#$&
!
!
!"#$!
!
!"#$!
!
! (#$&
!
!"#$!
is
))
!
))
! (#$&
!
!
!"#$!
!
!"#$!
!
!
))
!"#$!
1.
From Definition 2.6, one can see that a reduct is a minimum subset of the entire condition attributes
set that has the same classification capability as the original condition attributes set. One can easily show
that, in Table 2,
,
is a reduct, because
(:M4_ &g_ M
! (#$&
!
=
=
=
=
=
=
!
s
u
Zfd
is a reduct attribute if
is part of a reduct.
&g_ M
_C=DM4O
Every reduct must contain all the core attributes [14]. (We will explain the algorithm to find a reduct
in the later part of the paper). So ,
are reduct attributes. According to rough set theory,
based on the data in Table 2, in order to make good classification model for the attribute , we
need
information of the cars plus either or
but not necessary both at the same
time.
(:M4_
&g_ M
_C=DM4O
_ %MB M
A
>
$/( e1( E434343.( ( $/ ( $/ i/434343.( e1 E!
7W# s W
1 i/
7W#
H
(2) Another drawback of rough sets theory is the inefficiency in computation, which limits its suitability for large data sets in real-world applications. In order to find the reducts, core and dispensable
attributes, the rough sets model needs to construct all the equivalent classes based on the attribute
values of the condition and decision attributes. This process is very time-consuming, and thus the
model is very inefficient and infeasible, and doesnt scale for large data set, which is very common
in data mining applications [5, 8]. Some new algorithms to overcoming this inefficiency have been
developed [1, 13].
Our research investigation of the inefficiency problem of rough sets model finds out that most rough
set models do not integrate with the relational database systems and a lot of basic operations of these
computations are performed in flat files rather than utilizing the high performance database set operations. In considering of this and influenced by [8, 3], we borrow the main ideas of rough sets theory and
redefine them using the database theory to utilize the very efficient set-oriented database system operations. Almost all the operations in rough sets computation used in our method can be performed using
the database system operations such as
, etc. In this section, we will give our new
definitions of core attributes, dispensable attributes and reducts based on database operations. Two new
algorithms for finding core attributes and feature selection based on our new model will be presented in
the following sections.
As pointed out in [14], all the core attributes are indispensable part of every reduct. So it is very
important to have a very efficient way to find all the core attributes in order to get the reduct, the minimum
subset of the entire condition attributes set. In the traditional rough set model, a popular method to get
the core attributes is to construct a decision matrix first, then search all the entries in the decision matrix
to find all those entries with only one attribute. If the entry in the decision matrix contains only one
attribute, that attribute is a core attribute [2].
For example, the decision matrix generated from Table 2 is shown in Table 3. This decision matrix
contains all the condition attributes, which are not identical between two equivalent classes induced by
the decision attribute:
and
. In the
decision matrix, only
appears in the entries that have a single attribute. Thus, according to the
decision matrix in Table 3, the only core attribute in the condition attributes set for the data in Table 2 is
.
(:M4_
(:M4_
e1
$/ i&gi
(vi
( (vi&gi &gi
(vi&gi (vi (
(vi&gi
(vi&gi
&gi
(
(
(vi&gi
i&gi
_C=DM4O -rJJEO &&g_ M ( (:M4_
This method is very inefficient and it is not realistic to construct a decision matrix for millions of
tuples in the data table, which is a typical situation for data mining applications. Some authors have
presented approaches to finding core attributes and reducts without constructing the decision matrix.
For example, an approach developed in [13] first constructs the discernibility relation by sorting the
data tuples (objects) in the data table, then uses the discernibility relation to build the lower and upper
approximations, and finally applies the approximations to find a semi-minimal reduct. The algorithms
presented in [13] runs in time of at least
, where is the number of attributes and is the
number of tuples, because sorting tuples takes
time. In this section, however, we attempt
to propose a new approach to find core attributes and reducts without calculating the lower and upper
approximations, and our algorithm of finding core attributes takes only
time. For this purpose,
we will redefine the core attributes and reducts.
Following the relational algebra, in this paper, we use
to denote the
, and
for
.
* c C
KC ,
* c C
KC , c
*c C ,
u
Zfd is a core attribute if it satisfies the following condition:
ZBOD * *
j ,,\s ZBOD * *
,,$3
u
to denote `
! .
& ZI
;
;`
e5
6Hfd
( ' .)+, e5' .)( '
)u,s e5 '
i)-C=DZ ' )K,s e5' )3
In this case, a projection on
u
will have at least one fewer row than the projection on v
,
because and e5 being identical in
\`
are being combined in this projection. However, in the
projection "`
" , ( e5 are still distinguishable. So eliminating the attribute
will lose the ability
to distinguish tuple and e5 . Intuitively this means that some classification information is lost after
is eliminated.
For example, in Table 2, and have the same values on all the condition attributes except (:M4_ ;
the two tuples belong to different classes because they are different on the value on (:M4_ . Thus,
(:M4_ is the only attribute to distinguish between and . If (:M4_ is eliminated, then and
are indistinguishable. So (:M4_ is a core attribute for the table.
Definition 3.2. An attribute u
Zfd is a dispensable attribute in with respect to , if the classification
result of each tuple is not affected without using
, that is,
ZBOD * *
,,u\ZBOD * * '
,,$3
$
This definition characterizes that an attribute is dispensable if each tuple in the data table can be
classified in the same way no matter whether the attribute is present or not. We can check whether
is dispensable by using some
operations. We only need to take two projections of
attribute
the table: one on the attribute set
, and the other on
. If the cardinality of the
,
two projection tables is the same, then it means that no information is lost in removing attributes
otherwise, it indicates that
is relevant and should be reinstated. For example, in Table 2, one can
verify that
& ZI
K
In order to find reducts of the condition attributes set with respect to the decision attribute set, we
define the degree of dependency of the condition attributes on the decision attributes.
"n
"n
"n 9X
* "n ,
*
"n ,` ZBOD ZB* O* D * "
n
j, , , 3
* , , denoted
. The degree of
10
*
"n ,
The value
is the proportion of those tuples in the decision table that can be classified.
This value characterizes the ability to predict the class
and the complement
from tuples in the
decision table.
With the dependency measure between the condition attributes and the decision attributes, we can
redefine the reduct of the condition attributes set with respect to the decision attributes set in a data table
as follows.
"n * 9: ,
*
"n
*
4.
In this section, we will propose a new algorithm based on the the concepts defined in previous section
and database operations to find all the core attributes of a decision table. The data table, however,
may contain inconsistent tuples. Two tuples are said to be inconsistent if they have the same values
on the condition attributes, but are labeled as different classes (having different values on the decision
attributes). Inconsistent tuples can not be classified. Thus, the inconsistent tuples should be eliminated
from the data table before the classification process proceeds. Here we assume the inconsistent tuples are
noisy data, otherwise more attributes and values for tuples should be collected further to ensure the data
table is consistent. Our new algorithm is based on the database systems operations without calculation
of the lower and upper approximations. Compared with the original rough set approach, our algorithm
is efficient and scalable.
For simplicity, we first assume that the data table is inconsistency-free, and then we will discuss how
efficiently to detect and eliminate the inconsistent tuples from the data table based on our new rough set
model.
11
The new algorithm for finding all core attributes based on the relational database system is described
in Algorithm 1.
Algorithm 1 consists of two steps. The first step is simply to initialize the core attributes set to be
empty. The second step checks all condition attributes one by one to see if they are core attributes. If
yes, they are added to the set of the core attributes. The following theorem ensures that the return
of Algorithm 1 contains all core attributes and only those attributes. Before we present and prove the
theorem, lets first present two facts.
ZJEO-M
One can find out that Table 1 is inconsistent, while Table 2 is consistent.
7 9
' 7<) .7H/ 71 434343 7@? ! and ' ) / 1 434343.
! are
7 and , respectively, then 6 # f ' )(_Aqa$b434343E(C , and
7
f ' 7<)k2a$b434343E(c
# 7
;t or # 9^7
. ' ) is said a refinement of ' 7<) .
Let ' )2 </ i 1 434343.iK? ! and ' ;7<)2 / i 1 434343.i 5 ! be the set of equivalent classes
induced by and
7 , respectively, and ' ) ."/ 21 434343.
! be the set of equivalent classes
induced by . We have the following two propositions.
Proposition 4.1. 67:fd , if IKJELNM4OP mR TiP U*R`
s IKJELNM4O-P mQSR TiP U*R , then
ZBOD * * 7 ,,ZBOD * * 7
,,$3
. Assume
Fact 4.2. Let
the set of equivalent classes induced by
, either
Proof:
According to Definition 2.1, for given
. Thus,
%
* ,
ZJEO-M
ZJEO-M\
7 fd;
ZBOD * * 7
,,\s ZBOD * *
ZJEO-M\ZJEO-M r7
7 ,,
12
9 9 W
because 7 9 . Hence f
8K# ] K# 9W
a _ cd!" IKJELNM4O-P mR T(UGV . Therefore, IKJELNM4OP mQSR T(UGV 9 IKJELNM4O-P mR T(UGV and thus
IKJELNM4O-P mQSR TiP U*R`9 IKJELNM4O-P mR TiP U* R . Because IKJELNM4OP mQSR TiP U*R s IKJELNM4O-P mR TiP U*R from the given condition,
we must have IKJELNM4OP mQSR TiP U*R
IKJELNM4O-P mR TiP U*R 3
So it can be inferred that
Z such that fIKJELNM4O P mR TiP U*R and fIKJELNM4O P mQSR TiP U*R .
Thus,
@W
a
j C , such that fIKJELNM4O P mR T(UGV , which means,
a vo cv( fd 9^W
.
And 6a _ C`( fIKJELNM4O-P mQSR T(
U , that is, f 8 ] 9^>#
a ,lm! .
On the other hand, Afd ] a$b434343E$lm! . Hence
a lG( Hfj , but 60a ;_
lGi 9s ># . It is known >f W
. Thus, we have
> >fv@( 2f "W
s t -C=DA 9:s W
which
, but f W
. Because 8.># ] _@ a$b434343E(C ! , so
a :C
means,
> Zf such that Zf
such that f"
8 s . Thus, f0 r 8 s 3
and f ; but ' ,7
;) s
Therefore, we obtain ' 7<)WX 4' 7<) for f
4' 7
) for fW
and f s .
From above, one can see that and are projected to be same by * \7 , but different by
* ;7
: , . Thus, we conclude that * ;7
: , has at least one more distinct tuple than
* 7 , , and the proposition holds.
Proposition 4.2. Assume is consistent. 687 fd , if
ZBOD * * 7 ,,ZBOD * * 7
j ,,$
then
f
:
* A7 ,
* A7
> ,
4' A7<)+ ' A7<) -C=DN 4'
4' ) s ' )3
a
l
i rf
7
) s ' 7
)3
a :_>:s :C
Zf 2#
f W
6 a ,o :C`i 9s
i f
6 a o `C ( i f IKJELNM4OP mQSR U
i f IKJELNM4OP mQSR TiP U*R 3
i f IKJELNM4O P mR TiP U*R
KI JELNM4O P mR TiP U*R s
IKJELNM4O P mQSR TiP U*R
Proof:
According to the given condition, it can be easily inferred that
and
such that and are
projected to be same by
but distinct by
, that is,
,
Hence, we have
Therefore,
such that
and
such that
and
. So
(otherwise
).
By Definition 2.1,
. Thus,
On the other
hand, because
is consistent, by Fact 4.1,
. Therefore,
.
By Proposition 4.1 and Proposition 4.2, we immediately have:
Theorem 4.1. If the data table
1.
is a core attribute in
is consistent, then
with respect to
ZBOD * *
687 fd
if and only if
13
Theorem 4.1 reveals that the new definitions of the core and dispensable attributes in our new rough
set model are equivalent to the corresponding definitions defined in the traditional rough set model [14,
15].
Algorithm 1 can be implemented using
statements as follows.
& ZI
ZBOD * * ,, :
ZBOD * * ,, :
can be
i 7
, or
7
* c C , time, where c
c
*c C ,
Proof:
The for loop is executed times, and inside each loop, finding the cardinality takes
fore, the total running time is
.
* C , time. There
From above theorem, it is clear that the time complexity of our algorithm based on SQL queries to
finding core attributes is linear with respect to the number of objects and attributes, which is better than
, the time complexity of the algorithm presented in [13]. We do not, however, include the
time complexity caused by the SQL queries, which forms one of our future works by experiments with
large data sets.
Furthermore, most real applications contain noise data, and some tuples in the data table may be
inconsistent. Our model based on the relational database operations also provides an efficient means of
detecting and eliminating the inconsistency of the data table.
By Fact 4.1, to detect the inconsistency of the data table, we only need to test whether
. If they are not equal, there exist inconsistent tuples in the data table. The inconsistent
tuples can be eliminated by the means of computing the following relational operation:
* c C
uC ,
ZBOD * * ,,u
ZBOD * * ,,
+U *
U U * * ,,,$
%
%
JEC=D-_ h_kJEC
-Join:
(# #
* ,
, but named as
)
)
is a new data table that consists of those tuples of
, where is the Cartesian product of and .
&ZI
SELECT * FROM T U
WHERE EXISTS (SELECT * FROM T V
statement as follows:
satisfying the
14
WHERE (U.C=V.C)
(U.D
V.D))
If the database system implements the -Join by using Hash table (e.g. in Oracle 9i, you can set
the system variable HASH JOIN ENABLED to true), then detecting and eliminating inconsistent tuples
from the data table only take a linear time. Thus, we have,
*C ,
*C ,
Theorem 4.3. Based on our rough set model, the inconsistency of a data table can be detected in
time, and the inconsistent tuples can be eliminated in
time, where is the number of tuples in the
data table.
5.
Two kinds of attributes are generally perceived as being unnecessary: attributes that are irrelevant to the
target concept (like the tuple
, customer
), and attributes that are redundant given other attributes.
in Table 2. Either or
but not
For example, consider the attributes and
necessary both is needed at the same time for the classification purpose if the attribute
of the cars
appears. In actual applications, these two kinds of unnecessary attributes can exist at the same time but
the latter redundant attributes are more difficult to eliminate because of the interactions between them.
In order to reduce both kinds of unnecessary attributes to a minimum, we use feature selection. Feature
selection is a process we employ to choose a subset of attributes from the original attributes. Feature
selection has been studied intensively in the past decades [6, 7, 11, 12]. The purpose of feature selection
is to identify the significant features, eliminate the irrelevant or dispensable features to the learning task,
and build a good learning model. The benefits of feature selection are twofold: it considerably decreases
the running time of the induction algorithm, and increases the accuracy of the resulting model.
All feature selection algorithms fall into two categories: (1) the filter approach and (2) the wrapper
approach. In the filter approach, the feature selection is performed as a preprocessing step to induction.
Some of the well-known filter feature selection algorithms are RELIEF [7] and PRESET [12]. The filter
approach is ineffective in dealing with the feature redundancy. In the wrapper approach [6], the feature
selection is wrapped around an induction algorithm, so that the bias of the operators that define the
search and that of the induction algorithm interact mutually. Though the wrapper approach suffers less
from feature interaction, nonetheless, its running time would make the wrapper approach infeasible in
practice, especially if there are many features, because the wrapper approach keeps running the induction
algorithm on different subsets from the entire attributes set until a desirable subset is identified. We
intend to keep the algorithm bias as small as possible and would like to find a subset of attributes that can
generate good results by applying a suite of data mining algorithms. We focus on an induction algorithmindependent feature selection. Our goal is to construct a reasonably fast algorithm that can find a relevant
subset of attributes and eliminate the two kinds of unnecessary attributes effectively.
A decision table may have more than one reduct. Anyone of them can be used to replace the original
table. Finding all the reducts from a decision table is NP-Hard [9]. Fortunately, in many real applications,
it is usually not necessary to find all of them. One is sufficient. A natural question is which reduct is the
best if there exist more than one reduct. The selection depends on the optimality criterion associated with
the attributes. If it is possible to assign a cost function to attributes, then the selection can be naturally
based on the combined minimum cost criteria. In the absence of an attribute cost function, the only
source of information to select the reduct is the contents of the data table [12]. For simplicity, we adopt
&g_ M
_C=DM4O
&g_ M
_C=DM4O
(:M4_
* ,
15
"n\\
7 \
"n
7
"n
"n
!
"n\
"n
!- 7 7
* " n ,ua
the criteria that the best reduct is the one with the minimal number of attributes and that if there are two or
more reducts with the same number of attributes, then the reduct with the least number of combinations
of values of its attributes is selected.
With these considerations in mind, we propose a rough set based filter feature selection algorithm in
this section, which is illustrated in Algorithm 2.
Because all core attributes must be contained in all reducts, Algorithm 2 first calls Algorithm 1 to
find all core attributes and initializes the reduct with the complement of the core attributes set against the
condition attributes set. Then the algorithm ranks the attributes based on the attributes merit and adopts
the backward elimination approach to remove the redundant attributes. When two or more attributes have
the same merit values, the attribute with the least number of possible values is removed. This process is
repeated until a reduct is generated.
There are many algorithms developed to find reducts, but most of these algorithms suffer from the
performance problem because they are not integrated into the relational database systems and all the
related computation operations are performed on a flat file [9, 10]. In our algorithms presented above, all
the calculations such as
,
, attribute dependency, and merit values are utilizing the database
set operations. With this algorithm, we can get a reduct either
or
from the data in Table 2. For each reduct, we can derive a reduct table from the original table. For
example, the reduct table based on the reduct
, shown in Table 4, is generated by projecting
out the attributes
and from Table 2, which can still make a correct classification model.
is a minimum subset and cannot be reduced further without sacrificing the accuracy of the
classification model. If we create another table from Table 4 by removing Size, shown in Table 5, this
table cannot correctly distinguish between tuples
and as well as tuples
and , because these
tuples have the same
values but belong to different classes which are distinguishable in the
reduct table Table 4.
Summarily, our algorithm has many advantages over existing methods:
ZJEO-M O-MD i
(:M4_ +i&g_ M
(:M4_
(:M4_
&g_ M
(:M4_ ii&g_ M
(:M4_ ii&g_ M
/
(1
(1) it is effective and efficient in eliminating irrelevant and redundant features with strong interaction
by searching relevant and important features;
(2) it is feasible for applications with hundreds or thousands of features by using database systems
16
-o %M _kD
$/4(
e1E(
(
(:M4_
EJ L
JEL
cMD
_
_
(:M4_
EJ L
JEL
cMD
_
_
JEL
_
EJ L
_ %MB M
-o %M _kD
$/(
e1(
( E(
_
JEL
_
JEL
JEL
_ %MB M
&g_ M
JEc>oGB i
JEc>oGB i
JEc>oGB i
operations; and
(3) it is tolerant to inconsistency in the data table by considering the dependency between the condition
attributes and the decision attributes.
6. Conclusion
Rough sets theory has been applied successfully in many disciplines. One of the major limitations of
the traditional rough sets model in the real applications is the inefficiency in the computation of core
attributes and the generation of reducts. Most existing rough set models, do not integrate with database
systems and a lot of computational intensive operations such as discernibility relations computation,
core attributes search, reduct generation, and rule induction are performed on flat files, which limits their
applicability for large data sets in data mining applications.
In order to improve the efficiency of computing core attributes and reducts, many novel approaches
have been developed. In this paper, we proposed a new rough set model using relational algebra op
erations such as
and so on, by taking advantages of efficient data
organizations and SQL query algorithms developed and implemented in most relational database systems. We borrowed the main ideas of traditional rough sets theory and redefined them based on the
relational algebra and database systems to take their advantages of the very efficient set-oriented operations. Our model is based on the following ideas: effective objects pre-processing and organization such
as sorting and indexing [13] and efficient SQL queries implemented in most RDBMS systems [3, 8] help
improve the algorithm efficiency. With this new model, we presented two new algorithms to calculate
core attributes and reducts for feature section.
Our algorithm for finding core attributes is efficient and the outcome is proved to be correct based
17
on the definition of the traditional rough set model. We also discussed the detection and elimination of
inconsistent tuples from the data table. Our feature selection algorithm identifies a reduct efficiently and
reduces the data set significantly without losing essential information. Almost all the operations used in
generating core attributes and reducts in our method can be performed using the database systems setoriented operations. Since these relational algebra operations have been efficiently implemented in most
widely-used relational database systems like Oracle, DB2, Sybase, etc., the algorithms presented in this
paper can be extensively applied to relational database systems and adapted to a wide range of real-life
applications. Another merit of these algorithms is their scalability, because existing relational database
systems have demonstrated that their implementations of relational algebra operations are suitable to
process very large data sets. Thanks to these, our method is more efficient and scalable than the traditional
rough set based data mining approaches.
Our future work will be focusing on the experiments of this model with large data sets stored in
database systems, as well as applications of this model to feature selection and rule induction for knowledge discovery in very large data sets.
References
[1] Bazan, J., Nguyen, H., Nguyen, S., Synak, P., Wroblewski, J., Rough set algorithms in classification problems, Rough Set Methods and Applications: New Developments in Knowledge Discovery in Information
Systems, L. Polkowski, T. Y. Lin, and S. Tsumoto (eds), 49-88, Physica-Verlag, Heidelberg, Germany, 2000
[2] Cercone N., Ziarko W., Hu X., Rule Discovery from Databases: A Decision Matrix Approach, Proc. of
ISMIS, Zakopane, Poland, 653-662, 1996
[3] Fernandez-Baizan, A., Ruiz, E., Sanchez, J., Integrating RDMS and Data Mining Capabilities Using Rough
Sets, Proc. IMPU, Granada, Spain, 1996
[4] Garcia-Molina, H., Ullman, J. D., Widom, J., Database System Implementation, Prentice Hall, 2000.
[5] Hu, X., Cercone N., Han, J., Ziarko, W, GRS: A Generalized Rough Sets Model, in Data Mining, Data
Mining, Rough Sets and Granular Computing, T.Y. Lin, Y.Y.Yao and L. Zadeh (eds), Physica-Verlag, 447460, 2002
[6] John,G., Kohavi, R., Pfleger,K., Irrelevant Features and the Subset Selection Problem, Proc. ICML, 121-129,
1994
[7] Kira,K., Rendell,L.A. The feature Selection Problem: Traditional Methods and a New Algorithm, Proc.
AAAI, MIT Press, 129-134, 1992
[8] Kumar A., New Techniques for Data Reduction in Database Systems for Knowledge Discovery Applications,
Journal of Intelligent Information Systems, 10(1), 31-48, 1998
[9] Lin T.Y., Cercone, N. (eds), Rough Sets and Data Mining: Analysis of Imprecise Data, Kluwer Academic
Publisher, 1997
[10] Lin T.Y., Yao Y.Y. Zadeh L. (eds), Data Mining, Rough Sets and Granular Computing, Physica-Verlag, 2002
[11] Liu, H., Motoda., H. (eds), Feature Extraction Construction and Selection: A Data Mining Perspective.
Kluwer Academic Publisher, 1998
[12] Modrzejewski, M., Feature Selection Using Rough Sets Theory, Proc. ECML, 213-226, 1993
[13] Nguyen, H., Nguyen, S., Some efficient algorithms for rough set methods, Proc. IPMU Granada, Spain,
1451-1456, 1996
18
[14] Pawlak Z., Rough Sets, International Journal of Information and Computer Science, 11(5), 341-356, 1982
[15] Pawlak Z., Rough Sets: Theoretical Aspects of Reasoning about Data, Kluwer Academic Publishers, 1992
[16] Polkowski, L., Skowron, A., Rough mereology, Proc. ISMIS, Charlotte, NC, 85-94, 1994
[17] Polkowski, L., Skowron, A., Rough mereology: A new paradigm for approximate reasoning, J. of Approximate Reasoning, 15(4), 333-365, 1996
[18] Skowron, A., Rauszer C., The Discernibility Matrices and Functions in Information Systems, Intelligent
Decision Support - Handbook of Applications and Advances of the Rough Sets Theory, K. Slowinski (ed),
Kluwer, Dordrecht, 331-362, 1992
[19] Skowron, A., Stepaniuk, J., Tolerance approximation spaces, Fundamenta Informaticae 27(2-3), 245-253,
1996
[20] Stepaniuk, J., Generalized approximation spaces, Proc. of the 3rd International Workshop on Rough Sets
and Soft Computing, San Jose State University, San Jose, California, 156-163, 1994
[21] Ziarko, W., Variable Precision Rough Set Model, Journal of Computer and System Sciences, 46(1), 39-59,
1993