Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

Chapter 3

Rough Set Concepts

Rough set theory was developed by Zdzislaw Pawlak in the early 1980's. There

has been rapid growth of interest in rough set theory and its applications. It deals

with the classificatory analysis of data tables. The main goal of the rough set

analysis is to synthesize approximation of concepts from the acquired data. This

chapter introduces the basic underlying concepts and terminology related to rough

sets. It also explains the classical rough set method for mining rules from the data

sets.

3.1 Introduction
The philosophy of Rough Set (RS) is founded on the assumption that with every

object of the universe of discourse we associate some information. For example,.if


. \
objects are patients suffering from a certain disease, symptoms of the disease form

information about patients. In view of the available i~formation objects

characterized by the same values of the corresponding attributes are indiscernible

(similar). The indiscernibility relation generated in this way is the mathematical

basis of the rough set theory. Any set of all indiscernible objects is called an

elementary set, and forms a basic granule of knowledge about the universe. Any

36
union of some elementary sets is referred to as crisp set otherwise the set is rough

(imprecise, vague). Consequently each rough set has boundary-line cases while

crisp sets have no boundary-line elements at all. In the rough set approach, a

vague concept is replaced by a pair of well defined concepts called the lower and

the upper approximation of the vague concept. The lower approximation consists

of all objects which surely belong to the concept and the upper approximation

contains all objects which possibly belong to the concept.

3.2 Information System and Decision Table


In rough set theory, knowledge is a collection of facts expressed in terms of the

values of attributes that describe the objects. These facts are represented in the

form of a data table. Entries in a row represent an object. A data table described as

above is called an information system, attribute-value table or information· table.

Formally, an information systemS is a 4-tuple, S = (U, Q, V, f) where U a non-

empty, finite set of objects is called the universe; Q a finite set of attributes; V =

uVq, lfq E Q and Vq being the domain ·of attribute q; andf U x Q ~V, fbe the

information function assigning values from the set of examples to each of the

attributes q for every object in the universe U.

In many applications, there is an outcome of classification that is known.

This posterior knowledge is expressed by one distinguished attribute called

decision attribute. Information systems of this kind are called decision systems.

Formally, a decision table is an information system where Q = (A u D). A is the

set of categorical attributes and D is a set of decision attributes. In RS, the

decision table represents either a full or partial dependency occurring in data.

37
Example 3.1
Table 3.1 is an example of an information system containing data about 6 patients.

Columns of the table are labelled by attributes (symptoms) and rows by objects

(patients), whereas entries of the table are attribute values.

Patients p2, p3 and p5 are indiscernible with respect to the attribute

Headache, Patients p3 and p6 are indiscernible with respect to attributes Muscle-

pain, Temperature and Flu, and Patients p2 and p5 are indiscernible with respect

to attributes Headache, Muscle-pain and Temperature. Hence, for example, the

attribute Headache generates two elementary sets {p2, p3, p5} and {pl, p4, p6}.

Considering Flu as the decision attribute and, Headache, Muscle-pain and

Temperature as condition attribute, the table may be called as decision system.

Table 3.1: Example of an Information System: Flu Dataset

Patient Headache Muscle-pain Temperature Flu


pl no yes high yes
p2 yes no high yes
p3 yes yes very high yes
p4 no yes normal no
p5 yes· no high no
p6 no yes very high yes

3.3 lndiscernibility Relation


The most basic concept in rough set theory is the indiscernibility relation,

generated by information about objects of interest. The indiscernibility relation is

intended to express that due to lack. of knowledge, we are unable to discern some

objects employing the available information. It means that, in general, we are

unable to deal with single objects, a fundamental concept of RS theory. Formally,

38
for a subset P c Q of attributes of an information system S, a relation called

indiscernibility relation denoted by JND is defined as,

INDs (P) = {(x, y) E U xU: f(x, a)= f(y, a) 'r7 aE P}

If (x, y) E INDs(P) then objects x andy are called indiscernible with respect toP.

Here f(x, a) and f(y, a) represent the value of the attribute a for the objects x and y

respectively. The subscript s may be omitted in INDs(P) if information system is

implied from the context. IND(P) is an equivalence relation that partitions U into

equivalence classes, the sets of objects indiscernible with respect toP. Set of such

partitions are denoted by UIIND(P). An equivalence class of IND(P), i.e., the

block of the partition U/IND(P), containing object xis denoted by P(x).

Example 3.2
We have seen in the Example 3.1 that the attribute Headache generates two

elementary sets namely {pi, p4, p6} and {p2, p3, p5}.

Thus, UIIND {Headache}= {{pl, p4, p6}, {p2,p3,p5}}

Similarly, UIIND {Muscle-pain}= {{pl,p3,p4,p6}, {p2,p5} },

UIIND {Headache, Muscle-pain}= {pl,p4,p6}, {p2,p5}, {p3}} and

UIIND {Flu}= {{pl,p2,p3,p6}, {p4,p5} }.

3.4 Approximation of Sets


We have seen that equivalence relation induces a partitioning of the universe.

These partitions can be used to build new subsets of the universe. Subsets that are

more often of interest have the same value of the decision attribute. Let X~ U be

a desired subset of the universe. A description for X is desired that can determine

the membership status of each object in U with respect to X. Indiscernibility

relation is used for this purpose. If a partition defined by IND(P) partially overlaps

39
with the set X, the objects in such an equivalence class can not be determined

without ambiguity. Consequently, description of such a set X may not be possible.

Therefore, the description of X is defined in terms of P-lower approximation

(denoted as E..) and P-upper approximation (denoted asP). For PI;; Q

f_X = u{Y e U I IND(P): Y ~X} (3.1)


P X = u {Y e U I IND ( P) : Y n X :t. ¢} (3.2)

A set X for which E._X= PX is called as exact set otherwise it is called rough set

with respect toP.

The objects in EX can be with certainty classified as members of X on the

basis of knowledge in P, while the objects in P X can only be classified as

possible members of X on the basis of knowledge in P. The set

BNp(X) = PX -f_X

is called the ?-boundary region of X, and thus consists of those objects that we can

not decisively classify into X on the basis of knowledge in P. The set U:- P X is

called the ?-outside region of X and consists of those objects which can be with

certainty classified as do not belonging to X on the basis of knowledge in P.

Boundary region is non-empty for a rough set and empty for a crisp set.

Example 3.3 \

For Table 3.1, Patient p2 suffers from flu, whereas Patient p5 does not, and they

are indiscernible with respect to the attributes Headache, Muscle-pain and

Temperature, hence Flu can not be characterized in terms of the attributes

Headache, Muscle-pain and Temperature for p2 and p5. Therefore p2 and p5 are

boundary-line cases which can not be properly classified in view of the available

knowledge. The remaining Patients p 1, p3 and p6 display symptoms which enable

40
us to classify them with certainty as suffering from flu, p2 and p5 can not be

excluded as suffering from flu and p4 for sure does not suffer from flu, in view of

the displayed symptoms. Thus, if P = {Headache,Muscle-pain,Temperature} then

f_ (Flu== yes)= {p1,p3,p6}, P (Flu= yes)= {p1,p2,p3,p5,p6}, BN(P) = {p2,p5}

f_ (Flu= no)= {p4}, P (Flu= no)= {p2,p4,p5}, BN(P) = {p2,p5}

The above approximation is visualized as presented in Figure 3.1.

3.4.1 Properties of Approximations


Following properties of approximations are easily observable:

(1) f_(X) c X~ P(X)

- -
(2) f_(¢) = P(¢) = ¢, f_(U) = P(U) = U

- - -
(3) P(X u Y) = P(X) u P(Y)

(4) f_(XnY)=f_(X)nf_(Y)

(5) X c Y implies f_(X) ~ f_(Y) and P(X) ~ P(Y)

(6) P(X u Y) ~ f_(X) u f_(Y)

- -
(7) P(X n Y) c P(X) n P(Y)

-
(8) P(-X) = -P(X)

(9) P(-X)=-f_(X)

-
(1 0) P(P(X)) = P(f_(X)) = f_(X)

- -
(11) P(P(X)) =f_(P(X)) = P(X)

In the above properties -X denotes U-X.

41
Boundary
Region

Flu= yes

{ {p 1}'{p3}' {p6}}
-

Flu= yes I no

{{p2,p5}} Flu= no

{ {p4}}

Figure 3.1: Approximating the set of flu patients using


attributes Headache, Muscle-pain and Temperature
Notes:
Equivalence classes contained in the corresponding
regions are :
P (Flu =yes)= {p1, p3, p6}
P (Flu =yes)= {p1, p2, p3, p5, p6}
BN(P) = {p2, p5}

f._ (Flu = no)= {p4}


P (Flu = no)= {p2, p4, p5}
BN(P) = {p2, p5}

42
3.4.2 Classes of Rough sets
Four basic classes of rough sets can be defined as follows:

X is roughly P- defina!;le, iff f..( X) -:f.¢ and P(X) -:f. U

X is internally P- undefinable, iff f_(X) ==¢and P(X) -:f. U

X is externally P- undefinable, iff f.( X) -:f.¢ and P(X) == U

X is totally P- undefinable, iff f.( X) == ¢ and P(X) == U

3.4.3 Accuracy of Approximation


Rough set can also be characterized numerically by the coefficient a/' (X), called

the accuracy of approximation and defined as follows:

a (X)== I f.( X) I,
p I P(X) I

where I X I denotes the cardinality of X. If a I' (X) == 1 means X is crisp while

a P (X)< 1 means X is rough with respect toP.

3.5 Dependency of Attributes


Another important issue in data analysis is discovering dependencies between

attributes. A set of attributes P depends totally on a set of attributes R, denoted as

R~P, if all values of attributes in P are uniquely determined by values of

attributes in R. In other words, P depends totally on R, if there exists a functional

dependency between values of attributes in P and R. A more general concept of

dependency of attributes, called a partial dependency of attributes means that only.

some values of P are determined by the values of R. Rough set theory introduces a

measure of dependency of two subsets of attributes P, R s;;; Q. The measure is

called a degree of dependency of P on R, denoted by y 11 (P). It is defined as

43
YR (p) = card(POSR(P))
card(U) where POS
II
(P) = u RX
XeU I IND(P)-
(3.3)

The set POSR(P), positive region, is the set of all the elements of U that can be

uniquely classified into partitions UIIND(P) by R. The coefficient r 11 (P)

represents the fraction of the number of objects in the universe which can be

properly classified. If P totally depends on R then y 11 (P) = 1, else y 11 (P) < 1.

Example 3.4

To understand the usage of equation (3.3) for computing dependency of Flu on

Temperature (Table 3.1), we observe that the attribute Temperature determines

uniquely only some values of the attribute Flu. That is, (Temperature, very high)

implies (Flu, yes), similarly (Temperature, normal) implies (Flu, no), but

(Temperature, high) does not always imply (Flu, yes). Thus there exists partial

dependency between Temperature and Flu. To determine Y.Jemperature (Flu), using

equation (3.3):

U = {p1,p2,p3,p4,p5,p6} and U/JND(Flu) = {{p1,p2,p3,p6},{p4,p5}}

POSrremperaturej( {Flu})= {p3,p6} U {p4} = {p3,p4,p6}

Thus, YTemperature (Flu) = 3/6 = 0.5

Further, we compute that r Headache (Flu) = 0 and r Muscle-pain (Flu)= 0.

3.6 Reduction of Attributes


One often faces a question, whether we can remove some data from a data table

preserving its basic properties, that is, whether a table contains some superfluous

data. One natural dimension of reducing data is size of the dataset. It is achieved

by identifying equivalence classes and considering only one element of the

equivalence class to represent the entire class. The other dimension in reduction is

44
the number of attributes which is achieved by keeping only those attributes that

preserve the indiscemibility relation and, consequently the set approximation. The

rejected attributes are redundant with respect to classification since their removal

can not deteriorate the classification.

3.6.1 Reducts
A minimum set of attributes that preserves the indiscemibility relation is called a

reduct. In supervised learning, the reduct relative to decision attribute is useful.

Hence examples are discussed for decision relative reduct only. The relative

reduct of the attribute set P, P c Q, with respect to the dependency y P (Q) is

defined as a subset RED(P, Q) s;; P such that:

YRED(P,Q) = yp(Q)

i.e. relative reduct preserves the degree of inter attribute dependency.

For any attribute a e RED(P,Q), YRED(I',QJ-{al (Q) < YP (Q), i.e. the relative reduct

is a minimal subset of attributes with respect to the dependency of one.

Finding a minimal reduct (i.e. reduct with a minimal cardinality of

attributes among all reducts) is NP-hard [SR92]. Number of reducts of an

information system with m attributes may be equal to m (L ~ 2J} It appears that


computing the reducts is a non-trivial task that can not be solved simply by

increase of computational resources. It is, in fact, one of the bottlenecks of the

rough set methodology. Fortunately, there exists an efficient algorithm to compute

a single relative reduct in linear time [SR92, Joh74]. Genetic algorithms [Gol89]

are also used for simultaneous computation of many reducts in often acceptable

time, unless the number of attributes is very high [Wro95, Wro98, BK97].

45
3.6.2 Global and Local Reducts
A reduct if not explicitly mentioned as local reduct is called a global reduct.

Local reducts are also called value reducts. They are based on the fact that we can

often remove some values of the attributes without affecting the consistency of the

information system. Like global reducts, decision relative local reducts are useful

for supervised learning.

3.6.3 Core
The intersection of all relative reducts is called a relative core. In other words,

each element of the core belongs to some reduct. In a sense, the core is the most

important subset of attributes, for none of its elements can be removed without

affecting the classification power of attributes

Example 3.5
For Table 3.1, we have two reducts {Temperature, Headache} and {Temperature,

Muscle-pain}, with respect to attribute Flu. It means that either the attribute

Headache or the attribute Muscle-pain can be eliminated from the table and

consequently instead of Table 3.1 we can use either Table 3.2 or Table 3.3.

Core of dataset in Table 3.1, with respect to Flu is {Temperature}.

46
Table 3.2: Flu Data with Reduced Condition Attributes Headache and
Temperature·

Patient Headache Temperature Flu

pl No high yes

p2 Yes high yes

p3 Yes very high yes

p4 No normal no

p5 Yes high no

p6 No very high yes

Table 3.3: Flu Data with Reduced Condition Attributes Muscle-Pain and
Temperature

Patient Muscle-pain Temperature Flu


pl Yes high yes

p2 No high yes

p3 Yes very high yes

p4 Yes normal no

p5 No high no

p6 Yes very high yes

47
Example 3.6
By removing some values of the attribute Headache, Table 3.2 and Table 3.3 can

be simplified as shown in Table 3.4 and Table 3.5 respectively.

Value reducts as observed from Table 3.4 are {(Headache, no),

(Temperature, high)}, {(Headache, yes), (Temperature, high)}, {(Temperature,

normal)} and {(Temperature, very high)}

Value reducts as observed from Table 3.5 are {(Muscle-pain, yes),

(Temperature, high)}, {(Muscle-pain, no), (Temperature, high)}, {(Temperature,

normal)} and {(Temperature, very high)}.

Table 3.4: Flu Data with Reduced Values of Attribute Headache

Patient Headache Temperature Flu


pl No high yes
p2 Yes high yes
p3 - very high yes
p4 - normal no
p5 Yes high no
p6 - very high yes

Table 3.5: Flu Data with Reduced Values of Attribute Muscle-Pain

Patient Muscle-pain Temperature Flu


pl Yes high yes
p2 No high yes
p3 - very high yes
p4 - normal no
p5 No high no
p6 - very high yes

48
3.6.4 Reduct Co1nputation
To understand the methodology of reduct computation, we will use discernibility

matrix which is defined next. LetS be an information system with n objects. The

discernibility matrix of S is a symmetric n x n matrix with entries ciJ as given

below,

c!i ={a E Q I a(x;) -:t- a(x)} for i,j = l, .. ,n (3.4)

Each entry thus consists of the set of attributes upon which objects Xi and x1 differ.

A discernibility function.fs for an information systemS is a Boolean function of m

Boolean variables a\ ... ,a "m (corresponding to the attributes a1, ... ,am) defined as

follows,

A Boolean POS function as defined above can often be considerably simplified

while fully preserving the function's semantics. First of all, duplicate sums can be

eliminated since Boolean algebras have the property of multiplicative

idempotence, meaning that a.a = a for all members a. If the function has n

product terms, this can be done by a simple scan and sort procedure which is

bounded by the sorting step, typically O(n log n). Furthermore, a sum that

includes ("is a superset of') another sum in the function can be safely eliminated

since in Boolean algebras a . (a + b) = a for all members a, b. This property is

called absorption. Absorption can be carried out naively in O(n 2 ) time, but sub

quadratic algorithms exist [Prit95]. The set of all prime implicants 1 of Is

1
An implicant of a Boolean function/is any conjunction of literals (variables or
their negations) such that if the values of these literals are true under an arbitrary
valuation v of variables then the value of the function/under vis also true. A

49
determines the set of all reducts of S. In other words, all constituents in the

minimal disjunctive normal form of the function f(a ·,, ... ,a *mJ are all reducts of

In order to compute the value core and value reducts for an object x1, the

discernibility matrix as defined before is used and the discernibility function is

slightly modified,

Relative reducts and core can also be computed using discernibility matrix, which

needs slight modification,

ciJ =={a E Q I a(x;) -:F- a(x1)} for i,j == l, .. ,n and w(xh xj) (3.7)

where w(xi'x1) =X; E POSA(D) and x 1 ~ POSA(D) or


x, ~ POSA(D) and x 1 E POSA(D) or
X;,x, E POSA(D) and (xi'x1) ~ IND(D)

Recall that A is the set of condition attributes and D is the set of decision

attributes. Ifthe partition defined by Dis definable by A then the condition w(xi,xj)

in the above definition can be reduced to (xi, x) ~ IND(D). Thus entry Cu· is the set

of all attributes which discern objects xi and x1 that do not belong to the same

equivalence class of the relation IND(D). The D-core (decision relative core) is ·

the set of all single element entries of the ciiscernibility matrix i.e.

COREo(A) =={a E A: cij == {a},for some xi'xj }.

prime implicant is a minimal implicant. Here we are interested in implicants of


monotone Boolean functions only i.e. functions constructed without negation.

50
Similarly, D-reduct (decision relative reduct) is the minimal subset of

attributes that discerns all equivalence classes of the relation INDs(D) discernible

by the whole set of attributes. Every modified discernibility matrix defines

uniquely a discernibility function as before. All constituents in the minimal

disjunctive normal form of the discernibility function are all decision relative

reducts of A. In this study, decision relative reducts are extensively used, hence

following example will illustrate the steps for its computation ..

Example 3.7
To compute the decision relative reduct and core of Table 3.1.

Using 3.7, first we construct the decision relative discernibility matrix as shown.

Note that objects having the same decision are not compared among themselves.

Also the matrix is symmetrical with respect to the diagonal; hence only upper half

of the matrix needs to be considered for defining discernibility function using 3.5.

p2 p3 p4 p5 p6
pl t hm
p2 .hmt r.p
p3 ht mt
p4 t hmt ht t
p5 hm r.p mt hmt
p 6\ t

fA(D) = (t) (h+m+t) (h+t) (h+m) (m+t) (t) (h+m+t)

= (t) (h+m+t) (h+t) (h+m) (m+t) (Idempotent law)

= t (h+m) (Absorption law: "h+m+t", "m+t", "h+t" is super set of"t")

= ht+mt

51
Thus a function in POS form is simplified to a function in SOP form. Each

product term in this simplified form is the prime implicant of the function. Thus

decision relative global reducts of this function are {h, t} and {m, t} corresponding

to Table 3.2 and Table 3.3 respectively. Decision relative core (intersection of all

the reducts) of the Table 3.1 is {t}. By observing single element entries in the

decision relative discernibility matrix also qualifies t as the core.

The problem of computing a minimal reduct is NP-hard [SR92] but

approximation algorithms can be used to obtain knowledge about reduct sets.

Approximation algorithms do not give an optimal solution but have the acceptable

time complexity e.g. algorithms based on simulated annealing and Boltzmann

machines, genetic algorithms and algorithms using neural networks. We have

adopted genetic algorithms for generating large number of reducts and Johnson's

algorithm for computation of a single reduct in this thesis.

3. 7 Patterns and Rule Discovery


Rules can be perceived as data patterns or formulae that represent relationships

between attribute values. The most primitive pattern and hence the fundamental

building block for generating rules is called a selector. A selector is simply an

expression a = v where a E Q and v E Va. Patterns can be combined in a recursive

manner in order to form more complex patterns by means of the propositional

connectives {., +, ~, --,}, denoting conjunction, disjunction, implication and

negation respectively. The type of pattern that is most commonly considered in an

information system S is the conjunction of selectors, formed by overlaying the set

of attributes A over an object x E U and reading off the values of x for every

ae Reduct. Two commonly associated numerical measl.,lres associated with pattern

a are support and coverage. Support(a) refers to the number of objects in the

52
information system that have the property described by pattern a and coverage(a)

denotes the proportion of objects in U that match the description given by a.

Similarly as a decision system is a specialized type of information system, a

decision rule is a specialized type of pattern that specifies a relationship, possibly

a probabilistic one, between a set of conditions and a conclusion or decision. LetS

denote a decision system, and let o. denote a conjunction of selectors that only

involve attributes inS. Furthermore, let f3 denote a selector d = v, where v is any

allowed decision value. The decision rule read as if o. then f3 is denoted by a-+ f3.

The pattern o. is called the rule's antecedent, while the pattern f3 is called the rule

's consequent. RS theory provides mechanism to discover decision rules by

reading the attribute values from the reduced decision table using attributes in

reduct. In practical applications, where the rules are used to classify unseen

objects, reduct approximations are typically employed instead of proper reducts.

Like patterns, rules also have some numerical measures like accuracy, coverage

stability associated with them.

sunport(a.fJ)
accuracy( a ~ fJ) = -----=-r....:_____:.___:__;_ (3.8)
support( a)

support(a.jJ)
cove rage( a ~ fJ) = --=--=----____:.---'--~ (3.9)
support(fJ)

A graphical display of the relationship between accuracy and coverage can ·

be found in Figure 3.2. It is desirable for a rule to be accurate as well as to have a

high degree of coverage, although one does not necessarily imply others. Figure

3.3 shows graphically that as the antecedent of a decision rule grows longer, the

coverage decreases while the accuracy increases. Defining a point that balances

the trade-off between these two numerical rule measures can be difficult in

53
Xo
1- --,
0
I I x1
& ~----, I_ I

u
0
;.
0 ·:___II 1- ------------,I
~ I
0
.....l r----------1-- - - - - - - - - - - - 1 I
I I I I
I I I I
I I I I
0 .I
I
I
I
I~ -
I I
"'....
0
;.
I I I I
0 I I_ -----------1-
I I
~ L..---------- ------------'

Figure 3.2: Description of four decision rules ai--+l3i over a binary


decision domain, i.e., l3i =(d=O) or l3i =(d=1 ). Each dashed set
represents support (ai), the set of objects that match the rule's
antecedent ai.

IaI

Figure 3.3: [Coverage and Accuracy] vs. lal


As the length of the antecedent of a decision rule increases, the rule
becomes more specific and less general. As a result, the coverage
decreases while the accuracy increases. Finding a suitable balance
between the trade-off between coverage and accuracy can be difficult
in practice.

54
practice, and is also a function of the application domain. Bazan explores the issue

of stability of a rule in details [Baz98].

Example 3.8
To produce the decision rules for Table 3.1, we make use of the reducts obtained

in Example 3.7. Using reduct {Headache, Temperature}, the rules are:

1. If Headache= no and Temperature= high then Flu= yes (1, .25);


2. If Headache= yes and Temperature= high then Flu= yes (0.5, 0.25);
3. If Headache= yes and Temperature= high then Flu= no (0.5, 0.5);
4. If Headache= yes and Temperature= very high then Flu= yes (1, 0.25);
5. If Headache= no and Temperature= normal then Flu= no (1, 0.5);
6. If Headache= no and Temperature= very high then Flu= yes (1, 0.25);

Similarly using reduct {Muscle-pain, Temperature), we get

1. If Muscle-pain= yes and Temperature= high then Flu= yes (1, 0.25);
2. If Muscle-pain= no and Temperature= high then Flu= yes (0.5, 0.25);
3. If Muscle-pain= yes and Temperature= very high then Flu= yes (1, 0.5);
4. If Muscle-pain= yes and Temperature= normal then Flu= no (1, 0.5);
5. If Muscle-pain= no and Temperature= high then Flu= no (0.5, 0.5);

The figures in the parenthesis refer to accuracy and coverage of the rule

respectively. In subsequent chapters, decision rules using rough set theory

concepts are obtained using the methodology explained in Example 3.8. On closer·.

observation of the rules generated in the above case there exists scope for

improvement on the number of selectors and the number of rules e.g. rule 1 and

rule 2 for both the reducts can be replaced by a single rule "lfTemperature =high

then Flu = yes". This results in reducing the length of the rule as well as

improving the accuracy and coverage. This issue is addressed in the next chapter.

55
3.8 Some Other Terms and Concepts
3.8.1 Rough Membership
The rough membership function is a function that when applied to object x

quantifies the degree of relative overlap between the set X and the equivalence [x]

class to which x belongs. It is defined as follows:

The rough membership function can be interpreted as a frequency based estimate

of Pr(x EX I x, B), the conditional probability that object x belongs to set X, given

knowledge of the information signature of x with respect to attributes B [PWZ88].

3.8.2 Variable Precision Rough Set Model


A generalized model of rough set called Variable Precision Rough Set (VPRS)

model, aimed at modelling classification problems is presented in [Zia93a,

Zia93b]. The papers introduce the VPRS model, and demonstrate how it can be

used as a tool for data analysis. The primary advantage of a VPRS model is the

ability to recognize the presence of data dependencies in situations where data

items are considered independent by the original rough sets model.

The formulae for the lower and upper set approximations can be

generalized to some arbitrary level of precision :r e{ ~, 1] by means of the rou~h

membership function, as shown below.

fl.1rX ={xI J-Lf (x) ~ :r}


B1rX = {xl J-Lf(x) > 1-:r}

Note that the lower and upper approximations as originally formulated are

obtained as a special case with :r = 1.0 . A rough set X defined through the lower

56
and upper approximations !11rX and B1rX is also referred to as variable precision

rough set and can be seen as a way of thinning the boundary region [Zia93a,

Zia93b].

3.8.3 Approximate Reducts


Any subset B of A can be treated as approximate reduct of A, and the number

t: (B)= (y(A, D)- y(B, D))= _ y(B, D)


1
(A,D) y(A, D) y(A, D)

denoted simply as e(B), will be called an error of reduct approximation. It

expresses how exactly the set of attributes B approximates the set of condition

attributes A (relative to decision attributes D).

The concept of an approximate reduct is a generalization of the concept of

a proper reduct and is useful in cases when a smaller number of condition

attributes is preferred over accuracy of classification on training data. This can

allow increasing the classification accuracy on testing data. The error level of

reduct approximation should be tuned for a given data set to achieve this effect.

One of the techniques for reduct approximation is based on the approximation of

the positive region. For R E RED(A,d) and N equal to the number of objects in the

decision system S, the following algorithm computes this kind of reduct

approximation [SP97].

Algorithm ApproximateReduct

Step 1: Calculate positive regions POS R-faJ for all a E R.

Step 2: Choose from the reduct R one attribute a satisfying the condition:

\:fa e R POS 11 _1a.,l ~ POS 11 _1al'

Step 3: If POS 11 _1al > k.N (e.g.k =0.9 etc.) then

57
Begin

R = R-{ao }, go to step 1

End

Step 4: The new set of attributes (R) is called the approximate reduct.

Approximate reduct can help to extract interesting rules from decision

tables. Applying reduct approximation instead of reduct, decrease the quality of

the classification of objects from the training set, but it results in more general

rules with a higher quality of classification for new objects.

3.9 Computational Complexity of Rough Set Tools


Any of the rough set tools like lower approximation, upper approximation,

positive regions, reduct etc., as described in this chapter, can be straight forward

computed from the discernibility matrix with space complexity O(kn 2) and time

complexity O(kn 2) where n is the number of objects and k is the number of

attributes of data table. These methods will not be feasible for large data sets with

this complexity. Hoa and Son [HS96] presents efficient algorithms for computing

rough set tools. There is no need to store the discernibility matrix in

implementation of their methods. They propose the algorithms for computing of

positive regions, lower and upper approximations in O(kn log n) time using O(n)

space. Problem of computation of reduct can be implemented in O(Kn log n)

complexity and O(kn) space complexity (Chapter 4). Their approach can be\

applied for synthesis of efficient algorithms for discretization of numeric attributes

in large tables (Chapter 6).

3.10 Summary
In this chapter, the methodology of classical rough set theory for knowledge

discovery or rule generation is sketched by introducing many relevant terms and

58
definitions. Using a very small hypothetical dataset numbers of examples are

presented. It has been shown that RS can be treated as tool for data table analysis.

Reduct is the minimal set of attributes to preserve the indiscemibility relation in

an information system. Many algorithms are available for computation of reducts.

Some popular extensions of the classical rough set model e.g. variable precision

rough set model and approximate reducts are also introduced. Efficient heuristics

have been discussed to compute rough set tools thus making them suitable for

analysis of large data tables.

59

You might also like