Professional Documents
Culture Documents
An Adaptation of Relief For Attribute Estimation in Regression
An Adaptation of Relief For Attribute Estimation in Regression
2 RReliefF The complexity of Relief
qp p Hfor
$ training instances and
attributes is o . The original Relief can deal
2.1 Relief and ReliefF for classification with discrete and continuous attributes. However, it can not
deal with incomplete data and is limited to two-class prob-
The key idea of the original Relief algorithm (Kira and lems. Kononenko (1994) has shown that Relief’s estimates
Rendell, 1992), given in Figure 1, is to estimate the quality are strongly related to impurity functions and developed an
of attributes according to how well their values distinguish extension called ReliefF that is more robust, can tolerate in-
between the instances that are near to each other. For that complete and noisy data and can manage multiclass prob-
purpose, given a randomly selected instance R (line 3), Re- lems. One difference from original Relief, interesting also
lief searches for its two nearest neighbors: one from the for regression, is that, instead of one nearest hit and one
same class, called nearest hit H, and the other from a dif- nearest miss, ReliefF uses r nearest hits and misses and
ML
ferent class, called nearest miss M (line 4). It updates the averages their contribution to IKJ .
quality estimation W[A] for all the attributes A depending
on their values for R, M, and H (lines 5 and 6). The process The power of Relief is its ability to exploit information lo-
is repeated for times, where is a user-defined param- cally, taking the context into account, but still to provide
eter. the global view.
! #"$
Function calcu-
lates the difference between the values of Attribute for two 2.2 RReliefF for regression
instances. For discrete attributes it is defined as:
In regression problems the predicted value (class) is con-
54
5
6 ( $7+ 5489% &' ) $
% &'(#'*)$,+.-0/213 3 tinuous, therefore the (nearest) hits and misses cannot be
! <;=#>?
1': used. Instead of requiring the exact knowledge of whether
(1) two instances belong to the same class or not, we can intro-
and for continuous attributes as: duce a kind of probability that the predicted values of two
54
5
6 ( $B 54895
&' ) $ instances are different. This probability can be modeled
&' ( ' ) $,+A@ 3 3 @
CD
E$FB G,% H$ (2) with the relative distance between the predicted (class) val-
ues of the two instances.
The function is used also for calculating the distance Still, to estimate W[A] in (3), the information about the
between instances to find the nearest neighbors. The total sign of each contributed term is missing. In the follow-
distance is simply the sum of distances over all attributes.
ML ing derivation we reformulate (3), so that it can be directly
Relief’s estimate IKJ of the quality of attribute is an evaluated using the probability of the predicted values of
approximation of the following difference of probabilities two instances being different. If we rewrite
(Kononenko, 1994):
?L+ Ncstnu#uv + N %OQPSRwTVUVWaXSZ]\m^Y`b \Wae\fgcP f gW k \f$
IKJ @d d d (4)
N
OQPSRMTVUVWYX[Z]\_^a`cb \WYe'\f<gDP f<gT`8e^hiOQPjRMTYk X[Wff'$
@d d N stnu#uxy+ N
O]PjR9\ e'\ g z]e\OQPnk{gP[^ \WYe'\f<gP fgW k \f$
d dc@ d d d
B N
O]PjRMT#UVWYX[Z]\_^a`cb \WYe'\f<gDP f<gT`8e^hifWahl\mk*XnWaf'f$
@d d (3) (5)
and (Breiman et al., 1984). This measure is standard in regres-
N stnuux st[u#uv +
N
OQPSR\e\ cg z]e'\O]P[k*gP[^ sion tree systems. By this criterion the best attribute is the
d d@ one which minimizes the equation:
OQPSR9\ e'\ gcUVWaXSZ]\m^Y`b W O \WYe'\f<gDP f<g'W k*\f'$
@ d d d d d (6) KLNM % H$,+PORQTS8?Q$VUWORXYS8?X,$*
(9)
we obtain from (3) using Bayes rule:
? Q ? X
!
where and
OQ
are the subsets of cases that go left and
ORX
right, respectively, by the split based on , and and
<$
(7) are the proportions of cases that go left and right.
t is
?L
Therefore, we can estimate IKJ by approximating terms the standard deviation of the predicted values of cases in
defined by Equations 4, 5 and 6. This can be done by the subset :
the algorithm on Figure 2. The weights for different pre-
! c
diction (class), different attribute, and different prediction
<$7+ \ [Z[
*tB Y
<$<$ ) T
sx sv ?L 8<$]_
t ^`b( a
x$#sv ?L attribute are collected in " , " (10)
& sdifferent J , and " ,
" J M,L respectively. The final estimation of each at-
tribute IKJ (Equation 7) is computed in lines 14 and 15. The minimum of (9) among all possible splits for an at-
8 %$
Term in Figure 2 (lines 6, 8 and 10) is used tot take tribute is considered its quality estimate and is given in the
into account the distance between the two instances & and results below.
'
. Closer instances should have greater influence, ' so we We have used several families of artificial data sets to check
exponentially decreased the influence of instance with
t the behavior of RReliefF in different circumstances.
the distance from the given instance & :
( 8( %5$
8( %5$7+ W O FRACTION: each domain contains continuous attributes
)* ( ( 8'4%$ d
+-,
(8)
with values from 0 to 1. The predicted value is the
d + )f'e , part
fractional
( g ' of the)fsum
B i h e
', ( g
' j
of important attributes:
(
(%5$,+ /.1032547698:-D ;=<?> @BAC EF . These domains are
floating point generalizations of parity concept of or-
V t ' $ '
where r & is the rank of instance t in a sequence der , i.e., domains with highly dependent pure con-
of instances ordered by the distance from & and G is a user tinuous attributes.
defined parameter. We also experimented ' using the con- MODULO-8: domains are described by a set of attributes,
stant influence of
t all r
(
' %5$ + I! H
nearest instances around instance
value of each attribute is an integer value in the range
& by taking r , but the results did not dif- 0-7. Half of the attributes are treated as discrete and
fer significantly. Since we consider the former to be more half as continuous; each continuous attribute is exact
general we have chosen it for this presentation.
match of one of the discrete
attributes. The predicted
+ ) et ' g
value isd the sum
, '#$h ^QlO k
Note that the time complexity of of the important attributes by mod-
RReliefF is the same as
p p H$ . The most
that of original Relief, i.e., o ulo 8; . These domains are
' main for loop is the selec-
complex operation within the integer generalizations of parity
concept (which is the
'
tion of k nearest instances . For it we have to compute sum by modulo 2) of order . They shall show how
p H$
the distances from to & , which can be done in o well RReliefF recognizes highly dependent attributes
steps and how it ranks discrete and continuous attributes of
8$ for instances. This is the most complex operation.
o $ which r nearest in-
XS^/J,from
is needed to build a heap, equal importance.
stances are extracted in o r steps, but this is less
8 p E$ PARITY: each domain consists of discrete, Boolean at-
than o .
tributes. The informative attributes define parity
concept: if their parity bit is 0, the predicted value is
3 Estimating the attributes set to a random number between 0 and 0.5, otherwise
it is randomly chosen to be between 0.5 and 1.
In this Section we examine the ability of RReliefF to rec-
ognize and rank important attributes and in the next Section # Tsra$ ) e g'$h ^5Ol"&+
we use these abilities in learning regression trees. + p moo
d n
/ / 1 ', ( /
# T r] !$ ) e g' $h ^5Ol"&+ !
We compare the estimates of RReliefF with the mean
squared error (MSE) as a measure of the attribute’s quality ooq / 1 ', (
Algorithm RReliefF $
Input: for each training instance a vector of attribute values x and the predicted value x
Output: the vector W of estimations of the qualities of attributes
sx sv ?L sVx #Dsv ML ML
1. set all " ," J ," J , IKJ to 0;
2. for i := 1 to m do begin
t
3. randomly select instance & ;
' t
4. select k instances nearest to & ;
5. for j := 1 to k do begin
U $FB % ' $ @ S8( %5$
6. " sx := " sx @ & t ;
7. for A := 1 to #all attributes do begin
sv ML sv ML U
& t< 'V$ S8( %5$
8. " J := " J & ;
ML U t $By
' $ S
" sVx #sv J := " sVx #Dsv J
?L
9. @ & @
& t ' $ S8( %5$
10. & ;
11. end;
12. end;
13. end;
14. for A := 1 to #all attributes do
?L sV x #Dsv ML sx sv MLB sV
x #sv ?L $ B sx,$
15. IKJ := " J /" - " J " J / " ;
50
100
150
200
250
300
-0.01
-0.02
Number of examples
FRACTION 3 300 220 - -
-0.03
4 950 690 - -
MSE - number of examples - FRACTION-2 2+2 80 70 - -
0.32
0.30
MODULO-8 3+3 370 230 - -
4+4 - - - -
MSE estimate
0.28
0.26 2 50 50 - -
0.24 PARITY 3 100 100 - -
I1
0.22
I2 4 280 220 - -
0.20
best random LINEAR 4 - 10 340 20
0.18
COSINUS 3 - 50 490 90
0
50
100
150
200
250
300
Number of examples
Figure 3: Varying the number of examples in FRACTION Also interesting in the MODULO problem is that the im-
domain with 2 important and 8 random attributes. Note that portant discrete attributes are considered better than their
RReliefF gives higher scores to better attributes, while the continuous counterparts. aWe can understand this if we
MSE does the opposite. consider the behavior of function (see (1) and (2)).
Let’s take ttwo cases with 2 and t 5 being their values of
attribute , arespectively.
t '"] ra$ + ! is the discrete attribute,
If
mated as better than I1 and I2. the value of Ht , since the two categori-
The behaviors of RReliefF and MSE are similar on other
% t<'"Q ra$F+
cal values are different. ) If
is the continuous attribute,
.
/ T . So, with this form of a
FRACTION, MODULO and PARITY data sets. Due to function continuous attributes are underestimated. We can
the lack of space we will omit other graphs and will only overcome this problem with the ramp function as proposed
comment the summary of the results presented in the Ta- by (Hong, 1994). It can be defined as a generalization of
ble 1. The two numbers given are the limiting number of function for the continuous attributes:
examples that were needed that the estimates of the im-
portant attributes
which were estimated as the worst ( )
6(V*)#$,+ p /
!
1
st[u#u
and the best ( ) between important attributes were signif- m s 1
q <. ` . 1
<stnu#u
icantly better than the estimates of the attribute that was
estimated as the best between the random attributes. The + 54895
6 ( $` B 5489` 5
&' ) $ (11)
’-’ sign means that the estimator did not succeed to signifi- where @3 3 @ presents the dis-
cantly distinguish between the two groups of the attributes. tance
between
<st[u#u the attribute values of the two instances, and
and are two user definable threshold values;
We can observe that the number of the examples needed
is the maximum distance between the two attribute values
is increasing with the increasing complexity (number of 'st[u#u
to still consider them equal, and is the minimum dis-
important attributes) of each problem. While the PAR-
tance between attribute values to still consider them differ-
ITY, FRACTION and MODULO-8 are solvable for RRe-
ent. We have omitted the use of the ramp function from this
liefF, MSE is completely lost there. MODULO-8-4 (with
presentation, as it complicates the basic idea, but the tests
4 important attributes) is too difficult even for RReliefF. It
with sensible default thresholds have shown that continu-
seems that 1000 examples is not enough for a problem of
ous attributes are no longer underestimated.
such complexity, namely the complexity grows exponen-
tially here: the number of peaks in the instance space for The results for LINEAR and COSINUS domains show) that
(
MODULO-m-p domain is . This was confirmed by an separating the worst important attribute ( and , re-
additional experiment with 8000 examples where RReliefF spectively) from the best random attribute is not easy nei-
succeeded to separate the two groups of examples. ther for RReliefF nor for MSE, but MSE was better. RRe-
!" #%$&(' We can see that RReliefF is robust, as it can significantly
*
+ *,3.* distinguish the worst important attribute from the best ran-
[ T
* + *,2.- ^ 1/ dom even with 50 % of corrupted prediction values. The
\] *
+ *,2.* _(^ `ba0c
d e,f(g,h.i
V[ Z *
+ *,1.- Table 2 summarizes the results for all the domains. The two
TZ *
+ *,1.* columns give the maximal percentage of corrupted predic-
WXY *
*
++ *
*
/0/0-* tion values where the estimates of the worst important ( )
UVS TT *
+ *,*.-
and the best important attribute ( ), respectively, were still
S *
+ *,*.*
) *
+ *,*.- 4 54 64 74 84 94 :4 ;4
>@?BACBDBEGFIHGJBKLCBM EONPNQBEGM RBJPN
<4 =4 544 significantly better than the estimates of the best random
attribute. The ’-’ sign means that the estimator did not suc-
ceed to significantly distinguish between the two groups
even without any noise.
Figure 4: RReliefF’s estimates when the noise is added to
the predicted value. in FRACTION domain with 2 impor-
tant attributes. Table 2: Results of adding noise to predicted values. Num-
bers tell us which percent of predicted values could we cor-
rupt still to get significant differences in estimations.
liefF could distinguish the two attributes significantly with RReliefF MSE
more than 100 examples, but occasionally some peak in
domain I
the estimation of the random attributes caused the t-value
2 53 59 - -
to fall below the significance threshold. MSE could also
FRACTION 3 16 35 - -
mostly distinguish the two groups, but lower variation gives
4 3 14 - -
it a slight advantage. When separating the best important
2+2 64 75 - -
attribute from random attributes, both RReliefF and MSE
MODULO-8 3+3 52 70 - -
were successful but RReliefF needed less examples. This
4+4 - - - -
probably compensates the negative result for RReliefF with
2 66 70 - -
the worst important attribute (e.g., in the regression tree
PARITY 3 60 71 - -
learning a single best attribute is selected in each node).
4 50 67 - -
The difference between the estimators is also in ranking the
LINEAR 4 - 66 50 85
attributes by importance in COSINUS domain. ( The correct
) COSINUS 3 - 46 36 63
decreasing order replicated
while MSE orders them:
by ) RReliefF
E( is , ,
, , , as it is unable to de-
,
C pair
$ ,
t t
liefF or by MSE (9). These estimates are computed on the where` is the vectoru of attribute values. is the value
subset of the examples that reach current node. Such use predicted by , and is a predictor which always returns
of ReliefF on classification problems was shown to be sen-
sible and can significantly outperform impurity measures
the mean
have &
M value
t
$ of! the prediction values. Sensible predictors
.
(Kononenko et al., 1997).
Besides d the error we included also the measure of com-
We have run our system with two sets of parameters and plexity of the tree. C is the number of all the occur-
procedures. They are named according to the type of mod- rences of all the attributes anywhere in the tree plus the
Table 4: Relative error and complexity of the regression and model trees with RReliefF and MSE as the estimators of the
quality of the attributes.
linear point
d d d d
& M & M & M & M
RReliefF MSE RReliefF MSE
domain S S
Fraction-2 .34 112 .86 268 + .52 87 1.17 160 +
Fraction-3 .73 285 1.08 440 + 1.05 174 1.62 240 +
Fraction-4 1.05 387 1.10 392 0 1.51 250 1.65 219 0
Modulo-8-2 .22 98 .77 329 + .06 58 .81 195 +
Modulo-8-3 .58 345 1.08 436 + .59 166 1.58 251 +
Modulo-8-4 1.05 380 1.07 439 0 1.52 253 1.52 259 0
Parity-2 .28 125 .55 208 + .27 7 .38 103 0
Parity-3 .31 94 .82 236 + .27 15 .61 213 +
Parity-4 .35 138 .96 283 + .25 31 .88 284 +
Linear .02 4 .02 4 0 .19 59 .19 55 0
Cosinus .27 334 .41 364 + .36 105 .29 91 0
Auto-mpg .13 97 .14 102 0 .21 27 .21 19 0
Auto-price .14 48 .12 53 0 .28 16 .17 10 0
CPU .12 33 .15 47 0 .42 16 .31 10 0
Housing .17 129 .15 177 0 .31 33 .23 23 0
Servo .25 53 .28 55 0 .24 7 .33 9 0
constant term in the leaves. The column labeled S presents (FRACTION, MODULO-8 and PARITY), while on UCI
the significance of the differences between RReliefF and databases MSE was better on 3 data sets and RReliefF on
MSE computed with paired t-test at 0.05 level of signifi- one data set, but the differences are not significant. MSE
cance. 0 indicates that the differences are not significant at produced less complex trees on LINEAR, COSINUS and
0.05 level, ’+’ means that predictor with RReliefF is signif- UCI datasets. The reason for this is similar to the opposite
icantly better, and ’-’ implies the significantly better score effect observed in the linear mode. Since there are proba-
by MSE. bly no strong dependencies in these domains the ability to
detect them did not bring any advantage to RReliefF. MSE
In the linear mode (left side of Table 4) the predictor gen-
which minimizes the squared error was favored by the stop-
erated with RReliefF is on artificial data sets mostly sig-
ping criterion (percentage of the standard deviation) and
nificantly better than the predictor generated with MSE.
the pruning procedure which both rely on the estimation
On UCI databases the predictors are comparable, RReli-
of the error. Similar effect was observed in classification
efF being better on 3 data sets, and MSE on 2, but with
(Brodley, 1995; Lubinsky, 1995), namely it was empiri-
insignificant differences. The complexity of the models in-
cally shown that the classification error is the most appro-
duced by RReliefF is considerably smaller in most cases
priate for selecting the splits near the leaves of the decision
on both artificial and UCI data sets which indicates that
tree. This effect is hidden in the linear mode where the
RReliefF was more successful detecting the dependencies.
trees are smaller (but not less complex as can be seen from
With RReliefF the strong dependencies were mostly de-
Table 4) and the linear models in the leaves play the role of
tected and expressed with the selection of the appropriate
minimizing the accuracy.
attributes in the upper part of the tree; the remaining depen-
dencies were incorporated in the models used in the leaves
of the tree. MSE was blind for strong dependencies and 5 Conclusions
was splitting the examples solely to minimize their impu-
rity (mean squared error) which prevented it to success-
fully model weaker dependencies with linear formulas in Our experiments show that RReliefF can discover strong
the leaves of the tree. dependencies between attributes, while in domains with-
out such dependencies it performs the same as the mean
In the point mode RReliefF produces smaller and more squared error. It is also robust and noise tolerant. Its in-
accurate trees on problems with strong dependencies trinsic contextual nature allows it to recognize contextual
attributes. From our experimental results we can conclude Kira, K. and Rendell, L. A. (1992). A practical approach to
that learning regression trees with RReliefF is promising feature selection. In D.Sleeman and P.Edwards, edi-
especially in combination with linear models in the leaves tors, Proceedings of International Conference on Ma-
of the tree. chine Learning, pages 249–256. Morgan Kaufmann.
Both, RReliefF in regression and ReliefF in classification Kononenko, I. (1994). Estimating attributes: analysis and
(Kononenko, 1994) are estimators of (7), which gives a uni- extensions of Relief. In De Raedt, L. and Bergadano,
fied view on the estimation of the quality of attributes for F., editors, Machine Learning: ECML-94, pages 171–
classification and regression. RReliefF’s good performance 182. Springer Verlag.
and robustness indicate its appropriateness for feature se-
lection. Kononenko, I., Šimec, E., and Robnik-Šikonja, M. (1997).
Overcoming the myopia of inductive learning algo-
As further work we are planning to use RReliefF in detec- rithms with RELIEFF. Applied Intelligence, 7:39–55.
tion of context switch in incremental learning, and to guide
the constructive induction process. Lubinsky, D. J. (1995). Increasing the performance and
consistency of classification trees by using the accu-
racy criterion at the leaves. In Proceedings of the XII
References
International Conference on Machine Learning. Mor-
gan Kaufmann.
Atkeson, C. G., Moore, A. W., and Schall, S. (1996). Lo-
cally weighted learning. Technical report, Georga In- Mantaras, R. (1989). ID3 revisited: A distance based cri-
stitute of Technology. terion for attribute selection. In Proceedings of Int.
Symp. Methodologies for Intelligent Systems, Char-
Breiman, L., Friedman, L., Olshen, R., and Stone, lotte, North Carolina, USA.
C. (1984). Classification and regression trees.
Wadsworth Inc., Belmont, California. Murphy, P. and Aha, D. (1995). UCI repository of machine
learning databases.
Brodley, C. E. (1995). Automatic selection of split criterion (http://www.ics.uci.edu/ mlearn/MLRepository.html).
during tree growing based on node location. In Pro-
ceedings of the XII International Conference on Ma- Quinlan, J. R. (1993). Combining instance-based and
chine Learning. Morgan Kaufmann. model-based learning. In Proceedings of the X. In-
ternational Conference on Machine Learning, pages
Domingos, P. (1997). Context-sensitive feature selection 236–243. Morgan Kaufmann.
for lazy learners. Artificial Intelligence Review. (to
appear). Smyth, P. and Goodman, R. (1990). Rule induction us-
ing information theory. In Piatetsky-Shapiro, G.
Elomaa, T. and Ukkonen, E. (1994). A geometric approach and Frawley, W., editors, Knowledge Discovery in
to feature selection. In De Raedt, L. and Bergadano, Databases. MIT Press.
F., editors, Proceedings of European Conference on
Machine Learning, pages 351–354. Springer Verlag.