Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Appendix D

Statistics for Performance Evaluation


CD-71
D.1 Single-Valued Summary Statistics
The mean, or average value, is computed by summing the data and dividing the sum
by the number of data items. Specifically,
where
is the mean value
n is the number of data items
x
i
is the i
th
data item
The formula for computing variance is shown in Equation D.2.
where

2
is the variance
is the population mean
n is the number of data items
x
i
is the i
th
data item
When calculating the variance for a sampling of data, a better result is obtained by
dividing the sum of squares by n 1 rather than by n.The proof of this is beyond the
scope of this book; however, it suffices to say that when the division is by n 1, the
sample variance is an unbiased estimator of the population variance. An unbiased esti-
mator has the characteristic that the average value of the estimator taken over all pos-
sible samples is equal to the parameter being estimated.
D.2 The Normal Distribution
The equation for the normal, or bell-shaped, curve is shown in Equation D.3.
where
f(x) is the height of the curve corresponding to values of x
e is the base of natural logarithms approximated by 2.718282
f x
x
( ) / (
( ) ( ) /
) =

1
2
2
2 (D.3 )e

2
D.2
2
1
2
1
=
=

n
x
i
i
n
( ) ( )
(D.1) =
=
1
1
n
x
i
i
n
CD-72 Appendix D

Statistics for Performance Evaluation


is the arithmetic mean for the data
is the standard deviation
D.3 Comparing Supervised Learner Models
In Chapter 7 we described a general technique for comparing two supervised learner
models using the same test dataset. Here we provide two additional techniques for
comparing supervised models. In both cases, model test set error rate is treated as a
sample mean.
Comparing Models with Independent Test Data
With two independent test sets, we simply compute the variance for each model and
apply the classical hypothesis testing procedure. An outline of the technique follows.
1. Given

Two models, M
1
and M
2,
built with the same training data

Two independent test sets, set A containing n


1
elements and set B with n
2
elements

Error rate E
1
and variance v
1
for model M
1
on test set A

Error rate E
2
and variance v
2
for model M
2
on test set B
2. Compute
3. Conclude

If P >= 2, the difference in the test set performance of model M


1
and
model M
2
is significant.
Lets look at an example. Suppose we wish to compare the test set performance of
learner models M
1
and M
2
.We test M
1
on test set A and M
2
on test set B. Each test set
contains 100 instances. M
1
achieves an 80% classification accuracy with set A, and M
2
obtains a 70% accuracy with test set B.We wish to know if model M
1
has performed
significantly better than model M
2
.

For model M
1
:
E
1
= 0.20
v
1
= .2 (1 .2) = .16
P
E E
v n v n
=

+
1 2
1 1 2 2
( / / )
( ) D.4
D.3

Comparing Supervised Learner Models CD-73

For model M
2:
E
2
= 0.30
v
2
= .3 (1 .3) = .21

The computation for P is:


As P 1.4714, the difference in model performance is not considered to be sig-
nificant. We can increase our confidence in the result by switching the two test sets
and repeating the experiment.This is especially important if a significant difference is
seen with the initial test set selection. The average of the two values for P are then
used for the significance test.
Pairwise Comparison with a Single Test Set
When the same test set is applied to the data, one option is to perform an instance-
by-instance pairwise matching of the test set results.With an instance-based compari-
son, a single variance score based on pairwise differences is computed.The formula for
calculating the joint variance is shown in Equation D.5.
When test set error rate is the measure by which two models are compared, the
output attribute is categorical.Therefore for any instance i contained in class j, e
ij
is 0 if
the classification is correct and 1 if the classification is in error. When the output at-
tribute is numeric, e
ij
represents the absolute difference between the computed and ac-
V
n
E E
i
n
V
e i
e i
E E
n
e e
1i 2i 12 1 2
12
1 2
1
1
1
2
=

[( ) ( )] (D.5)
where
is the joint variance
is the classifier error on the instance for learner model M
is the classifier error on the instance for learner model M
is the overall classifier error rate for model M minus the classifier error rate
for model M
is the total number of test set instances
1i
th
1
2i
th
2
1
2
P
0.20 0.30
=
(0.16/100 + 0.21/100)
CD-74 Appendix D

Statistics for Performance Evaluation


tual output value.With the revised formula for computing joint variance, the equation
to test for a significant difference in model performance becomes
Once again, a 95% confidence level for a significant difference in model test set
performance is seen if P >= 2.This technique is appropriate only if an instance-based
pairwise comparison of model performance is possible. In the next section we address
the case where an instance-based comparison is not possible.
D.4 Confidence Intervals for Numeric Output
Just as when the output was categorical, we are interested in computing confidence
intervals for one or more numeric measures. For purposes of illustration, we use mean
absolute error. As with classifier error rate, mean absolute error is treated as a sample mean.
The sample variance is given by the formula:
Lets look at an example using the data in Table 7.2. To determine a confidence
interval for the mae computed for the data in Table 7.2, we first calculate the variance.
Specifically,
Next, as with the classifier error rate, we compute the standard error for the mae
as the square root of the variance divided by the number of sample instances.
Finally, we calculate the 95% confidence interval by respectively subtracting and
adding two standard errors to the computed mae. This tells us that we can be 95%
confident that the actual mae falls somewhere between 0.0108 and 0.1100.
SE (0.0092 / 15) .0248 = 0
variance( )
14
( )
1
1
2
0 0604 0 0604
0 024 0 0604 0 002 0 0604 0 001 0 0604
0 0092
15
2 2 2
. .
( . . ) ( . . ) .... ( . . )
.
=
=

+ + +

e
i
i
variance( )
1
( )
1
D.7)
where
is the absolute error for the th instance
is the number of instances
1
2

mae
n
e mae
i
n
i
n
i
i
e
=
=
(
P
E E
V n
=

1 2
12
/
(D.6)
D.4

Confidence Intervals for Numeric Output CD-75


D.5 Comparing Models with Numeric Output
The procedure for comparing models giving numeric output is identical to that for
models with categorical output. In the case where two independent test sets are avail-
able and mae measures model performance, the classical hypothesis testing model takes
the form:
where
mae
1
is the mean absolute error for model M
1
mae
2
is the mean absolute error for model M
2
v
1
and v
2
are variance scores associated with M
1
and M
2
n
1
and n
2
are the number of instances within each respective test set
When the models are tested on the same data and a pairwise comparison is possi-
ble, we use the formula:
where
mae
1
is the mean absolute error for model M
1
mae
2
is the mean absolute error for model M
2
v
12
is the joint variance computed with the formula defined in Equation D.5
n is the number of test set instances
When the same test data is applied but a pairwise comparison is not possible, the
most straightforward approach is to compute the variance associated with the mae for
each model using the equation:
where
mae
j
is the mean absolute error for model j
e
i
is the absolute value of the computed value minus the actual value for instance i
n is the number of test set instances
The hypothesis of no significant difference is then tested with Equation D.11.
variance (D.10) ( ) ( ) mae
n
e mae j
i j
i
n
=

1
1
1
2
P
mae mae
v n
=

1 2
12
/
(D.9)
P
v n v n
mae mae
=

+
1 2
1 1 2 2
( / / )
(D.8)
CD-76 Appendix D

Statistics for Performance Evaluation


where
v is either the average or the larger of the variance scores for each model
n is the total number of test set instances
As is the case when the output attribute is categorical, using the larger of the two
variance scores is the stronger test.
P
v
mae mae
n
=

1 2
2 ( / )
(D.11)
D.5

Comparing Models with Numeric Output CD-77

You might also like