Professional Documents
Culture Documents
Lecture20 Meta
Lecture20 Meta
Lecture20 Meta
Jason Corso
SUNY at Buffalo
1 / 45
Introduction
Now that weve built an intuition for some pattern classifiers, lets
take a step back and look at some of the more foundational
underpinnings.
Algorithm-Independent means
those mathematical foundations that do not depend upon the
particular classifier or learning algorithm used;
techniques that can be used in conjunction with different learning
algorithms, or provide guidance in their use.
2 / 45
3 / 45
3 / 45
3 / 45
3 / 45
3 / 45
4 / 45
4 / 45
4 / 45
4 / 45
Thus, we will use the off-training set error, which is the error on
points not in the training set.
If the training set is very large, then the maximum size of the
off-training set is necessarily small.
J. Corso (SUNY at Buffalo)
4 / 45
5 / 45
5 / 45
5 / 45
5 / 45
5 / 45
5 / 45
Let E be the error for our cost function (zero-one or for some general
loss function).
6 / 45
Let E be the error for our cost function (zero-one or for some general
loss function).
We cannot compute the error directly (i.e., based on distance of h to
the unknown target function F ).
6 / 45
Let E be the error for our cost function (zero-one or for some general
loss function).
We cannot compute the error directly (i.e., based on distance of h to
the unknown target function F ).
So, what we can do is compute the expected value of the error given
our dataset, which will require us to marginalize over all possible
targets.
XX
E[E|D] =
P (x) [1 (F (x), h(x))] P (h|D)P (F |D) (1)
h,F xD
/
6 / 45
7 / 45
7 / 45
(3)
8 / 45
(3)
8 / 45
(3)
8 / 45
(3)
8 / 45
(4)
9 / 45
(4)
(5)
9 / 45
(4)
(5)
(6)
9 / 45
An NFLT Example
x
000
001
010
011
100
101
110
111
F
1
-1
1
-1
1
-1
1
1
h1
1
-1
1
1
1
1
1
1
h2
1
-1
1
-1
-1
-1
-1
-1
10 / 45
possible
learning systems
+
-
b)
-
- -
- + -
++
+
d)
impossible
learning systems
+ +
+ +
c)
+
0
0
0
0 0
0
0
0
0
0
+
0
e)
0
f)
0
+
0
0
0
problem space
(not feature space)
a)
-
0
0
FIGURE 9.1. The No Free Lunch Theorem shows the generalization performance on
the off-training
data that can beidea
achieved
alsothis:
showsFor
the every
performance
There is set
a conservation
one (top
can row)
takeand
from
that cannot be achieved (bottom row). Each square represents all possible classification
possible learning algorithm for binary classification the sum of
problems consistent with the training datathis is not the familiar feature space. A +
performance
over all possible
functions ishigher
exactly
indicates
that the classification
algorithmtarget
has generalization
thanzero.
average, a
indicates
lower
than
average,
and
a
0
indicates
average
performance.
The
size of 11
a / 45
J. Corso (SUNY at Buffalo)
Algorithm Independent Topics
12 / 45
12 / 45
13 / 45
13 / 45
13 / 45
Feature Representation
We can use logical expressions or predicates to describe a pattern
(we will get to this later in non-metric methods).
Denote a binary feature attribute by fi , then a particular pattern
might be described by the predicate f1 f2 .
We could also define such predicates on the data itself: x1 x2 .
Define the rank of a predicate to be the number of elements it
contains.
a)
b)
f1
f1
x1
x1
x2
f2
x3
x4
c)
f3
x6
x5
x3
x4
f3
x2
f1
f2
x1
x2
f3
x3
f2
x4
x5
14 / 45
15 / 45
15 / 45
15 / 45
15 / 45
One can always find multiple ways of representing the same features.
For example, we might use f10 and f20 to represent blind in right eye
and same in both eyes, resp. Then we would have the following four
types of people (in both cases):
x1
x2
x3
x4
f1
0
0
1
1
f2
0
1
0
1
f10
0
0
1
1
f20
1
0
0
1
16 / 45
17 / 45
17 / 45
17 / 45
17 / 45
2
r2
17 / 45
2
r2
17 / 45
18 / 45
18 / 45
18 / 45
Bias and Variance are two measures of how well a learning algorithm
matches a classification problem.
Bias measures the accuracy or quality of the match: high bias is a
poor match.
Variance measures the precision or specificity of the match: a high
variance implies a weak match.
One can generally adjust the bias and the variance, but they are not
independent.
19 / 45
Regression
20 / 45
Regression
variance
20 / 45
Regression
Bias is the difference between the expected value and the true
(generally unknown) value).
A low bias means that on average we will accurately estimate F from
D.
21 / 45
Regression
Bias is the difference between the expected value and the true
(generally unknown) value).
A low bias means that on average we will accurately estimate F from
D.
A low variance means that the estimate of F does not change much
as the training set varies.
21 / 45
Regression
Bias is the difference between the expected value and the true
(generally unknown) value).
A low bias means that on average we will accurately estimate F from
D.
A low variance means that the estimate of F does not change much
as the training set varies.
Note, that even if an estimator is unbiased, there can nevertheless be
a large mean-square error arising from a large variance term.
21 / 45
Regression
Bias is the difference between the expected value and the true
(generally unknown) value).
A low bias means that on average we will accurately estimate F from
D.
A low variance means that the estimate of F does not change much
as the training set varies.
Note, that even if an estimator is unbiased, there can nevertheless be
a large mean-square error arising from a large variance term.
The bias-variance dilemma describes the trade-off between the two
terms above: procedures with increased flexibility to adapt to the
training data tend to have lower bias but higher variance.
21 / 45
a)
Regression
b)
g(x) = fixed
c)
g(x) = fixed
d)
g(x) = a0 + a1x
learned
g(x)
g(x)
F(x)
D1
F(x)
g(x)
F(x)
F(x)
g(x)
y
g(x)
F(x)
D2
y
g(x)
F(x)
F(x)
F(x)
g(x)
g(x)
g(x)
D3
F(x)
g(x)
F(x)
F(x)
F(x)
g(x)
g(x)
p
bias
p
bias
variance
variance
bias
bias
variance
variance
22 / 45
Classification
(9)
23 / 45
Classification
(9)
(10)
23 / 45
Classification
(9)
(10)
(11)
23 / 45
Classification
24 / 45
Classification
24 / 45
Classification
(13)
24 / 45
Classification
(14)
(15)
25 / 45
Classification
(14)
(15)
25 / 45
Classification
where (t) =
1
2
R
t
variance
exp 1/2u2 du.
26 / 45
Classification
where (t) =
1
2
R
t
variance
exp 1/2u2 du.
You can visualize the boundary bias by imagining taking the spatial
average of decision boundaries obtained by running the learning
algorithm on all possible data sets.
26 / 45
Classification
where (t) =
1
2
R
t
variance
exp 1/2u2 du.
You can visualize the boundary bias by imagining taking the spatial
average of decision boundaries obtained by running the learning
algorithm on all possible data sets.
Hence, the effect of the variance term on the boundary error is highly
nonlinear and depends on the value of the boundary bias.
J. Corso (SUNY at Buffalo)
26 / 45
Classification
27 / 45
Classification
27 / 45
Classification
27 / 45
Classification
27 / 45
Classification
x1
x2
x1
x2
x1
x2
D2
x2
x1 x
truth
x1 x
x1
x1
x1
truth
i12
2
i21 i2
2
0 i1 i12
i =0i =2 i21
i2i2=
i2
2
i1
()
x2
D1
x2
x1
1 0
0 1
high
x1
x1
x1
x2
x2
D2
i22
D2
x1 x
x1 x
x2
x1
1 01
0 1
( )
x2
x2
high
x2
x2
x1
p
x1
p
x1
EB
x2
D3
Bias
x2
x2
c)
2i1
() ) i = (
x2
x2
x2
D3
b)
c)
Bias
low
low
x2
D1
a)
boundary
distributions
2
i1
b)
error
histograms
a)
high
x1
p
x1
x2
E
EB
Variance
EB
low
Resampling Statistics
The results from the discussion in bias and variance suggest a way to
estimate these two values (and hence a way to quantify the match
between a method and a problem).
What is it?
29 / 45
Resampling Statistics
The results from the discussion in bias and variance suggest a way to
estimate these two values (and hence a way to quantify the match
between a method and a problem).
What is it?
Resampling. I.e., taking multiple datasets, perform the estimation
and evaluate the boundary distributions and error histograms.
29 / 45
Resampling Statistics
The results from the discussion in bias and variance suggest a way to
estimate these two values (and hence a way to quantify the match
between a method and a problem).
What is it?
Resampling. I.e., taking multiple datasets, perform the estimation
and evaluate the boundary distributions and error histograms.
Lets make these ideas clear in presenting two methods for
resampling:
1
2
Jackknife
Bootstrap
29 / 45
Resampling Statistics
Jackknife
Lets first attempt to demonstrate how resampling can be used to
yield a more informative estimate of a general statistic.
30 / 45
Resampling Statistics
Jackknife
Lets first attempt to demonstrate how resampling can be used to
yield a more informative estimate of a general statistic.
Suppose we have a set D of n points xi sampled from a
one-dimensional distribution.
30 / 45
Resampling Statistics
Jackknife
Lets first attempt to demonstrate how resampling can be used to
yield a more informative estimate of a general statistic.
Suppose we have a set D of n points xi sampled from a
one-dimensional distribution.
The familiar estimate of the mean is
n
1X
xi
n
(19)
i=1
30 / 45
Resampling Statistics
Jackknife
Lets first attempt to demonstrate how resampling can be used to
yield a more informative estimate of a general statistic.
Suppose we have a set D of n points xi sampled from a
one-dimensional distribution.
The familiar estimate of the mean is
n
1X
xi
n
(19)
i=1
And, the estimate of the accuracy of the mean is the sample variance,
given by
n
2 =
1 X
(xi
)2
n1
(20)
i=1
30 / 45
Resampling Statistics
Now, suppose we were instead interested in the median (the point for
which half the distribution is higher, half lower).
31 / 45
Resampling Statistics
Now, suppose we were instead interested in the median (the point for
which half the distribution is higher, half lower).
Of course, we can determine the median explicitly.
31 / 45
Resampling Statistics
Now, suppose we were instead interested in the median (the point for
which half the distribution is higher, half lower).
Of course, we can determine the median explicitly.
But, how can we compute the accuracy of the median, or a measure
of the error or spread of it? This problem would translate to many
other statistics that we could compute, such as the mode.
31 / 45
Resampling Statistics
Now, suppose we were instead interested in the median (the point for
which half the distribution is higher, half lower).
Of course, we can determine the median explicitly.
But, how can we compute the accuracy of the median, or a measure
of the error or spread of it? This problem would translate to many
other statistics that we could compute, such as the mode.
Resampling will help us here.
31 / 45
Resampling Statistics
Now, suppose we were instead interested in the median (the point for
which half the distribution is higher, half lower).
Of course, we can determine the median explicitly.
But, how can we compute the accuracy of the median, or a measure
of the error or spread of it? This problem would translate to many
other statistics that we could compute, such as the mode.
Resampling will help us here.
First, define some new notation for leave-one-out statistic in which
the statistic, such as the mean, is computed by leaving a particular
datum out of the estimate. For the mean, we have
(i) =
n
xi
1 X
xj =
.
n1
n1
(21)
j6=i
31 / 45
Resampling Statistics
Now, suppose we were instead interested in the median (the point for
which half the distribution is higher, half lower).
Of course, we can determine the median explicitly.
But, how can we compute the accuracy of the median, or a measure
of the error or spread of it? This problem would translate to many
other statistics that we could compute, such as the mode.
Resampling will help us here.
First, define some new notation for leave-one-out statistic in which
the statistic, such as the mean, is computed by leaving a particular
datum out of the estimate. For the mean, we have
(i) =
n
xi
1 X
xj =
.
n1
n1
(21)
j6=i
Next, we can define the jackknife estimate of the mean, that is, the
mean of the leave-one-out means:
n
1X
(i)
(22)
() =
n
i=1
31 / 45
Resampling Statistics
And, we can show that the jackknife mean is the same of the
traditional mean.
32 / 45
Resampling Statistics
And, we can show that the jackknife mean is the same of the
traditional mean.
The variance of the jacknife estimate is given by
n
n1X
Var[
] =
((i) () )2
n
(23)
i=1
32 / 45
Resampling Statistics
And, we can show that the jackknife mean is the same of the
traditional mean.
The variance of the jacknife estimate is given by
n
n1X
Var[
] =
((i) () )2
n
(23)
i=1
=
Varjack []
n1X
((i) () )2
n
(25)
i=1
32 / 45
Resampling Statistics
bias[]
(26)
33 / 45
Resampling Statistics
bias[]
And, we can use the jackknife method to estimate such a bias.
= (n 1) ()
biasjack []
(26)
(27)
33 / 45
Resampling Statistics
bias[]
And, we can use the jackknife method to estimate such a bias.
= (n 1) ()
biasjack []
(26)
(27)
(28)
33 / 45
Resampling Statistics
bias[]
And, we can use the jackknife method to estimate such a bias.
= (n 1) ()
biasjack []
(26)
(27)
(28)
33 / 45
Resampling Statistics
34 / 45
Resampling Statistics
34 / 45
Resampling Statistics
34 / 45
Resampling Statistics
34 / 45
Resampling Statistics
34 / 45
Resampling Statistics
34 / 45
Resampling Statistics
34 / 45
Resampling Statistics
probability a
leave-one-out
replication has a
particular accuracy
C1
C2
accuracy (%)
70
75
80
85
90
95
100
Resampling Statistics
Bootstrap
A bootstrap dataset is one created by randomly selecting n points
from the training set D with replacement.
Because D contains itself n points, there is nearly always duplication of
individual points in a bootstrap dataset.
36 / 45
Resampling Statistics
Bootstrap
A bootstrap dataset is one created by randomly selecting n points
from the training set D with replacement.
Because D contains itself n points, there is nearly always duplication of
individual points in a bootstrap dataset.
36 / 45
Resampling Statistics
Bootstrap
A bootstrap dataset is one created by randomly selecting n points
from the training set D with replacement.
Because D contains itself n points, there is nearly always duplication of
individual points in a bootstrap dataset.
(29)
b=1
36 / 45
Resampling Statistics
B
1 X (b) ()
(30)
b=1
37 / 45
Resampling Statistics
(31)
b=1
38 / 45
Resampling Statistics
(31)
b=1
38 / 45
Resampling Statistics
(31)
b=1
38 / 45
Resampling Statistics
39 / 45
Resampling Statistics
39 / 45
Resampling Statistics
39 / 45
Resampling Statistics
39 / 45
Resampling Statistics
Cross-Validation
40 / 45
Resampling Statistics
Cross-Validation
40 / 45
Resampling Statistics
li
va
n
tio
a
d
trainin
stop
training
here
amount of training,
parameter adjustment
E 9.9. In validation, the data set D is split into two parts. The first (e.g.,
terns)J. is
used as a standard training
set for setting free parameters in 41
the
cl
Corso (SUNY at Buffalo)
Algorithm Independent Topics
/ 45
Resampling Statistics
42 / 45
Resampling Statistics
42 / 45
Resampling Statistics
42 / 45
Resampling Statistics
42 / 45
Resampling Statistics
42 / 45
Resampling Statistics
42 / 45
43 / 45
Bagging
44 / 45
Bagging
44 / 45
Bagging
44 / 45
Bagging
44 / 45
Bagging
44 / 45
Bagging
44 / 45
Bagging
44 / 45
Bagging
45 / 45