Lect 26 PDF

Machine Learning
Computational
Learning Theory
Machine Learning
Introduction
2
Computational learning theory

Is it possible to identify classes of learning problems that are inherently easy or difficult?
Can we characterize the number of training examples necessary or sufficient to assure
successful learning?
How is this number affected if the learner is allowed to pose queries to the trainer ?
Can we characterize the number of mistakes that a learner will make before learning
the target function?
Can we characterize the inherent computational complexity of classes of learning
problems?
General answers to all these questions are not yet known.

This chapter
Sample complexity
Computational complexity
Mistake bound
Machine Learning
Probably Approximately Correct (PAC) Learning [1]

3
Problem Setting for concept learning:

X : Set of all possible instances over which target functions may be defined.
Training and Testing instances are generated from X according some unknown distribution D.
We assume that D is stationary.
C : Set of target concepts that our learner might be called upon to learn.
Target concept is a Boolean function c : X {0,1}.
H : Set of all possible hypotheses.

Goal : Producing hypothesis h H which is an estimate of c.
Evaluation: Performance of h measured over new samples drawn randomly using distribution D.
Error of a Hypothesis :
The training error (sample error ) of hypothesis h with respect to target concept c and data sample S of size n is.
1
errorS (h ) [c ( x ) h( x )]
n xS
The true error (denoted errorD(h)) of hypothesis h with respect to target concept c and distribution D is the
probability that h will misclassify an instance drawn at random according to D.
errorD (h ) Pr [c ( x ) h( x )]
xD
errorD (h )
Instance Space X
D( x)
h ( x ) c ( x )
errorD(h) depends strongly of the D. D(x) is prob. of presenting x.
h is approximately correct if errorD(h)
Machine Learning
h-
c
+
Where c
and h
disagree
+
-

4
We are trying to characterize the number of training examples needed to learn

a hypothesis h for which errorD (h)=0.
Problems :
May be multiple consistent hypotheses and the learner can not pickup one of them.
Since training set is chosen randomly, the true error may not be zero.
To accommodate these difficulties, we will not require that :

True error may not be zero, We require that true error is bounded by some constant .
The learner succeed for every training sequence, we require that the failure probability is
bounded by some constant . Is confidence parameter.
In short, we require only that the learner probably learns a hypothesis that is
approximately correct.
PAC Learnability :
Definition : C is PAC-learnable by L using H if for all c C, distributions D over X, there
exists an such that 0 < < 1/2, and a such that 0 < < 1/2, learner L will, with probability
at least (1 - ), output a hypothesis h H such that errorD(h) in time that is polynomial in
(1/ ), (1/ ), n, and |C|.
If L requires some minimum processing time per training example, then for C to be
PAC-Learnable by L, L must learn from a polynomial number of training examples.
Machine Learning

5
Sample Complexity : The growth in the number of required training examples

with problem size.
The most limiting factor for success of a learner is the limited availability of training
data.
Consistent learner : A learner is consistent if it outputs hypotheses that perfectly

fit the training data, whenever possible.
Our Concern : Can we bound true error of h (given training error of h)?
Definition : Version space VSH,D is said to be -exhausted with respect to c and D, if all
hVSH,D has true error less than with respect to c and D (
h VSH,D . errorD(h) < )
Hypothesis Space H
error = 0.1
r = 0.2
error = 0.2
r = 0.0
VSH,D
error = 0.3
r = 0.1
error = 0.1
r = 0.0
error = 0.3
r = 0.4
error = 0.2
r = 0.3
(r = training error, error = true error)
Machine Learning

6
Theorem [Haussler, 1988]

If the hypothesis space H is finite, and D is a sequence of m 1 independent random
examples of some target concept c, then for any 0 1, the probability that the
version space with respect to H and D is not -exhausted (with respect to c) is less
than or equal to | H | e - m
Important Result!
Bounds the probability that any consistent learner will output a hypothesis h with
error(h)
Want this probability to be below a specified threshold
| H | e -m
To achieve, solve inequality for m: let
m 1/ (ln |H| + ln (1/))

Need to see at least this many examples
It is possible that | H | e - m > 1.
Machine Learning

7
Example :
H: conjunctions of constraints on up to n boolean attributes (n boolean literals)
| H | = 3n, m 1/ (ln 3n + ln (1/
)) = 1/ (n ln 3 + ln (1/
))
How About EnjoySport?

H as given in EnjoySport (conjunctive concepts with dont cares)
| H | = 973
m 1/ (ln |H| + ln (1/
))
Example goal: probability 1 - = 95% of hypotheses with errorD(h) < 0.1

m 1/0.1 (ln 973 + ln (1/0.05)) 98.8
Example
Sky
0
1
2
3
Sunny
Sunny
Rainy
Sunny
Air
Temp
Warm
Warm
Cold
Warm
Humidity
Wind
Water
Forecast
Normal
High
High
High
Strong
Strong
Strong
Strong
Warm
Warm
Warm
Cool
Same
Same
Change
Change
Machine Learning
Enjoy
Sport
Yes
Yes
No
Yes

8
Unbiased Learner
Recall: sample complexity bound m 1/ (ln | H | + ln (1/
))
Sample complexity not always polynomial
Example: for unbiased learner, | H | = 2 | X |
Suppose X consists of n booleans (binary-valued attributes)
n
| X | = 2n, | H | = 22
m 1/ (2n ln 2 + ln (1/
))
Sample complexity for this H is exponential in n
Agnostic Learner : A learner that make no assumption that the target concept is
representable by H and that simply finds the hypothesis with minimum error.
How Hard Is This?

Sample complexity: m 1/22 (ln |H| + ln (1/
))
Derived from Hoeffding bounds: P [TrueErrorD(h) > TrainingErrorD(h) + ] e-2m2
Machine Learning

9
Drawbacks of sample complexity

The bound is not tight, when |H| is large and probability may be grater than 1.
When |H| is infinite.
Vapnik-Chervonekis dimension (VC (H))

VC-dimension measures complexity of hypothesis space H, not by the number of distinct hypotheses |H|,
but by the number of distinct instances from X that can be completely discriminated using H.
Dichotomies: A dichotomy(concept) of a set S is a partition of S into two subsets S1 and S2

Shattering
A set of instances S is shattered by hypothesis space H if and only if for every dichotomy
(concept) of S, there exists a hypothesis in H consistent with this dichotomy
Intuition: a rich set of functions shatters a larger instance space
Instance
Space X
Machine Learning
Vapnik-Chervonekis Dimension [1]

10
From Chapter 2, unbiased hypotheses space is capable of

representing every possible concept (dichotomy) defined over the
instance space X.
Unbiased hypotheses space H can shatter instance space X.
If H cannot shatter X, but can shatter some large subset S of X, what
happens? This is defined by VC (H).
Vapnik-Chervonekis dimension (VC (H))
VC (H) of hypotheses space H defined over the instance space X is the size of
largest finite subset of X shattered by H. If arbitrary large finite sets of X can be
shattered by H, then VC (H) = .
For any finite H, VC (H) log2 |H|
Machine Learning

11
Example :
X = R : The set of real numbers.
H : The set of intervals on real line in form of a < x < b, where a and b may be any real constants.
What is VC (H)?
Let S = {3.1,5.7}, can S be shattered by H?
1<x<2
1<x<4
4<x<7
1<x<7
Let S = {x0, x1, x2} (x0 < x1 < x2), can S be shattered by H?
The dichotomy that includes x0 and x2 but not x1 cannot be shattered by H.
Thus VC (H) = 2.
Machine Learning

12
Example :
X :The set of instances corresponding to points in x-y plane.
H : The set of all linear decision surfaces in the plane (such as decision function of perceptron).
What is VC (H)?
Colinear points cannot be shattered!
Thus VC (H) = 2 or 3 or .
VC (H) > 3. To show VC (H) < d, we must show that no set of size d can be shattered.
In this example, no set of size 4, can be shattered, hence VC (H) = 3
It can be shown that VC-dimension of linear decision surfaces in r-dimensional is r+1.
Example :
X :The set of instances corresponding to points in x-y plane.
H : The set of all axis-aligned rectangles in two dimensions.
What is VC (H)?
Machine Learning

13
Example : Feed forward Neural Networks with N free parameters

VC for a neural network with linear activation function is O(N).
VC for a neural network with Threshold activation function is O(N log N).
VC for a neural network with sigmoid activation function is O(N2).
Machine Learning
Mistake Bounds
14
How many mistakes will the learner make in its prediction before it learns
the target concept.
Example : Find-S
Suppose H be conjunction of up to n Boolean literals and their negations
Find-S
Initialize h to the most specific hypothesis l1 l1 l2 l2 ln ln
For each positive training instance x do
remove from h any literal that is not satisfied by x
Output hypothesis h
How Many Mistakes before Converging to Correct h?

Once a literal is removed, it is never put back
No false positives (started with most restrictive h): count false negatives
First example will remove n candidate literals
Worst case: every remaining literal is also removed (incurring 1 mistake each)
Find-S makes at most n + 1 mistakes
Machine Learning

Lect 26 PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lect 26 PDF

Uploaded by

Copyright:

Available Formats

Machine Learning

Computational learning theory

General answers to all these questions are not yet known.

Probably Approximately Correct (PAC) Learning [1]

Problem Setting for concept learning:

H : Set of all possible hypotheses.

errorD(h) depends strongly of the D. D(x) is prob. of presenting x.

h is approximately correct if errorD(h)

Probably Approximately Correct (PAC) Learning [2]

We are trying to characterize the number of training examples needed to learn

To accommodate these difficulties, we will not require that :

Probably Approximately Correct (PAC) Learning [3]

Sample Complexity : The growth in the number of required training examples

Consistent learner : A learner is consistent if it outputs hypotheses that perfectly

(r = training error, error = true error)

Probably Approximately Correct (PAC) Learning [4]

Theorem [Haussler, 1988]

m 1/ (ln |H| + ln (1/))

Probably Approximately Correct (PAC) Learning [5]

How About EnjoySport?

Example goal: probability 1 - = 95% of hypotheses with errorD(h) < 0.1

Probably Approximately Correct (PAC) Learning [6]

How Hard Is This?

Probably Approximately Correct (PAC) Learning [7]

Drawbacks of sample complexity

Vapnik-Chervonekis dimension (VC (H))

Dichotomies: A dichotomy(concept) of a set S is a partition of S into two subsets S1 and S2

Vapnik-Chervonekis Dimension [1]

From Chapter 2, unbiased hypotheses space is capable of

Vapnik-Chervonekis Dimension [2]

Vapnik-Chervonekis Dimension [3]

Vapnik-Chervonekis Dimension [4]

Example : Feed forward Neural Networks with N free parameters

Initialize h to the most specific hypothesis l1 l1 l2 l2 ln ln

For each positive training instance x do

remove from h any literal that is not satisfied by x

How Many Mistakes before Converging to Correct h?

You might also like