Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Machine Learning

Computational
Learning Theory

Machine Learning

Introduction
2

 Computational learning theory


 Is it possible to identify classes of learning problems that are inherently easy or difficult?
 Can we characterize the number of training examples necessary or sufficient to assure
successful learning?
 How is this number affected if the learner is allowed to pose queries to the trainer ?
 Can we characterize the number of mistakes that a learner will make before learning
the target function?
 Can we characterize the inherent computational complexity of classes of learning
problems?

 General answers to all these questions are not yet known.


 This chapter
 Sample complexity
 Computational complexity
 Mistake bound

Machine Learning

Probably Approximately Correct (PAC) Learning [1]


3

 Problem Setting for concept learning:


 X : Set of all possible instances over which target functions may be defined.
 Training and Testing instances are generated from X according some unknown distribution D.
 We assume that D is stationary.

 C : Set of target concepts that our learner might be called upon to learn.
 Target concept is a Boolean function c : X {0,1}.

 H : Set of all possible hypotheses.


 Goal : Producing hypothesis h H which is an estimate of c.

 Evaluation: Performance of h measured over new samples drawn randomly using distribution D.

 Error of a Hypothesis :
 The training error (sample error ) of hypothesis h with respect to target concept c and data sample S of size n is.
1
errorS (h ) [c ( x ) h( x )]
n xS
 The true error (denoted errorD(h)) of hypothesis h with respect to target concept c and distribution D is the
probability that h will misclassify an instance drawn at random according to D.
errorD (h ) Pr [c ( x ) h( x )]
xD

errorD (h )

Instance Space X

D( x)
h ( x ) c ( x )

 errorD(h) depends strongly of the D. D(x) is prob. of presenting x.

 h is approximately correct if errorD(h)

Machine Learning

h-

c
+

Where c
and h
disagree

+
-

Probably Approximately Correct (PAC) Learning [2]


4

 We are trying to characterize the number of training examples needed to learn


a hypothesis h for which errorD (h)=0.
 Problems :
 May be multiple consistent hypotheses and the learner can not pickup one of them.
 Since training set is chosen randomly, the true error may not be zero.

 To accommodate these difficulties, we will not require that :


 True error may not be zero, We require that true error is bounded by some constant .
 The learner succeed for every training sequence, we require that the failure probability is
bounded by some constant . Is confidence parameter.

 In short, we require only that the learner probably learns a hypothesis that is

approximately correct.

 PAC Learnability :
 Definition : C is PAC-learnable by L using H if for all c C, distributions D over X, there
exists an such that 0 < < 1/2, and a such that 0 < < 1/2, learner L will, with probability
at least (1 - ), output a hypothesis h H such that errorD(h) in time that is polynomial in
(1/ ), (1/ ), n, and |C|.
 If L requires some minimum processing time per training example, then for C to be
PAC-Learnable by L, L must learn from a polynomial number of training examples.

Machine Learning

Probably Approximately Correct (PAC) Learning [3]


5

 Sample Complexity : The growth in the number of required training examples


with problem size.
 The most limiting factor for success of a learner is the limited availability of training
data.

 Consistent learner : A learner is consistent if it outputs hypotheses that perfectly


fit the training data, whenever possible.
 Our Concern : Can we bound true error of h (given training error of h)?
 Definition : Version space VSH,D is said to be -exhausted with respect to c and D, if all
hVSH,D has true error less than with respect to c and D (
h VSH,D . errorD(h) < )
Hypothesis Space H

error = 0.1
r = 0.2

error = 0.2
r = 0.0

VSH,D
error = 0.3
r = 0.1

error = 0.1
r = 0.0

error = 0.3
r = 0.4

error = 0.2
r = 0.3

(r = training error, error = true error)

Machine Learning

Probably Approximately Correct (PAC) Learning [4]


6

 Theorem [Haussler, 1988]


 If the hypothesis space H is finite, and D is a sequence of m 1 independent random
examples of some target concept c, then for any 0 1, the probability that the
version space with respect to H and D is not -exhausted (with respect to c) is less
than or equal to | H | e - m

 Important Result!
 Bounds the probability that any consistent learner will output a hypothesis h with

error(h)
 Want this probability to be below a specified threshold
| H | e -m
 To achieve, solve inequality for m: let

m 1/ (ln |H| + ln (1/))


 Need to see at least this many examples
 It is possible that | H | e - m > 1.

Machine Learning

Probably Approximately Correct (PAC) Learning [5]


7

 Example :
 H: conjunctions of constraints on up to n boolean attributes (n boolean literals)
 | H | = 3n, m 1/ (ln 3n + ln (1/
)) = 1/ (n ln 3 + ln (1/
))

 How About EnjoySport?


 H as given in EnjoySport (conjunctive concepts with dont cares)
| H | = 973
m 1/ (ln |H| + ln (1/
))

 Example goal: probability 1 - = 95% of hypotheses with errorD(h) < 0.1


 m 1/0.1 (ln 973 + ln (1/0.05)) 98.8

Example

Sky

0
1
2
3

Sunny
Sunny
Rainy
Sunny

Air
Temp
Warm
Warm
Cold
Warm

Humidity

Wind

Water

Forecast

Normal
High
High
High

Strong
Strong
Strong
Strong

Warm
Warm
Warm
Cool

Same
Same
Change
Change

Machine Learning

Enjoy
Sport
Yes
Yes
No
Yes

Probably Approximately Correct (PAC) Learning [6]


8

 Unbiased Learner
 Recall: sample complexity bound m 1/ (ln | H | + ln (1/
))
 Sample complexity not always polynomial
 Example: for unbiased learner, | H | = 2 | X |
 Suppose X consists of n booleans (binary-valued attributes)
n

| X | = 2n, | H | = 22

m 1/ (2n ln 2 + ln (1/
))
Sample complexity for this H is exponential in n

 Agnostic Learner : A learner that make no assumption that the target concept is
representable by H and that simply finds the hypothesis with minimum error.

 How Hard Is This?


 Sample complexity: m 1/22 (ln |H| + ln (1/
))
 Derived from Hoeffding bounds: P [TrueErrorD(h) > TrainingErrorD(h) + ] e-2m2

Machine Learning

Probably Approximately Correct (PAC) Learning [7]


9

 Drawbacks of sample complexity


 The bound is not tight, when |H| is large and probability may be grater than 1.
 When |H| is infinite.

 Vapnik-Chervonekis dimension (VC (H))


 VC-dimension measures complexity of hypothesis space H, not by the number of distinct hypotheses |H|,
but by the number of distinct instances from X that can be completely discriminated using H.

 Dichotomies: A dichotomy(concept) of a set S is a partition of S into two subsets S1 and S2


 Shattering
 A set of instances S is shattered by hypothesis space H if and only if for every dichotomy
(concept) of S, there exists a hypothesis in H consistent with this dichotomy
 Intuition: a rich set of functions shatters a larger instance space

Instance
Space X
Machine Learning

Vapnik-Chervonekis Dimension [1]


10

 From Chapter 2, unbiased hypotheses space is capable of


representing every possible concept (dichotomy) defined over the
instance space X.
 Unbiased hypotheses space H can shatter instance space X.
 If H cannot shatter X, but can shatter some large subset S of X, what
happens? This is defined by VC (H).
 Vapnik-Chervonekis dimension (VC (H))
 VC (H) of hypotheses space H defined over the instance space X is the size of
largest finite subset of X shattered by H. If arbitrary large finite sets of X can be
shattered by H, then VC (H) = .
 For any finite H, VC (H) log2 |H|

Machine Learning

Vapnik-Chervonekis Dimension [2]


11

 Example :
 X = R : The set of real numbers.
 H : The set of intervals on real line in form of a < x < b, where a and b may be any real constants.

 What is VC (H)?
 Let S = {3.1,5.7}, can S be shattered by H?
 1<x<2

 1<x<4

 4<x<7

 1<x<7

 Let S = {x0, x1, x2} (x0 < x1 < x2), can S be shattered by H?
 The dichotomy that includes x0 and x2 but not x1 cannot be shattered by H.

 Thus VC (H) = 2.

Machine Learning

Vapnik-Chervonekis Dimension [3]


12

 Example :
 X :The set of instances corresponding to points in x-y plane.
 H : The set of all linear decision surfaces in the plane (such as decision function of perceptron).

 What is VC (H)?
 Colinear points cannot be shattered!
 Thus VC (H) = 2 or 3 or .
 VC (H) > 3. To show VC (H) < d, we must show that no set of size d can be shattered.
 In this example, no set of size 4, can be shattered, hence VC (H) = 3
 It can be shown that VC-dimension of linear decision surfaces in r-dimensional is r+1.

 Example :
 X :The set of instances corresponding to points in x-y plane.
 H : The set of all axis-aligned rectangles in two dimensions.

 What is VC (H)?
Machine Learning

Vapnik-Chervonekis Dimension [4]


13

 Example : Feed forward Neural Networks with N free parameters


 VC for a neural network with linear activation function is O(N).
 VC for a neural network with Threshold activation function is O(N log N).
 VC for a neural network with sigmoid activation function is O(N2).

Machine Learning

Mistake Bounds
14

 How many mistakes will the learner make in its prediction before it learns
the target concept.
 Example : Find-S
 Suppose H be conjunction of up to n Boolean literals and their negations
 Find-S

Initialize h to the most specific hypothesis l1 l1 l2 l2 ln ln

For each positive training instance x do

remove from h any literal that is not satisfied by x

Output hypothesis h

 How Many Mistakes before Converging to Correct h?


 Once a literal is removed, it is never put back
 No false positives (started with most restrictive h): count false negatives
 First example will remove n candidate literals
 Worst case: every remaining literal is also removed (incurring 1 mistake each)
 Find-S makes at most n + 1 mistakes
Machine Learning

You might also like