Download as pdf or txt
Download as pdf or txt
You are on page 1of 70

Concept Learning

Jayanta Mukhopadhyay
Dept. of Computer Science and Engg.

Courtesy: Prof. Pabitra Mitra, CSE, IIT Kharagpur


References

n 1. Chapter 2 and 7 of “Machine


learning” by Tom M. Mitchel.
n 2. Chapter 2 of “Introduction to
Machine Learning” by Ethem Alpaydin.
Which days does one come
out to enjoy sports?
n Sky condition n Wind
n Rainy / Cloudy / n Strong / Weak
Sunny n Water
n Humidity n Warm / Cool
n High / Normal n Forecast
n Temperature n Same / Change
n Warm / Cold
Attributes of a day: takes on values
Enjoy sports (?): Yes / No 3
Learning Task
n To make a hypothesis about the day on
which a person comes out to enjoy sports.
n in the form of a boolean function on the
attributes of the day.

n Find the right hypothesis/function from


historical data
n Training Examples (TE)
4
Training Examples for EnjoySport
c( )=1
c( )=1
c( )=0
c( )=1
c is the target concept

<Negative and positive learning examples.


<To learn the target concept c.
<A Boolean function

5
Concept learning
< To derive a Boolean function from training
examples.
q Many “hypothetical” Boolean functions
Øfind h such that h = c.
<Generate hypotheses for concept from TE’s

6
Representing a Hypothesis
n A hypothesis as conjunction of constraints.
n Each constraint: a Boolean condition on attribute
values.
n Three forms
n Specific value : Water = Warm
n Don’t-care value: Water = ?
n Any value satisfies condition.

n No value allowed : Water = Æ


n i.e., no permissible value given values of other
attributes
n Represented in the form of a vector. 7
Example of a hypothesis
n Represented in the form of a vector:
n <sky, temp, humid, wind, water, forecast>
n h=<Sunny ? ? Strong ? Same>
h(x) = 1 if h is true on x
=0 otherwise
n x is also represented as a vector, an element in
the 6-D space.
n x= <Sunny, Warm, Normal, Strong, Warm, Same>
n h(x)=1

n x= <Sunny, Warm, Normal, Strong, Warm, Change>


n h(x)=0
8
Space of Hypotheses
n H: A set of all possible hypotheses
n Finite number of combinations.
n Size of input space X
n X = Sky x Temp x Humid x Wind x Water x Forecast
n |X|= 3 x 2 x 2 x 2 x 2 x 2 = 96
n Size of H
n Each attribute A can have |A|+|{Æ,?}| conditions.
n E.g. for Sky: 3+2=5

n |H|= 5 x 4 x 4 x 4 x 4 x 4= 5120
n But every h with Æ : empty set of instances (all negatives)
n No. of distinct hypotheses: 1+4x3x3x3x3x3=973
Concept Learning: Task
TASK T: predicting when a person will enjoy
sport
– Target function c: EnjoySport : X ® {0, 1}
– Cannot, in general, know Target function c
vAdopt hypotheses H about c
– Form of hypotheses H :
vConjunctions of literals
vá?, Cold, High, ?, ?, ? ñ
10
Concept Learning: Experience
<EXPERIENCE E
– Instances X: possible days described by
attributes Sky, Temp, Humidity, Wind, Water,
Forecast
– Training examples D: Positive / negative
examples of target function {áx1, c(x1)ñ, . . .
áxm, c(xm)ñ}

11
Concept Learning:
Performance Measure

PERFORMANCE MEASURE P:
The Hypothesis h in H such that h(x) = c(x)
for all x in D (Training Examples).
– may exist several alternative hypotheses
that fit examples.
– Assumption of inductive learning on h
being true for unseen examples.

12
Inductive Learning Hypothesis

q Any hypothesis h found to approximate the


target function c well over a sufficiently large
set of training examples D will also
approximate the target function well over
other unobserved examples (.i.e. in
population distribution D) .

∀x ∈ D, h(x) ≈ c(x) à ∀x ∈ D, h(x) ≈ c(x)


13
Ordering on Hypotheses
Instances X Hypotheses H
specific

general

x1=áSunny Warm High Strong Cool Sameñ h1=áSunny ? ? Strong ? ?ñ


x2=áSunny Warm High Light Warm Sameñ h2=áSunny ? ? ? ? ?ñ
h3=áSunny ? ? ? Cool ?ñ

< h is more general than h¢( h ³g h¢) if for each instance x,


h¢(x) = 1 ® h(x) = 1
< Which is the most general/most specific hypothesis?
Learning as a search problem
n Search a hypothesis h in the space H to best fit
examples.
n If examples are error free, h should satisfy all of
them
n Not unique.
n several alternative hypotheses may fit examples.
n May not exist any solution at all!
n Satisfying all +ve and –ve examples.

n Constraints may have other form

n e.g. Sky condition could be (rainy OR cloudy),

but not admitted in H.


Approaches to learning
algorithms
n Approach 1: Search based on ordering of
hypotheses.
n Approach 2: Search based on finding all
possible hypotheses using a good
representation of hypothesis space.
n All hypotheses that fit data

The choice of the hypothesis space


reduces the number of hypotheses.
16
Assumes
o There is a hypothesis h in
describing target function c.
o There are no errors in the
TE’s.
Find-S Algorithm o Everything except the
positive examples is negative.

1. Initialize h to the most specific hypothesis in H


(what is this?)
2. For each positive training instance x
For each attribute constraint ai in h
If the constraint ai in h is satisfied by x
do nothing
Else
replace ai in h by the next more general
constraint that is satisfied by x
3. Output hypothesis h.
As no change for a negative example, it is ignored. 17
Example of Find-S
Instances X Hypotheses H

specific

general
h0=áÆ Æ Æ Æ Æ Æñ
x1=áSunny Warm Normal Strong Warm Sameñ + h1=áSunny Warm Normal Strong Warm Sameñ
x2=áSunny Warm High Strong Warm Sameñ + h2=áSunny Warm ? Strong Warm Sameñ
x3=áRainy Cold High Strong Warm Changeñ - h3=áSunny Warm ? Strong Warm Sameñ
x4=áSunny Warm High Strong Cool Changeñ + h4=áSunny Warm ? Strong ? ?ñ
Problems with Find-S
n Problems: n Advantages
Simple
n Throws away information!
n

n Outcome
n Negative examples independent of
n Can’t tell whether it has order of examples
learned the concept n Why?
n Depending on H, there might n Any alternative?
be several h’s that fit TEs!
Picks a maximally specific h. n Keep all consistent
hypotheses!
n
(why?)
n Candidate
n Can’t tell when training data elimination
is inconsistent algorithm
n Since ignores negative TEs
19
Consistent Hypothesis
n if h(x) = c(x) for each training example áx, c(x)ñ
in D.
n consistent with a set of training examples D of
target concept c
n Note that consistency is with respect to specific D.
n Notation:
Consistent (h, D) º "áx, c(x)ñÎD :: h(x) = c(x)
Agnostic hypothesis:
May label erroneously a training sample.
Agnostic (h, D) º ∃áx, c(x)ñÎD :: h(x) ≠ c(x) 20
Version Space
n VSH,D :The subset of hypotheses from H
consistent with D
n with respect to hypothesis space H and
training examples D
n Notation:
VSH,D = {h | h Î H Ù Consistent (h, D)}

21
List-Then-Eliminate Algorithm
1. VersionSpace ¬ list of all hypotheses in H
2. For each training example áx, c(x)ñ

remove from VersionSpace any hypothesis


h for which h(x) ¹ c(x).
3. Output the list of hypotheses in
VersionSpace.

Essentially a brute force procedure.


22
Example of Find-S, Revisited
Instances X Hypotheses H

specific

general
h0=áÆ Æ Æ Æ Æ Æñ
x1=áSunny Warm Normal Strong Warm Sameñ + h1=áSunny Warm Normal Strong Warm Sameñ
x2=áSunny Warm High Strong Warm Sameñ + h2=áSunny Warm ? Strong Warm Sameñ
x3=áRainy Cold High Strong Warm Changeñ - h3=áSunny Warm ? Strong Warm Sameñ
x3=áSunny Warm High Strong Cool Changeñ + h4=áSunny Warm ? Strong ? ?ñ
h4: Least general
Restriction on most general hypothesis looking at the –ve sample!
Version Space for this
Example

S {áSunny Warm ? Strong ? ?ñ }

áSunny ? ? Strong ? ?ñ áSunny Warm ? ? ? ?ñ á? Warm ? Strong ? ?ñ

G { áSunny ? ? ? ? ?ñá?
, Warm ? ? ? ?ñ }
24
Compact Representation of
the Version Space

n Store the most and the least general


boundaries of space.
n Generalize from most specific boundaries
n Use +ve samples.
n Specialize from most general boundaries
n Use –ve samples.
n Generate all intermediate h’s in VS
n any h in VS must be consistent with all TE’s

25
Compact Representation of
the Version Space VSH,D
n The general boundary, G,
n the set of its maximally general members
consistent with D
n Summarizes the negative examples;
n Anything more general covers wrongly a negative TE
n The specific boundary, S,
n the set of its maximally specific members
consistent with D
n Summarizes the positive examples;
n Anything more specific fails to cover a positive TE
26
Theorem
Every member of the version space lies
between the S,G boundary
VSH,D = {h | h Î H Ù $sÎS $gÎG (g ³ h ³ s)}

n Must prove:
n 1) every h satisfying RHS is in VSH,D;
n 2) every member of VSH,D satisfies RHS.

27
Every member of the version space lies
between the S,G boundary

Proof VSH,D = {h | h Î H Ù $sÎS $gÎG (g ³ h ³ s)}

n Must prove:
n 1) every h satisfying RHS is in VSH,D;

n 2) every member of VSH,D satisfies RHS.

n For 1), let g, h, s be arbitrary members of G, H, S


respectively with g>h>s Prove that h is consistent.
n s must be satisfied by all + TEs and so must h because it is
more general;
n g cannot be satisfied by any – TEs, and so nor can h
n h is in VSH,D since satisfied by all + TEs and no – TEs
n For 2),
n Since h satisfies all + TEs and no – TEs, h ³ s, and g ³ h.
28
Candidate Elimination Algorithm
G ¬ maximally general hypotheses in H
S ¬ maximally specific hypotheses in H
For each training example d, do
n If d is positive

n Remove from G every hypothesis inconsistent with d

n For each hypothesis s in S that is inconsistent with d

n Remove s from S
n Add to S all minimal generalizations h of s such that
1. h is consistent with d, and
2. some member of G is more general than h
n Remove from S every hypothesis that is more general
than another hypothesis in S 29
Candidate Elimination
Algorithm (cont)
n If d is a negative example
n Remove from S every hypothesis inconsistent with d

n For each hypothesis g in G that is inconsistent with d

n Remove g from G

n Add to G all minimal specializations h of g such that

1. h is consistent with d, and


2. some member of S is more specific than h
n Remove from G every hypothesis that is less general than
another hypothesis in G
n Essentially use n Convergence guaranteed if:

n Pos TEs to generalize S n no errors

n Neg TEs to specialize G n there is h in H

n Independent of order of TEs describing c.


30
Recall : If d is positive
Example Remove from G every hypothesis
inconsistent with d
For each hypothesis s in S that is
inconsistent with d
•Remove s from S
S0 {áÆ Æ Æ Æ Æ Æñ} •Add to S all minimal generalizations
h of s that are specializations of a
G0 {á? ? ? ? ? ?ñ} hypothesis in G
•Remove from S every hypothesis
that is more general than another
hypothesis in S
áSunny Warm Normal Strong Warm Sameñ +

S1 {áSunny Warm Normal Strong Warm Sameñ}

G1 {á? ? ? ? ? ?ñ}
Example (contd)

S1 {áSunny Warm Normal Strong Warm Sameñ}


G1 {á? ? ? ? ? ?ñ}

áSunny Warm High Strong Warm Sameñ +

S2 {áSunny Warm ? Strong Warm Sameñ}

G2 {á? ? ? ? ? ?ñ}

32
If d is a negative example
– Remove from S every
hypothesis

Example (contd) inconsistent with d


– For each hypothesis g
in G that is inconsistent
S2 {áSunny Warm ? Strong Warm Sameñ} with d
G2 {á? ? ? ? ? ?ñ} v Remove g from G
v Add to G all minimal
áRainy Cold High Strong Warm Changeñ - specializations h of g
that generalize some
hypothesis in S
Current G boundary is incorrect
v Remove from G
So, need to make it more specific.
every hypothesis
that is less general
S3 {áSunny Warm ? Strong Warm Sameñ} than another
hypothesis in G

G3 {áSunny ? ? ? ? ?ñ , á? Warm ? ? ? ?ñ , á? ? ? ? ? Sameñ }


33
Example (contd)
< Why are there no hypotheses left relating to:
– á Cloudy ? ? ? ? ?ñ
– Inconsistent with S.
< The following specialization using the third value
á? ? Normal ? ? ?ñ,
is not more general than the specific boundary
{áSunny Warm ? Strong Warm Sameñ}

< The specializations á? ? ? Weak ? ?ñ, á? ? ? ? Cool ?ñ


are also inconsistent with S
34
Example (contd)

S3 {áSunny Warm ? Strong Warm Sameñ}

G3 {áSunny ? ? ? ? ?ñ , á? Warm ? ? ? ?ñ , á? ? ? ? ? Sameñ}

áSunny Warm High Strong Cool Changeñ +

S4 {áSunny Warm ? Strong ? ?ñ}

G4 {áSunny ? ? ? ? ?ñ, á? Warm ? ? ? ?ñ}


35
Example (contd)
áSunny Warm High Strong Cool Changeñ +
<Why does this example remove a hypothesis from
G?:
– á? ? ? ? ? Sameñ
<This hypothesis
– Cannot be specialized, since would not cover new TE.
– Cannot be generalized, because more general would
cover negative TE.
– Hence must drop hypothesis.

36
Version Space of the Example
S {áSunny Warm ? Strong ? ?ñ}

áSunny ? ? Strong ? ?ñ áSunny Warm ? ? ? ?ñ á? Warm ? Strong ? ?ñ

S
G {áSunny ? ? ? ? ?ñ, á? Warm ? ? ? ?ñ} version
space
G
37
Convergence of algorithm
n Convergence guaranteed if:
n no errors
n there is h in H describing c.
n Ambiguity removed from VS when S = G
n Containing single h
n When have seen enough TEs
n For any false negative TE, algorithm will remove every
h consistent with TE, and hence may remove correct
target concept from VS
n If observed enough, TEs will find that S, G boundaries
converge to empty VS 38
Which Next Training Example?
S {áSunny Warm ? Strong ? ?ñ}

áSunny ? ? Strong ? ?ñ áSunny Warm ? ? ? ?ñ á? Warm ? Strong ? ?ñ

G {áSunny ? ? ? ? ?ñ, á? Warm ? ? ? ?ñ}


Assume learner can choose the next TE n Example:
n áSunny Warm
n Should choose d such that
Normal Weak
n Reduces maximally the number of
Warm Sameñ ?
hypotheses in VS
n If pos,
n Best TE: satisfies precisely 50%
generalizes S
hypotheses;
n If neg,
n Can’t always be done
specializes G 39
Which Next Training Example?
S {áSunny Warm ? Strong ? ?ñ}

áSunny ? ? Strong ? ?ñ áSunny Warm ? ? ? ?ñ á? Warm ? Strong ? ?ñ

G {áSunny ? ? ? ? ?ñ, á? Warm ? ? ? ?ñ}

Order of examples matters


for intermediate sizes of
S,G; not for the final S, G
40
Classifying new cases using VS
S {áSunny Warm ? Strong ? ?ñ}

áSunny ? ? Strong ? ?ñ áSunny Warm ? ? ? ?ñ á? Warm ? Strong ? ?ñ

G {áSunny ? ? ? ? ?ñ, á? Warm ? ? ? ?ñ}


n Use voting procedure on following examples:
n áSunny Warm Normal Strong Cool Changeñ

n áRainy Cool Normal Weak Warm Sameñ

n áSunny Warm Normal Weak Warm Sameñ

n áSunny Cold Normal Strong Warm Same ñ 41


Effect of incomplete
hypothesis space
n Preceding algorithms work if target function is
in H
n Will generally not work if target function not in H
n Consider following examples which represent
target function
“sky = sunny or sky = cloudy”:
n áSunny Warm Normal Strong Cool Changeñ Y
n áCloudy Warm Normal Strong Cool Changeñ Y
n áRainy Warm Normal Strong Cool Changeñ N

42
Effect of incomplete
hypothesis space
“sky = sunny or sky = cloudy”:
áSunny Warm Normal Strong Cool Changeñ Y
áCloudy Warm Normal Strong Cool Changeñ Y
áRainy Warm Normal Strong Cool Changeñ N

n If apply CE algorithm as before, end up with


empty VS
n After first two TEs, Need more
n S= á? Warm Normal Strong Cool Changeñ expressive
n New hypothesis is overly general hypotheses
n it covers the third negative TE!
n Our H does not include the appropriate c. 43
Unbiased Learners n Cannot predict
usefully:
n if no limits on representation of hypotheses n TEs have

(i.e., full logical representation: and, or, not), unanimous


can only learn examples…no generalization vote
possible! n other x’s have

n Say, 5 TEs {x1, x2, x3, x4, x5}, with x4, x5


50/50 vote!
negative TEs n For every h

n Apply CE algorithm in H that


predicts +,
n S :disjunction of +ve examples
there is
n S={x1 OR x2 OR x3}
another
n G :negation of disjunction of -ve examples that
n G={not (x4 or x5)} predicts -
n Need to use all instances to learn the
concept! 44
Inductive Bias

n As constraints on representation of
hypotheses
n Example of limiting connectives to conjunctions
n Allows learning of generalized hypotheses
n Introduces bias that depends on hypothesis
representation
n Need formal definition of inductive bias
of learning algorithm

45
Inductive system as an
equivalent deductive system

n Inductive bias made explicit in equivalent


deductive system
n Logically represented system that produces
same outputs (classification) from inputs
(TEs, instance x, bias B)
n E.g. The CE procedure

46
Equivalent deductive system
q Inductive bias (IB) of learning algorithm L:
n any minimal set of assertions B
n such that for any target concept c and training
examples D, used to logically infer the value c(x) of
any instance x from B, D, and x
n for a rote learner, B = {}, and there is no IB
n Difficult to apply in many cases, but a useful
guide

47
Inductive bias and specific
learning algorithms
n Rote learners:
n no IB
n Version space candidate elimination (CE)
algorithm:
n The target concept c can be represented in H
n Find-S:
n The target concept c can be represented in H;
n all instances that are not positive are negative.
48
Computational
Complexity of VS
n The S set for conjunctive feature vectors
n linear in the number of features and the number of
training examples.
n The G set for conjunctive feature vectors
n exponential in the number of training examples.
n In more expressive languages,
n both S and G can grow exponentially.
n The order of processing examples significantly
affect computational complexity.
49
Size of S and G?
n n Boolean attributes
n 1 positive example: (T, T, .., T) G0: (?,?,…?)
n n/2 negative examples:
G1: (T,?,..,?), (?,T,?,…?)
n (F,F,T,..T)
n (T,T,F,F,T..T) G2: (T,?,T,?..,?), (T,?,?,T..,?), (?,T,T,?,…?), (?,T,?,T,…?)
n (T,T,T,T,F,F,T..T)
n ..
n (T,..T,F,F)
n |S|=1
n |G|=2n/2 50
Probably Approximately
Correct (PAC) learning model
n A consistent hypothesis c
n Training error: 0 h
+
n True error ≠ 0
n errorD(h)=P(c(x) ≠ h(x))
n D: Population distribution c(x) ≠ h(x) -
Probably Approximately
Correct (PAC) learning model
n C: Concept class defined over a set of instances X of
length n
n L: A learner using hypothesis space H.
n C PAC learnable.
n If ∀c∈C, distribution D over X, 0 < 𝜀 < ½ and 0 < 𝛿 < ½
n learner L with probability at least (1- 𝛿 ) outputs a
hypothesis h ∈H, such that errorD(h) < 𝜀,
n in time that is polynomial in 1/ 𝜀, 1/ 𝛿, n and size(c).
Sample complexity of a
learner
n How many training samples required to get
a successful hypothesis with a high
probability?
n Number of samples for a PAC learner?
n Equivalently, how many samples required so
that the version space consists of any
consistent hypothesis bounded by an error 𝜀.
n VSH,D for a target concept c having such property
called 𝜀-exhausted.
Theorem of 𝜀-exhausting the VS
n Given finite H of a target concept c, and m
training samples independently randomly
drawn forming data D, for any 0 < 𝜀 < 1,
n P(VSH,D is 𝜀-exhausted) < |H|e-𝜀m
Proof:
For any h with error > 𝜀, it appears consistent
if all m samples are correctly labeled with prob. (1- 𝜀)m.
Let there be k such hypotheses with error > 𝜀.
Prob. that at least one of them would be consistent: k (1- 𝜀)m
As 1- 𝜀 < e-𝜀 è k (1- 𝜀)m < k e-𝜀m < |H| e-𝜀m
Sample complexity of a PAC
learner
n Maximum Prob. of providing a consistent hypothesis
with error > 𝜀 : 𝛿
n Hence, for a PAC learner
n |H|e
-𝜀m < 𝛿

n è m > (1/ 𝜀) (ln |H| + ln (1/𝛿))

n An overestimate as size of version space is much


smaller than |H|.
n m grows linearly with 1/ 𝜀 and logarithmically with
1/𝛿
n Also grows logarithmically with |H|
Example
n C= Target functions of n Boolean attributes in
conjunctive forms.
n Each literal can have three values true, false, and ignore
(always 1).
n H=C
n A hypothesis in the same conjunctive form of a literal
n |H|= ?
n 3n
n Hence, m > (1/ 𝜀) (n ln 3 + ln (1/𝛿))
n Suppose, n=10, 𝜀=0.1 and 𝛿=5%
n m > (1/.1) (10 ln 3+ ln (1/.05))=139.82, i.e. 140
Sample complexity of infinite H
n Dichotomy on a set of instances S
n Partitioning into two sets (+ve and –ve examples).
n No. of all possible dichotomies: 2|S|
n Shattering by a hypothesis space H
n If there exists a consistent h for every dichotomy of S.
n Vapnik-Chervonenkis (VC) dimension:
n The size of the largest finite subset in X shattered by H.
n VC(H) < log2 |H|
n Proof: H requires 2 distinct hypothesis to shatter d
d

instances.
n è 2 < |H|. Hence, d=VC(H) < log2 |H|
d
A few examples
n H= Set of intervals [a,b], in real axis.
n h(x): 1 if x in [a,b], else 0.
n Shatters any pair of distinct points, e.g., p and q,
p<q
n e.g. [p-2, p-1], [p- 𝜀, p+ 𝜀], [q- 𝜀, q+ 𝜀] [q+1,q+2]
n Existence of any hypothesis sufficient.
n Existence of any pair of points being shattered sufficient.
n Say three points p < q < r
n No h shattering {p,r|q} dichotomy.
n VC(H)=2.
VC(H)=3

A few examples (Contd.)


In an d-dimensional
n H= Set of straight lines in a plane. space with (d-1)-
dimensional
n h(x): 1 if x lies in right half, else 0.
hyperplanes,
n Shatters any pair of distinct points. VC(H)=d+1.

n All points in one half, two points in two different halves.


n Shatters three non-collinear distinct points
n All three in one half, two in one half and the other point
in the opposite half.
n Need not be true for all instances of three points.
n No set of 4 distinct points could be shattered
n There exists a dichotomy, each containing a pair of
points, non-separable by a straight line.
Another example
n H= All axis aligned
rectangles.
n Shatters maximum 4 points.
n VC(H)=4.
n Enough to find any set of 4
points for shattering.
n Not required for all 4 points in
the space.
n 4 points in a straight line not Not possible to place 5 points
shattered. anywhere for shattering.
Courtesy: “Introduction to Machine” Learning by Ethem Alpaydin (Chapter 2, Fig. 2.6)
VC-dimension: Significance
n Measure of model capacity and complexity.
n Higher dimension higher capacity, more
complex.
n LUT having infinite capacity
n A bit pessimistic measure.
n Does not consider probability distribution in
feature space.
n A simple model may discern classes (with data
points much larger than the VC dimension).
Handling noise in data
n Three major sources:
n Imprecision in measurement of features.
n Error in labeling (Teacher noise).
n Missing additional attributes in representation
(hidden or latent attributes).
n Noise may not provide consistent
hypothesis.
n Tolerate training error within a limit to use
simpler model.
Occam’s razor
n Given comparable empirical error, a
simple (but not too simple) model would
generalize better than a complex model.
n simpler explanations more plausible and any
unnecessary complexity to be shaved off.
Inductive bias in data driven
models
n As training data is a small segment of the
input space.
n Smaller the proportion greater the inductive
bias.
n Low training error still may provide high errors
on unseen inputs. To what extent a model trained on the
training set predicts the correct output for
n Generalization error. new instances is called generalization.

n Higher the proportion of training samples in


the input space, better is model fitting and
lower generalization error.
Matching complexities
n Of Model and underlying process generating
data.
n Lower complex model à Higher training and
generalization error.
n Underfitting
n Higher complex model à Low training error, but
may have high generalization error.
n Overfitting
n Even the chosen complexity is matched, model fitting
requires more data point.
Triple trade-off
n A trade-off between three factors in
any data driven learning algorithm:
n the complexity of the hypothesis
n the amount of training data, and
n the generalization error on new
examples.

Dietterich, T. G. 2003. “Machine Learning.” In Nature Encyclopedia of Cognitive Science. London: Macmillan.
Model selection
n Empirical choice of model complexity
n Number of parameters
n Degree of a polynomial for regression
n Divide input in 3 sets:
n Training, Validation and Test.
n Increase model complexity by keeping training and
validation error low.
n May adopt cross-validation.
n Check on generalization error.
n There exist other information theoretic /
likelihood ratio based approaches.
Summary
n Concept learning as search through H
n General-to-specific ordering over H
n Version space candidate elimination algorithm
n S and G boundaries characterize learner’s uncertainty
n Learner can generate useful queries
n Inductive leaps possible only if learner is biased!
n Inductive learners can be modeled as equiv deductive
systems
n Concept learning algorithms: unable to handle data with
errors
n Allow forming hypothesis with low training error.

n e.g. learning decision trees 68


Summary
n Learning allowing low error n Empirical choice of
n Supervised learning of a model model
given labelled data. n Training, validation
n Classification and test sets.
Regression
n
n Three types of errors.
n Three trade-offs of learning n Choose model by
n Model capacity and complexity. keeping training and
n VC-Dimension validation errors low.
Maximum number of points shattered
n

by a hypothesis space.
n Generalization error
indicated by test error.
n Number of labelled samples.
n Generalization error.

You might also like