Concept Learning: Courtesy: Prof. Pabitra Mitra, CSE, IIT Kharagpur

Concept Learning
Jayanta Mukhopadhyay
Dept. of Computer Science and Engg.
Courtesy: Prof. Pabitra Mitra, CSE, IIT Kharagpur

References
n 1. Chapter 2 and 7 of “Machine

learning” by Tom M. Mitchel.
n 2. Chapter 2 of “Introduction to
Machine Learning” by Ethem Alpaydin.
Which days does one come
out to enjoy sports?
n Sky condition n Wind
n Rainy / Cloudy / n Strong / Weak
Sunny n Water
n Humidity n Warm / Cool
n High / Normal n Forecast
n Temperature n Same / Change
n Warm / Cold
Attributes of a day: takes on values
Enjoy sports (?): Yes / No 3
Learning Task
n To make a hypothesis about the day on
which a person comes out to enjoy sports.
n in the form of a boolean function on the
attributes of the day.
n Find the right hypothesis/function from

historical data
n Training Examples (TE)
4
Training Examples for EnjoySport
c( )=1
c( )=1
c( )=0
c( )=1
c is the target concept
<Negative and positive learning examples.

<To learn the target concept c.
<A Boolean function
5
Concept learning
< To derive a Boolean function from training
examples.
q Many “hypothetical” Boolean functions
Øfind h such that h = c.
<Generate hypotheses for concept from TE’s
6
Representing a Hypothesis
n A hypothesis as conjunction of constraints.
n Each constraint: a Boolean condition on attribute
values.
n Three forms
n Specific value : Water = Warm
n Don’t-care value: Water = ?
n Any value satisfies condition.
n No value allowed : Water = Æ

n i.e., no permissible value given values of other
attributes
n Represented in the form of a vector. 7
Example of a hypothesis
n Represented in the form of a vector:
n <sky, temp, humid, wind, water, forecast>
n h=<Sunny ? ? Strong ? Same>
h(x) = 1 if h is true on x
=0 otherwise
n x is also represented as a vector, an element in
the 6-D space.
n x= <Sunny, Warm, Normal, Strong, Warm, Same>
n h(x)=1
n x= <Sunny, Warm, Normal, Strong, Warm, Change>

n h(x)=0
8
Space of Hypotheses
n H: A set of all possible hypotheses
n Finite number of combinations.
n Size of input space X
n X = Sky x Temp x Humid x Wind x Water x Forecast
n |X|= 3 x 2 x 2 x 2 x 2 x 2 = 96
n Size of H
n Each attribute A can have |A|+|{Æ,?}| conditions.
n E.g. for Sky: 3+2=5
n |H|= 5 x 4 x 4 x 4 x 4 x 4= 5120
n But every h with Æ : empty set of instances (all negatives)
n No. of distinct hypotheses: 1+4x3x3x3x3x3=973
Concept Learning: Task
TASK T: predicting when a person will enjoy
sport
– Target function c: EnjoySport : X ® {0, 1}
– Cannot, in general, know Target function c
vAdopt hypotheses H about c
– Form of hypotheses H :
vConjunctions of literals
vá?, Cold, High, ?, ?, ? ñ
10
Concept Learning: Experience
<EXPERIENCE E
– Instances X: possible days described by
attributes Sky, Temp, Humidity, Wind, Water,
Forecast
– Training examples D: Positive / negative
examples of target function {áx1, c(x1)ñ, . . .
áxm, c(xm)ñ}
11
Concept Learning:
Performance Measure
PERFORMANCE MEASURE P:
The Hypothesis h in H such that h(x) = c(x)
for all x in D (Training Examples).
– may exist several alternative hypotheses
that fit examples.
– Assumption of inductive learning on h
being true for unseen examples.
12
Inductive Learning Hypothesis
q Any hypothesis h found to approximate the

target function c well over a sufficiently large
set of training examples D will also
approximate the target function well over
other unobserved examples (.i.e. in
population distribution D) .
∀x ∈ D, h(x) ≈ c(x) à ∀x ∈ D, h(x) ≈ c(x)

13
Ordering on Hypotheses
Instances X Hypotheses H
specific
general
x1=áSunny Warm High Strong Cool Sameñ h1=áSunny ? ? Strong ? ?ñ

x2=áSunny Warm High Light Warm Sameñ h2=áSunny ? ? ? ? ?ñ
h3=áSunny ? ? ? Cool ?ñ
< h is more general than h¢( h ³g h¢) if for each instance x,

h¢(x) = 1 ® h(x) = 1
< Which is the most general/most specific hypothesis?
Learning as a search problem
n Search a hypothesis h in the space H to best fit
examples.
n If examples are error free, h should satisfy all of
them
n Not unique.
n several alternative hypotheses may fit examples.
n May not exist any solution at all!
n Satisfying all +ve and –ve examples.
n Constraints may have other form
n e.g. Sky condition could be (rainy OR cloudy),
but not admitted in H.

Approaches to learning
algorithms
n Approach 1: Search based on ordering of
hypotheses.
n Approach 2: Search based on finding all
possible hypotheses using a good
representation of hypothesis space.
n All hypotheses that fit data
The choice of the hypothesis space

reduces the number of hypotheses.
16
Assumes
o There is a hypothesis h in
describing target function c.
o There are no errors in the
TE’s.
Find-S Algorithm o Everything except the
positive examples is negative.
1. Initialize h to the most specific hypothesis in H

(what is this?)
2. For each positive training instance x
For each attribute constraint ai in h
If the constraint ai in h is satisfied by x
do nothing
Else
replace ai in h by the next more general
constraint that is satisfied by x
3. Output hypothesis h.
As no change for a negative example, it is ignored. 17
Example of Find-S
specific
general
h0=áÆ Æ Æ Æ Æ Æñ
x1=áSunny Warm Normal Strong Warm Sameñ + h1=áSunny Warm Normal Strong Warm Sameñ
x2=áSunny Warm High Strong Warm Sameñ + h2=áSunny Warm ? Strong Warm Sameñ
x3=áRainy Cold High Strong Warm Changeñ - h3=áSunny Warm ? Strong Warm Sameñ
x4=áSunny Warm High Strong Cool Changeñ + h4=áSunny Warm ? Strong ? ?ñ
Problems with Find-S
n Problems: n Advantages
Simple
n Throws away information!
n
n Outcome
n Negative examples independent of
n Can’t tell whether it has order of examples
learned the concept n Why?
n Depending on H, there might n Any alternative?
be several h’s that fit TEs!
Picks a maximally specific h. n Keep all consistent
hypotheses!
n
(why?)
n Candidate
n Can’t tell when training data elimination
is inconsistent algorithm
n Since ignores negative TEs
19
Consistent Hypothesis
n if h(x) = c(x) for each training example áx, c(x)ñ
in D.
n consistent with a set of training examples D of
target concept c
n Note that consistency is with respect to specific D.
n Notation:
Consistent (h, D) º "áx, c(x)ñÎD :: h(x) = c(x)
Agnostic hypothesis:
May label erroneously a training sample.
Agnostic (h, D) º ∃áx, c(x)ñÎD :: h(x) ≠ c(x) 20
Version Space
n VSH,D :The subset of hypotheses from H
consistent with D
n with respect to hypothesis space H and
training examples D
n Notation:
VSH,D = {h | h Î H Ù Consistent (h, D)}
21
List-Then-Eliminate Algorithm
1. VersionSpace ¬ list of all hypotheses in H
2. For each training example áx, c(x)ñ
remove from VersionSpace any hypothesis

h for which h(x) ¹ c(x).
3. Output the list of hypotheses in
VersionSpace.
Essentially a brute force procedure.

22
Example of Find-S, Revisited
specific
general
h0=áÆ Æ Æ Æ Æ Æñ
x1=áSunny Warm Normal Strong Warm Sameñ + h1=áSunny Warm Normal Strong Warm Sameñ
x2=áSunny Warm High Strong Warm Sameñ + h2=áSunny Warm ? Strong Warm Sameñ
x3=áRainy Cold High Strong Warm Changeñ - h3=áSunny Warm ? Strong Warm Sameñ
x3=áSunny Warm High Strong Cool Changeñ + h4=áSunny Warm ? Strong ? ?ñ
h4: Least general
Restriction on most general hypothesis looking at the –ve sample!
Version Space for this
Example
S {áSunny Warm ? Strong ? ?ñ }
áSunny ? ? Strong ? ?ñ áSunny Warm ? ? ? ?ñ á? Warm ? Strong ? ?ñ
G { áSunny ? ? ? ? ?ñá?
, Warm ? ? ? ?ñ }
24
Compact Representation of
the Version Space
n Store the most and the least general

boundaries of space.
n Generalize from most specific boundaries
n Use +ve samples.
n Specialize from most general boundaries
n Use –ve samples.
n Generate all intermediate h’s in VS
n any h in VS must be consistent with all TE’s
25
Compact Representation of
the Version Space VSH,D
n The general boundary, G,
n the set of its maximally general members
consistent with D
n Summarizes the negative examples;
n Anything more general covers wrongly a negative TE
n The specific boundary, S,
n the set of its maximally specific members
consistent with D
n Summarizes the positive examples;
n Anything more specific fails to cover a positive TE
26
Theorem
Every member of the version space lies
between the S,G boundary
VSH,D = {h | h Î H Ù $sÎS $gÎG (g ³ h ³ s)}
n Must prove:
n 1) every h satisfying RHS is in VSH,D;
n 2) every member of VSH,D satisfies RHS.
27
Every member of the version space lies
between the S,G boundary
Proof VSH,D = {h | h Î H Ù $sÎS $gÎG (g ³ h ³ s)}
n Must prove:
n 1) every h satisfying RHS is in VSH,D;
n 2) every member of VSH,D satisfies RHS.
n For 1), let g, h, s be arbitrary members of G, H, S

respectively with g>h>s Prove that h is consistent.
n s must be satisfied by all + TEs and so must h because it is
more general;
n g cannot be satisfied by any – TEs, and so nor can h
n h is in VSH,D since satisfied by all + TEs and no – TEs
n For 2),
n Since h satisfies all + TEs and no – TEs, h ³ s, and g ³ h.
28
Candidate Elimination Algorithm
G ¬ maximally general hypotheses in H
S ¬ maximally specific hypotheses in H
For each training example d, do
n If d is positive
n Remove from G every hypothesis inconsistent with d
n For each hypothesis s in S that is inconsistent with d
n Remove s from S
n Add to S all minimal generalizations h of s such that
1. h is consistent with d, and
2. some member of G is more general than h
n Remove from S every hypothesis that is more general
than another hypothesis in S 29
Candidate Elimination
Algorithm (cont)
n If d is a negative example
n Remove from S every hypothesis inconsistent with d
n For each hypothesis g in G that is inconsistent with d
n Remove g from G
n Add to G all minimal specializations h of g such that
1. h is consistent with d, and

2. some member of S is more specific than h
n Remove from G every hypothesis that is less general than
another hypothesis in G
n Essentially use n Convergence guaranteed if:
n Pos TEs to generalize S n no errors
n Neg TEs to specialize G n there is h in H
n Independent of order of TEs describing c.

30
Recall : If d is positive
Example Remove from G every hypothesis
inconsistent with d
For each hypothesis s in S that is
inconsistent with d
•Remove s from S
S0 {áÆ Æ Æ Æ Æ Æñ} •Add to S all minimal generalizations
h of s that are specializations of a
G0 {á? ? ? ? ? ?ñ} hypothesis in G
•Remove from S every hypothesis
that is more general than another
hypothesis in S
áSunny Warm Normal Strong Warm Sameñ +
S1 {áSunny Warm Normal Strong Warm Sameñ}
G1 {á? ? ? ? ? ?ñ}
Example (contd)
S1 {áSunny Warm Normal Strong Warm Sameñ}

G1 {á? ? ? ? ? ?ñ}
áSunny Warm High Strong Warm Sameñ +
S2 {áSunny Warm ? Strong Warm Sameñ}
G2 {á? ? ? ? ? ?ñ}
32
If d is a negative example
– Remove from S every
hypothesis
Example (contd) inconsistent with d

– For each hypothesis g
in G that is inconsistent
S2 {áSunny Warm ? Strong Warm Sameñ} with d
G2 {á? ? ? ? ? ?ñ} v Remove g from G
v Add to G all minimal
áRainy Cold High Strong Warm Changeñ - specializations h of g
that generalize some
hypothesis in S
Current G boundary is incorrect
v Remove from G
So, need to make it more specific.
every hypothesis
that is less general
S3 {áSunny Warm ? Strong Warm Sameñ} than another
hypothesis in G
G3 {áSunny ? ? ? ? ?ñ , á? Warm ? ? ? ?ñ , á? ? ? ? ? Sameñ }

33
Example (contd)
< Why are there no hypotheses left relating to:
– á Cloudy ? ? ? ? ?ñ
– Inconsistent with S.
< The following specialization using the third value
á? ? Normal ? ? ?ñ,
is not more general than the specific boundary
{áSunny Warm ? Strong Warm Sameñ}
< The specializations á? ? ? Weak ? ?ñ, á? ? ? ? Cool ?ñ

are also inconsistent with S
34
Example (contd)
S3 {áSunny Warm ? Strong Warm Sameñ}
G3 {áSunny ? ? ? ? ?ñ , á? Warm ? ? ? ?ñ , á? ? ? ? ? Sameñ}
áSunny Warm High Strong Cool Changeñ +
S4 {áSunny Warm ? Strong ? ?ñ}
G4 {áSunny ? ? ? ? ?ñ, á? Warm ? ? ? ?ñ}

35
Example (contd)
áSunny Warm High Strong Cool Changeñ +
<Why does this example remove a hypothesis from
G?:
– á? ? ? ? ? Sameñ
<This hypothesis
– Cannot be specialized, since would not cover new TE.
– Cannot be generalized, because more general would
cover negative TE.
– Hence must drop hypothesis.
36
Version Space of the Example
S {áSunny Warm ? Strong ? ?ñ}
S
G {áSunny ? ? ? ? ?ñ, á? Warm ? ? ? ?ñ} version
space
G
37
Convergence of algorithm
n Convergence guaranteed if:
n no errors
n there is h in H describing c.
n Ambiguity removed from VS when S = G
n Containing single h
n When have seen enough TEs
n For any false negative TE, algorithm will remove every
h consistent with TE, and hence may remove correct
target concept from VS
n If observed enough, TEs will find that S, G boundaries
converge to empty VS 38
Which Next Training Example?
G {áSunny ? ? ? ? ?ñ, á? Warm ? ? ? ?ñ}

Assume learner can choose the next TE n Example:
n áSunny Warm
n Should choose d such that
Normal Weak
n Reduces maximally the number of
Warm Sameñ ?
hypotheses in VS
n If pos,
n Best TE: satisfies precisely 50%
generalizes S
hypotheses;
n If neg,
n Can’t always be done
specializes G 39
Which Next Training Example?
G {áSunny ? ? ? ? ?ñ, á? Warm ? ? ? ?ñ}
Order of examples matters

for intermediate sizes of
S,G; not for the final S, G
40
Classifying new cases using VS
G {áSunny ? ? ? ? ?ñ, á? Warm ? ? ? ?ñ}

n Use voting procedure on following examples:
n áSunny Warm Normal Strong Cool Changeñ
n áRainy Cool Normal Weak Warm Sameñ
n áSunny Warm Normal Weak Warm Sameñ
n áSunny Cold Normal Strong Warm Same ñ 41

Effect of incomplete
hypothesis space
n Preceding algorithms work if target function is
in H
n Will generally not work if target function not in H
n Consider following examples which represent
target function
“sky = sunny or sky = cloudy”:
n áSunny Warm Normal Strong Cool Changeñ Y
n áCloudy Warm Normal Strong Cool Changeñ Y
n áRainy Warm Normal Strong Cool Changeñ N
42
Effect of incomplete
hypothesis space
“sky = sunny or sky = cloudy”:
áSunny Warm Normal Strong Cool Changeñ Y
áCloudy Warm Normal Strong Cool Changeñ Y
áRainy Warm Normal Strong Cool Changeñ N
n If apply CE algorithm as before, end up with

empty VS
n After first two TEs, Need more
n S= á? Warm Normal Strong Cool Changeñ expressive
n New hypothesis is overly general hypotheses
n it covers the third negative TE!
n Our H does not include the appropriate c. 43
Unbiased Learners n Cannot predict
usefully:
n if no limits on representation of hypotheses n TEs have
(i.e., full logical representation: and, or, not), unanimous

can only learn examples…no generalization vote
possible! n other x’s have
n Say, 5 TEs {x1, x2, x3, x4, x5}, with x4, x5

50/50 vote!
negative TEs n For every h
n Apply CE algorithm in H that

predicts +,
n S :disjunction of +ve examples
there is
n S={x1 OR x2 OR x3}
another
n G :negation of disjunction of -ve examples that
n G={not (x4 or x5)} predicts -
n Need to use all instances to learn the
concept! 44
Inductive Bias
n As constraints on representation of
hypotheses
n Example of limiting connectives to conjunctions
n Allows learning of generalized hypotheses
n Introduces bias that depends on hypothesis
representation
n Need formal definition of inductive bias
of learning algorithm
45
Inductive system as an
equivalent deductive system
n Inductive bias made explicit in equivalent

deductive system
n Logically represented system that produces
same outputs (classification) from inputs
(TEs, instance x, bias B)
n E.g. The CE procedure
46
Equivalent deductive system
q Inductive bias (IB) of learning algorithm L:
n any minimal set of assertions B
n such that for any target concept c and training
examples D, used to logically infer the value c(x) of
any instance x from B, D, and x
n for a rote learner, B = {}, and there is no IB
n Difficult to apply in many cases, but a useful
guide
47
Inductive bias and specific
learning algorithms
n Rote learners:
n no IB
n Version space candidate elimination (CE)
algorithm:
n The target concept c can be represented in H
n Find-S:
n The target concept c can be represented in H;
n all instances that are not positive are negative.
48
Computational
Complexity of VS
n The S set for conjunctive feature vectors
n linear in the number of features and the number of
training examples.
n The G set for conjunctive feature vectors
n exponential in the number of training examples.
n In more expressive languages,
n both S and G can grow exponentially.
n The order of processing examples significantly
affect computational complexity.
49
Size of S and G?
n n Boolean attributes
n 1 positive example: (T, T, .., T) G0: (?,?,…?)
n n/2 negative examples:
G1: (T,?,..,?), (?,T,?,…?)
n (F,F,T,..T)
n (T,T,F,F,T..T) G2: (T,?,T,?..,?), (T,?,?,T..,?), (?,T,T,?,…?), (?,T,?,T,…?)
n (T,T,T,T,F,F,T..T)
n ..
n (T,..T,F,F)
n |S|=1
n |G|=2n/2 50
Probably Approximately
Correct (PAC) learning model
n A consistent hypothesis c
n Training error: 0 h
+
n True error ≠ 0
n errorD(h)=P(c(x) ≠ h(x))
n D: Population distribution c(x) ≠ h(x) -
Probably Approximately
Correct (PAC) learning model
n C: Concept class defined over a set of instances X of
length n
n L: A learner using hypothesis space H.
n C PAC learnable.
n If ∀c∈C, distribution D over X, 0 < 𝜀 < ½ and 0 < 𝛿 < ½
n learner L with probability at least (1- 𝛿 ) outputs a
hypothesis h ∈H, such that errorD(h) < 𝜀,
n in time that is polynomial in 1/ 𝜀, 1/ 𝛿, n and size(c).
Sample complexity of a
learner
n How many training samples required to get
a successful hypothesis with a high
probability?
n Number of samples for a PAC learner?
n Equivalently, how many samples required so
that the version space consists of any
consistent hypothesis bounded by an error 𝜀.
n VSH,D for a target concept c having such property
called 𝜀-exhausted.
Theorem of 𝜀-exhausting the VS
n Given finite H of a target concept c, and m
training samples independently randomly
drawn forming data D, for any 0 < 𝜀 < 1,
n P(VSH,D is 𝜀-exhausted) < |H|e-𝜀m
Proof:
For any h with error > 𝜀, it appears consistent
if all m samples are correctly labeled with prob. (1- 𝜀)m.
Let there be k such hypotheses with error > 𝜀.
Prob. that at least one of them would be consistent: k (1- 𝜀)m
As 1- 𝜀 < e-𝜀 è k (1- 𝜀)m < k e-𝜀m < |H| e-𝜀m
Sample complexity of a PAC
learner
n Maximum Prob. of providing a consistent hypothesis
with error > 𝜀 : 𝛿
n Hence, for a PAC learner
n |H|e
-𝜀m < 𝛿
n è m > (1/ 𝜀) (ln |H| + ln (1/𝛿))
n An overestimate as size of version space is much

smaller than |H|.
n m grows linearly with 1/ 𝜀 and logarithmically with
1/𝛿
n Also grows logarithmically with |H|
Example
n C= Target functions of n Boolean attributes in
conjunctive forms.
n Each literal can have three values true, false, and ignore
(always 1).
n H=C
n A hypothesis in the same conjunctive form of a literal
n |H|= ?
n 3n
n Hence, m > (1/ 𝜀) (n ln 3 + ln (1/𝛿))
n Suppose, n=10, 𝜀=0.1 and 𝛿=5%
n m > (1/.1) (10 ln 3+ ln (1/.05))=139.82, i.e. 140
Sample complexity of infinite H
n Dichotomy on a set of instances S
n Partitioning into two sets (+ve and –ve examples).
n No. of all possible dichotomies: 2|S|
n Shattering by a hypothesis space H
n If there exists a consistent h for every dichotomy of S.
n Vapnik-Chervonenkis (VC) dimension:
n The size of the largest finite subset in X shattered by H.
n VC(H) < log2 |H|
n Proof: H requires 2 distinct hypothesis to shatter d
d
instances.
n è 2 < |H|. Hence, d=VC(H) < log2 |H|
d
A few examples
n H= Set of intervals [a,b], in real axis.
n h(x): 1 if x in [a,b], else 0.
n Shatters any pair of distinct points, e.g., p and q,
p<q
n e.g. [p-2, p-1], [p- 𝜀, p+ 𝜀], [q- 𝜀, q+ 𝜀] [q+1,q+2]
n Existence of any hypothesis sufficient.
n Existence of any pair of points being shattered sufficient.
n Say three points p < q < r
n No h shattering {p,r|q} dichotomy.
n VC(H)=2.
VC(H)=3
A few examples (Contd.)

In an d-dimensional
n H= Set of straight lines in a plane. space with (d-1)-
dimensional
n h(x): 1 if x lies in right half, else 0.
hyperplanes,
n Shatters any pair of distinct points. VC(H)=d+1.
n All points in one half, two points in two different halves.

n Shatters three non-collinear distinct points
n All three in one half, two in one half and the other point
in the opposite half.
n Need not be true for all instances of three points.
n No set of 4 distinct points could be shattered
n There exists a dichotomy, each containing a pair of
points, non-separable by a straight line.
Another example
n H= All axis aligned
rectangles.
n Shatters maximum 4 points.
n VC(H)=4.
n Enough to find any set of 4
points for shattering.
n Not required for all 4 points in
the space.
n 4 points in a straight line not Not possible to place 5 points
shattered. anywhere for shattering.
Courtesy: “Introduction to Machine” Learning by Ethem Alpaydin (Chapter 2, Fig. 2.6)
VC-dimension: Significance
n Measure of model capacity and complexity.
n Higher dimension higher capacity, more
complex.
n LUT having infinite capacity
n A bit pessimistic measure.
n Does not consider probability distribution in
feature space.
n A simple model may discern classes (with data
points much larger than the VC dimension).
Handling noise in data
n Three major sources:
n Imprecision in measurement of features.
n Error in labeling (Teacher noise).
n Missing additional attributes in representation
(hidden or latent attributes).
n Noise may not provide consistent
hypothesis.
n Tolerate training error within a limit to use
simpler model.
Occam’s razor
n Given comparable empirical error, a
simple (but not too simple) model would
generalize better than a complex model.
n simpler explanations more plausible and any
unnecessary complexity to be shaved off.
Inductive bias in data driven
models
n As training data is a small segment of the
input space.
n Smaller the proportion greater the inductive
bias.
n Low training error still may provide high errors
on unseen inputs. To what extent a model trained on the
training set predicts the correct output for
n Generalization error. new instances is called generalization.
n Higher the proportion of training samples in

the input space, better is model fitting and
lower generalization error.
Matching complexities
n Of Model and underlying process generating
data.
n Lower complex model à Higher training and
generalization error.
n Underfitting
n Higher complex model à Low training error, but
may have high generalization error.
n Overfitting
n Even the chosen complexity is matched, model fitting
requires more data point.
Triple trade-off
n A trade-off between three factors in
any data driven learning algorithm:
n the complexity of the hypothesis
n the amount of training data, and
n the generalization error on new
examples.
Dietterich, T. G. 2003. “Machine Learning.” In Nature Encyclopedia of Cognitive Science. London: Macmillan.
Model selection
n Empirical choice of model complexity
n Number of parameters
n Degree of a polynomial for regression
n Divide input in 3 sets:
n Training, Validation and Test.
n Increase model complexity by keeping training and
validation error low.
n May adopt cross-validation.
n Check on generalization error.
n There exist other information theoretic /
likelihood ratio based approaches.
Summary
n Concept learning as search through H
n General-to-specific ordering over H
n Version space candidate elimination algorithm
n S and G boundaries characterize learner’s uncertainty
n Learner can generate useful queries
n Inductive leaps possible only if learner is biased!
n Inductive learners can be modeled as equiv deductive
systems
n Concept learning algorithms: unable to handle data with
errors
n Allow forming hypothesis with low training error.
n e.g. learning decision trees 68

Summary
n Learning allowing low error n Empirical choice of
n Supervised learning of a model model
given labelled data. n Training, validation
n Classification and test sets.
Regression
n
n Three types of errors.
n Three trade-offs of learning n Choose model by
n Model capacity and complexity. keeping training and
n VC-Dimension validation errors low.
Maximum number of points shattered
n
by a hypothesis space.
n Generalization error
indicated by test error.
n Number of labelled samples.
n Generalization error.

Concept Learning: Courtesy: Prof. Pabitra Mitra, CSE, IIT Kharagpur

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Concept Learning: Courtesy: Prof. Pabitra Mitra, CSE, IIT Kharagpur

Uploaded by

Copyright:

Available Formats

Concept Learning

Courtesy: Prof. Pabitra Mitra, CSE, IIT Kharagpur

n 1. Chapter 2 and 7 of “Machine

n Find the right hypothesis/function from

<Negative and positive learning examples.

n No value allowed : Water = Æ

n x= <Sunny, Warm, Normal, Strong, Warm, Change>

q Any hypothesis h found to approximate the

∀x ∈ D, h(x) ≈ c(x) à ∀x ∈ D, h(x) ≈ c(x)

x1=áSunny Warm High Strong Cool Sameñ h1=áSunny ? ? Strong ? ?ñ

< h is more general than h¢( h ³g h¢) if for each instance x,

n Constraints may have other form

n e.g. Sky condition could be (rainy OR cloudy),

but not admitted in H.

The choice of the hypothesis space

1. Initialize h to the most specific hypothesis in H

remove from VersionSpace any hypothesis

Essentially a brute force procedure.

S {áSunny Warm ? Strong ? ?ñ }

áSunny ? ? Strong ? ?ñ áSunny Warm ? ? ? ?ñ á? Warm ? Strong ? ?ñ

n Store the most and the least general

Proof VSH,D = {h | h Î H Ù $sÎS $gÎG (g ³ h ³ s)}

n 2) every member of VSH,D satisfies RHS.

n For 1), let g, h, s be arbitrary members of G, H, S

n Remove from G every hypothesis inconsistent with d

n For each hypothesis s in S that is inconsistent with d

n For each hypothesis g in G that is inconsistent with d

n Add to G all minimal specializations h of g such that

1. h is consistent with d, and

n Pos TEs to generalize S n no errors

n Neg TEs to specialize G n there is h in H

n Independent of order of TEs describing c.

S1 {áSunny Warm Normal Strong Warm Sameñ}

S1 {áSunny Warm Normal Strong Warm Sameñ}

áSunny Warm High Strong Warm Sameñ +

S2 {áSunny Warm ? Strong Warm Sameñ}

Example (contd) inconsistent with d

G3 {áSunny ? ? ? ? ?ñ , á? Warm ? ? ? ?ñ , á? ? ? ? ? Sameñ }

< The specializations á? ? ? Weak ? ?ñ, á? ? ? ? Cool ?ñ

S3 {áSunny Warm ? Strong Warm Sameñ}

G3 {áSunny ? ? ? ? ?ñ , á? Warm ? ? ? ?ñ , á? ? ? ? ? Sameñ}

áSunny Warm High Strong Cool Changeñ +

S4 {áSunny Warm ? Strong ? ?ñ}

G4 {áSunny ? ? ? ? ?ñ, á? Warm ? ? ? ?ñ}

áSunny ? ? Strong ? ?ñ áSunny Warm ? ? ? ?ñ á? Warm ? Strong ? ?ñ

áSunny ? ? Strong ? ?ñ áSunny Warm ? ? ? ?ñ á? Warm ? Strong ? ?ñ

G {áSunny ? ? ? ? ?ñ, á? Warm ? ? ? ?ñ}

áSunny ? ? Strong ? ?ñ áSunny Warm ? ? ? ?ñ á? Warm ? Strong ? ?ñ

G {áSunny ? ? ? ? ?ñ, á? Warm ? ? ? ?ñ}

Order of examples matters

áSunny ? ? Strong ? ?ñ áSunny Warm ? ? ? ?ñ á? Warm ? Strong ? ?ñ

G {áSunny ? ? ? ? ?ñ, á? Warm ? ? ? ?ñ}

n áRainy Cool Normal Weak Warm Sameñ

n áSunny Warm Normal Weak Warm Sameñ

n áSunny Cold Normal Strong Warm Same ñ 41

n If apply CE algorithm as before, end up with

(i.e., full logical representation: and, or, not), unanimous

n Say, 5 TEs {x1, x2, x3, x4, x5}, with x4, x5

n Apply CE algorithm in H that

n Inductive bias made explicit in equivalent

n è m > (1/ 𝜀) (ln |H| + ln (1/𝛿))

n An overestimate as size of version space is much

A few examples (Contd.)