Professional Documents
Culture Documents
Concept Learning: Courtesy: Prof. Pabitra Mitra, CSE, IIT Kharagpur
Concept Learning: Courtesy: Prof. Pabitra Mitra, CSE, IIT Kharagpur
Jayanta Mukhopadhyay
Dept. of Computer Science and Engg.
5
Concept learning
< To derive a Boolean function from training
examples.
q Many “hypothetical” Boolean functions
Øfind h such that h = c.
<Generate hypotheses for concept from TE’s
6
Representing a Hypothesis
n A hypothesis as conjunction of constraints.
n Each constraint: a Boolean condition on attribute
values.
n Three forms
n Specific value : Water = Warm
n Don’t-care value: Water = ?
n Any value satisfies condition.
n |H|= 5 x 4 x 4 x 4 x 4 x 4= 5120
n But every h with Æ : empty set of instances (all negatives)
n No. of distinct hypotheses: 1+4x3x3x3x3x3=973
Concept Learning: Task
TASK T: predicting when a person will enjoy
sport
– Target function c: EnjoySport : X ® {0, 1}
– Cannot, in general, know Target function c
vAdopt hypotheses H about c
– Form of hypotheses H :
vConjunctions of literals
vá?, Cold, High, ?, ?, ? ñ
10
Concept Learning: Experience
<EXPERIENCE E
– Instances X: possible days described by
attributes Sky, Temp, Humidity, Wind, Water,
Forecast
– Training examples D: Positive / negative
examples of target function {áx1, c(x1)ñ, . . .
áxm, c(xm)ñ}
11
Concept Learning:
Performance Measure
PERFORMANCE MEASURE P:
The Hypothesis h in H such that h(x) = c(x)
for all x in D (Training Examples).
– may exist several alternative hypotheses
that fit examples.
– Assumption of inductive learning on h
being true for unseen examples.
12
Inductive Learning Hypothesis
general
specific
general
h0=áÆ Æ Æ Æ Æ Æñ
x1=áSunny Warm Normal Strong Warm Sameñ + h1=áSunny Warm Normal Strong Warm Sameñ
x2=áSunny Warm High Strong Warm Sameñ + h2=áSunny Warm ? Strong Warm Sameñ
x3=áRainy Cold High Strong Warm Changeñ - h3=áSunny Warm ? Strong Warm Sameñ
x4=áSunny Warm High Strong Cool Changeñ + h4=áSunny Warm ? Strong ? ?ñ
Problems with Find-S
n Problems: n Advantages
Simple
n Throws away information!
n
n Outcome
n Negative examples independent of
n Can’t tell whether it has order of examples
learned the concept n Why?
n Depending on H, there might n Any alternative?
be several h’s that fit TEs!
Picks a maximally specific h. n Keep all consistent
hypotheses!
n
(why?)
n Candidate
n Can’t tell when training data elimination
is inconsistent algorithm
n Since ignores negative TEs
19
Consistent Hypothesis
n if h(x) = c(x) for each training example áx, c(x)ñ
in D.
n consistent with a set of training examples D of
target concept c
n Note that consistency is with respect to specific D.
n Notation:
Consistent (h, D) º "áx, c(x)ñÎD :: h(x) = c(x)
Agnostic hypothesis:
May label erroneously a training sample.
Agnostic (h, D) º ∃áx, c(x)ñÎD :: h(x) ≠ c(x) 20
Version Space
n VSH,D :The subset of hypotheses from H
consistent with D
n with respect to hypothesis space H and
training examples D
n Notation:
VSH,D = {h | h Î H Ù Consistent (h, D)}
21
List-Then-Eliminate Algorithm
1. VersionSpace ¬ list of all hypotheses in H
2. For each training example áx, c(x)ñ
specific
general
h0=áÆ Æ Æ Æ Æ Æñ
x1=áSunny Warm Normal Strong Warm Sameñ + h1=áSunny Warm Normal Strong Warm Sameñ
x2=áSunny Warm High Strong Warm Sameñ + h2=áSunny Warm ? Strong Warm Sameñ
x3=áRainy Cold High Strong Warm Changeñ - h3=áSunny Warm ? Strong Warm Sameñ
x3=áSunny Warm High Strong Cool Changeñ + h4=áSunny Warm ? Strong ? ?ñ
h4: Least general
Restriction on most general hypothesis looking at the –ve sample!
Version Space for this
Example
G { áSunny ? ? ? ? ?ñá?
, Warm ? ? ? ?ñ }
24
Compact Representation of
the Version Space
25
Compact Representation of
the Version Space VSH,D
n The general boundary, G,
n the set of its maximally general members
consistent with D
n Summarizes the negative examples;
n Anything more general covers wrongly a negative TE
n The specific boundary, S,
n the set of its maximally specific members
consistent with D
n Summarizes the positive examples;
n Anything more specific fails to cover a positive TE
26
Theorem
Every member of the version space lies
between the S,G boundary
VSH,D = {h | h Î H Ù $sÎS $gÎG (g ³ h ³ s)}
n Must prove:
n 1) every h satisfying RHS is in VSH,D;
n 2) every member of VSH,D satisfies RHS.
27
Every member of the version space lies
between the S,G boundary
n Must prove:
n 1) every h satisfying RHS is in VSH,D;
n Remove s from S
n Add to S all minimal generalizations h of s such that
1. h is consistent with d, and
2. some member of G is more general than h
n Remove from S every hypothesis that is more general
than another hypothesis in S 29
Candidate Elimination
Algorithm (cont)
n If d is a negative example
n Remove from S every hypothesis inconsistent with d
n Remove g from G
G1 {á? ? ? ? ? ?ñ}
Example (contd)
G2 {á? ? ? ? ? ?ñ}
32
If d is a negative example
– Remove from S every
hypothesis
36
Version Space of the Example
S {áSunny Warm ? Strong ? ?ñ}
S
G {áSunny ? ? ? ? ?ñ, á? Warm ? ? ? ?ñ} version
space
G
37
Convergence of algorithm
n Convergence guaranteed if:
n no errors
n there is h in H describing c.
n Ambiguity removed from VS when S = G
n Containing single h
n When have seen enough TEs
n For any false negative TE, algorithm will remove every
h consistent with TE, and hence may remove correct
target concept from VS
n If observed enough, TEs will find that S, G boundaries
converge to empty VS 38
Which Next Training Example?
S {áSunny Warm ? Strong ? ?ñ}
42
Effect of incomplete
hypothesis space
“sky = sunny or sky = cloudy”:
áSunny Warm Normal Strong Cool Changeñ Y
áCloudy Warm Normal Strong Cool Changeñ Y
áRainy Warm Normal Strong Cool Changeñ N
n As constraints on representation of
hypotheses
n Example of limiting connectives to conjunctions
n Allows learning of generalized hypotheses
n Introduces bias that depends on hypothesis
representation
n Need formal definition of inductive bias
of learning algorithm
45
Inductive system as an
equivalent deductive system
46
Equivalent deductive system
q Inductive bias (IB) of learning algorithm L:
n any minimal set of assertions B
n such that for any target concept c and training
examples D, used to logically infer the value c(x) of
any instance x from B, D, and x
n for a rote learner, B = {}, and there is no IB
n Difficult to apply in many cases, but a useful
guide
47
Inductive bias and specific
learning algorithms
n Rote learners:
n no IB
n Version space candidate elimination (CE)
algorithm:
n The target concept c can be represented in H
n Find-S:
n The target concept c can be represented in H;
n all instances that are not positive are negative.
48
Computational
Complexity of VS
n The S set for conjunctive feature vectors
n linear in the number of features and the number of
training examples.
n The G set for conjunctive feature vectors
n exponential in the number of training examples.
n In more expressive languages,
n both S and G can grow exponentially.
n The order of processing examples significantly
affect computational complexity.
49
Size of S and G?
n n Boolean attributes
n 1 positive example: (T, T, .., T) G0: (?,?,…?)
n n/2 negative examples:
G1: (T,?,..,?), (?,T,?,…?)
n (F,F,T,..T)
n (T,T,F,F,T..T) G2: (T,?,T,?..,?), (T,?,?,T..,?), (?,T,T,?,…?), (?,T,?,T,…?)
n (T,T,T,T,F,F,T..T)
n ..
n (T,..T,F,F)
n |S|=1
n |G|=2n/2 50
Probably Approximately
Correct (PAC) learning model
n A consistent hypothesis c
n Training error: 0 h
+
n True error ≠ 0
n errorD(h)=P(c(x) ≠ h(x))
n D: Population distribution c(x) ≠ h(x) -
Probably Approximately
Correct (PAC) learning model
n C: Concept class defined over a set of instances X of
length n
n L: A learner using hypothesis space H.
n C PAC learnable.
n If ∀c∈C, distribution D over X, 0 < 𝜀 < ½ and 0 < 𝛿 < ½
n learner L with probability at least (1- 𝛿 ) outputs a
hypothesis h ∈H, such that errorD(h) < 𝜀,
n in time that is polynomial in 1/ 𝜀, 1/ 𝛿, n and size(c).
Sample complexity of a
learner
n How many training samples required to get
a successful hypothesis with a high
probability?
n Number of samples for a PAC learner?
n Equivalently, how many samples required so
that the version space consists of any
consistent hypothesis bounded by an error 𝜀.
n VSH,D for a target concept c having such property
called 𝜀-exhausted.
Theorem of 𝜀-exhausting the VS
n Given finite H of a target concept c, and m
training samples independently randomly
drawn forming data D, for any 0 < 𝜀 < 1,
n P(VSH,D is 𝜀-exhausted) < |H|e-𝜀m
Proof:
For any h with error > 𝜀, it appears consistent
if all m samples are correctly labeled with prob. (1- 𝜀)m.
Let there be k such hypotheses with error > 𝜀.
Prob. that at least one of them would be consistent: k (1- 𝜀)m
As 1- 𝜀 < e-𝜀 è k (1- 𝜀)m < k e-𝜀m < |H| e-𝜀m
Sample complexity of a PAC
learner
n Maximum Prob. of providing a consistent hypothesis
with error > 𝜀 : 𝛿
n Hence, for a PAC learner
n |H|e
-𝜀m < 𝛿
instances.
n è 2 < |H|. Hence, d=VC(H) < log2 |H|
d
A few examples
n H= Set of intervals [a,b], in real axis.
n h(x): 1 if x in [a,b], else 0.
n Shatters any pair of distinct points, e.g., p and q,
p<q
n e.g. [p-2, p-1], [p- 𝜀, p+ 𝜀], [q- 𝜀, q+ 𝜀] [q+1,q+2]
n Existence of any hypothesis sufficient.
n Existence of any pair of points being shattered sufficient.
n Say three points p < q < r
n No h shattering {p,r|q} dichotomy.
n VC(H)=2.
VC(H)=3
Dietterich, T. G. 2003. “Machine Learning.” In Nature Encyclopedia of Cognitive Science. London: Macmillan.
Model selection
n Empirical choice of model complexity
n Number of parameters
n Degree of a polynomial for regression
n Divide input in 3 sets:
n Training, Validation and Test.
n Increase model complexity by keeping training and
validation error low.
n May adopt cross-validation.
n Check on generalization error.
n There exist other information theoretic /
likelihood ratio based approaches.
Summary
n Concept learning as search through H
n General-to-specific ordering over H
n Version space candidate elimination algorithm
n S and G boundaries characterize learner’s uncertainty
n Learner can generate useful queries
n Inductive leaps possible only if learner is biased!
n Inductive learners can be modeled as equiv deductive
systems
n Concept learning algorithms: unable to handle data with
errors
n Allow forming hypothesis with low training error.
by a hypothesis space.
n Generalization error
indicated by test error.
n Number of labelled samples.
n Generalization error.