Professional Documents
Culture Documents
AIML IMPROVEMENT TEST DOC
AIML IMPROVEMENT TEST DOC
AIML IMPROVEMENT TEST DOC
1. Write candidate elimination algorithm. Apply the algorithm to obtain the final
version space for the training example,
Sl. Sky AirTemp Humidity Wind Water Forecast EnjoySport
No.
1 Sunny Warm Normal Strong Warm Same Yes
https://youtu.be/40D3G_cCtWc
CANDIDATE-ELIMINTION algorithm begins by initializing the version space to the
set of all hypotheses in H;
Initializing the G boundary set to contain the most general hypothesis in H
G0 <?, ?, ?, ?, ?, ?>
Initializing the S boundary set to contain the most specific (least general) hypothesis
S0 S0
Given that there are six attributes that could be specified to specialize G2, why are
there only three new hypotheses in G3?
For example, the hypothesis h = (?, ?, Normal, ?, ?, ?) is a minimal specialization of G2
that correctly labels the new example as a negative example, but it is not included in
G3. The reason this hypothesis is excluded is that it is inconsistent with the previously
encountered positive examples
• Consider the fourth training example.
• This positive example further generalizes the S boundary of the version space.
It also results in removing one member of the G boundary, because this member
fails to cover the new positive example
• After processing these four examples, the boundary sets S4 and G4 delimit the
version space of all hypotheses consistent with the set of incrementally observed
training examples
2 No Married 100k No
3 No Single 70k No
6 No Married 60k No
7 Yes Divorced 220k No
9 No Married 75k No
Attributes
Refund→ {Yes, No}
MaritalStatus→{Single, Married, Divorced}
TaxableIncome→{continuous-valued}
Refer Old QP Problem Solutions
4. Draw the perceptron network with the notation. Derive an equation of gradient
descent rule to minimize the error.
Perceptron
Figure 2: A Perceptron
A perceptron takes a vector of real-valued inputs, calculates a linear combination of
these inputs, then outputs a 1 if the result is greater than some threshold and -1 otherwise.
Given inputs x1 through xn, the output O(x1, . . . , xn) computed by the perceptron is
Where,
• wi → Real-valued constant, or weight →Contribution of input xi to the perceptron output.
• w0 → Threshold that the weighted combination of inputs must surpass in order for the
perceptron to output a 1
⃗⃗⃗. 𝑥⃗)
𝑂(𝑥⃗) = 𝑠𝑔𝑛(𝑤
1 𝑖𝑓 𝑦 > 0
Where, 𝑠𝑔𝑛(𝑦) = {
−1 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
𝐻 = (𝑤 ⃗⃗⃗ ∈ ℜ(𝑛+1) )
⃗⃗⃗|𝑤
E from
1
⃗⃗⃗) ≡ ∑(𝑡𝑑 − 𝑜𝑑 )2
𝐸(𝑤
2
𝑑𝜖𝐷
Let 𝐸(𝑤
⃗⃗⃗) be E.
𝜕𝐸 𝜕 1
= ∑(𝑡𝑑 − 𝑜𝑑 )2
𝜕𝑤𝑖 𝜕𝑤𝑖 2
𝑑𝜖𝐷
1 𝜕
= ∑ (𝑡 − 𝑜𝑑 )2
2 𝜕𝑤𝑖 𝑑
𝑑𝜖𝐷
1 𝜕
= ∑ 2(𝑡𝑑 − 𝑜𝑑 ) (𝑡 − 𝑜𝑑 )
2 𝜕𝑤𝑖 𝑑
𝑑𝜖𝐷
𝜕
= ∑(𝑡𝑑 − 𝑜𝑑 ) (𝑡 − 𝑤
⃗⃗⃗. 𝑥
⃗⃗⃗⃗⃗)
𝜕𝑤𝑖 𝑑 𝑑
𝑑𝜖𝐷
𝜕𝐸
= ∑(𝑡𝑑 − 𝑜𝑑 ) (−𝑥𝑖𝑑 ) __(6)
𝜕𝑤𝑖
𝑑𝜖𝐷
𝜕𝐸 𝜕𝐸
Substitute = ∑𝑑𝜖𝐷(𝑡𝑑 − 𝑜𝑑 ) (−𝑥𝑖𝑑 ) in ∆wi= -𝜂 .
𝜕𝑤𝑖 𝜕𝑤𝑖
5. Explain the importance of the terms: (i) Hidden Layer (ii) Generalization (iii)
Overfitting (iv) Stopping criterion
i. Hidden layer
- A hidden layer is one of the layers of sigmoid unit.
- One intriguing property of BACKPROPAGATION is its ability to discover
useful intermediate representations at the hidden unit layers inside the
network.
- Because training examples constrain only the network inputs and outputs, the
weight-tuning procedure is free to set weights that define whatever hidden
unit representation is most effective at minimizing the squared error E.
- This can lead BACKPROPAGATION to define new hidden layer features
that are not explicit in the input representation, but which capture properties
of the input instances that are most relevant to learning the target function
(1)
• Consider figure (1). Here, the eight network inputs are connected to three
hidden units, which are in turn connected to the eight output units.
Because of this structure, the three hidden units will be forced to re-
represent the eight input values in some way that captures their relevant
features, so that this hidden layer representation can be used by the output
units to compute the correct target values.
• Consider training the network shown in Figure (1) to learn the simple
target function f (2) = 2, where 2 is a vector containing seven 0's and a
single 1. The network must learn to reproduce the eight inputs at the
corresponding eight output units. Although this is a simple function, the
network in this case is constrained to use only three hidden units.
Therefore, the essential information from all eight input units must be
captured by the three learned hidden units.
ii. Generalization
• It means how good our model is at learning from the given data and
applying the learnt information elsewhere.
• When training a neural network, there’s going to be some data which the
Neural Network trains on, and there’s going to be some data reserved for
checking the performance of the Neural Network.
• If the Neural Network performs well on the data which it has not trained
on, we can say it has generalized well on the given data.
• To see the dangers of minimizing the error over the training data, consider
how the error E varies with the number of weight iterations. Figure below
shows this variation for two fairly typical applications of
BACKPROPAGATION.
(A) (B)
The lower of the two lines shows the monotonically decreasing error E over the
training set, as the number of gradient descent iterations grows. The upper line shows
the error E measured over a different validation set of examples, distinct from the
training examples. This line measures the generalization accuracy of the network-the
accuracy with which it fits examples beyond the training data.
The generalization accuracy measured over the validation examples first
decreases, then increases, even as the error over the training examples continues to
decrease.
This occurs because the weights are being tuned to fit idiosyncrasies of the training
examples that are not representative of the general distribution of examples. The large
number of weight parameters in ANNs provides many degrees of freedom for fitting
such idiosyncrasies.
iii. Overfitting
Consider that network weights are initialized to small random values.
- With weights of nearly identical value, only very smooth decision surfaces
are describable.
- As training proceeds, some weights begin to grow in order to reduce the error
over the training data, and the complexity of the learned decision surface
increases. Thus, the effective complexity of the hypotheses that can be
reached by BACKPROPAGATION increases with the number of weight-
tuning iterations.
- Given enough weight-tuning iterations, BACKPROPAGATION will often
be able to create overly complex decision surfaces that fit noise in the training
data or unrepresentative characteristics of the particular training sample.
- This overfitting problem is analogous to the overfitting problem in decision
tree learning.
2 Green 2 Tall No M
5 Green 2 Short No H
6 White 2 Tall No H
7 White 2 Tall No H
To Prove: Maximum likelihood hypothesis hML minimizes the sum of the squared
errors between the observed training values di and the hypothesis predictions h(xi)
Proof:
We know that
𝒂𝒓𝒈𝒎𝒂𝒙
𝒉𝑴𝑳 ≡ 𝒉∈𝑯𝑷(𝑫|𝒉)---(1)
Assumptions
Fixed set of training instances ⟨x1 … xm ⟩
D→ corresponding sequence of target values D=⟨d1 … dm ⟩
di = f(xi) + ei
Training examples are mutually independent given h → P(D|h) → product of various
p(di|h)
𝑎𝑟𝑔𝑚𝑎𝑥 𝑚
𝑎𝑟𝑔𝑚𝑎𝑥 𝑚
1 −
1
(𝑑 −ℎ(𝑥𝑖 ))2
ℎ𝑀𝐿 = ∏ 𝑒 2𝜎2 𝑖
√2𝜋𝜎 2
ℎ𝜖𝐻 𝑖=1
The first term in this expression is a constant independent of h, and can therefore be
discarded, yielding,
𝑎𝑟𝑔𝑚𝑎𝑥 𝑚
1
ℎ𝑀𝐿 = ∑− (𝑑 − ℎ(𝑥𝑖 ))2
2𝜎 2 𝑖
ℎ𝜖𝐻 𝑖=1
Maximizing this negative quantity is equivalent to minimizing the
corresponding positive quantity
argmin m
1
hML = ∑ (d − h(xi ))2
2σ2 i
hϵH i=1
𝑎𝑟𝑔𝑚𝑖𝑛 𝑚
2
ℎ𝑀𝐿 = ∑(𝑑𝑖 − ℎ(𝑥𝑖 ))
ℎ𝜖𝐻 𝑖=1
Above equation shows that the maximum likelihood hypothesis hML is the one that
minimizes the sum of the squared errors between the observed training values di and the
hypothesis predictions h(xi).
9. Explain the derivation of K-Means algorithm.
Derivation of k-Means Algorithm
k-means problem→ to estimate the parameters 𝜃 = ⟨𝜇1 … 𝜇𝑘 ⟩ that define the
means of k Normal distributions.
Given,
X = {⟨𝑥𝑖𝑗 ⟩} → observed data
Z = {⟨𝑧𝑖1 , … , 𝑧𝑖𝑘 ⟩} →indicates which of the k Normal distributions was
used to generate xij.
To apply EM algorithm → derive an expression for Q(h’|h)
Derive an expression for p(Y|h’)
The probability p(yi|h’) of a single instance yi = ⟨𝑥𝑖 , 𝑧𝑖1 , … , 𝑧𝑖𝑘 ⟩ of the full data
can be written as
1 −
1
∑𝑘 𝑧 (𝑥 −𝜇𝑗′ )2
𝑝(𝑦𝑖 |ℎ′ ) = 𝑝(𝑥𝑖 , 𝑧𝑖1 , … , 𝑧𝑖𝑘 |ℎ′ ) = 𝑒 2𝜋𝜎2 𝑗=1 𝑖𝑗 𝑖
√2𝜋𝜎 2
Here, only one of zij can have the value 1 and all other must be 0.
Given this probability for a single instance 𝑝(𝑦𝑖 |ℎ′ ),the logarithm of the
probability ln P(Y|h’) for all m instances in the data is
ln P(Y|h’) = 𝑙𝑛 ∏𝑚 ′
𝑖=1 𝑝(𝑦𝑖 |ℎ )
𝑚
= ∑ 𝑙𝑛 𝑝(𝑦𝑖 |ℎ′ )
𝑖=1
𝑚
1 −
1
∑𝑘 𝑧 (𝑥 −𝜇𝑗′ )2
ln P(Y|h’) = ∑ (𝑙𝑛 𝑒 2𝜋𝜎2 𝑗=1 𝑖𝑗 𝑖 )
𝑖=1
√2𝜋𝜎 2
That is, ln P(Y|h’) is a linear function of zij.
In general, for any function f (z) that is a linear function of z, the following
equality holds
E[f(z)] = f(E[z])
And also
𝑚
′]
1 −
1
∑𝑘 𝑧 (𝑥 −𝜇𝑗′ )2
𝐸[ln 𝑃(𝑌|ℎ = 𝐸 [∑ (𝑙𝑛 𝑒 2𝜋𝜎2 𝑗=1 𝑖𝑗 𝑖 )]
𝑖=1
√2𝜋𝜎 2
𝑚
1 −
1
∑𝑘 𝐸[𝑧𝑖𝑗 ](𝑥𝑖 −𝜇𝑗′ )2
= ∑ (𝑙𝑛 𝑒 2𝜋𝜎2 𝑗=1 )
𝑖=1
√2𝜋𝜎 2
Therefore, the function Q(h’|h) for the k means problem is
𝑚
′ |ℎ)
1 −
1
∑𝑘 𝐸[𝑧𝑖𝑗 ](𝑥𝑖 −𝜇𝑗′ )2
𝑄(ℎ = ∑ (𝑙𝑛 𝑒 2𝜋𝜎2 𝑗=1 )
𝑖=1
√2𝜋𝜎 2
Where, ℎ′ = ⟨𝜇′1 , … , 𝜇′𝑘 ⟩
E [zij] → calculated based on current hypothesis h and observed data X.
From k-means Gaussians
1 2
− (𝑥 −𝜇 )
𝑒 2𝜎2 𝑖 𝑗
𝐸[𝑧𝑖𝑗 ] = 1 2
---(9)
− (𝑥 −𝜇 )
∑2𝑛=1 𝑒 2𝜎2 𝑖 𝑛
Thus,
• The first (estimation) step of the EM algorithm defines the Q function
based on the estimated E[zij] terms.
• The second (maximization) step then finds the values 𝜇′1 , … , 𝜇′𝑘 that
maximize this Q function.
In the current case
𝑎𝑟𝑔𝑚𝑎𝑥 𝑚
𝑎𝑟𝑔𝑚𝑎𝑥 ′
1 −
1
∑𝑘 𝐸[𝑧𝑖𝑗 ](𝑥𝑖 −𝜇𝑗′ )2
ℎ′𝑄(ℎ |ℎ) = ∑ (𝑙𝑛 𝑒 )
2𝜋𝜎2 𝑗=1
√2𝜋𝜎 2
ℎ′ 𝑖=1
𝑎𝑟𝑔𝑚𝑖𝑛 𝑚 𝑘 𝑘
---(10)
Therefore,
The maximum likelihood hypothesis here minimizes a weighted sum of
squared errors, where the contribution of each instance xi to the error that
defines 𝜇𝑗′ is weighted by E[zij] .
The quantity given by Equation (10) is minimized by setting each 𝜇𝑗′ to
the weighted sample mean
∑𝑚
𝑖=1 𝐸[𝑧𝑖𝑗 ]𝑥𝑖
𝜇𝑗 ←
∑𝑚
𝑖=1 𝐸[𝑧𝑖𝑗 ]
---(11)
Eq. 10 & 11 → Two steps in the k-means algorithm
10.Explain locally weighted linear regression.
LWR is
• LOCAL because nearby or distance-weighted training examples are used to form
the local approximation to f
• WEIGHTED because the contribution of each training example is weighted by its
distance from the query point
• REGRESSION because this is the term used widely in the statistical learning
community for the problem of approximating real-valued functions.
The general approach in LWR!!
Given,
xq →new query instance
Approach→ construct an approximation 𝑓̂ that fits the training examples in the
neighborhood surrounding xq.
→ use the approximation to calculate 𝑓̂(𝑥𝑞 )
Here,
𝑓̂(𝑥𝑞 ) → output as the estimated target value for the query instance.
𝑓̂ need to be retained as a different local approximation will be calculated for each
distinct query instance
Locally Weighted Linear Regression
Consider,
A case of locally weighted regression in which the target function f is approximated
near xq using a linear function of the form
𝑓̂(𝑥𝑞 ) = 𝑤0 + 𝑤1 𝑎1 (𝑥) + ⋯ + 𝑤𝑛 𝑎𝑛 (𝑥)
From Gradient Descent rule,
1
𝐸 ≡ ∑𝑥∈𝐷(𝑓(𝑥) − 𝑓̂(𝑥))2 ---(5)
2
And
∆𝑤𝑗 = ƞ ∑𝑥∈𝐷(𝑓(𝑥) − 𝑓̂(𝑥))𝑎𝑗 (𝑥)) ---(6)
How shall we modify this procedure to derive a local approximation rather than a
global one?
Ans: Redefine the error criterion E to emphasize fitting the local training examples
There are 3 possible criteria:
Let E(xq) → error is being defined as a function of the query point xq.
1. Minimize the squared error over just the k nearest neighbors:
1
𝐸1 (𝑥𝑞 ) ≡ ∑ (𝑓(𝑥) − 𝑓̂(𝑥))2
2
𝑥∈𝑘 𝑛𝑒𝑎𝑟𝑒𝑠𝑡 𝑛𝑏𝑟𝑠 𝑜𝑓 𝑥𝑞
2. Minimize the squared error over the entire set D of training examples, while
weighting the error of each training example by some decreasing function K of its
distance from xq
1 2
𝐸2 (𝑥𝑞 ) ≡ ∑ (𝑓(𝑥) − 𝑓̂(𝑥)) 𝐾(𝑑(𝑥𝑞 , 𝑥))
2
𝑥∈𝐷
3. Combine 1 and 2
1 2
𝐸3 (𝑥𝑞 ) ≡ ∑ (𝑓(𝑥) − 𝑓̂(𝑥)) 𝐾(𝑑(𝑥𝑞 , 𝑥))
2
𝑥∈𝑘 𝑛𝑒𝑎𝑟𝑒𝑠𝑡 𝑛𝑏𝑟𝑠 𝑜𝑓 𝑥𝑞
Criteria 3 is the best approach. If criteria 3 is used and gradient descent in Eq. (6) is re-
derived, we get the training rule as follows:
∆𝑤𝑗 = ƞ ∑𝑥∈𝑘 𝑛𝑒𝑎𝑟𝑒𝑠𝑡 𝑛𝑏𝑟𝑠 𝑜𝑓 𝑥𝑞 𝐾(𝑑(𝑥𝑞 , 𝑥)) (𝑓(𝑥) − 𝑓̂(𝑥))𝑎𝑗 (𝑥)) ---(7)
Where ,
1
𝐸𝑑 (𝑤
⃗⃗⃗) ≡ (𝑡 − 𝑜𝑘 )2
2 𝑘
Here outputs is the set of output units in the network, tk is the target value of unit k
for training example d, and ok is the output of unit k given training example d.
The derivation of the stochastic gradient descent rule is conceptually straightforward,
but requires keeping track of a number of subscripts and variables
𝝏𝑬𝒅
Derive an expression for
𝝏𝒘𝒋𝒊
→ to implement stochastic gradient descent in (𝒂)
We know that,
The weight wji influences the rest of the network only through netj .
Therefore,
𝜕𝐸
use chain rule for 𝜕𝑤𝑑
𝑗𝑖
𝜕(𝑡𝑘 −𝑜𝑘 )2
will be zero for all output units k except when k=j.
𝜕𝑜𝑗
1 𝜕(𝑡𝑗 −𝑜𝑗 )
= 2(𝑡𝑗 − 𝑜𝑗 ) =- (𝑡𝑗 − 𝑜𝑗 )..(3)
2 𝜕𝑜𝑗
𝜕𝑜𝑗
𝜕𝑛𝑒𝑡𝑗
= 𝑜𝑗 (1 − 𝑜𝑗 ) ..(4)
Combining (a) and (1) we get stochastic gradient descent for output units
𝜕𝐸𝑑
∆𝑤𝑗𝑖 = −𝜂
𝜕𝑤𝑗𝑖
𝜕𝐸𝑑
∆𝑤𝑗𝑖 = −𝜂 𝑥
𝜕𝑛𝑒𝑡𝑗 𝑗𝑖
∆𝑤𝑗𝑖 = −𝜂[−(𝑡𝑗 − 𝑜𝑗 )𝑜𝑗 (1 − 𝑜𝑗 )]𝑥𝑗𝑖
∆𝒘𝒋𝒊 = 𝜼(𝒕𝒋 − 𝒐𝒋 )𝒐𝒋 (𝟏 − 𝒐𝒋 )𝒙𝒋𝒊..(6)
Case 2: Training Rule for Hidden Unit Weights
j→ hidden/internal unit
To derive training rule for wji , consider the indirect ways wji can influence the network and
in turn the Ed.
Refer to all the units immediately downstream of unit j in the network. →
Downstream(j)
netj can influence the network outputs only through units in Downstream(j). Therefore,
𝜕𝐸𝑑 𝜕𝐸𝑑 𝜕𝑛𝑒𝑡𝑘
= ∑
𝜕𝑛𝑒𝑡𝑗 𝜕𝑛𝑒𝑡𝑘 𝜕𝑛𝑒𝑡𝑗
𝑘∈𝐷𝑜𝑤𝑛𝑠𝑡𝑟𝑒𝑎𝑚(𝑗)
𝜕𝐸𝑑 𝜕𝑛𝑒𝑡𝑘
= ∑ −𝛿𝑘
𝜕𝑛𝑒𝑡𝑗 𝜕𝑛𝑒𝑡𝑗
𝑘∈𝐷𝑜𝑤𝑛𝑠𝑡𝑟𝑒𝑎𝑚(𝑗)
netk can influence the network through the outputs of units in Downstream(j). Hence apply
chain rule to get,
𝜕𝐸𝑑 𝜕𝑛𝑒𝑡𝑘 𝜕𝑜𝑗
= ∑ −𝛿𝑘
𝜕𝑛𝑒𝑡𝑗 𝜕𝑜𝑗 𝜕𝑛𝑒𝑡𝑗
𝑘∈𝐷𝑜𝑤𝑛𝑠𝑡𝑟𝑒𝑎𝑚(𝑗)
𝜕𝐸𝑑 𝜕𝑜𝑗
= ∑ −𝛿𝑘 𝑤𝑘𝑗
𝜕𝑛𝑒𝑡𝑗 𝜕𝑛𝑒𝑡𝑗
𝑘∈𝐷𝑜𝑤𝑛𝑠𝑡𝑟𝑒𝑎𝑚(𝑗)
𝜕𝐸𝑑
= ∑ −𝛿𝑘 𝑤𝑘𝑗 𝑜𝑗 (1 − 𝑜𝑗 )
𝜕𝑛𝑒𝑡𝑗
𝑘∈𝐷𝑜𝑤𝑛𝑠𝑡𝑟𝑒𝑎𝑚(𝑗)
𝜕𝐸𝑑
Use 𝛿𝑗 to represent - , therefore we get
𝜕𝑛𝑒𝑡𝑗
12.Prove that Maximum likelihood hypothesis hML minimizes the sum of the squared
errors between the observed training values di and the hypothesis predictions h(xi)
To Prove: Maximum likelihood hypothesis hML minimizes the sum of the squared errors
between the observed training values di and the hypothesis predictions h(xi)
𝒂𝒓𝒈𝒎𝒂𝒙
𝒉𝑴𝑳 ≡ 𝒉∈𝑯𝑷(𝑫|𝒉)---(3)
Proof: From equation (3) we have
Deriving the maximum likelihood hypothesis starting with our earlier definition of hML,
but using lower case p to refer to the probability density
argmax
hML = hϵHp(D|h)
Assumptions
Fixed set of training instances ⟨x1 … xm ⟩
D→ corresponding sequence of target values D=⟨d1 … dm ⟩
di = f(xi) + ei
Training examples are mutually independent given h → P(D|h) → product of various
p(di|h)
𝑎𝑟𝑔𝑚𝑎𝑥 𝑚
𝑎𝑟𝑔𝑚𝑎𝑥 𝑚
1 −
1
(𝑑 −ℎ(𝑥𝑖 ))2
ℎ𝑀𝐿 = ∏ 𝑒 2𝜎2 𝑖
√2𝜋𝜎 2
ℎ𝜖𝐻 𝑖=1
The first term in this expression is a constant independent of h, and can therefore be
discarded, yielding,
𝑎𝑟𝑔𝑚𝑎𝑥 𝑚
1
ℎ𝑀𝐿 = ∑− (𝑑 − ℎ(𝑥𝑖 ))2
2𝜎 2 𝑖
ℎ𝜖𝐻 𝑖=1
Maximizing this negative quantity is equivalent to minimizing the
corresponding positive quantity
argmin m
1
hML = ∑ (d − h(xi ))2
2σ2 i
hϵH i=1
𝑎𝑟𝑔𝑚𝑖𝑛 𝑚
Above equation shows that the maximum likelihood hypothesis hML is the one that
minimizes the sum of the squared errors between the observed training values di and
the hypothesis predictions h(xi).
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner