AIML IMPROVEMENT TEST DOC

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 50

AIML IMPROVEMENT TEST

1. Write candidate elimination algorithm. Apply the algorithm to obtain the final
version space for the training example,
Sl. Sky AirTemp Humidity Wind Water Forecast EnjoySport
No.
1 Sunny Warm Normal Strong Warm Same Yes

2 Sunny Warm High Strong Warm Same Yes

3 Rainy Cool High Strong Warm Change No

4 Sunny Warm High Strong Cool Change Yes

https://youtu.be/40D3G_cCtWc
CANDIDATE-ELIMINTION algorithm begins by initializing the version space to the
set of all hypotheses in H;
Initializing the G boundary set to contain the most general hypothesis in H
G0 <?, ?, ?, ?, ?, ?>
Initializing the S boundary set to contain the most specific (least general) hypothesis
S0 S0      

When the first training example is presented, the CANDIDATE-ELIMINTION


algorithm checks the S boundary and finds that it is overly specific and it fails to cover
the positive example.
The boundary is therefore revised by moving it to the least more general hypothesis
that covers this new example
No update of the G boundary is needed in response to this training example because
Go correctly covers this example

• When the second training example is observed, it has a similar effect of


generalizing S further to S2, leaving G again unchanged i.e., G2 = G1 = G0
Consider the third training example. This negative example reveals that the G boundary
of the version space is overly general, that is, the hypothesis in G incorrectly predicts
that this new example is a positive example.
The hypothesis in the G boundary must therefore be specialized until it correctly
classifies this new negative example

Given that there are six attributes that could be specified to specialize G2, why are
there only three new hypotheses in G3?
For example, the hypothesis h = (?, ?, Normal, ?, ?, ?) is a minimal specialization of G2
that correctly labels the new example as a negative example, but it is not included in
G3. The reason this hypothesis is excluded is that it is inconsistent with the previously
encountered positive examples
• Consider the fourth training example.

• This positive example further generalizes the S boundary of the version space.
It also results in removing one member of the G boundary, because this member
fails to cover the new positive example
• After processing these four examples, the boundary sets S4 and G4 delimit the
version space of all hypotheses consistent with the set of incrementally observed
training examples

2. List the issues of decision tree learning.


Avoiding overfitting the data
Reduced-Error Pruning
Reduced-error pruning, is to consider each of the decision nodes in the tree to
be candidates for pruning
Pruning a decision node consists of removing the subtree rooted at that node,
making it a leaf node, and assigning it the most common classification of the
training examples affiliated with that node
Nodes are removed only if the resulting pruned tree performs no worse than-the
original over the validation set.
Reduced error pruning has the effect that any leaf node added due to coincidental
regularities in the training set is likely to be pruned because these same coincidences
are unlikely to occur in the validation set
Rule Post-Pruning
Rule post-pruning involves the following steps:
Infer the decision tree from the training set, growing the tree until the training data is
fit as well as possible and allowing overfitting to occur.
Convert the learned tree into an equivalent set of rules by creating one rule for each
path from the root node to a leaf node.
Prune (generalize) each rule by removing any preconditions that result in improving its
estimated accuracy.
Sort the pruned rules by their estimated accuracy, and consider them in this sequence
when classifying subsequent instances.
Incorporating Continuous-Valued Attributes
i. Define new discrete valued attributes that partition the continuous attribute
value into a discrete set of intervals.
E.g., {high ≡ Temp > 35º C, med ≡ 10º C < Temp ≤ 35º C, low ≡ Temp ≤ 10º
C}
ii. Using thresholds for splitting nodes
e.g., A ≤ a produces subsets A ≤ a and A > a
Alternative Measures for Selecting Attributes
o The problem is if attributes with many values, Gain will select it ?
o Example: consider the attribute Date, which has a very large number of possible
values. (e.g., March 4, 1979).
o If this attribute is added to the PlayTennis data, it would have the highest
information gain of any of the attributes. This is because Date alone perfectly
predicts the target attribute over the training data. Thus, it would be selected as
the decision attribute for the root node of the tree and lead to a tree of depth one,
which perfectly classifies the training data.
o This decision tree with root node Date is not a useful predictor because it
perfectly separates the training data, but poorly predict on subsequent examples.
Handling Training Examples with Missing Attribute Values
• If node n test A, assign most common value of A among other training examples
sorted to node n
• Assign most common value of A among other training examples with same
target value
• Assign a probability pi to each of the possible values vi of A rather than simply
assigning the most common value to A(x)

Handling Attributes with Differing Costs


•In some learning tasks the instance attributes may have associated costs.
•For example: In learning to classify medical diseases, the patients described in
terms of attributes such as Temperature, BiopsyResult, Pulse,
BloodTestResults, etc.
• These attributes vary significantly in their costs, both in terms of
monetary cost and cost to patient comfort
• Decision trees use low-cost attributes where possible, depends only on
high-cost attributes only when needed to produce reliable classifications
3. Write and explain the decision tree for the following transactions.
Tid Refund MaritalStatus TaxableIncome Cheat

1 Yes Single 125k No

2 No Married 100k No

3 No Single 70k No

4 Yes Married 120k No

5 No Divorced 95k Yes

6 No Married 60k No
7 Yes Divorced 220k No

8 No Single 85k Yes

9 No Married 75k No

10 No Single 90k Yes

Attributes
Refund→ {Yes, No}
MaritalStatus→{Single, Married, Divorced}
TaxableIncome→{continuous-valued}
Refer Old QP Problem Solutions

4. Draw the perceptron network with the notation. Derive an equation of gradient
descent rule to minimize the error.
Perceptron

One type of ANN system is based on a unit called a perceptron. Perceptron is a


single layer neural network.

Figure 2: A Perceptron
A perceptron takes a vector of real-valued inputs, calculates a linear combination of
these inputs, then outputs a 1 if the result is greater than some threshold and -1 otherwise.
Given inputs x1 through xn, the output O(x1, . . . , xn) computed by the perceptron is

Where,
• wi → Real-valued constant, or weight →Contribution of input xi to the perceptron output.
• w0 → Threshold that the weighted combination of inputs must surpass in order for the
perceptron to output a 1

Sometimes, the perceptron function is written as,

⃗⃗⃗. 𝑥⃗)
𝑂(𝑥⃗) = 𝑠𝑔𝑛(𝑤

1 𝑖𝑓 𝑦 > 0
Where, 𝑠𝑔𝑛(𝑦) = {
−1 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

Learning a perceptron involves choosing values for the weights w0 , . . . , wn .


Therefore, the space H of candidate hypotheses considered in perceptron learning is
the set of all possible real-valued weight vectors.

𝐻 = (𝑤 ⃗⃗⃗ ∈ ℜ(𝑛+1) )
⃗⃗⃗|𝑤

Derivation of Gradient Descent Rule


Compute the derivative of E with respect to each component of the vector 𝑤
⃗⃗⃗. →
Gradient of E w.r.t. 𝑤
⃗⃗⃗. → 𝛻𝐸(𝑤
⃗⃗⃗).
𝜕𝐸 𝜕𝐸 𝜕𝐸
𝛻𝐸(𝑤
⃗⃗⃗)≡ [ , ,…, ] __(3)
𝜕𝑤0 𝜕𝑤1 𝜕𝑤𝑛
When interpreted as a vector in weight space, the gradient specifies the direction
that produces the steepest increase in E. The negative of this vector therefore gives the
direction of steepest decrease.
The training rule for gradient descent is
𝑤
⃗⃗⃗ ← 𝑤
⃗⃗⃗+∆𝑤
⃗⃗⃗
Where,
__(4)
⃗⃗⃗ = -𝜂𝛻𝐸(𝑤
∆𝑤 ⃗⃗⃗)
η → learning rate → step size in the gradient descent search
-ve sign → to move the weight vector in the direction that decreases E.
The training rule for gradient descent can also be written as
wi← wi+∆wi
Where,
𝜕𝐸
∆wi= -𝜂 __(5)
𝜕𝑤𝑖
𝜕𝐸
The vector of derivatives that form the gradient can be obtained by differentiating
𝜕𝑤𝑖

E from
1
⃗⃗⃗) ≡ ∑(𝑡𝑑 − 𝑜𝑑 )2
𝐸(𝑤
2
𝑑𝜖𝐷

Let 𝐸(𝑤
⃗⃗⃗) be E.
𝜕𝐸 𝜕 1
= ∑(𝑡𝑑 − 𝑜𝑑 )2
𝜕𝑤𝑖 𝜕𝑤𝑖 2
𝑑𝜖𝐷

1 𝜕
= ∑ (𝑡 − 𝑜𝑑 )2
2 𝜕𝑤𝑖 𝑑
𝑑𝜖𝐷

1 𝜕
= ∑ 2(𝑡𝑑 − 𝑜𝑑 ) (𝑡 − 𝑜𝑑 )
2 𝜕𝑤𝑖 𝑑
𝑑𝜖𝐷

𝜕
= ∑(𝑡𝑑 − 𝑜𝑑 ) (𝑡 − 𝑤
⃗⃗⃗. 𝑥
⃗⃗⃗⃗⃗)
𝜕𝑤𝑖 𝑑 𝑑
𝑑𝜖𝐷
𝜕𝐸
= ∑(𝑡𝑑 − 𝑜𝑑 ) (−𝑥𝑖𝑑 ) __(6)
𝜕𝑤𝑖
𝑑𝜖𝐷
𝜕𝐸 𝜕𝐸
Substitute = ∑𝑑𝜖𝐷(𝑡𝑑 − 𝑜𝑑 ) (−𝑥𝑖𝑑 ) in ∆wi= -𝜂 .
𝜕𝑤𝑖 𝜕𝑤𝑖

∆𝑤𝑖 = 𝜂 ∑(𝑡𝑑 − 𝑜𝑑 ) 𝑥𝑖𝑑 __(7)


𝑑𝜖𝐷

5. Explain the importance of the terms: (i) Hidden Layer (ii) Generalization (iii)
Overfitting (iv) Stopping criterion
i. Hidden layer
- A hidden layer is one of the layers of sigmoid unit.
- One intriguing property of BACKPROPAGATION is its ability to discover
useful intermediate representations at the hidden unit layers inside the
network.
- Because training examples constrain only the network inputs and outputs, the
weight-tuning procedure is free to set weights that define whatever hidden
unit representation is most effective at minimizing the squared error E.
- This can lead BACKPROPAGATION to define new hidden layer features
that are not explicit in the input representation, but which capture properties
of the input instances that are most relevant to learning the target function
(1)
• Consider figure (1). Here, the eight network inputs are connected to three
hidden units, which are in turn connected to the eight output units.
Because of this structure, the three hidden units will be forced to re-
represent the eight input values in some way that captures their relevant
features, so that this hidden layer representation can be used by the output
units to compute the correct target values.
• Consider training the network shown in Figure (1) to learn the simple
target function f (2) = 2, where 2 is a vector containing seven 0's and a
single 1. The network must learn to reproduce the eight inputs at the
corresponding eight output units. Although this is a simple function, the
network in this case is constrained to use only three hidden units.
Therefore, the essential information from all eight input units must be
captured by the three learned hidden units.
ii. Generalization
• It means how good our model is at learning from the given data and
applying the learnt information elsewhere.
• When training a neural network, there’s going to be some data which the
Neural Network trains on, and there’s going to be some data reserved for
checking the performance of the Neural Network.
• If the Neural Network performs well on the data which it has not trained
on, we can say it has generalized well on the given data.
• To see the dangers of minimizing the error over the training data, consider
how the error E varies with the number of weight iterations. Figure below
shows this variation for two fairly typical applications of
BACKPROPAGATION.
(A) (B)
The lower of the two lines shows the monotonically decreasing error E over the
training set, as the number of gradient descent iterations grows. The upper line shows
the error E measured over a different validation set of examples, distinct from the
training examples. This line measures the generalization accuracy of the network-the
accuracy with which it fits examples beyond the training data.
The generalization accuracy measured over the validation examples first
decreases, then increases, even as the error over the training examples continues to
decrease.
This occurs because the weights are being tuned to fit idiosyncrasies of the training
examples that are not representative of the general distribution of examples. The large
number of weight parameters in ANNs provides many degrees of freedom for fitting
such idiosyncrasies.
iii. Overfitting
Consider that network weights are initialized to small random values.
- With weights of nearly identical value, only very smooth decision surfaces
are describable.
- As training proceeds, some weights begin to grow in order to reduce the error
over the training data, and the complexity of the learned decision surface
increases. Thus, the effective complexity of the hypotheses that can be
reached by BACKPROPAGATION increases with the number of weight-
tuning iterations.
- Given enough weight-tuning iterations, BACKPROPAGATION will often
be able to create overly complex decision surfaces that fit noise in the training
data or unrepresentative characteristics of the particular training sample.
- This overfitting problem is analogous to the overfitting problem in decision
tree learning.

iv. Stopping criterion


- Weight Decay
• Decrease each weight by some small factor during each iteration.
• This is equivalent to modifying the definition of E to include a penalty
term corresponding to the total magnitude of the network weights.
• The motivation for this approach is to keep weight values small, to bias
learning against complex decision surfaces.
- Provide a set of validation data to the algorithm in addition to the training
data.
• The algorithm monitors the error with respect to this validation set,
while using the training set to drive the gradient descent search.
• In essence, this allows the algorithm itself to plot the two curves shown
in Figure(A) and (B).
- How many weight-tuning iterations should the algorithm perform
• It should use the number of iterations that produces the lowest error
over the validation set, since this is the best indicator of network
performance over unseen examples.
• Two copies of the network weights are kept: one copy for training and
a separate copy of the best-performing weights thus far, measured by
their error over the validation set.
• Once the trained weights reach a significantly higher error over the
validation set than the stored weights, training is terminated and the
stored weights are returned as the final hypothesis
6. The following table gives data set about stolen vehicles. Using Naïve Bayes
classifier classify the new data (Red, SUV, Domestic).
Color Type Origin Stolen

Red Sports Domestic Yes

Red Sports Domestic No

Red Sports Domestic Yes

Yellow Sports Domestic No

Yellow Sports Imported Yes


Yellow SUV Imported No

Yellow SUV Imported Yes

Yellow SUV Domestic No

Red SUV Imported No

Red Sports Imported Yes

Refer Old QP Problem Solutions


Refer https://youtu.be/ptoHrKK3-Fo for sample procedure
7. Estimate conditional probabilities of each attributes {color, legs, height, smelly}
for the species classes: {M, H} using the data given in the table. Using these
probabilities estimate the probability values for the new instance- (color=Green,
legs=2, height =Tall, and smelly=No)
No. Color Legs Height Smelly Species

1 White 3 Short Yes M

2 Green 2 Tall No M

3 Green 3 Short Yes M

4 White 3 Short Yes M

5 Green 2 Short No H

6 White 2 Tall No H

7 White 2 Tall No H

8 White 2 Short Yes H

Refer Old QP Problem Solutions


Refer https://youtu.be/ptoHrKK3-Fo for sample procedure
8. Prove that how maximum likelihood (Bayesian learning) can be used in any
learning algorithms that are used to minimize the squared error between actual
output hypotheses and predicted output hypothesis. (6M)(JULY 19)

To Prove: Maximum likelihood hypothesis hML minimizes the sum of the squared
errors between the observed training values di and the hypothesis predictions h(xi)
Proof:
We know that
𝒂𝒓𝒈𝒎𝒂𝒙
𝒉𝑴𝑳 ≡ 𝒉∈𝑯𝑷(𝑫|𝒉)---(1)

From equation (1) we have


Deriving the maximum likelihood hypothesis starting with our earlier definition of
hML, but using lower case p to refer to the probability density
argmax
hML = hϵHp(D|h)

Assumptions
Fixed set of training instances ⟨x1 … xm ⟩
D→ corresponding sequence of target values D=⟨d1 … dm ⟩
di = f(xi) + ei
Training examples are mutually independent given h → P(D|h) → product of various
p(di|h)
𝑎𝑟𝑔𝑚𝑎𝑥 𝑚

ℎ𝑀𝐿 = ∏ 𝑝(𝑑𝑖 |ℎ)


ℎ𝜖𝐻 𝑖=1

ei obeys Normal distribution with zero mean and unknown variance σ2 .


di must also obey Normal distribution with variance σ2 centered around the true
target value f(xi) rather than zero.
Hence,
p(di|h) can be written as a Normal distribution with variance σ2 and mean μ = f
(xi) .
Because we are writing the expression for the probability of di given that h is the
correct description of the target function f , we will also substitute p = f (xi)= h(xi),
yielding
𝑎𝑟𝑔𝑚𝑎𝑥 𝑚
1 1 (𝑑 − )2
− 𝑖 𝜇
𝑒 2𝜎2
ℎ𝑀𝐿 = ∏ 2
√2𝜋𝜎
ℎ𝜖𝐻 𝑖 =1

𝑎𝑟𝑔𝑚𝑎𝑥 𝑚
1 −
1
(𝑑 −ℎ(𝑥𝑖 ))2
ℎ𝑀𝐿 = ∏ 𝑒 2𝜎2 𝑖
√2𝜋𝜎 2
ℎ𝜖𝐻 𝑖=1

Rather than maximizing the above complicated expression we shall choose to


maximize its (less complicated) logarithm
𝑎𝑟𝑔𝑚𝑎𝑥 𝑚
1 1
ℎ𝑀𝐿 = ∑ 𝑙𝑛 − (𝑑 − ℎ(𝑥𝑖 ))2
√2𝜋𝜎 2 2𝜎 2 𝑖
ℎ𝜖𝐻 𝑖=1

The first term in this expression is a constant independent of h, and can therefore be
discarded, yielding,
𝑎𝑟𝑔𝑚𝑎𝑥 𝑚
1
ℎ𝑀𝐿 = ∑− (𝑑 − ℎ(𝑥𝑖 ))2
2𝜎 2 𝑖
ℎ𝜖𝐻 𝑖=1
Maximizing this negative quantity is equivalent to minimizing the
corresponding positive quantity
argmin m
1
hML = ∑ (d − h(xi ))2
2σ2 i
hϵH i=1
𝑎𝑟𝑔𝑚𝑖𝑛 𝑚
2
ℎ𝑀𝐿 = ∑(𝑑𝑖 − ℎ(𝑥𝑖 ))
ℎ𝜖𝐻 𝑖=1
Above equation shows that the maximum likelihood hypothesis hML is the one that
minimizes the sum of the squared errors between the observed training values di and the
hypothesis predictions h(xi).
9. Explain the derivation of K-Means algorithm.
Derivation of k-Means Algorithm
k-means problem→ to estimate the parameters 𝜃 = ⟨𝜇1 … 𝜇𝑘 ⟩ that define the
means of k Normal distributions.
Given,
X = {⟨𝑥𝑖𝑗 ⟩} → observed data
Z = {⟨𝑧𝑖1 , … , 𝑧𝑖𝑘 ⟩} →indicates which of the k Normal distributions was
used to generate xij.
To apply EM algorithm → derive an expression for Q(h’|h)
Derive an expression for p(Y|h’)
The probability p(yi|h’) of a single instance yi = ⟨𝑥𝑖 , 𝑧𝑖1 , … , 𝑧𝑖𝑘 ⟩ of the full data
can be written as
1 −
1
∑𝑘 𝑧 (𝑥 −𝜇𝑗′ )2
𝑝(𝑦𝑖 |ℎ′ ) = 𝑝(𝑥𝑖 , 𝑧𝑖1 , … , 𝑧𝑖𝑘 |ℎ′ ) = 𝑒 2𝜋𝜎2 𝑗=1 𝑖𝑗 𝑖
√2𝜋𝜎 2
Here, only one of zij can have the value 1 and all other must be 0.
Given this probability for a single instance 𝑝(𝑦𝑖 |ℎ′ ),the logarithm of the
probability ln P(Y|h’) for all m instances in the data is
ln P(Y|h’) = 𝑙𝑛 ∏𝑚 ′
𝑖=1 𝑝(𝑦𝑖 |ℎ )
𝑚

= ∑ 𝑙𝑛 𝑝(𝑦𝑖 |ℎ′ )
𝑖=1
𝑚
1 −
1
∑𝑘 𝑧 (𝑥 −𝜇𝑗′ )2
ln P(Y|h’) = ∑ (𝑙𝑛 𝑒 2𝜋𝜎2 𝑗=1 𝑖𝑗 𝑖 )
𝑖=1
√2𝜋𝜎 2
That is, ln P(Y|h’) is a linear function of zij.
In general, for any function f (z) that is a linear function of z, the following
equality holds
E[f(z)] = f(E[z])
And also
𝑚
′]
1 −
1
∑𝑘 𝑧 (𝑥 −𝜇𝑗′ )2
𝐸[ln 𝑃(𝑌|ℎ = 𝐸 [∑ (𝑙𝑛 𝑒 2𝜋𝜎2 𝑗=1 𝑖𝑗 𝑖 )]
𝑖=1
√2𝜋𝜎 2
𝑚
1 −
1
∑𝑘 𝐸[𝑧𝑖𝑗 ](𝑥𝑖 −𝜇𝑗′ )2
= ∑ (𝑙𝑛 𝑒 2𝜋𝜎2 𝑗=1 )
𝑖=1
√2𝜋𝜎 2
Therefore, the function Q(h’|h) for the k means problem is
𝑚
′ |ℎ)
1 −
1
∑𝑘 𝐸[𝑧𝑖𝑗 ](𝑥𝑖 −𝜇𝑗′ )2
𝑄(ℎ = ∑ (𝑙𝑛 𝑒 2𝜋𝜎2 𝑗=1 )
𝑖=1
√2𝜋𝜎 2
Where, ℎ′ = ⟨𝜇′1 , … , 𝜇′𝑘 ⟩
E [zij] → calculated based on current hypothesis h and observed data X.
From k-means Gaussians
1 2
− (𝑥 −𝜇 )
𝑒 2𝜎2 𝑖 𝑗
𝐸[𝑧𝑖𝑗 ] = 1 2
---(9)
− (𝑥 −𝜇 )
∑2𝑛=1 𝑒 2𝜎2 𝑖 𝑛

Thus,
• The first (estimation) step of the EM algorithm defines the Q function
based on the estimated E[zij] terms.
• The second (maximization) step then finds the values 𝜇′1 , … , 𝜇′𝑘 that
maximize this Q function.
In the current case
𝑎𝑟𝑔𝑚𝑎𝑥 𝑚
𝑎𝑟𝑔𝑚𝑎𝑥 ′
1 −
1
∑𝑘 𝐸[𝑧𝑖𝑗 ](𝑥𝑖 −𝜇𝑗′ )2
ℎ′𝑄(ℎ |ℎ) = ∑ (𝑙𝑛 𝑒 )
2𝜋𝜎2 𝑗=1
√2𝜋𝜎 2
ℎ′ 𝑖=1
𝑎𝑟𝑔𝑚𝑖𝑛 𝑚 𝑘 𝑘

= ∑ ∑ ∑ 𝐸[𝑧𝑖𝑗 ] (𝑋𝑖 − 𝜇𝑗′ )2


ℎ′ 𝑖=1 𝑗=1 𝑗=1

---(10)
Therefore,
The maximum likelihood hypothesis here minimizes a weighted sum of
squared errors, where the contribution of each instance xi to the error that
defines 𝜇𝑗′ is weighted by E[zij] .
The quantity given by Equation (10) is minimized by setting each 𝜇𝑗′ to
the weighted sample mean
∑𝑚
𝑖=1 𝐸[𝑧𝑖𝑗 ]𝑥𝑖
𝜇𝑗 ←
∑𝑚
𝑖=1 𝐸[𝑧𝑖𝑗 ]

---(11)
Eq. 10 & 11 → Two steps in the k-means algorithm
10.Explain locally weighted linear regression.
LWR is
• LOCAL because nearby or distance-weighted training examples are used to form
the local approximation to f
• WEIGHTED because the contribution of each training example is weighted by its
distance from the query point
• REGRESSION because this is the term used widely in the statistical learning
community for the problem of approximating real-valued functions.
The general approach in LWR!!
Given,
xq →new query instance
Approach→ construct an approximation 𝑓̂ that fits the training examples in the
neighborhood surrounding xq.
→ use the approximation to calculate 𝑓̂(𝑥𝑞 )
Here,
𝑓̂(𝑥𝑞 ) → output as the estimated target value for the query instance.
𝑓̂ need to be retained as a different local approximation will be calculated for each
distinct query instance
Locally Weighted Linear Regression
Consider,
A case of locally weighted regression in which the target function f is approximated
near xq using a linear function of the form
𝑓̂(𝑥𝑞 ) = 𝑤0 + 𝑤1 𝑎1 (𝑥) + ⋯ + 𝑤𝑛 𝑎𝑛 (𝑥)
From Gradient Descent rule,
1
𝐸 ≡ ∑𝑥∈𝐷(𝑓(𝑥) − 𝑓̂(𝑥))2 ---(5)
2

And
∆𝑤𝑗 = ƞ ∑𝑥∈𝐷(𝑓(𝑥) − 𝑓̂(𝑥))𝑎𝑗 (𝑥)) ---(6)
How shall we modify this procedure to derive a local approximation rather than a
global one?
Ans: Redefine the error criterion E to emphasize fitting the local training examples
There are 3 possible criteria:
Let E(xq) → error is being defined as a function of the query point xq.
1. Minimize the squared error over just the k nearest neighbors:
1
𝐸1 (𝑥𝑞 ) ≡ ∑ (𝑓(𝑥) − 𝑓̂(𝑥))2
2
𝑥∈𝑘 𝑛𝑒𝑎𝑟𝑒𝑠𝑡 𝑛𝑏𝑟𝑠 𝑜𝑓 𝑥𝑞

2. Minimize the squared error over the entire set D of training examples, while
weighting the error of each training example by some decreasing function K of its
distance from xq
1 2
𝐸2 (𝑥𝑞 ) ≡ ∑ (𝑓(𝑥) − 𝑓̂(𝑥)) 𝐾(𝑑(𝑥𝑞 , 𝑥))
2
𝑥∈𝐷

3. Combine 1 and 2
1 2
𝐸3 (𝑥𝑞 ) ≡ ∑ (𝑓(𝑥) − 𝑓̂(𝑥)) 𝐾(𝑑(𝑥𝑞 , 𝑥))
2
𝑥∈𝑘 𝑛𝑒𝑎𝑟𝑒𝑠𝑡 𝑛𝑏𝑟𝑠 𝑜𝑓 𝑥𝑞

Criteria 3 is the best approach. If criteria 3 is used and gradient descent in Eq. (6) is re-
derived, we get the training rule as follows:
∆𝑤𝑗 = ƞ ∑𝑥∈𝑘 𝑛𝑒𝑎𝑟𝑒𝑠𝑡 𝑛𝑏𝑟𝑠 𝑜𝑓 𝑥𝑞 𝐾(𝑑(𝑥𝑞 , 𝑥)) (𝑓(𝑥) − 𝑓̂(𝑥))𝑎𝑗 (𝑥)) ---(7)

11.Derive the Backpropagation rule


Derivation of the BACKPROPAGATION Rule
Notations
Deriving the stochastic gradient descent rule: Stochastic gradient descent involves
iterating through the training examples one at a time, for each training example d
descending the gradient of the error Ed with respect to this single example
1
From 𝐸𝑑 (𝑤 ⃗⃗⃗) = (𝑡𝑑 − 𝑜𝑑 )2 → Gradient Descent→ iterated through
2
training_examples one at a time. i.e.. For each training example d every weight wji is
updated by adding to it Δ wji
𝜕𝐸𝑑
∆𝑤𝑗𝑖 = −𝜂 ..(a)
𝜕𝑤𝑗𝑖

Where ,
1
𝐸𝑑 (𝑤
⃗⃗⃗) ≡ (𝑡 − 𝑜𝑘 )2
2 𝑘

Here outputs is the set of output units in the network, tk is the target value of unit k
for training example d, and ok is the output of unit k given training example d.
The derivation of the stochastic gradient descent rule is conceptually straightforward,
but requires keeping track of a number of subscripts and variables

• xji = the ith input to unit j


• wji = the weight associated with the ith input to unit j
• netj = Σi wjixji (the weighted sum of inputs for unit j )
• oj = the output computed by unit j
• tj = the target output for unit j
• σ = the sigmoid function
• outputs = the set of units in the final layer of the network
• Downstream(j) = the set of units whose immediate inputs include the output of
unit j
Derivation
𝜕𝐸𝑑
∆𝑤𝑗𝑖 = −𝜂 ..(a)
𝜕𝑤𝑗𝑖

𝝏𝑬𝒅
Derive an expression for
𝝏𝒘𝒋𝒊
→ to implement stochastic gradient descent in (𝒂)

We know that,
The weight wji influences the rest of the network only through netj .
Therefore,
𝜕𝐸
use chain rule for 𝜕𝑤𝑑
𝑗𝑖

𝜕𝐸𝑑 𝜕𝐸𝑑 𝜕𝑛𝑒𝑡𝑗


=
𝜕𝑤𝑗𝑖 𝜕𝑛𝑒𝑡𝑗 𝜕𝑤𝑗𝑖
𝜕𝐸𝑑 𝜕𝐸𝑑
= 𝑥 ..(1)
𝜕𝑤𝑗𝑖 𝜕𝑛𝑒𝑡𝑗 𝑗𝑖
𝜕𝐸𝑑
Given (1) , we need to derive convenient expression for . Unit j is either a hidden unit
𝜕𝑛𝑒𝑡𝑗

or an output unit. Derivation is done considering these both cases.


Case 1: Training Rule for Output Unit Weights
netj can influence network only through oj.
𝜕𝐸𝑑
Therefore, use the chain rule for
𝜕𝑛𝑒𝑡𝑗

𝜕𝐸𝑑 𝜕𝐸𝑑 𝜕𝑜𝑗


=
𝜕𝑛𝑒𝑡𝑗 𝜕𝑜𝑗 𝜕𝑛𝑒𝑡𝑗
.. (2)

Consider the 1st term in (2)


1 2
𝜕𝐸𝑑 𝜕2 ∑𝑘∈𝑜𝑢𝑡𝑝𝑢𝑡𝑠(𝑡𝑘 −𝑜𝑘 )
𝜕𝑜𝑗
= 𝜕𝑜𝑗

𝜕(𝑡𝑘 −𝑜𝑘 )2
will be zero for all output units k except when k=j.
𝜕𝑜𝑗

Therefore, drop all other terms where k≠j.


1 2
𝜕𝐸𝑑 𝜕2(𝑡𝑗 −𝑜𝑗 )
𝜕𝑜𝑗
= 𝜕𝑜𝑗

1 𝜕(𝑡𝑗 −𝑜𝑗 )
= 2(𝑡𝑗 − 𝑜𝑗 ) =- (𝑡𝑗 − 𝑜𝑗 )..(3)
2 𝜕𝑜𝑗

Consider the 2nd term in (2)


We know that, oj = σ(netj)
𝜕𝑜𝑗
The derivative 𝜕𝑛𝑒𝑡 is a derivative of sigmoid function
𝑗

i.e., equal to σ(netj)(1-σ(netj)).


Therefore,
𝜕𝑜𝑗 𝜕σ(netj)
=
𝜕𝑛𝑒𝑡𝑗 𝜕𝑛𝑒𝑡𝑗

𝜕𝑜𝑗
𝜕𝑛𝑒𝑡𝑗
= 𝑜𝑗 (1 − 𝑜𝑗 ) ..(4)

Substituting (4) and (3) in (2) we get,


𝜕𝐸𝑑
=- (𝑡𝑗 − 𝑜𝑗 )𝑜𝑗 (1 − 𝑜𝑗 )..(5)
𝜕𝑛𝑒𝑡𝑗

Combining (a) and (1) we get stochastic gradient descent for output units
𝜕𝐸𝑑
∆𝑤𝑗𝑖 = −𝜂
𝜕𝑤𝑗𝑖
𝜕𝐸𝑑
∆𝑤𝑗𝑖 = −𝜂 𝑥
𝜕𝑛𝑒𝑡𝑗 𝑗𝑖
∆𝑤𝑗𝑖 = −𝜂[−(𝑡𝑗 − 𝑜𝑗 )𝑜𝑗 (1 − 𝑜𝑗 )]𝑥𝑗𝑖
∆𝒘𝒋𝒊 = 𝜼(𝒕𝒋 − 𝒐𝒋 )𝒐𝒋 (𝟏 − 𝒐𝒋 )𝒙𝒋𝒊..(6)
Case 2: Training Rule for Hidden Unit Weights
j→ hidden/internal unit
To derive training rule for wji , consider the indirect ways wji can influence the network and
in turn the Ed.
Refer to all the units immediately downstream of unit j in the network. →
Downstream(j)
netj can influence the network outputs only through units in Downstream(j). Therefore,
𝜕𝐸𝑑 𝜕𝐸𝑑 𝜕𝑛𝑒𝑡𝑘
= ∑
𝜕𝑛𝑒𝑡𝑗 𝜕𝑛𝑒𝑡𝑘 𝜕𝑛𝑒𝑡𝑗
𝑘∈𝐷𝑜𝑤𝑛𝑠𝑡𝑟𝑒𝑎𝑚(𝑗)

𝜕𝐸𝑑 𝜕𝑛𝑒𝑡𝑘
= ∑ −𝛿𝑘
𝜕𝑛𝑒𝑡𝑗 𝜕𝑛𝑒𝑡𝑗
𝑘∈𝐷𝑜𝑤𝑛𝑠𝑡𝑟𝑒𝑎𝑚(𝑗)

netk can influence the network through the outputs of units in Downstream(j). Hence apply
chain rule to get,
𝜕𝐸𝑑 𝜕𝑛𝑒𝑡𝑘 𝜕𝑜𝑗
= ∑ −𝛿𝑘
𝜕𝑛𝑒𝑡𝑗 𝜕𝑜𝑗 𝜕𝑛𝑒𝑡𝑗
𝑘∈𝐷𝑜𝑤𝑛𝑠𝑡𝑟𝑒𝑎𝑚(𝑗)

𝜕𝐸𝑑 𝜕𝑜𝑗
= ∑ −𝛿𝑘 𝑤𝑘𝑗
𝜕𝑛𝑒𝑡𝑗 𝜕𝑛𝑒𝑡𝑗
𝑘∈𝐷𝑜𝑤𝑛𝑠𝑡𝑟𝑒𝑎𝑚(𝑗)

𝜕𝐸𝑑
= ∑ −𝛿𝑘 𝑤𝑘𝑗 𝑜𝑗 (1 − 𝑜𝑗 )
𝜕𝑛𝑒𝑡𝑗
𝑘∈𝐷𝑜𝑤𝑛𝑠𝑡𝑟𝑒𝑎𝑚(𝑗)
𝜕𝐸𝑑
Use 𝛿𝑗 to represent - , therefore we get
𝜕𝑛𝑒𝑡𝑗

𝛿𝑗 =𝑜𝑗 (1 − 𝑜𝑗 ) ∑𝑘∈𝐷𝑜𝑤𝑛𝑠𝑡𝑟𝑒𝑎𝑚(𝑗) 𝛿𝑘 𝑤𝑘𝑗


And from (a)
𝜕𝐸𝑑 𝜕𝐸𝑑 𝜕𝐸
∆𝑤𝑗𝑖 = −𝜂 =−𝜂 𝑥 =𝜂(− 𝑑 )𝑥𝑗𝑖
𝜕𝑤𝑗𝑖 𝜕𝑛𝑒𝑡𝑗 𝑗𝑖 𝜕𝑛𝑒𝑡𝑗

∆𝑤𝑗𝑖 = 𝜂𝛿𝑗 𝑥𝑗𝑖

12.Prove that Maximum likelihood hypothesis hML minimizes the sum of the squared
errors between the observed training values di and the hypothesis predictions h(xi)
To Prove: Maximum likelihood hypothesis hML minimizes the sum of the squared errors
between the observed training values di and the hypothesis predictions h(xi)
𝒂𝒓𝒈𝒎𝒂𝒙
𝒉𝑴𝑳 ≡ 𝒉∈𝑯𝑷(𝑫|𝒉)---(3)
Proof: From equation (3) we have
Deriving the maximum likelihood hypothesis starting with our earlier definition of hML,
but using lower case p to refer to the probability density
argmax
hML = hϵHp(D|h)

Assumptions
Fixed set of training instances ⟨x1 … xm ⟩
D→ corresponding sequence of target values D=⟨d1 … dm ⟩
di = f(xi) + ei
Training examples are mutually independent given h → P(D|h) → product of various
p(di|h)
𝑎𝑟𝑔𝑚𝑎𝑥 𝑚

ℎ𝑀𝐿 = ∏ 𝑝(𝑑𝑖 |ℎ)


ℎ𝜖𝐻 𝑖=1

ei obeys Normal distribution with zero mean and unknown variance σ2 .


di must also obey Normal distribution with variance σ2 centered around the true
target value f(xi) rather than zero.
Hence,
p(di|h) can be written as a Normal distribution with variance σ2 and mean μ = f (xi) .
Because we are writing the expression for the probability of di given that h is the
correct description of the target function f , we will also substitute p = f (xi)= h(xi),
yielding
𝑎𝑟𝑔𝑚𝑎𝑥 𝑚
1 1 (𝑑 − )2
− 𝑖 𝜇
𝑒 2𝜎2
ℎ𝑀𝐿 = ∏ 2
√2𝜋𝜎
ℎ𝜖𝐻 𝑖=1

𝑎𝑟𝑔𝑚𝑎𝑥 𝑚
1 −
1
(𝑑 −ℎ(𝑥𝑖 ))2
ℎ𝑀𝐿 = ∏ 𝑒 2𝜎2 𝑖
√2𝜋𝜎 2
ℎ𝜖𝐻 𝑖=1

Rather than maximizing the above complicated expression we shall choose to


maximize its (less complicated) logarithm
𝑎𝑟𝑔𝑚𝑎𝑥 𝑚
1 1
ℎ𝑀𝐿 = ∑ 𝑙𝑛 − (𝑑 − ℎ(𝑥𝑖 ))2
√2𝜋𝜎 2 2𝜎 2 𝑖
ℎ𝜖𝐻 𝑖=1

The first term in this expression is a constant independent of h, and can therefore be
discarded, yielding,
𝑎𝑟𝑔𝑚𝑎𝑥 𝑚
1
ℎ𝑀𝐿 = ∑− (𝑑 − ℎ(𝑥𝑖 ))2
2𝜎 2 𝑖
ℎ𝜖𝐻 𝑖=1
Maximizing this negative quantity is equivalent to minimizing the
corresponding positive quantity
argmin m
1
hML = ∑ (d − h(xi ))2
2σ2 i
hϵH i=1
𝑎𝑟𝑔𝑚𝑖𝑛 𝑚

ℎ𝑀𝐿 = ∑(𝑑𝑖 − ℎ(𝑥𝑖 ))2


ℎ𝜖𝐻 𝑖=1

Above equation shows that the maximum likelihood hypothesis hML is the one that
minimizes the sum of the squared errors between the observed training values di and
the hypothesis predictions h(xi).
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner

You might also like