Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Artificial Intelligence and Machine Learning Laboratory – 18CSL76

ALGORITHMS
1. A* Algorithm

2. AO* Algorithm
3. Candidate Elimination Algorithm
Initialize G to the set of maximally general hypotheses in H
Initialize S to the set of maximally specific hypotheses in H
For each training example d, do
• If d is a positive example
• Remove from G any hypothesis inconsistent with d
• For each hypothesis s in S that is not consistent with d
• Remove s from S
• Add to S all minimal generalizations h of s such that
• h is consistent with d, and some member of G is more general
than h
• Remove from S any hypothesis that is more general than another
hypothesis in S
• If d is a negative example
• Remove from S any hypothesis inconsistent with d
• For each hypothesis g in G that is not consistent with d
• Remove g from G
• Add to G all minimal specializations h of g such that
• h is consistent with d, and some member of S is more specific
than h
• Remove from G any hypothesis that is less general than another
hypothesis in G
4. ID3 Algorithm
ID3(Examples, Target_attribute, Attributes)
Examples are the training examples.
Target_attribute is the attribute whose value is to be predicted by the tree.
Attributes is a list of other attributes that may be tested by the learned decision tree.
Returns a decision tree that correctly classifies the given Examples.

• Create a Root node for the tree


• If all Examples are positive, Return the single-node tree Root, with label = +
• If all Examples are negative, Return the single-node tree Root, with label = -
• If Attributes is empty, Return the single-node tree Root, with label = most
common value of Target_attribute in Examples
• Otherwise Begin
o A ← the attribute from Attributes that best* classifies Examples
o The decision attribute for Root ← A
o For each possible value, vi, of A,
▪ Add a new tree branch below Root, corresponding to the test A = vi
▪ Let Examples vi, be the subset of Examples that have value vi for A
▪ If Examples vi , is empty
• Then below this new branch add a leaf node with label =
most common value of Target_attribute in Examples
• Else below this new branch add the subtree
o ID3(Examples vi, Target_attribute, Attributes –
{A}))
• End
• Return Root
5. Backpropagation Algorithm
BACKPROPAGATION(training_examples, η, nin, nout, nhidden)
Each training example is a pair of the form (𝑥⃗, 𝑡⃗) where 𝑥⃗ is the vector of
network input values, and 𝑡⃗ is the vector of target network output values.
η is the learning rate (e.g., 0.05).
nin is the number of network input layer units
nout is the number of network output layer units
nhidden is the number of network hidden later units
Create a feed-forward network with nin inputs, nhidden hidden units, and nout output
units.
Initialize all network weights to small random numbers (e.g., between -.05 and .05)
Until the termination condition is met, Do
For each (𝑥⃗, 𝑡⃗) in training_examples, Do
Propagate the input forward through the network:
i). Input the instance 𝑥⃗ to the network and compute the output ou for every
unit u in the network
ii). For each network output unit k, calculate its error term δk
δk  ok (1– ok) (tk – ok)
iii). For each hidden unit h, calculate its error term δh
δk  oh (1– oh)∑𝑘∈𝑜𝑢𝑡𝑝𝑢𝑡𝑠 𝑤𝑘ℎ δk
iv). Update each network weight wji
wji  wji + Δ wji
where, Δ wji = ηδjxji
𝜕𝐸
and 𝛿𝑘 = − 𝜕𝑛𝑒𝑡
𝑛

6. Naïve Bayes Classifier


• Let X be training instances and y be the corresponding class labels for each
training instance
• For each class Ci calculate the prior probability P(Ci)
|𝐶𝑖 |
𝑃(𝐶𝑖 ) =
𝑇𝑜𝑡𝑎𝑙 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐼𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠
• For each feature, Xj and each class Ci calculate the mean (μ) and variance (σ2)
∑𝑁 𝑖
𝑋𝑘𝑗
𝜇𝑖𝑗 = 𝑘−1
𝑁𝑖
∑𝑁 𝑖
(𝑋𝑘𝑗 − 𝜇𝑖𝑗 )2
𝜎𝑖𝑗2 = 𝑘−1
𝑁𝑖
Where, Ni is the number of instances in class Ci and Xkj is the jth feature of the
kth instance in class Ci.
• For a new instance Xnew:
o For each class Ci, calculate the class conditional probability P(Xnew|Ci)
for each feature Xj using the Gaussian Probability Density function:
1 (𝑋𝑛𝑒𝑤,𝑗 − 𝜇𝑖𝑗 )2
P(Xnew|Ci) = exp (−
2𝜎𝑖𝑗2
√2𝜋𝜎𝑖𝑗2
Where, Xnew,j is the jth feature of the new instance
• Calculate the posterior probability P(Ci |Xnew) for each class Ci using Bayes’
theorem:
𝑛
P(𝐶𝑖 |𝑋𝑛𝑒𝑤 )αP(𝐶𝑖 ) ∙ ∏ P(𝐶𝑖 |𝑋𝑛𝑒𝑤 )
𝑗=1
• Assign the class label Cpred for Xnew as the class with the highest posterior
probability:
𝐶𝑝𝑟𝑒𝑑 = 𝑎𝑟𝑔𝑚𝑎𝑥𝐶𝑖 P(𝐶𝑖 |𝑋𝑛𝑒𝑤 )
• End

7. EM Algorithm
Problem → k-Means
Task → Search for a maximum likelihood hypothesis by repeatedly re-estimating the
expected values of the hidden variables zij given its current hypothesis < µ1 …. µk >
→ Recalculate maximum likelihood hypothesis using these expected values
for the hidden variables.

Procedure:
First initialize the hypothesis to h = < µ1 , µ2 >
Iteratively re-estimate h by repeating the following two steps until the procedure
converges to a stationary value for h.
Step 1: Calculate the expected value E[zij] of each hidden variable zij assuming the
current hypothesis h = < µ1 , µ2 > holds.
Step 2: Calculate a new maximum likelihood hypothesis h’ = < µ1’ , µ2’ > assuming
the value taken on by each hidden variable zij is its expected value E[zij] calculated in
Step 1.
Replace h = < µ1 , µ2 > by h’ = < µ1’ , µ2’ > and iterate
Step 1 must calculate the expected value of zij.
E[zij] → probability that instance xi was generated by the jth Normal distribution.
𝑝(𝑥 = 𝑥𝑖 |𝜇 = 𝜇𝑗 )
𝐸[𝑧𝑖𝑗 ] =
∑2𝑛=1 𝑝(𝑥 = 𝑥𝑖 |𝜇 = 𝜇𝑛 )
1
− (𝑥 −𝜇 )2
𝑒 2𝜎2 𝑖 𝑗
= 1
− (𝑥 −𝜇 )2
∑2𝑛=1 𝑒 2𝜎2 𝑖 𝑛

First step is implemented by substituting the current values < µ1 , µ2 > and the
observed xi into the above expression.
Second step → use E[zij] calculated in Step 1 to derive a new maximum likelihood
hypothesis h’ = < µ1’ , µ2’ >.
It is
∑𝐦
𝐢=𝟏 𝐄[𝐳𝐢𝐣 ]𝐱 𝐢
𝛍𝐣 ←
∑𝐦
𝐢=𝟏 𝐄[𝐳𝐢𝐣 ]

---(8)
The above expression is similar to
m
1
μML = ∑ xi
m
i=1

---(7)
(7) → used to estimate µ for a single Normal distribution.
(8) → the weighted sample mean for µj, with each instance weighted by the
expectation E[zij] that it was generated by the jth Normal distribution .
Conclusion:
The current hypothesis is used to estimate the unobserved variables, and the expected
values of these variables are then used to calculate an improved hypothesis
The EM algorithm repeats the following two steps until convergence:
Step 1: Estimation (E) step: Calculate Q(h’|h) using the current hypothesis h and
the observed data X to estimate the probability distribution over Y.
Q(h’|h) ← E[ln P(Y|h’)|h,X]
Step 2: Maximization (M) step: Replace hypothesis h by the hypothesis h' that
maximizes this Q function.
𝑎𝑟𝑔𝑚𝑎𝑥
ℎ← ℎ′𝑄(ℎ′|ℎ)

k – means Algorithms
k-means problem→ to estimate the parameters 𝜃 = ⟨𝜇1 … 𝜇𝑘 ⟩ that define the means
of k Normal distributions.
Given,
X = {⟨𝑥𝑖𝑗 ⟩} → observed data
Z = {⟨𝑧𝑖1 , … , 𝑧𝑖𝑘 ⟩} →indicates which of the k Normal distributions was used to
generate xij.
To apply EM algorithm → derive an expression for Q(h’|h)
Derive an expression for p(Y|h’)
The probability p(yi|h’) of a single instance yi = ⟨𝑥𝑖 , 𝑧𝑖1 , … , 𝑧𝑖𝑘 ⟩ of the full data can
be written as
1 −
1
∑𝑘 𝑧 (𝑥 −𝜇𝑗′ )2
𝑝(𝑦𝑖 |ℎ′ ) = 𝑝(𝑥𝑖 , 𝑧𝑖1 , … , 𝑧𝑖𝑘 |ℎ′ ) = 𝑒 2𝜋𝜎2 𝑗=1 𝑖𝑗 𝑖
√2𝜋𝜎 2
Here, only one of zij can have the value 1 and all other must be 0.
Given this probability for a single instance 𝑝(𝑦𝑖 |ℎ′ ),the logarithm of the probability
ln P(Y|h’) for all m instances in the data is
ln P(Y|h’) = 𝑙𝑛 ∏𝑚 ′
𝑖=1 𝑝(𝑦𝑖 |ℎ )
𝑚

= ∑ 𝑙𝑛 𝑝(𝑦𝑖 |ℎ′ )
𝑖=1
𝑚
1 −
1
∑𝑘 𝑧 (𝑥 −𝜇𝑗′ )2
ln P(Y|h’) = ∑ (𝑙𝑛 𝑒 2𝜋𝜎2 𝑗=1 𝑖𝑗 𝑖 )
𝑖=1
√2𝜋𝜎 2
That is, ln P(Y|h’) is a linear function of zij.
In general, for any function f (z) that is a linear function of z, the following equality
holds
E[f(z)] = f(E[z])
And also
𝑚
′]
1 −
1
∑𝑘 𝑧 (𝑥 −𝜇𝑗′ )2
𝐸[ln 𝑃(𝑌|ℎ = 𝐸 [∑ (𝑙𝑛 𝑒 2𝜋𝜎2 𝑗=1 𝑖𝑗 𝑖 )]
𝑖=1
√2𝜋𝜎 2
𝑚
1 −
1
∑𝑘 𝐸[𝑧𝑖𝑗 ](𝑥𝑖 −𝜇𝑗′ )2
= ∑ (𝑙𝑛 𝑒 2𝜋𝜎2 𝑗=1 )
𝑖=1
√2𝜋𝜎 2
Therefore, the function Q(h’|h) for the k means problem is
𝑚
′ |ℎ)
1 −
1
∑𝑘 𝐸[𝑧𝑖𝑗 ](𝑥𝑖 −𝜇𝑗′ )2
𝑄(ℎ = ∑ (𝑙𝑛 𝑒 2𝜋𝜎2 𝑗=1 )
𝑖=1
√2𝜋𝜎 2
Where, ℎ′ = ⟨𝜇′1 , … , 𝜇′𝑘 ⟩
E [zij] → calculated based on current hypothesis h and observed data X.
From k-means Gaussians
1 2
− (𝑥 −𝜇 )
𝑒 2𝜎2 𝑖 𝑗
𝐸[𝑧𝑖𝑗 ] = 1 2
---(9)
− (𝑥 −𝜇 )
∑2𝑛=1 𝑒 2𝜎2 𝑖 𝑛

Thus,
• The first (estimation) step of the EM algorithm defines the Q function based on
the estimated E[zij] terms.
• The second (maximization) step then finds the values 𝜇′1 , … , 𝜇′𝑘 that maximize
this Q function.
In the current case
𝑎𝑟𝑔𝑚𝑎𝑥 𝑚
𝑎𝑟𝑔𝑚𝑎𝑥 ′
1 −
1
∑𝑘 𝐸[𝑧𝑖𝑗 ](𝑥𝑖 −𝜇𝑗′ )2
ℎ′𝑄(ℎ |ℎ) = ∑ (𝑙𝑛 𝑒 2𝜋𝜎2 𝑗=1 )
√2𝜋𝜎 2
ℎ′ 𝑖=1
𝑎𝑟𝑔𝑚𝑖𝑛 𝑚 𝑘 𝑘

= ∑ ∑ ∑ 𝐸[𝑧𝑖𝑗 ] (𝑥𝑖 − 𝜇𝑗′ )2


ℎ′ 𝑖=1 𝑗=1 𝑗=1

---(10)
Therefore,
The maximum likelihood hypothesis here minimizes a weighted sum of squared
errors, where the contribution of each instance xi to the error that defines 𝜇𝑗′ is
weighted by E[zij] .
The quantity given by Equation (10) is minimized by setting each 𝜇𝑗′ to the
weighted sample mean
∑𝑚
𝑖=1 𝐸[𝑧𝑖𝑗 ]𝑥𝑖
𝜇𝑗 ←
∑𝑚
𝑖=1 𝐸[𝑧𝑖𝑗 ]

---(11)
Eq. 10 & 11 → Two steps in the k-means algorithm
8. K Nearest Neighbour Algorithm
Let instance x be described by the feature vector
⟨𝑎1 (𝑥), 𝑎2 (𝑥), … , 𝑎𝑛 (𝑥)⟩
Where,
ar (x) → value of the rth attribute of instance x.
The distance between two instances xi and xj being d(xi , xj) is given by
𝑛

𝑑(𝑥𝑖 , 𝑥𝑗 ) = √∑(𝑎𝑟 (𝑥𝑖 ) − 𝑎𝑟 (𝑥𝑗 ))2


𝑟=1

Target function may be either discrete-valued or real-valued.


Consider learning discrete-valued target functions of the form
𝒇: 𝕽𝒏 → 𝑽
Where,
V→ finite set {v1 ,…, vs}
The k-Nearest Neighbour Algorithm for approximating a discrete-valued target
function
Training algorithm:
• For each training example ⟨𝑥, 𝑓(𝑥)⟩ add the example to the list training_examples
Classification Algorithm:
• Given a query instance xq to be classified
• Let xl ...xk denote the k instances from training_examples that are nearest to
xq
• Return
𝒂𝒓𝒈𝒎𝒂𝒙 𝒌

𝒇̂(𝒙𝒒 ) ← ∑ 𝜹(𝒗, 𝒇(𝒙𝒊 ))


𝒗∈𝑽 𝒊=𝟏

Where, 𝛿(𝑎, 𝑏) = 1 if a==b


𝛿(𝑎, 𝑏) = 0 otherwise

9. Locally Weighted Regression


Given,
xq →new query instance
Approach→ construct an approximation 𝑓̂ that fits the training examples in the
neighborhood surrounding xq.
→ use the approximation to calculate 𝑓̂(𝑥𝑞 )
Here,
𝑓̂(𝑥𝑞 ) → output as the estimated target value for the query instance.
𝑓̂ need to be retained as a different local approximation will be calculated for
each distinct query instance
Consider,
A case of locally weighted regression in which the target function f is approximated
near xq using a linear function of the form
𝑓̂(𝑥𝑞 ) = 𝑤0 + 𝑤1 𝑎1 (𝑥) + ⋯ + 𝑤𝑛 𝑎𝑛 (𝑥)
From Gradient Descent rule,
1
𝐸 ≡ 2 ∑𝑥∈𝐷(𝑓(𝑥) − 𝑓̂(𝑥))2 ---(5)

And
∆𝑤𝑗 = ƞ ∑𝑥∈𝐷(𝑓(𝑥) − 𝑓̂(𝑥))𝑎𝑗 (𝑥)) ---(6)

Let E(xq) → error is being defined as a function of the query point xq.
1. Minimize the squared error over just the k nearest neighbors:
1
𝐸1 (𝑥𝑞 ) ≡ ∑ (𝑓(𝑥) − 𝑓̂(𝑥))2
2
𝑥∈𝑘 𝑛𝑒𝑎𝑟𝑒𝑠𝑡 𝑛𝑏𝑟𝑠 𝑜𝑓 𝑥𝑞

2. Minimize the squared error over the entire set D of training examples, while
weighting the error of each training example by some decreasing function K of
its distance from xq
1 2
𝐸2 (𝑥𝑞 ) ≡ ∑ (𝑓(𝑥) − 𝑓̂(𝑥)) 𝐾(𝑑(𝑥𝑞 , 𝑥))
2
𝑥∈𝐷

3. Combine 1 and 2
1 2
𝐸3 (𝑥𝑞 ) ≡ ∑ (𝑓(𝑥) − 𝑓̂(𝑥)) 𝐾(𝑑(𝑥𝑞 , 𝑥))
2
𝑥∈𝑘 𝑛𝑒𝑎𝑟𝑒𝑠𝑡 𝑛𝑏𝑟𝑠 𝑜𝑓 𝑥𝑞

Criteria 3 is the best approach. If criteria 3 is used and gradient descent in Eq. (6) is
re-derived, we get the training rule as follows:
∆𝑤𝑗 = ƞ ∑𝑥∈𝑘 𝑛𝑒𝑎𝑟𝑒𝑠𝑡 𝑛𝑏𝑟𝑠 𝑜𝑓 𝑥𝑞 𝐾(𝑑(𝑥𝑞 , 𝑥)) (𝑓(𝑥) − 𝑓̂(𝑥))𝑎𝑗 (𝑥)) ---(7)

You might also like