Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 35

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

ACADEMIC YEAR 2023-2024 – EVEN SEMESTER (R-2021)


Assessment - I
Answer Key
Answer all questions
PART – A (5 X 2 =10 Marks)
1 Define agents. List the applications of Artificial Intelligence.
. An agent can be anything that perceive its environment through sensors and act upon that environment
through actuators. An Agent runs in the cycle of perceiving, thinking, and acting. An agent can be:

o Human-Agent: A human agent has eyes, ears, and other organs, which work for sensors
and hand, legs, vocal tract work for actuators.
o Robotic Agent: A robotic agent can have cameras, infrared range finder, NLP for sensors
and various motors for actuators.
o Software Agent: Software agent can have keystrokes, file contents as sensory input and
act on those inputs and display output on the screen.

Applications of AI
2 Given that P(A)=0.3,P(A|B)=0.4 and P(B)=0.5, Compute P(B|A).
.

3 Compare supervised and unsupervised machine learning.


. Supervised Learning Unsupervised Learning
 Supervised learning algorithms are trained  Unsupervised learning algorithms are
using labeled data. trained using unlabeled data.
 Supervised learning model takes direct  Unsupervised learning model does not
feedback to check if it is predicting correct take any feedback.
output or not.
 Supervised learning model predicts the out-  Unsupervised learning model finds the
put. hidden patterns in data.
 In supervised learning, input data is  In unsupervised learning, only input data
provided to the model along with the out- is provided to the model.
put.
 The goal of supervised learning is to train  The goal of unsupervised learning is to
the model so that it can predict the output find the hidden patterns and useful in-
when it is given new data. sights from the unknown dataset.
 Supervised learning needs supervision to  Unsupervised learning does not need any
train the model. supervision to train the model.
 Supervised learning can be categorized  Unsupervised Learning can be classified
in Classification and Regression prob- in Clustering and Associations prob-
lems. lems.
 Supervised learning can be used for those  Unsupervised learning can be used for
cases where we know the input as well as those cases where we have only input
corresponding outputs. data and no corresponding output data.
 Supervised learning model produces an ac-  Unsupervised learning model may give
curate result. less accurate result as compared to super-
vised learning.
 Supervised learning is not close to true Ar-  Unsupervised learning is more close to
tificial intelligence as in this, we first train the true Artificial Intelligence as it learns
the model for each data, and then only it similarly as a child learns daily routine
can predict the correct output. things by his experiences.
 It includes various algorithms such as Lin-  It includes various algorithms such as
ear Regression, Logistic Regression, Sup- Clustering, KNN, and Apriori algorithm.
port Vector Machine, Multi-class Classific-
ation, Decision tree, Bayesian Logic, etc.
4
Distinguish between k-means and KNN algorithm.
.
Parameter kNN K−means clustering

Type of learn- Supervised learning Unsupervised learning


ing

Task Classification and regression Clustering

Parameter K, the number of nearest K, the number of clusters


neighbors

Input Labeled data Unlabeled data

Distance cal- Euclidean, Manhattan, or other Euclidean distance between data points
culation distance metrics and cluster centers

Output Prediction or estimation of Grouping of similar data points into k


output variable based on k clusters
nearest neighbors

Application Classification and regression Customer segmentation, image compres-


tasks sion, anomaly detection, and other clus-
tering tasks

Limitations Sensitivity to the choice of k Sensitivity to initial placement of cluster


and distance metric centers and assumption of isotropic and
equally sized data points

5 What are the three main types gradient descent algorithm?


. There are three types of gradient descent learning algorithms: batch gradient descent, stochastic gradient
descent and mini-batch gradient descent.
 Batch gradient descent.
 Stochastic gradient descent.
 Mini-batch gradient descent
PART – B (5X 10 =50 Marks)
6.a. Analyze the pros and cons of heuristic search strategies in detail with examples.
Best first search (BFS)
 This algorithm always chooses the path which appears best at that moment. It is the
combination of depth-first search and breadth-first search algorithms. It lets us to take the
benefit of both algorithms. It uses the heuristic function and search. With the help of the best-
first search, at each step, we can choose the most promising node.
Best first search algorithm:
 Step 1: Place the starting node into the OPEN list.
 Step 2: If the OPEN list is empty, Stop and return failure.

 Step 3: Remove the node n from the OPEN list, which has the lowest value of h(n), and places
it in the CLOSED list.

 Step 4: Expand the node n, and generate the successors of node n.

 Step 5: Check each successor of node n, and find whether any node is a goal node or not. If
any successor node is the goal node, then return success and stop the search, else continue to
next step.

 Step 6: For each successor node, the algorithm checks for evaluation function f(n) and then
check if the node has been in either OPEN or CLOSED list. If the node has not been in both
lists, then add it to the OPEN list.

 Step 7: Return to Step 2.

A* Search Algorithm

A* search is the most commonly known form of best-first search. It uses the heuristic function h(n)
and cost to reach the node n from the start state g(n). It has combined features of UCS and greedy best-
first search, by which it solve the problem efficiently.

It finds the shortest path through the search space using the heuristic function. This search algorithm
expands fewer search tree and gives optimal results faster.

Algorithm of A* search:

Step 1: Place the starting node in the OPEN list.

Step 2: Check if the OPEN list is empty or not. If the list is empty, then return failure and stops.

Step 3: Select the node from the OPEN list which has the smallest value of the evaluation function
(g+h). If node n is the goal node, then return success and stop, otherwise.

Step 4: Expand node n and generate all of its successors, and put n into the closed list. For each suc-
cessor n', check whether n' is already in the OPEN or CLOSED list. If not, then compute the evaluation
function for n' and place it into the Open list.

Step 5: Else, if node n' is already in OPEN and CLOSED, then it should be attached to the back
pointer which reflects the lowest g(n') value.

Step 6: Return to Step 2.

Advantage:

 It is more efficient than that of BFS and DFS.


 Time complexity of Best first search is much less than Breadth first search.
 The Best first search allows us to switch between paths by gaining the benefits of both breadth
first and depth first search. Because, depth first is good because a solution can be found
without computing all nodes and Breadth first search is good because it does not get trapped in
dead ends.

Disadvantages:

 Sometimes, it covers more distance than our consideration.


6.b. Give an example of a problem for which breadth first search would work better than depth first search.
Breadth-First Search (BFS) :
BFS, Breadth-First Search, is a vertex-based technique for finding the shortest path in the graph. It
uses a Queue data structure that follows first in first out. In BFS, one vertex is selected at a time
when it is visited and marked then its adjacent are visited and stored in the queue. It is slower than
DFS.

Depth First Search (DFS) :


DFS, Depth First Search, is an edge-based technique. It uses the Stack data structure and performs
two stages, first visited vertices are pushed into the stack, and second if there are no vertices then
visited vertices are popped.

7.a. Discuss variable elimination algorithm for answering queries on Bayesian networks
For inferencing in probabilistic system, it is required to calculate posterior probability distribution for a
set of query variables, where some observed events are given. [That is we have some values attached to
evidence variables].

• Notation Revisited:
The notation used in inferencing is same as the one used in probability theory.

X: Query variable.

E: The set of evidence variables E 1,....., Em and 'e' is the perticular observed event. Y: The set of non-
evidence variables Y1, Y2,..... Yk [Non-evidence variables are also called as hidden variables].

X: It the complete set of all the types of variables, where X = {X} U E U Y.

• Generally the query requires the posterior probability distribution P(X | e) [assuming that query
variable is not among the evidence variables, if it is, then posterior distribution for X simply gives
probability 1 to the observed value]. [Note that query can contain more than one variable. For study
purpose we are assuming single variable].
• Example: In the burglary case, if the observed event is Jcalls = true and Mcalls true.
The query is 'Has burglary occured?'

The probability distribution for this situation would be,

P(Burglary | J calls = true, M calls = true) = < 0.284, 0.716 >


2. Inference by Enumeration

A Bayesian network gives a complete representation of the full joint distribution. These full joint
distributions can be written as product of conditional probabilities from the Bayesian network.

A query can be answered using Bayesian network by computing sums of products of conditional
probabilities from the network.

• The algorithm
The algorithm ENUMERATE-JOINT-ASK gives inference by enumerating on full joint distribution.

Characteristics of algorithm:

1) It takes input a full joint distribution P and looks up values in it. [The same algorithm can be
modified to take input as Bayesian network and looking up in joint entries by multiplying the
corresponding conditional probability table entries from Bayesian network.

2) The ENUMERATION-JOINT-ASK uses ENUMERATION-ASK (EA) algorithm which evaluate


expression using depth-first recursions. Therefore, the space complexity of EA is only linear in the
number of variables. The algorithm sums over the full joint distribution without ever constructing it
explicitely. The time complexity for network with 'n' boolean variables is always O(2 n) which is better
than the O(n 2n) required in simple inferencing approach (using posterior probability).

3) The drawback of the algorithm is, it keeps on evaluating repeated sub expression which results in
wastage of computation time.

The enumeration algorithm for answering queries on Bayesian network.

• The algorithm
Function ENUMERATION-ASK (X, e, bn) returns a distribution over X.

Inputs: X, the query variable

e, observed values for variables E.

bn, a Bayes net with variables {X} UEU Y/* y = Hidden variable */

Q(X) ← A distribution over X, initially empty for each value xi of X do extend e with value xi for X.

Q(xi) ← ENUMERATE-ALL (VARS[bn] e) return NORMALIZE (Q(x)).

Function ENUMERATE-ALL (vars, e) returns a real number.

if EMPTY? (vars) then return 1.0

Y ← FIRST (vars)

If Y has value y in e.

Then return P(y | parents (Y)) X ENUMERATE-ALL (REST(vars), e)

else return, Σy Ply parents (Y)) X ENUMERATE-ALL (REST(vars), ey)

where eyis e extended with Y = y.

Example:

Consider query,

P(Burglary | J calls = true, M calls =true)


Hidden variables in the queries are → Earthquake and Alarm.

using the query equation.

P(Burglary | j, m) = α P( Burglary, J, M) = α Σe Σa P(Burglary, e, a, j, m)

The semantics of Bayesian networks (equation 7.2.1) then gives us an expression in terms of CPT
entries. For simplicity, we will do this just for Burglary = true.

P(b | j, m) = α Σe ΣaP(b)P(e)P(a | b, e)P(j | a) P(m | a)

o To compute this expression, we have to add four terms, each computed by multiplying five
numbers.
o Worst case, where we have to sum out almost all the variables, the complexity of the algorithm
for a network with n boolean variables is O(n2n).
o An improvement can be obtained from the following simple observations. The P(b) term is a
constant and can be moved outside the summations over a and e, and the P(e) term can be
moved outside the summation over a. Hence, we have
P(b | j, m) = α P(b) Σe P(e) ΣaP(a | b, e)P(j | a) P(m, a)

This expression can be evaluated by looping through the variables in order, multiplying CPT entries as
we go. For each summation, we also need to loop over the variable's possible values. The structure of
this computation is shown in following diagram. Using the numbers from Fig. 7.3.2, we obtain P(b | j,
m) = α × 0.00059224. The corresponding computation for ⌐b yields α × 0.0014919;

Hence,

P(B | j, m)= α < 0.00059224, 0.0014919 >

≈ <0.284, 0.716 >

That is, the chance of burglary, given calls from both neighbours is about 28 %. Note In the Fig. 7.3.7,
the evaluation proceeds top to down, multiplying values along each path and summing at the "t" nodes.
Observe that there is repetition of paths for j and m.

3. The Variable Elimination Algorithm


The enumeration algorithm can be improved substantially by eliminating calculations of repeated sub
expression in tree. Calculation can be done once and save the results for later use. This is a form of
dynamic programming.

Working of variable elimination algorithm

1) It works by evaluating expressions such as [P(b | j, m) = α P(b) Σ e P(e) Σa P(a | b, e) P(j | a) P(m | a)]
in right-to-left order.

2) Intermediate results are stored and summations over each variable are done only for those portions
of the expression that depends on the variable.

For example: Consider the Burglary network.

We evaluate the expression:

3) Factors: Each part of the expression is annotated with the name of the associated variable, these
parts are called factors.

Steps in algorithm:

i) The factor for M, P (m | a), does not require summing over M. Probability is stored, given each value
of a, in a two-element factor,

Note fM means that M was used to produce f.

ii) Store the factor for J as the two-element vector fj (A).

iii) The factor for A is P(a | B, e) which will be a 2×2×2 matrix fA (A, B, E).

iv) Summing out A from the product of these tree factors. This will give 2×2 matrix whose indices
range over just B and E. We put bar over A in the name of the matrix to indicate that A has been
summed out.

FȂJM(B, E) =Σa fA (a, B, E) × fj (a) × fM (a)

= fA (a, B, E) x fJ(a) x fM (a) + fA ( ⌐a, B, E) x fj (¬a)×fm (¬a)

The multiplication process used is called a printwise product.

v) Processing E in the same way (i.e.) sum out E from the product of

fE (E) and fȂJM (B, E):

fȂJM (B) = fE (e) × fȂJM (B, e) + fE (¬e) × fȂJM (B, ¬e).

vi) Compute the answer simply by multiplying the factor for B. (i.e.) (f B|B) = P(B), 701 by the
accumulated matrix (B) :

P(B | j, m) = α fB (B) x f EȂJM (B).

From the above sequence of steps it can be noticed that two computational operations are required.

a) Pointwise product of a pair of factors.


b) Summing out a variable from a product of factors.

a) Pointwise product of a pair of factors: The pointwise product of two factors f 1 and f2 yields a new
factor f, those variables are the union of the variables in f 1 and f2. Suppose the two factors have
variables Y1,..., Yk. Then we have f(x1,..., Xj, Y1,.... Yk, Z1,.....Zl) = f1 (X1,....Xj, Y1.... Yk) f2(Y1,..... Yk,
Z1,....Zl). If all the variables are binary, then f1 and f2 have 2j+k and 2k+l entries and the pointwise product
has 2j+k+1 entries.

For example: Given two factors f 1 (A, B) and f2 (B, C) with probability distributions shown below, the
pointwise product f1 × f2 is given as f1 (A, B, C).

b) Summing out a variable from a product of factors: It is a straight forward computation. Any
factor that does not depend on the variable to be summed out can be moved outside the summation
process.

For example:

Σe fE (e) × fA (A, B, e) × fj (A) × fM (A) = fj (A) × fM (A) × Σ e fE (e) × fA (A, B, e). Now, the pointwise
product inside the summation is computed and the variable is summed out of the resulting matrix.

Fj (A) × fM (A) × Σe fE (e) × fA (A, B, e) = fj (A) × fM (A) × fEA(A, B).

Matrices are not multiplied until we need to sum but a variable from the accumulated product. At that
point, multiply those matrices that include the variable to be summed out.

The procedure for pointwise product and summing is given below, the variable elimination algorithm is
shown below:

Function ELIMINATION-ASK (X, e, bn) returns a distribution over X

Inputs: X, the query variable

e, evidence specified as an event

bn, a Bayesian network specifying joint distribution P(X1,..., Xn).

Factors ← [ ] : vars ← REVERSE (VARS [bn])

for each var in vars do

Factors ←[MAKE-FACTOR (var, e) factors]

if var is a hidden variable then

Factors←SUM-OUT (var, factors)


returns NORMALIZE (POINTWISE-PRODUCT (factors)).

P (J calls, Burglary = True)

The first step is to write out the nested summation.

P(J | b) = α P(b) Σe P(e) Σa P(a | b, e) P(J | a) Σm P(M | a).

Evaluating this expression from right to left,

Σm P(M | a) is equal to 1.

Note The variable M is irrelevant to this query. Result of the query P (J calls/Burglary = True) is
unchanged if we remove M calls from the network. We can remove any leaf node which is not a query
variable or an evidence variable. After its removal, there may be more leaf nodes and they may be
irrelevant. Eventually we find that every variable that is not an ancestor of a query variable or evidence
variable is, irrelevant to the query. A variable elimination algorithm can remove all these variables
before evaluating the query.

4. The Complexity Involved in Exact Inferencing

The variable elimination algorithm is more efficient than enumeration algorithm because it avoids
repeated computations as well as drops irrelevant variables.

The variable elimination algorithm constructs the factor, deriving its operation. The space and time
complexity of variable elimination is directly dependant on size of the largest factor constructed during
the operation. Basically the factor construction is determined by the order of elimination of variables
and by the structure of the network; which affects both space and time complexity.

For developing more efficient process we can construct singly connected networks which are also
called as polytrees. In singly connected network, there is at most one undirected path between any two
nodes in the networks. The singly connected networks have property that, the time and space
complexity of exact inference in polytrees is linear in the size of the network. Here the size is defined
as the number of CPT entries. If the number of parents of each node is bounded by a constant, then the
complexity I will also be linear in the number of nodes.

For example: The Burglary network shown in the Fig. 7.3.2 is a polytrees.

[Note that every problem may not be represented as polytrees].

In multiply networks [In this, their can be multiple undirected paths between any two nodes and more
than one directed path between some pair of nodes], variable elimination takes exponential time and
space complexity in the worst case, even when the number of parents per node is bounded. It should be
noted that variable elimination includes inference in propositional logic as a special case and inference
in Bayesian network is NP-hard. In fact it is strictly harder than NP-complete problem.

Clustering algorithm:

1) Clustering algorithm (known as joint tree algorithms) in which inferencing time can be reduced to
O(n). In clustering individual nodes of the network are joint to form cluster nodes to such a way that
the resulting network is a polytree.

2) The variable elimination algorithm is efficient algorithm for answering individual queries. Posterior
probabilities are computed for all the variables in the network. It can be less efficient, in polytree
network because it needs to issue O(n) queries costing O(n) each, for a total of O(n 2) time, clustering
algorithm, improves over it.

For example: The multiply connected network shown in Fig. 7.3.8 (a) can be converted into a polytree
by combining the Sprinkler and Rain node into a clusternode called Sprinkler + Rain, as shown in Fig.
7.2.8 (b). The two Boolean nodes are replaced by a meganode that takes on four possible values: TT,
TF, FT, FF. The meganode has only one parent, the Boolean variable. Cloudy, so there are two
conditioning cases.

7.b. Describe how Bayesian statistics provides reasoning under various kinds of uncertainty.
Bayes' theorem:

 Bayes' theorem is also known as Bayes' rule, Bayes' law, or Bayesian reasoning, which determ-
ines the probability of an event with uncertain knowledge.

 In probability theory, it relates the conditional probability and marginal probabilities of two ran -
dom events.

 Bayes' theorem was named after the British mathematician Thomas Bayes. The Bayesian infer-
ence is an application of Bayes' theorem, which is fundamental to Bayesian statistics.

 It is a way to calculate the value of P(B|A) with the knowledge of P(A|B).

 Bayes' theorem allows updating the probability prediction of an event by observing new informa-
tion of the real world.

Example: If cancer corresponds to one's age then by using Bayes' theorem, we can determine the prob-
ability of cancer more accurately with the help of age.

Bayes' theorem can be derived using product rule and conditional probability of event A with known
event B:

As from product rule we can write:

1. P(A ⋀ B)= P(A|B) P(B) or

Similarly, the probability of event B with known event A:


1. P(A ⋀ B)= P(B|A) P(A)

Equating right hand side of both the equations, we will get:

The above equation (a) is called as Bayes' rule or Bayes' theorem. This equation is basic of most
modern AI systems for probabilistic inference.

It shows the simple relationship between joint and conditional probabilities. Here,

P(A|B) is known as posterior, which we need to calculate, and it will be read as Probability of hypo-
thesis A when we have occurred an evidence B.

P(B|A) is called the likelihood, in which we consider that hypothesis is true, then we calculate the
probability of evidence.

P(A) is called the prior probability, probability of hypothesis before considering the evidence

P(B) is called marginal probability, pure probability of an evidence.

In the equation (a), in general, we can write P (B) = P(A)*P(B|Ai), hence the Bayes' rule can be written
as:

Where A1, A2, A3,........, An is a set of mutually exclusive and exhaustive events.

Applying Bayes' rule:

Bayes' rule allows us to compute the single term P(B|A) in terms of P(A|B), P( B), and P(A). This is
very useful in cases where we have a good probability of these three terms and want to determine the
fourth one. Suppose we want to perceive the effect of some unknown cause, and want to compute that
cause, then the Bayes' rule becomes:

Example-1:

Question: what is the probability that a patient has diseases meningitis with a stiff neck?

Given Data:

A doctor is aware that disease meningitis causes a patient to have a stiff neck, and it occurs 80% of the
time. He is also aware of some more facts, which are given as follows:

o The Known probability that a patient has meningitis disease is 1/30,000.


o The Known probability that a patient has a stiff neck is 2%.

Let a be the proposition that patient has stiff neck and b be the proposition that patient has meningitis. ,
so we can calculate the following as:

P(a|b) = 0.8

P(b) = 1/30000

P(a)= .02

Hence, we can assume that 1 patient out of 750 patients has meningitis disease with a stiff neck.
8.a. Compare the merits and demerits of Random forest and Naïve Bayes classifier with an example.
Random Forest Algorithm

 Random Forest is a popular machine-learning algorithm that belongs to the supervised learning
technique.

 It can be used for both Classification and Regression problems in ML.

 It is based on the concept of ensemble learning, which is a process of combining multiple classifi-
ers to solve a complex problem and to improve the performance of the model.

 As the name suggests, "Random Forest is a classifier that contains a number of decision trees
on various subsets of the given dataset and takes the average to improve the predictive accuracy
of that dataset." Instead of relying on one decision tree, the random forest takes the prediction
from each tree and based on the majority votes of predictions, and it predicts the final output.

 The greater number of trees in the forest leads to higher accuracy and prevents the problem
of overfitting.

 The below diagram explains the working of the Random Forest algorithm:
Why use Random Forest?

Below are some points that explain why we should use the Random Forest algorithm:

o It takes less training time as compared to other algorithms.


o It predicts output with high accuracy, even for the large dataset it runs efficiently.
o It can also maintain accuracy when a large proportion of data is missing.

How does Random Forest algorithm work?

Random Forest works in two-phase first is to create the random forest by combining N decision tree,
and second is to make predictions for each tree created in the first phase.

The Working process can be explained in the below steps and diagram:

Step-1: Select random K data points from the training set.

Step-2: Build the decision trees associated with the selected data points (Subsets).

Step-3: Choose the number N for decision trees that you want to build.

Step-4: Repeat Step 1 & 2.

Step-5: For new data points, find the predictions of each decision tree, and assign the new data points
to the category that wins the majority votes.

Advantages of Random Forest


o Random Forest is capable of performing both Classification and Regression tasks.
o It is capable of handling large datasets with high dimensionality.
o It enhances the accuracy of the model and prevents the overfitting issue.

Disadvantages of Random Forest


o Although random forest can be used for both classification and regression tasks, it is not more
suitable for Regression tasks.

Naïve Bayes Classifier

o Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes the-
orem and used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional training dataset.
o Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which
helps in building the fast machine learning models that can make quick predictions.
o It is a probabilistic classifier, which means it predicts on the basis of the probability of an
object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental analysis,
and classifying articles.

Why is it called Naïve Bayes?

The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described as:

o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is indepen -
dent of the occurrence of other features. Such as if the fruit is identified on the bases of color,
shape, and taste, then red, spherical, and sweet fruit is recognized as an apple. Hence each fea -
ture individually contributes to identify that it is an apple without depending on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.

Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the
probability of a hypothesis with prior knowledge. It depends on the conditional probability.
o The formula for Bayes' theorem is given as:

Where,

P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.

P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a hypo-
thesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing the evidence.

P(B) is Marginal Probability: Probability of Evidence.

Working of Naïve Bayes' Classifier:

Working of Naïve Bayes' Classifier can be understood with the help of the below example:

Suppose we have a dataset of weather conditions and corresponding target variable "Play". So using
this dataset we need to decide that whether we should play or not on a particular day according to the
weather conditions. So to solve this problem, we need to follow the below steps:

1. Convert the given dataset into frequency tables.


2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.

Problem: If the weather is sunny, then the Player should play or not?

Solution: To solve this, first consider the below dataset:

Outlook Play

0 Rainy Yes

1 Sunny Yes

2 Overcast Yes

3 Overcast Yes

4 Sunny No

5 Rainy Yes

6 Sunny Yes

7 Overcast Yes

8 Rainy No

9 Sunny No

10 Sunny Yes

11 Rainy No

12 Overcast Yes

13 Overcast Yes
Frequency table for the Weather Conditions:

Weather Yes No

Overcast 5 0
Rainy 2 2

Sunny 3 2

Total 10 5
Likelihood table weather condition:

Weather No Yes

Overcast 0 5 5/14= 0.35

Rainy 2 2 4/14=0.29

Sunny 2 3 5/14=0.35

All 4/14=0.29 10/14=0.71


Applying Bayes'theorem:

P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)

P(Sunny|Yes)= 3/10= 0.3

P(Sunny)= 0.35

P(Yes)=0.71

So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60

P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)

P(Sunny|NO)= 2/4=0.5

P(No)= 0.29

P(Sunny)= 0.35

So P(No|Sunny)= 0.5*0.29/0.35 = 0.41

So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)

Hence on a Sunny day, Player can play the game.

Advantages of Naïve Bayes Classifier:


o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the other Algorithms.
o It is the most popular choice for text classification problems.

Disadvantages of Naïve Bayes Classifier:


o Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the rela-
tionship between features.

Applications of Naïve Bayes Classifier:


o It is used for Credit Scoring.
o It is used in medical data classification.
o It can be used in real-time predictions because Naïve Bayes Classifier is an eager learner.
o It is used in Text classification such as Spam filtering and Sentiment analysis.

Types of Naïve Bayes Model:

There are three types of Naive Bayes Model, which are given below:

o Gaussian: The Gaussian model assumes that features follow a normal distribution. This means
if predictors take continuous values instead of discrete, then the model assumes that these val-
ues are sampled from the Gaussian distribution.
o Multinomial: The Multinomial Naïve Bayes classifier is used when the data is multinomial
distributed. It is primarily used for document classification problems, it means a particular doc-
ument belongs to which category such as Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.
o Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the predic-
tor variables are the independent Booleans variables. Such as if a particular word is present or
not in a document. This model is also famous for document classification tasks.

8.b. Examine the process of computing coefficients in a logistic regression with an example.
 Logistic regression is a supervised learning algorithm that makes use of logistic func-
tions to predict the probability of a binary outcome.

 Logistic regression is defined as a supervised machine learning algorithm that accomplishes


binary classification tasks by predicting the probability of an outcome, event, or observation.

What Is Logistic Regression?


Logistic regression is a supervised machine learning algorithm that accomplishes binary classi-
fication tasks by predicting the probability of an outcome, event, or observation. The model de-
livers a binary or dichotomous outcome limited to two possible outcomes: yes/no, 0/1, or true/
false.
Logical regression analyzes the relationship between one or more independent variables and classifies
data into discrete classes. It is extensively used in predictive modeling, where the model estimates the
mathematical probability of whether an instance belongs to a specific category or not.
For example, 0 – represents a negative class; 1 – represents a positive class. Logistic regression is com -
monly used in binary classification problems where the outcome variable reveals either of the two cat -
egories (0 and 1).
Some examples of such classifications and instances where the binary response is expected or implied
are:
1. Determine the probability of heart attacks: With the help of a logistic model, medical practition-
ers can determine the relationship between variables such as the weight, exercise, etc., of an individual
and use it to predict whether the person will suffer from a heart attack or any other medical complica-
tion.
2. Possibility of enrolling into a university: Application aggregators can determine the probability of
a student getting accepted to a particular university or a degree course in a college by studying the rela-
tionship between the estimator variables, such as GRE, GMAT, or TOEFL scores.
3. Identifying spam emails: Email inboxes are filtered to determine if the email communication is
promotional/spam by understanding the predictor variables and applying a logistic regression al-
gorithm to check its authenticity.
Key advantages of logistic regression
Logistic Regression Equation and Assumptions
 Logistic regression uses a logistic function called a sigmoid function to map predictions and
their probabilities. The sigmoid function refers to an S-shaped curve that converts any real
value to a range between 0 and 1.
 Moreover, if the output of the sigmoid function (estimated probability) is greater than a pre-
defined threshold on the graph, the model predicts that the instance belongs to that class. If the
estimated probability is less than the predefined threshold, the model predicts that the instance
does not belong to the class.
 For example, if the output of the sigmoid function is above 0.5, the output is considered as 1.
On the other hand, if the output is less than 0.5, the output is classified as 0. Also, if the graph
goes further to the negative end, the predicted value of y will be 0 and vice versa. In other
words, if the output of the sigmoid function is 0.65, it implies that there are 65% chances of the
event occurring; a coin toss, for example.
The sigmoid function is referred to as an activation function for logistic regression and is defined as:

Equation of Logistic Regression


where,
 e = base of natural logarithms
 value = numerical value one wishes to transform
The following equation represents logistic regression:

here,

 x = input value
 y = predicted output
 b0 = bias or intercept term
 b1 = coefficient for input (x)
Types of Logistic Regression with Examples
1. Binary logistic regression
Binary logistic regression predicts the relationship between the independent and binary dependent vari-
ables. Some examples of the output of this regression type may be, success/failure, 0/1, or true/false.
Examples:
1. Deciding on whether or not to offer a loan to a bank customer: Outcome = yes or no.
2. Evaluating the risk of cancer: Outcome = high or low.
3. Predicting a team’s win in a football match: Outcome = yes or no.
2. Multinomial logistic regression
A categorical dependent variable has two or more discrete outcomes in a multinomial regression type.
This implies that this regression type has more than two possible outcomes.
Examples:
1. Let’s say you want to predict the most popular transportation type for 2040. Here, trans-
port type equates to the dependent variable, and the possible outcomes can be electric
cars, electric trains, electric buses, and electric bikes.
2. Predicting whether a student will join a college, vocational/trade school, or corporate in-
dustry.
3. Estimating the type of food consumed by pets, the outcome may be wet food, dry food,
or junk food.
3. Ordinal logistic regression
Ordinal logistic regression applies when the dependent variable is in an ordered state (i.e., ordinal).
The dependent variable (y) specifies an order with two or more categories or levels.
Examples: Dependent variables represent,
1. Formal shirt size: Outcomes = XS/S/M/L/XL
2. Survey answers: Outcomes = Agree/Disagree/Unsure
3. Scores on a math test: Outcomes = Poor/Average/Good
Cluster the following eight points (with (x, y) representing locations) into three clusters:
9.a.
A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4, 9)
Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2).
The distance function between two points a = (x1, y1) and b = (x2, y2) is defined as-
Ρ(a, b) = |x2 – x1| + |y2 – y1|
We follow the above discussed K-Means Clustering Algorithm-
Iteration-01:

 We calculate the distance of each point from each of the center of the three clusters.
 The distance is calculated by using the given distance function.

The following illustration shows the calculation of distance between point A1(2, 10) and each of the
center of the three clusters-
Calculating Distance Between A1(2, 10) and C1(2, 10)-

Ρ(A1, C1)
= |x2 – x1| + |y2 – y1|
= |2 – 2| + |10 – 10|
=0
Calculating Distance Between A1(2, 10) and C2(5, 8)-

Ρ(A1, C2)
= |x2 – x1| + |y2 – y1|
= |5 – 2| + |8 – 10|
=3+2
=5
Calculating Distance Between A1(2, 10) and C3(1, 2)-
Ρ(A1, C3)
= |x2 – x1| + |y2 – y1|
= |1 – 2| + |2 – 10|
=1+8
=9
In the similar manner, we calculate the distance of other points from each of the center of the three
clusters.
Next,
 We draw a table showing all the results.
 Using the table, we decide which point belongs to which cluster.
 The given point belongs to that cluster whose center is nearest to it.

Distance from Distance from Distance from


Point belongs
Given Points center (2, 10) of center (5, 8) of center (1, 2) of
to Cluster
Cluster-01 Cluster-02 Cluster-03

A1(2, 10) 0 5 9 C1

A2(2, 5) 5 6 4 C3

A3(8, 4) 12 7 9 C2

A4(5, 8) 5 0 10 C2
A5(7, 5) 10 5 9 C2

A6(6, 4) 10 5 7 C2

A7(1, 2) 9 10 0 C3

A8(4, 9) 3 2 10 C2

From here, New clusters are-


Cluster-01:
First cluster contains points-
 A1(2, 10)
Cluster-02:
Second cluster contains points-
 A3(8, 4)
 A4(5, 8)
 A5(7, 5)
 A6(6, 4)
 A8(4, 9)
Cluster-03:
Third cluster contains points-
 A2(2, 5)
 A7(1, 2)
Now,
 We re-compute the new cluster clusters.
 The new cluster center is computed by taking mean of all the points contained in that cluster.
For Cluster-01:

 We have only one point A1(2, 10) in Cluster-01.


 So, cluster center remains the same.

For Cluster-02:
Center of Cluster-02
= ((8 + 5 + 7 + 6 + 4)/5, (4 + 8 + 5 + 4 + 9)/5)
= (6, 6)
For Cluster-03:
Center of Cluster-03
= ((2 + 1)/2, (5 + 2)/2)
= (1.5, 3.5)
This is completion of Iteration-01.
Iteration-02:

 We calculate the distance of each point from each of the center of the three clusters.
 The distance is calculated by using the given distance function.

The following illustration shows the calculation of distance between point A1(2, 10) and each of the
center of the three clusters-
Calculating Distance Between A1(2, 10) and C1(2, 10)-

Ρ(A1, C1)
= |x2 – x1| + |y2 – y1|
= |2 – 2| + |10 – 10|
=0
Calculating Distance Between A1(2, 10) and C2(6, 6)-

Ρ(A1, C2)
= |x2 – x1| + |y2 – y1|
= |6 – 2| + |6 – 10|
=4+4
=8
Calculating Distance Between A1(2, 10) and C3(1.5, 3.5)-

Ρ(A1, C3)
= |x2 – x1| + |y2 – y1|
= |1.5 – 2| + |3.5 – 10|
= 0.5 + 6.5
=7

In the similar manner, we calculate the distance of other points from each of the center of the three
clusters.
Next,
 We draw a table showing all the results.
 Using the table, we decide which point belongs to which cluster.
 The given point belongs to that cluster whose center is nearest to it.

Distance from Distance from Distance from


Point belongs
Given Points center (2, 10) of center (6, 6) of center (1.5, 3.5) of
to Cluster
Cluster-01 Cluster-02 Cluster-03
A1(2, 10) 0 8 7 C1

A2(2, 5) 5 5 2 C3

A3(8, 4) 12 4 7 C2

A4(5, 8) 5 3 8 C2

A5(7, 5) 10 2 7 C2

A6(6, 4) 10 2 5 C2

A7(1, 2) 9 9 2 C3

A8(4, 9) 3 5 8 C1

From here, New clusters are-


Cluster-01:
First cluster contains points-
 A1(2, 10)
 A8(4, 9)

Cluster-02:
Second cluster contains points-
 A3(8, 4)
 A4(5, 8)
 A5(7, 5)
 A6(6, 4)
Cluster-03:
Third cluster contains points-
 A2(2, 5)
 A7(1, 2)
Now,
 We re-compute the new cluster clusters.
 The new cluster center is computed by taking mean of all the points contained in that cluster.
For Cluster-01:
Center of Cluster-01
= ((2 + 4)/2, (10 + 9)/2)
= (3, 9.5)
For Cluster-02:
Center of Cluster-02
= ((8 + 5 + 7 + 6)/4, (4 + 8 + 5 + 4)/4)
= (6.5, 5.25)
For Cluster-03:
Center of Cluster-03
= ((2 + 1)/2, (5 + 2)/2)
= (1.5, 3.5)
This is completion of Iteration-02.
After second iteration, the center of the three clusters are-
 C1(3, 9.5)
 C2(6.5, 5.25)
 C3(1.5, 3.5)

9.b. Apply K(=2)-Means algorithm over the data (185, 72), (170, 56), (168, 60), (179,68), (182,72),
(188,77) up to two iterations and show the clusters. Initially choose first two objects as initial
centroids.
Solution:
Given, number of clusters to be created (K) = 2 say c1 and c2,
number of iterations = 2 and
The given data points can be represented in tabular form as:

also, first two objects as initial centroids:


Centroid for first cluster c1 = (185, 72)
Centroid for second cluster c2 = (170, 56)

Iteration 1: Now calculating similarity by using Euclidean dis-


tance measure as:
Euclidean distance calculation

Representing above information in tabular form:

Distance of each data points from cluster centroids

The resulting cluster after first iteration is:


Data points cluster

Iteration 2: Now calculating centroid for each cluster:

Calculating centroid as mean of data points

Now, again calculating similarity:

Distance calculation between data points and centroids

Representing above information in tabular form.


Distance of each data points from cluster centroids

The resulting cluster after second iteration is:

Data points cluster

Distance calculation between data points and centroids


Representing above information in tabular form.

Distance of each data points from cluster centroids

The resulting cluster after second iteration is:

Data points cluster

As we have already completed two iteration as asked by our question, the numerical ends here.

Since, the clustering doesn’t change after second iteration, so terminate the iteration even if question

doesn’t say so.

10.a Elaborate the steps in the back propagation-learning algorithm. What is the importance of it in
designing neural networks?

What is Artificial Neural Networks?


 A neural network is a group of connected I/O units where each connection has a weight associ-
ated with its computer programs. It helps you to build predictive models from large databases.

 This model builds upon the human nervous system. It helps you to conduct image understand-
ing, human learning, computer speech, etc.

What is Backpropagation?
 Backpropagation is the essence of neural network training.

 It is the method of fine-tuning the weights of a neural network based on the error rate obtained
in the previous epoch (i.e., iteration).

 Proper tuning of the weights allows you to reduce error rates and make the model reliable by
increasing its generalization.

 Backpropagation in neural network is a short form for “backward propagation of errors.”

 It is a standard method of training artificial neural networks.

 This method helps calculate the gradient of a loss function with respect to all the weights in the
network.

How Backpropagation Algorithm Works


The Back propagation algorithm in neural network computes the gradient of the loss function for a
single weight by the chain rule. It efficiently computes one layer at a time, unlike a native direct com-
putation. It computes the gradient, but it does not define how the gradient is used. It generalizes the
computation in the delta rule.

Consider the following Back propagation neural network example diagram to understand:

1. Inputs X, arrive through the preconnected path


2. Input is modeled using real weights W. The weights are usually randomly selected.
3. Calculate the output for every neuron from the input layer, to the hidden layers, to the output
layer.
4. Calculate the error in the outputs

ErrorB= Actual Output – Desired Output

5. Travel back from the output layer to the hidden layer to adjust the weights such that the error is
decreased.

Keep repeating the process until the desired output is achieved

Why We Need Backpropagation?


Most prominent advantages of Backpropagation are:

 Backpropagation is fast, simple and easy to program


 It has no parameters to tune apart from the numbers of input
 It is a flexible method as it does not require prior knowledge about the network
 It is a standard method that generally works well
 It does not need any special mention of the features of the function to be learned.

What is a Feed Forward Network?


A feedforward neural network is an artificial neural network where the nodes never form a cycle. This
kind of neural network has an input layer, hidden layers, and an output layer. It is the first and simplest
type of artificial neural network.

Types of Backpropagation Networks


Two Types of Backpropagation Networks are:

 Static Back-propagation
 Recurrent Backpropagation

Static back-propagation
 It is one kind of backpropagation network which produces a mapping of a static input for static
output. It is useful to solve static classification issues like optical character recognition.

Recurrent Backpropagation
 Recurrent Back propagation in data mining is fed forward until a fixed value is achieved. After
that, the error is computed and propagated backward.

 The main difference between both of these methods is: that the mapping is rapid in static back-
propagation while it is nonstatic in recurrent backpropagation.

History of Backpropagation

 In 1961, the basics concept of continuous backpropagation were derived in the context of con-
trol theory by J. Kelly, Henry Arthur, and E. Bryson.
 In 1969, Bryson and Ho gave a multi-stage dynamic system optimization method.
 In 1974, Werbos stated the possibility of applying this principle in an artificial neural network.
 In 1982, Hopfield brought his idea of a neural network.
 In 1986, by the effort of David E. Rumelhart, Geoffrey E. Hinton, Ronald J. Williams, back-
propagation gained recognition.
 In 1993, Wan was the first person to win an international pattern recognition contest with the
help of the backpropagation method.

Backpropagation Key Points

 Simplifies the network structure by elements weighted links that have the least effect on the
trained network
 You need to study a group of input and activation values to develop the relationship between
the input and hidden unit layers.
 It helps to assess the impact that a given input variable has on a network output. The knowl-
edge gained from this analysis should be represented in rules.
 Backpropagation is especially useful for deep neural networks working on error-prone
projects, such as image or speech recognition.
 Backpropagation takes advantage of the chain and power rules allows backpropagation to
function with any number of outputs.
Disadvantages of using Backpropagation

 The actual performance of backpropagation on a specific problem is dependent on the input


data.
 Back propagation algorithm in data mining can be quite sensitive to noisy data
 You need to use the matrix-based approach for backpropagation instead of mini-batch.

10. Explain a deep feedforward neural network with a neat sketch.


b
 Feedforward neural networks were among the first and most successful learning algorithms.

They are also called deep networks, multi-layer perceptron (MLP), or simply neural networks.

As data travels through the network’s artificial mesh, each layer processes an aspect of the

data, filters outliers, spots familiar entities and produces the final output.

Feedforward neural networks are made up of the following:

Input layer: This layer consists of the neurons that receive inputs and pass them on to the other layers.
The number of neurons in the input layer should be equal to the attributes or features in the dataset.
Output layer: The output layer is the predicted feature and depends on the type of model you’re build-
ing.
Hidden layer: In between the input and output layer, there are hidden layers based on the type of
model. Hidden layers contain a vast number of neurons which apply transformations to the inputs be -
fore passing them. As the network is trained, the weights are updated to be more predictive.
Neuron weights: Weights refer to the strength or amplitude of a connection between two neurons. If
you are familiar with linear regression, you can compare weights on inputs like coefficients. Weights
are often initialized to small random values, such as values in the range 0 to 1.

To better understand how feedforward neural networks function, let’s solve a simple problem — pre-

dicting if it’s raining or not when given three inputs.

x1 - day/night
x2 - temperature
x3 - month

Let’s assume the threshold value to be 20, and if the output is higher than 20 then it will be raining,

otherwise it’s a sunny day. Given a data tuple with inputs (x1, x2, x3) as (0, 12, 11), initial weights of

the feedforward network (w1, w2, w3) as (0.1, 1, 1) and biases as (1, 0, 0).

Here’s how the neural network computes the data in three simple steps:

1. Multiplication of weights and inputs: The input is multiplied by the assigned weight values, which

this case would be the following:


(x1* w1) = (0 * 0.1) = 0

(x2* w2) = (1 * 12) = 12

(x3* w3) = (11 * 1) = 11

2. Adding the biases: In the next step, the product found in the previous step is added to their respect -

ive biases. The modified inputs are then summed up to a single value.

(x1* w1) + b1 = 0 + 1

(x2* w2) + b2 = 12 + 0

(x3* w3) + b3 = 11 + 0

weighted_sum = (x1* w1) + b1 + (x2* w2) + b2 + (x3* w3) + b3 = 23

3. Activation: An activation function is the mapping of summed weighted input to the output of the

neuron. It is called an activation/transfer function because it governs the inception at which the neuron

is activated and the strength of the output signal.

4. Output signal: Finally, the weighted sum obtained is turned into an output signal by feeding the

weighted sum into an activation function (also called transfer function). Since the weighted sum in our

example is greater than 20, the perceptron predicts it to be a rainy day.

The image below illustrates this process more clearly.


There are several activation functions for different use cases. The most commonly used activation

functions are relu, tanh and softmax. Here’s a handy cheat sheet:

Calculating the Loss

In simple terms, a loss function quantifies how “good” or “bad” a given model is in classifying the in -
put data. In most learning networks, the loss is calculated as the difference between the actual output

and the predicted output.

Mathematically:

loss = y_{predicted} - y_{original}

The function that is used to compute this error is known as loss function J(.). Different loss functions

will return different errors for the same prediction, having a considerable effect on the performance of

the model.

Gradient Descent

Gradient descent is the most popular optimization technique for feedforward neural networks. The

term “gradient” refers to the quantity change of output obtained from a neural network when the inputs

change a little. Technically, it measures the updated weights concerning the change in error. The gradi-

ent can also be defined as the slope of a function. The higher the angle, the steeper the slope and the

faster a model can learn.

You might also like