Professional Documents
Culture Documents
AIML CIA II Question Paper ECE Anskey
AIML CIA II Question Paper ECE Anskey
o Human-Agent: A human agent has eyes, ears, and other organs, which work for sensors
and hand, legs, vocal tract work for actuators.
o Robotic Agent: A robotic agent can have cameras, infrared range finder, NLP for sensors
and various motors for actuators.
o Software Agent: Software agent can have keystrokes, file contents as sensory input and
act on those inputs and display output on the screen.
Applications of AI
2 Given that P(A)=0.3,P(A|B)=0.4 and P(B)=0.5, Compute P(B|A).
.
Distance cal- Euclidean, Manhattan, or other Euclidean distance between data points
culation distance metrics and cluster centers
Step 3: Remove the node n from the OPEN list, which has the lowest value of h(n), and places
it in the CLOSED list.
Step 5: Check each successor of node n, and find whether any node is a goal node or not. If
any successor node is the goal node, then return success and stop the search, else continue to
next step.
Step 6: For each successor node, the algorithm checks for evaluation function f(n) and then
check if the node has been in either OPEN or CLOSED list. If the node has not been in both
lists, then add it to the OPEN list.
A* Search Algorithm
A* search is the most commonly known form of best-first search. It uses the heuristic function h(n)
and cost to reach the node n from the start state g(n). It has combined features of UCS and greedy best-
first search, by which it solve the problem efficiently.
It finds the shortest path through the search space using the heuristic function. This search algorithm
expands fewer search tree and gives optimal results faster.
Algorithm of A* search:
Step 2: Check if the OPEN list is empty or not. If the list is empty, then return failure and stops.
Step 3: Select the node from the OPEN list which has the smallest value of the evaluation function
(g+h). If node n is the goal node, then return success and stop, otherwise.
Step 4: Expand node n and generate all of its successors, and put n into the closed list. For each suc-
cessor n', check whether n' is already in the OPEN or CLOSED list. If not, then compute the evaluation
function for n' and place it into the Open list.
Step 5: Else, if node n' is already in OPEN and CLOSED, then it should be attached to the back
pointer which reflects the lowest g(n') value.
Advantage:
Disadvantages:
7.a. Discuss variable elimination algorithm for answering queries on Bayesian networks
For inferencing in probabilistic system, it is required to calculate posterior probability distribution for a
set of query variables, where some observed events are given. [That is we have some values attached to
evidence variables].
• Notation Revisited:
The notation used in inferencing is same as the one used in probability theory.
X: Query variable.
E: The set of evidence variables E 1,....., Em and 'e' is the perticular observed event. Y: The set of non-
evidence variables Y1, Y2,..... Yk [Non-evidence variables are also called as hidden variables].
• Generally the query requires the posterior probability distribution P(X | e) [assuming that query
variable is not among the evidence variables, if it is, then posterior distribution for X simply gives
probability 1 to the observed value]. [Note that query can contain more than one variable. For study
purpose we are assuming single variable].
• Example: In the burglary case, if the observed event is Jcalls = true and Mcalls true.
The query is 'Has burglary occured?'
A Bayesian network gives a complete representation of the full joint distribution. These full joint
distributions can be written as product of conditional probabilities from the Bayesian network.
A query can be answered using Bayesian network by computing sums of products of conditional
probabilities from the network.
• The algorithm
The algorithm ENUMERATE-JOINT-ASK gives inference by enumerating on full joint distribution.
Characteristics of algorithm:
1) It takes input a full joint distribution P and looks up values in it. [The same algorithm can be
modified to take input as Bayesian network and looking up in joint entries by multiplying the
corresponding conditional probability table entries from Bayesian network.
3) The drawback of the algorithm is, it keeps on evaluating repeated sub expression which results in
wastage of computation time.
• The algorithm
Function ENUMERATION-ASK (X, e, bn) returns a distribution over X.
bn, a Bayes net with variables {X} UEU Y/* y = Hidden variable */
Q(X) ← A distribution over X, initially empty for each value xi of X do extend e with value xi for X.
Y ← FIRST (vars)
If Y has value y in e.
Example:
Consider query,
The semantics of Bayesian networks (equation 7.2.1) then gives us an expression in terms of CPT
entries. For simplicity, we will do this just for Burglary = true.
o To compute this expression, we have to add four terms, each computed by multiplying five
numbers.
o Worst case, where we have to sum out almost all the variables, the complexity of the algorithm
for a network with n boolean variables is O(n2n).
o An improvement can be obtained from the following simple observations. The P(b) term is a
constant and can be moved outside the summations over a and e, and the P(e) term can be
moved outside the summation over a. Hence, we have
P(b | j, m) = α P(b) Σe P(e) ΣaP(a | b, e)P(j | a) P(m, a)
This expression can be evaluated by looping through the variables in order, multiplying CPT entries as
we go. For each summation, we also need to loop over the variable's possible values. The structure of
this computation is shown in following diagram. Using the numbers from Fig. 7.3.2, we obtain P(b | j,
m) = α × 0.00059224. The corresponding computation for ⌐b yields α × 0.0014919;
Hence,
That is, the chance of burglary, given calls from both neighbours is about 28 %. Note In the Fig. 7.3.7,
the evaluation proceeds top to down, multiplying values along each path and summing at the "t" nodes.
Observe that there is repetition of paths for j and m.
1) It works by evaluating expressions such as [P(b | j, m) = α P(b) Σ e P(e) Σa P(a | b, e) P(j | a) P(m | a)]
in right-to-left order.
2) Intermediate results are stored and summations over each variable are done only for those portions
of the expression that depends on the variable.
3) Factors: Each part of the expression is annotated with the name of the associated variable, these
parts are called factors.
Steps in algorithm:
i) The factor for M, P (m | a), does not require summing over M. Probability is stored, given each value
of a, in a two-element factor,
iii) The factor for A is P(a | B, e) which will be a 2×2×2 matrix fA (A, B, E).
iv) Summing out A from the product of these tree factors. This will give 2×2 matrix whose indices
range over just B and E. We put bar over A in the name of the matrix to indicate that A has been
summed out.
v) Processing E in the same way (i.e.) sum out E from the product of
vi) Compute the answer simply by multiplying the factor for B. (i.e.) (f B|B) = P(B), 701 by the
accumulated matrix (B) :
From the above sequence of steps it can be noticed that two computational operations are required.
a) Pointwise product of a pair of factors: The pointwise product of two factors f 1 and f2 yields a new
factor f, those variables are the union of the variables in f 1 and f2. Suppose the two factors have
variables Y1,..., Yk. Then we have f(x1,..., Xj, Y1,.... Yk, Z1,.....Zl) = f1 (X1,....Xj, Y1.... Yk) f2(Y1,..... Yk,
Z1,....Zl). If all the variables are binary, then f1 and f2 have 2j+k and 2k+l entries and the pointwise product
has 2j+k+1 entries.
For example: Given two factors f 1 (A, B) and f2 (B, C) with probability distributions shown below, the
pointwise product f1 × f2 is given as f1 (A, B, C).
b) Summing out a variable from a product of factors: It is a straight forward computation. Any
factor that does not depend on the variable to be summed out can be moved outside the summation
process.
For example:
Σe fE (e) × fA (A, B, e) × fj (A) × fM (A) = fj (A) × fM (A) × Σ e fE (e) × fA (A, B, e). Now, the pointwise
product inside the summation is computed and the variable is summed out of the resulting matrix.
Matrices are not multiplied until we need to sum but a variable from the accumulated product. At that
point, multiply those matrices that include the variable to be summed out.
The procedure for pointwise product and summing is given below, the variable elimination algorithm is
shown below:
Σm P(M | a) is equal to 1.
Note The variable M is irrelevant to this query. Result of the query P (J calls/Burglary = True) is
unchanged if we remove M calls from the network. We can remove any leaf node which is not a query
variable or an evidence variable. After its removal, there may be more leaf nodes and they may be
irrelevant. Eventually we find that every variable that is not an ancestor of a query variable or evidence
variable is, irrelevant to the query. A variable elimination algorithm can remove all these variables
before evaluating the query.
The variable elimination algorithm is more efficient than enumeration algorithm because it avoids
repeated computations as well as drops irrelevant variables.
The variable elimination algorithm constructs the factor, deriving its operation. The space and time
complexity of variable elimination is directly dependant on size of the largest factor constructed during
the operation. Basically the factor construction is determined by the order of elimination of variables
and by the structure of the network; which affects both space and time complexity.
For developing more efficient process we can construct singly connected networks which are also
called as polytrees. In singly connected network, there is at most one undirected path between any two
nodes in the networks. The singly connected networks have property that, the time and space
complexity of exact inference in polytrees is linear in the size of the network. Here the size is defined
as the number of CPT entries. If the number of parents of each node is bounded by a constant, then the
complexity I will also be linear in the number of nodes.
For example: The Burglary network shown in the Fig. 7.3.2 is a polytrees.
In multiply networks [In this, their can be multiple undirected paths between any two nodes and more
than one directed path between some pair of nodes], variable elimination takes exponential time and
space complexity in the worst case, even when the number of parents per node is bounded. It should be
noted that variable elimination includes inference in propositional logic as a special case and inference
in Bayesian network is NP-hard. In fact it is strictly harder than NP-complete problem.
Clustering algorithm:
1) Clustering algorithm (known as joint tree algorithms) in which inferencing time can be reduced to
O(n). In clustering individual nodes of the network are joint to form cluster nodes to such a way that
the resulting network is a polytree.
2) The variable elimination algorithm is efficient algorithm for answering individual queries. Posterior
probabilities are computed for all the variables in the network. It can be less efficient, in polytree
network because it needs to issue O(n) queries costing O(n) each, for a total of O(n 2) time, clustering
algorithm, improves over it.
For example: The multiply connected network shown in Fig. 7.3.8 (a) can be converted into a polytree
by combining the Sprinkler and Rain node into a clusternode called Sprinkler + Rain, as shown in Fig.
7.2.8 (b). The two Boolean nodes are replaced by a meganode that takes on four possible values: TT,
TF, FT, FF. The meganode has only one parent, the Boolean variable. Cloudy, so there are two
conditioning cases.
7.b. Describe how Bayesian statistics provides reasoning under various kinds of uncertainty.
Bayes' theorem:
Bayes' theorem is also known as Bayes' rule, Bayes' law, or Bayesian reasoning, which determ-
ines the probability of an event with uncertain knowledge.
In probability theory, it relates the conditional probability and marginal probabilities of two ran -
dom events.
Bayes' theorem was named after the British mathematician Thomas Bayes. The Bayesian infer-
ence is an application of Bayes' theorem, which is fundamental to Bayesian statistics.
Bayes' theorem allows updating the probability prediction of an event by observing new informa-
tion of the real world.
Example: If cancer corresponds to one's age then by using Bayes' theorem, we can determine the prob-
ability of cancer more accurately with the help of age.
Bayes' theorem can be derived using product rule and conditional probability of event A with known
event B:
The above equation (a) is called as Bayes' rule or Bayes' theorem. This equation is basic of most
modern AI systems for probabilistic inference.
It shows the simple relationship between joint and conditional probabilities. Here,
P(A|B) is known as posterior, which we need to calculate, and it will be read as Probability of hypo-
thesis A when we have occurred an evidence B.
P(B|A) is called the likelihood, in which we consider that hypothesis is true, then we calculate the
probability of evidence.
P(A) is called the prior probability, probability of hypothesis before considering the evidence
In the equation (a), in general, we can write P (B) = P(A)*P(B|Ai), hence the Bayes' rule can be written
as:
Where A1, A2, A3,........, An is a set of mutually exclusive and exhaustive events.
Bayes' rule allows us to compute the single term P(B|A) in terms of P(A|B), P( B), and P(A). This is
very useful in cases where we have a good probability of these three terms and want to determine the
fourth one. Suppose we want to perceive the effect of some unknown cause, and want to compute that
cause, then the Bayes' rule becomes:
Example-1:
Question: what is the probability that a patient has diseases meningitis with a stiff neck?
Given Data:
A doctor is aware that disease meningitis causes a patient to have a stiff neck, and it occurs 80% of the
time. He is also aware of some more facts, which are given as follows:
Let a be the proposition that patient has stiff neck and b be the proposition that patient has meningitis. ,
so we can calculate the following as:
P(a|b) = 0.8
P(b) = 1/30000
P(a)= .02
Hence, we can assume that 1 patient out of 750 patients has meningitis disease with a stiff neck.
8.a. Compare the merits and demerits of Random forest and Naïve Bayes classifier with an example.
Random Forest Algorithm
Random Forest is a popular machine-learning algorithm that belongs to the supervised learning
technique.
It is based on the concept of ensemble learning, which is a process of combining multiple classifi-
ers to solve a complex problem and to improve the performance of the model.
As the name suggests, "Random Forest is a classifier that contains a number of decision trees
on various subsets of the given dataset and takes the average to improve the predictive accuracy
of that dataset." Instead of relying on one decision tree, the random forest takes the prediction
from each tree and based on the majority votes of predictions, and it predicts the final output.
The greater number of trees in the forest leads to higher accuracy and prevents the problem
of overfitting.
The below diagram explains the working of the Random Forest algorithm:
Why use Random Forest?
Below are some points that explain why we should use the Random Forest algorithm:
Random Forest works in two-phase first is to create the random forest by combining N decision tree,
and second is to make predictions for each tree created in the first phase.
The Working process can be explained in the below steps and diagram:
Step-2: Build the decision trees associated with the selected data points (Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-5: For new data points, find the predictions of each decision tree, and assign the new data points
to the category that wins the majority votes.
o Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes the-
orem and used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional training dataset.
o Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which
helps in building the fast machine learning models that can make quick predictions.
o It is a probabilistic classifier, which means it predicts on the basis of the probability of an
object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental analysis,
and classifying articles.
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described as:
o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is indepen -
dent of the occurrence of other features. Such as if the fruit is identified on the bases of color,
shape, and taste, then red, spherical, and sweet fruit is recognized as an apple. Hence each fea -
ture individually contributes to identify that it is an apple without depending on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the
probability of a hypothesis with prior knowledge. It depends on the conditional probability.
o The formula for Bayes' theorem is given as:
Where,
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a hypo-
thesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
Working of Naïve Bayes' Classifier can be understood with the help of the below example:
Suppose we have a dataset of weather conditions and corresponding target variable "Play". So using
this dataset we need to decide that whether we should play or not on a particular day according to the
weather conditions. So to solve this problem, we need to follow the below steps:
Problem: If the weather is sunny, then the Player should play or not?
Outlook Play
0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes
Frequency table for the Weather Conditions:
Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 5
Likelihood table weather condition:
Weather No Yes
Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny)= 0.35
P(Yes)=0.71
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
There are three types of Naive Bayes Model, which are given below:
o Gaussian: The Gaussian model assumes that features follow a normal distribution. This means
if predictors take continuous values instead of discrete, then the model assumes that these val-
ues are sampled from the Gaussian distribution.
o Multinomial: The Multinomial Naïve Bayes classifier is used when the data is multinomial
distributed. It is primarily used for document classification problems, it means a particular doc-
ument belongs to which category such as Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.
o Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the predic-
tor variables are the independent Booleans variables. Such as if a particular word is present or
not in a document. This model is also famous for document classification tasks.
8.b. Examine the process of computing coefficients in a logistic regression with an example.
Logistic regression is a supervised learning algorithm that makes use of logistic func-
tions to predict the probability of a binary outcome.
here,
x = input value
y = predicted output
b0 = bias or intercept term
b1 = coefficient for input (x)
Types of Logistic Regression with Examples
1. Binary logistic regression
Binary logistic regression predicts the relationship between the independent and binary dependent vari-
ables. Some examples of the output of this regression type may be, success/failure, 0/1, or true/false.
Examples:
1. Deciding on whether or not to offer a loan to a bank customer: Outcome = yes or no.
2. Evaluating the risk of cancer: Outcome = high or low.
3. Predicting a team’s win in a football match: Outcome = yes or no.
2. Multinomial logistic regression
A categorical dependent variable has two or more discrete outcomes in a multinomial regression type.
This implies that this regression type has more than two possible outcomes.
Examples:
1. Let’s say you want to predict the most popular transportation type for 2040. Here, trans-
port type equates to the dependent variable, and the possible outcomes can be electric
cars, electric trains, electric buses, and electric bikes.
2. Predicting whether a student will join a college, vocational/trade school, or corporate in-
dustry.
3. Estimating the type of food consumed by pets, the outcome may be wet food, dry food,
or junk food.
3. Ordinal logistic regression
Ordinal logistic regression applies when the dependent variable is in an ordered state (i.e., ordinal).
The dependent variable (y) specifies an order with two or more categories or levels.
Examples: Dependent variables represent,
1. Formal shirt size: Outcomes = XS/S/M/L/XL
2. Survey answers: Outcomes = Agree/Disagree/Unsure
3. Scores on a math test: Outcomes = Poor/Average/Good
Cluster the following eight points (with (x, y) representing locations) into three clusters:
9.a.
A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4, 9)
Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2).
The distance function between two points a = (x1, y1) and b = (x2, y2) is defined as-
Ρ(a, b) = |x2 – x1| + |y2 – y1|
We follow the above discussed K-Means Clustering Algorithm-
Iteration-01:
We calculate the distance of each point from each of the center of the three clusters.
The distance is calculated by using the given distance function.
The following illustration shows the calculation of distance between point A1(2, 10) and each of the
center of the three clusters-
Calculating Distance Between A1(2, 10) and C1(2, 10)-
Ρ(A1, C1)
= |x2 – x1| + |y2 – y1|
= |2 – 2| + |10 – 10|
=0
Calculating Distance Between A1(2, 10) and C2(5, 8)-
Ρ(A1, C2)
= |x2 – x1| + |y2 – y1|
= |5 – 2| + |8 – 10|
=3+2
=5
Calculating Distance Between A1(2, 10) and C3(1, 2)-
Ρ(A1, C3)
= |x2 – x1| + |y2 – y1|
= |1 – 2| + |2 – 10|
=1+8
=9
In the similar manner, we calculate the distance of other points from each of the center of the three
clusters.
Next,
We draw a table showing all the results.
Using the table, we decide which point belongs to which cluster.
The given point belongs to that cluster whose center is nearest to it.
A1(2, 10) 0 5 9 C1
A2(2, 5) 5 6 4 C3
A3(8, 4) 12 7 9 C2
A4(5, 8) 5 0 10 C2
A5(7, 5) 10 5 9 C2
A6(6, 4) 10 5 7 C2
A7(1, 2) 9 10 0 C3
A8(4, 9) 3 2 10 C2
For Cluster-02:
Center of Cluster-02
= ((8 + 5 + 7 + 6 + 4)/5, (4 + 8 + 5 + 4 + 9)/5)
= (6, 6)
For Cluster-03:
Center of Cluster-03
= ((2 + 1)/2, (5 + 2)/2)
= (1.5, 3.5)
This is completion of Iteration-01.
Iteration-02:
We calculate the distance of each point from each of the center of the three clusters.
The distance is calculated by using the given distance function.
The following illustration shows the calculation of distance between point A1(2, 10) and each of the
center of the three clusters-
Calculating Distance Between A1(2, 10) and C1(2, 10)-
Ρ(A1, C1)
= |x2 – x1| + |y2 – y1|
= |2 – 2| + |10 – 10|
=0
Calculating Distance Between A1(2, 10) and C2(6, 6)-
Ρ(A1, C2)
= |x2 – x1| + |y2 – y1|
= |6 – 2| + |6 – 10|
=4+4
=8
Calculating Distance Between A1(2, 10) and C3(1.5, 3.5)-
Ρ(A1, C3)
= |x2 – x1| + |y2 – y1|
= |1.5 – 2| + |3.5 – 10|
= 0.5 + 6.5
=7
In the similar manner, we calculate the distance of other points from each of the center of the three
clusters.
Next,
We draw a table showing all the results.
Using the table, we decide which point belongs to which cluster.
The given point belongs to that cluster whose center is nearest to it.
A2(2, 5) 5 5 2 C3
A3(8, 4) 12 4 7 C2
A4(5, 8) 5 3 8 C2
A5(7, 5) 10 2 7 C2
A6(6, 4) 10 2 5 C2
A7(1, 2) 9 9 2 C3
A8(4, 9) 3 5 8 C1
Cluster-02:
Second cluster contains points-
A3(8, 4)
A4(5, 8)
A5(7, 5)
A6(6, 4)
Cluster-03:
Third cluster contains points-
A2(2, 5)
A7(1, 2)
Now,
We re-compute the new cluster clusters.
The new cluster center is computed by taking mean of all the points contained in that cluster.
For Cluster-01:
Center of Cluster-01
= ((2 + 4)/2, (10 + 9)/2)
= (3, 9.5)
For Cluster-02:
Center of Cluster-02
= ((8 + 5 + 7 + 6)/4, (4 + 8 + 5 + 4)/4)
= (6.5, 5.25)
For Cluster-03:
Center of Cluster-03
= ((2 + 1)/2, (5 + 2)/2)
= (1.5, 3.5)
This is completion of Iteration-02.
After second iteration, the center of the three clusters are-
C1(3, 9.5)
C2(6.5, 5.25)
C3(1.5, 3.5)
9.b. Apply K(=2)-Means algorithm over the data (185, 72), (170, 56), (168, 60), (179,68), (182,72),
(188,77) up to two iterations and show the clusters. Initially choose first two objects as initial
centroids.
Solution:
Given, number of clusters to be created (K) = 2 say c1 and c2,
number of iterations = 2 and
The given data points can be represented in tabular form as:
As we have already completed two iteration as asked by our question, the numerical ends here.
Since, the clustering doesn’t change after second iteration, so terminate the iteration even if question
10.a Elaborate the steps in the back propagation-learning algorithm. What is the importance of it in
designing neural networks?
This model builds upon the human nervous system. It helps you to conduct image understand-
ing, human learning, computer speech, etc.
What is Backpropagation?
Backpropagation is the essence of neural network training.
It is the method of fine-tuning the weights of a neural network based on the error rate obtained
in the previous epoch (i.e., iteration).
Proper tuning of the weights allows you to reduce error rates and make the model reliable by
increasing its generalization.
This method helps calculate the gradient of a loss function with respect to all the weights in the
network.
Consider the following Back propagation neural network example diagram to understand:
5. Travel back from the output layer to the hidden layer to adjust the weights such that the error is
decreased.
Static Back-propagation
Recurrent Backpropagation
Static back-propagation
It is one kind of backpropagation network which produces a mapping of a static input for static
output. It is useful to solve static classification issues like optical character recognition.
Recurrent Backpropagation
Recurrent Back propagation in data mining is fed forward until a fixed value is achieved. After
that, the error is computed and propagated backward.
The main difference between both of these methods is: that the mapping is rapid in static back-
propagation while it is nonstatic in recurrent backpropagation.
History of Backpropagation
In 1961, the basics concept of continuous backpropagation were derived in the context of con-
trol theory by J. Kelly, Henry Arthur, and E. Bryson.
In 1969, Bryson and Ho gave a multi-stage dynamic system optimization method.
In 1974, Werbos stated the possibility of applying this principle in an artificial neural network.
In 1982, Hopfield brought his idea of a neural network.
In 1986, by the effort of David E. Rumelhart, Geoffrey E. Hinton, Ronald J. Williams, back-
propagation gained recognition.
In 1993, Wan was the first person to win an international pattern recognition contest with the
help of the backpropagation method.
Simplifies the network structure by elements weighted links that have the least effect on the
trained network
You need to study a group of input and activation values to develop the relationship between
the input and hidden unit layers.
It helps to assess the impact that a given input variable has on a network output. The knowl-
edge gained from this analysis should be represented in rules.
Backpropagation is especially useful for deep neural networks working on error-prone
projects, such as image or speech recognition.
Backpropagation takes advantage of the chain and power rules allows backpropagation to
function with any number of outputs.
Disadvantages of using Backpropagation
They are also called deep networks, multi-layer perceptron (MLP), or simply neural networks.
As data travels through the network’s artificial mesh, each layer processes an aspect of the
data, filters outliers, spots familiar entities and produces the final output.
Input layer: This layer consists of the neurons that receive inputs and pass them on to the other layers.
The number of neurons in the input layer should be equal to the attributes or features in the dataset.
Output layer: The output layer is the predicted feature and depends on the type of model you’re build-
ing.
Hidden layer: In between the input and output layer, there are hidden layers based on the type of
model. Hidden layers contain a vast number of neurons which apply transformations to the inputs be -
fore passing them. As the network is trained, the weights are updated to be more predictive.
Neuron weights: Weights refer to the strength or amplitude of a connection between two neurons. If
you are familiar with linear regression, you can compare weights on inputs like coefficients. Weights
are often initialized to small random values, such as values in the range 0 to 1.
To better understand how feedforward neural networks function, let’s solve a simple problem — pre-
x1 - day/night
x2 - temperature
x3 - month
Let’s assume the threshold value to be 20, and if the output is higher than 20 then it will be raining,
otherwise it’s a sunny day. Given a data tuple with inputs (x1, x2, x3) as (0, 12, 11), initial weights of
the feedforward network (w1, w2, w3) as (0.1, 1, 1) and biases as (1, 0, 0).
Here’s how the neural network computes the data in three simple steps:
1. Multiplication of weights and inputs: The input is multiplied by the assigned weight values, which
2. Adding the biases: In the next step, the product found in the previous step is added to their respect -
ive biases. The modified inputs are then summed up to a single value.
(x1* w1) + b1 = 0 + 1
(x2* w2) + b2 = 12 + 0
(x3* w3) + b3 = 11 + 0
3. Activation: An activation function is the mapping of summed weighted input to the output of the
neuron. It is called an activation/transfer function because it governs the inception at which the neuron
4. Output signal: Finally, the weighted sum obtained is turned into an output signal by feeding the
weighted sum into an activation function (also called transfer function). Since the weighted sum in our
functions are relu, tanh and softmax. Here’s a handy cheat sheet:
In simple terms, a loss function quantifies how “good” or “bad” a given model is in classifying the in -
put data. In most learning networks, the loss is calculated as the difference between the actual output
Mathematically:
The function that is used to compute this error is known as loss function J(.). Different loss functions
will return different errors for the same prediction, having a considerable effect on the performance of
the model.
Gradient Descent
Gradient descent is the most popular optimization technique for feedforward neural networks. The
term “gradient” refers to the quantity change of output obtained from a neural network when the inputs
change a little. Technically, it measures the updated weights concerning the change in error. The gradi-
ent can also be defined as the slope of a function. The higher the angle, the steeper the slope and the