DSML

Introduction to Descriptive
Statistics and Probability for

Data Science
- Sandeep Chaurasia
Content
• Introduction
• Measure of Central Tendency (Mean, Mode, Median)
• Measures of Variability (Range, IQR, Variance, Standard Deviation)
• Probability (Bernoulli Trials, Normal Distribution)
• Central Limit Theorem
• Z scores
Descriptive Statistics
• Statistics has become the universal language of the sciences, and data
analysis can lead to powerful results.
• Has there been a significant change in the mean sawtimber volume in the red
pine stands?
• Has there been an increase in the number of invasive species found in the
Great Lakes?
• What proportion of white tail deer in New Hampshire have weights below the
limit considered healthy?
• Did fertilizer A, B, or C have an effect on the corn yield?
Statistics is the science of collecting, organizing, summarizing,

analyzing, and interpreting information.
Cont.
• Good statistics come from good samples, and are used to draw
conclusions or answer questions about a population
Fig. Using sample statistics to estimate population parameters.

• A population is the group to be studied, and population data is a
collection of all elements in the population. For example:
• All the fish in Long Lake.
• All the lakes in the Adirondack Park.
• All the grizzly bears in Yellowstone National Park.
• A sample is a subset of data drawn from the population of interest.
For example:
• 100 fish randomly sampled from Long Lake.
• 25 lakes randomly selected from the Adirondack Park.
• 60 grizzly bears with a home range in Yellowstone National Park.
• Populations are characterized by descriptive measures called
parameters. Inferences about parameters are based on sample
statistics. For example, the population mean (µ) is estimated by the
sample mean (x̄). The population variance (σ2) is estimated by the
sample variance (s2).
• Variables are the characteristics we are interested in. For example:
• The length of fish in Long Lake.
• The pH of lakes in the Adirondack Park.
• The weight of grizzly bears in Yellowstone National Park.
Variables
• Variables are divided into two major groups: qualitative and
quantitative. Qualitative variables have values that are attributes or
categories. Mathematical operations cannot be applied to qualitative
variables.
• Examples of qualitative variables are gender, race, and petal color.
• Examples of quantitative variables are age, height, and length.
• Quantitative variables can be broken down further into two more
categories: discrete and continuous variables
Cont.
• Descriptive measures of samples are called statistics and are typically

written using Roman letters. The sample mean is
• The sample variance is s2 and the sample standard deviation is s.
Sample statistics are used to estimate unknown population
parameters.
Measures of Center
Mean
The arithmetic mean of a variable, often called the average, is computed by adding
up all the values and dividing by the total number of values.
The sample mean is usually the best, unbiased estimate of the population mean.
However, the mean is influenced by extreme values (outliers) and may not be the
best measure of center with strongly skewed data. The following equations
compute the population mean and sample mean.
where xi is an element in the data set, N is the number of elements in the

population, and n is the number of elements in the sample data set.
• Find the mean for the following sample data set: 6.4, 5.2, 7.9, 3.4
Median
• The median of a variable is the middle value of the data set when the
data are sorted in order from least to greatest. It splits the data into
two equal halves with 50% of the data below the median and 50%
above the median.
#1 : 23, 27, 29, 31, 35, 39, 40, 42, 44, 47, 51
#2 : 23, 27, 29, 31, 35, 39, 40, 42, 44, 47
Mode
• The mode is the most frequently occurring value and is commonly used
with qualitative data as the values are categorical. Categorical data cannot
be added, subtracted, multiplied or divided, so the mean and median
cannot be computed.
• Understanding the relationship between the mean and median is
important. It gives us insight into the distribution of the variable. For
example, if the distribution is skewed right (positively skewed), the mean
will increase to account for the few larger observations that pull the
distribution to the right.
Measures of Dispersion
• Measures of center look at the average or middle values of a data set.
Measures of dispersion look at the spread or variation of the
data. Variation refers to the amount that the values vary among
themselves. Values in a data set that are relatively close to each other
have lower measures of variation. Values that are spread farther
apart have higher measures of variation.
Examine the two histograms below. Both groups have the same mean weight, but the values of Group A are more
spread out compared to the values in Group B. Both groups have an average weight of 267 lb. but the weights of
Group A are more variable.
Range
• The range of a variable is the largest value minus the smallest value. It
is the simplest measure and uses only these two values in a
quantitative data set.
• Find the range for the given data set: 12, 29, 32, 34, 38, 49, 57
Range = 57 – 12 = 45
Variance
The variance uses the difference between each value and its arithmetic
mean. The differences are squared to deal with positive and negative
differences. The sample variance (s2) is an unbiased estimator of the
population variance (σ2), with n-1 degrees of freedom.
• Compute the variance of the sample data: 3, 5, 7.
Standard Deviation
The standard deviation is the square root of the variance (both
population and sample). While the sample variance is the positive,
unbiased estimator for the population variance, the units for the
variance are squared. The standard deviation is a common method for
numerically describing the distribution of a variable.
• Compute the standard deviation of the

sample data: 3, 5, 7
Standard Error of the Means
If we want to estimate the heights of eighty-year-old cherry trees, we can proceed as
follows:
•Randomly select 100 trees
•Compute the sample mean of the 100 heights
•Use that as our estimate
We want to use this sample mean to estimate the true but unknown population mean.
• Sample 1—we compute sample mean x̄
The sample mean (x̄) is a random variable with its own probability distribution called
the sampling distribution of the sample mean. The distribution of the sample mean will
have a mean equal to µ and a standard deviation equal to
The standard error is the standard deviation of all possible sample means
Example #1 :
5 Students in a college were selected at random and their agents were found to
be 18, 21, 19, 20 and 26.
a) calculate the standard deviation of the ages in the sample
b) calculate the standard error
#2
In a certain university, the mean age office student is 20.5 with a standard
deviation of 0.8.
a) calculate the standard error of the mean if a sample of 25 students were
selected
b) what would be the standard error of the mean be if a simple sample of 100
students were selected
Coefficient of Variation
To compare standard deviations between different populations or

samples is difficult because the standard deviation depends on units of
measure. The coefficient of variation expresses the standard deviation
as a percentage of the sample or population mean.
Ex: Store wait time in minutes

Store1 6.5 6.7 6.8 7.2 7.3 7.4 7.9
Store2 4.2 5.4 6.2 7.7. 8.4 9.2 9.8
Probability Distribution
• To find the probabilities associated with a continuous random variable, we
use a probability density function (PDF).
• A PDF is an equation used to find probabilities for continuous random
variables. The PDF must satisfy the following two rules:
1.The area under the curve must equal one (over all possible values of the random
variable).
2.The probabilities must be equal to or greater than zero for all possible values of the
random variable.
• The Normal Distribution
Many continuous random variables have a bell-shaped or somewhat
symmetric distribution. The curve is bell-shaped, symmetric about the mean,
and defined by µ and σ (the mean and standard deviation).
There are normal curves for every combination of µ and σ. The mean (µ) shifts the curve to the left or right. The
standard deviation (σ) alters the spread of the curve.
The first pair of curves have different means but the same standard deviation.
The second pair of curves share the same mean (µ) but have different standard deviations. The pink curve has a
smaller standard deviation. It is narrower and taller, and the probability is spread over a smaller range of values. The
blue curve has a larger standard deviation. The curve is flatter, and the tails are thicker. The probability is spread over a
larger range of values.
Properties of the normal curve:
• The mean is the center of this distribution and the highest point.
• The curve is symmetric about the mean. (The area to the left of the
mean equals the area to the right of the mean.)
• The total area under the curve is equal to one.
• As x increases and decreases, the curve goes to zero but never
touches.
• The PDF of an normal curve
• A normal curve can be used to estimate probabilities.
• A normal curve can be used to estimate proportions of a population
that have certain x-values.
The Standard Normal Distribution
• There are millions of possible combinations of means and standard
deviations for continuous random variables. Finding probabilities
associated with these variables would require us to integrate the PDF
over the range of values we are interested in. To avoid this, we can
rely on the standard normal distribution.
• We can use the Z-score to standardize any normal random variable,
converting the x-values to Z-scores.
• Mean = 120.
• Std. Dev = 12
• Z = xi – mean / std. dev.
Application:
Standardization
Multiple feature : to scale down the feature values z score
Standard scaler in Sklearn.
Compare scores between different distribution.
Avg = 181 Avg = 182
Std. dev = 12 Std. dev = 5
Real value = 187 Real value = 185
Probability
Random Variables : A random variable is a variable that takes on different
values determined by chance. In other words, it is a numerical quantity that
varies at random.
Ex. Suppose we flip a fair coin three times and record if it shows a head or a tail. The
outcome or sample space is S={HHH,HHT,HTH,THH,TTT,TTH,THT,HTT}.
Discrete Random Variable: When the random variable can assume only a
countable, sometimes infinite, number of values.
Continuous Random Variable: When the random variable can assume an
uncountable number of values in a line interval.
Probability Functions:
• Probability Mass Function (PMF) for discrete random variable
• Probability Density Function (PDF) for continuous random variable
• Cumulative Distribution Function (CDF) for is a function that gives the
probability that the random variable, X, is less than or equal to the value x.
Example : Consider the data set with the values : 0, 1, 2, 3, 4. If X is a
random variable of a random draw from these values, what is the
probability you select 2?
P(x=2) = ?
Find the CDF, in tabular form of the random variable, X, as defined

above.
Hypothesis Testing
• The first step in hypothesis testing is to set up two competing
hypotheses. The hypotheses are the most important aspect. If the
hypotheses are incorrect, your conclusion will also be incorrect.
• The two hypotheses are named the null hypothesis and the
alternative hypothesis.
The goal of hypothesis testing is to see if there is enough evidence against the null hypothesis. In other words, to
see if there is enough evidence to reject the null hypothesis. If there is not enough evidence, then we fail to reject
the null hypothesis.
• Ex: A man, Mr. XyZ, goes to trial and is tried for the murder of his ex-
wife. He is either guilty or innocent. Set up the null and alternative
hypotheses for this example.
The hypotheses being tested are:
1.The man is guilty
2.The man is innocent
Set-Up for One-Sample Hypotheses

Chi-Square Test for Independence
• Chi square test is a hypothesis test that is used when you want to
determine if there is a relationship between two categorical variables
• Null hypothesis: there is no relationship between gender and
highest educational attainment.
• Alternative hypothesis: There is a correlation between gender
and the highest educational attainment.
The chi-squared value is calculated via:
For a significance level of 5 %
For a significance level of 5 %, this results in 3.841
Since the calculated chi-squared value is smaller, there is no significant difference. As a prerequisite for this test,
please note that all expected frequencies must be greater than 5.
• A school principal would like to know which days of the week student
are most likely to be absent . The principal expects that student will
absent equally during the five-day school week the principal select a
random sample of 100 teacher asking them which day of the week
they had the highest number of student absence. The observed and
expected result are shown in the table below based on this result to
the days for the highest number of absences occur with equal
frequency? (use a 5% significance level)
Mon Tue Wed Thu Fri
Observed Absences 23 16 14 19 28
Expected Absences 20 20 20 20 20
• In an antimalarial campaign in India, quinine was administrated to
500 person out of a total population of 2000. The number of fever
cases is shown below :
Treatment Fever No fever Total

Quinine 20 480 500
No Quinine 100 1400 1500
Total 120 1880 2000
Introduction to Analysis of Variance
Make an analysis of the variance on given
data:
• To assess the significance of possible variation in performance in a
certain test between the convent school of a city a common test was
given to several students taken at random from the 5th class of three
schools concerned the result given below :
A B C
9 13 14
11 12 13
13 10 17
9 15 7
8 5 9
Introduction to Descriptive
Statistics and Probability for
Data Science
- Sandeep Chaurasia
Random Variable
A random variable is a variable whose value is not known. It can either be
discrete (having a specific value) or continuous (any value in a continuous
range). All possible values that a random variable accepts is also called a
sample space.
Binomial Random Variable: A binomial random variable is a number of

successes in an experiment consisting of N trails. Some of the examples are:
• The number of successes (tails) in an experiment of 100 trials of tossing a coin. Here the
sample space is {0, 1, 2, …100}
• The number of successes (four) in an experiment of 100 trials of rolling a dice. Here the
sample space is {0, 1, 2, …100}
Binomial Distribution
Binomial distribution is a discrete probability distribution that represents
the probabilities of binomial random variables. The binomial distribution
is a probability distribution associated with a binomial experiment in
which the binomial random variable specifies the number of successes
or failures that occurred within that sample space.
• Example. Suppose you flipped a coin. The probability of getting heads

or tails is equal. But what will be the probability of getting six heads in
ten flips of coins? This is where you will need binomial distribution. You
can calculate the probability of getting six heads in ten flips of a coin.
• The binomial distribution formula for any random variable X is given
by
P(x, n, P) = nCx * Px * (1 - P)n-x
• Where, n = the number of experiments, x = 0, 1, 2, 3, 4, … (total

number of successes), p = Probability of success in a single
experiment.
Ex: Let’s calculate the probability of getting exactly six heads when a
coin is tossed ten times.
P(x=6) = 10C6 * 0.56 * 0.54 = ?
Mean and Variance of Binomial Distribution: The mean and variance of
the binomial distribution are:
• Mean = np Properties of a binomial distribution are:
• Variance = npq where, 1.There are only two possible outcomes:
• p is the probability of success True or False, Yes or No.
2.There are N number of independent
• q is the probability of failure (1-p)
trials.
• n is the number of trials. 3.The probability of success and failure
varies in each trial.
4.Only the number of successes are taken
into account out of N independent trials.
Examples:
#1: 80% of people who purchase pet insurance are women. If 9
pet insurance owners are randomly selected, find the probability
that exactly 6 are women.
#2: 60% of people who purchase sports cars are men. If 10

sports car owners are randomly selected, find the probability that
exactly 7 are men.
Bernoulli Distribution
• A discrete probability distribution wherein the random variable can
only have 2 possible outcomes is known as a Bernoulli Distribution. If
in a Bernoulli trial the random variable takes on the value of 1, it
means that this is a success.
• The probability of success is given by p. Similarly, if the value of the
random variable is 0, it indicates failure. The probability of failure is q
or 1 - p
f(x, p) = px (1 - p)1 - x, x ϵ {0, 1}
The mean or average of a Bernoulli
distribution is given by the formula
E[X] = p
To find the variance formula of a

Bernoulli distribution we use E[X2] -
(E[X])2 and apply properties.
Thus, Var[x] = p(1-p) of a Bernoulli
distribution.
Bernoulli distribution is a case of

binomial distribution when only 1 trial
has been conducted. A binomial
distribution is given by X ∼∼ Binomial (n,
p). When n = 1, it becomes a Bernoulli
distribution.
• Example 1: A basketball player can shoot a ball into the basket with a
probability of 0.6. What is the probability that he misses the shot?
We know that success probability P (X = 1) = p = 0.6: Thus, probability of
failure is P (X = 0) = 1 - p = 1 - 0.6 = 0.4
• Example 2: If a Bernoulli distribution has a parameter 0.45 then find its

mean.
Solution: X ∼∼ Bernoulli (p) or X ∼∼ Bernoulli (0.45).
Mean E[X] = p = 0.45
• Example 3: If a Bernoulli distribution has a parameter 0.72 then find its

variance.
Solution: X ∼∼ Bernoulli (p) or X ∼∼ Bernoulli (0.72).
Variance Var[X] = p (1-p) = 0.72 (0.28) = 0.2016
Poisson Distribution
• Poisson distribution is used to estimate how many times an event is
likely to occur within the given period of time. λ is the Poisson rate
parameter that indicates the expected value of the average number
of events in the fixed time interval. Poisson distribution has wide use
in the fields of business as well as in biology.
• Example, customer care center receives 100 calls per hour, 8
hours a day. As we can see that the calls are independent of
each other. The probability of the number of calls per minute
has a Poisson probability distribution. There can be any number
of calls per minute irrespective of the number of calls received
in the previous minute. Below is the curve of the probabilities
for a fixed value of λ of a function following Poisson
distribution:
For a random discrete variable X that follows the Poisson distribution,
and λ is the average rate of value, then the probability of x is given by:
f(x) = P(X=x) = (e-λ λx )/x! , where

x = 0, 1, 2, 3... ; e is the Euler's number(e = 2.718) & λ is an average rate of the expected value and λ
= variance, also λ>0
For Poisson distribution, which has λ as the average rate, for a fixed interval of time, then the mean of the
Poisson distribution and the value of variance will be the same. So, for X following Poisson distribution, we can
say that λ is the mean as well as the variance of the distribution.
Hence: E(X) = V(X) = λ
Example 1: In a cafe, the customer arrives at a mean rate of 2 per min. Find the probability of arrival of 5 customers
in 1 minute using the Poisson distribution formula.
Given: λ = 2, and x = 5.
Example 2: Find the mass probability of function at x = 6, if the value of the mean is 3.4.
Given: λ = 3.4, and x = 6.

Introduction to Regression
- Sandeep Chaurasia
Machine Learning
• [Machine learning is the] field of study that gives computers the
ability to learn without being explicitly programmed. —Arthur
Samuel, 1959
• A computer program is said to learn from experience E with respect

to some task T and some performance measure P, if its performance
on T, as measured by P, improves with experience E. —Tom Mitchell,
1997
Your spam filter is a machine learning program that, given examples of
spam emails (flagged by users) and examples of regular emails
(nonspam, also called “ham”), can learn to flag spam.
The examples that the system uses to learn are called the training set.
Each training example is called a training instance (or sample).
The part of a machine learning system that learns and makes
predictions is called a model.
Neural networks and random forests are examples of models.
In this case, the task T is to flag spam for new emails, the experience E
is the training data, and the performance measure P needs to be
defined;
Examples of Applications:
• Analyzing images of products on a production line to automatically classify them

• Detecting tumors in brain scans
• Automatically classifying news articles
• Automatically flagging offensive comments on discussion forums
• Summarizing long documents automatically
• Creating a chatbot or a personal assistant.
• Forecasting your company’s revenue next year, based on many performance metrics.
• Making your app react to voice commands
• Detecting credit card fraud
• Segmenting clients based on their purchases so that you can design a different marketing
strategy for each segment
• Representing a complex, high-dimensional dataset in a clear and insightful diagram
• Recommending a product that a client may be interested in, based on past purchases
• Building an intelligent bot for a game.
Types of Machine Learning Systems
• How they are supervised during training (supervised, unsupervised, semi-
supervised, self-supervised, and others)
• Whether or not they can learn incrementally on the fly (online versus batch
learning)
• Whether they work by simply comparing new data points to known data
points, or instead by detecting patterns in the training data and building a
predictive model, much like scientists do (instance-based versus model-
based learning)
• These criteria are not exclusive; you can combine them in any way you like
Supervised learning
• In supervised learning, the training set you feed to the algorithm
includes the desired solutions, called labels
A typical supervised learning task is classification. The spam filter is a good

example of this: it is trained with many example emails along with their class
(spam or ham), and it must learn how to classify new emails.
Another typical task is to predict a target numeric value, such as the price of
a car, given a set of features (mileage, age, brand, etc.). This sort of task is
called regression.
Unsupervised learning
• In unsupervised learning, as you might guess, the training data is
unlabeled. The system tries to learn without a teacher.
• For example, say you have a lot of data about your blog’s visitors. You may want
to run a clustering algorithm to try to detect groups of similar visitors.
• Another important unsupervised task is anomaly detection—for
example, detecting unusual credit card transactions to prevent fraud,
catching manufacturing defects, or automatically removing outliers
from a dataset before feeding it to another learning algorithm. The
system is shown mostly normal instances during training, so it learns
to recognize them; then, when it sees a new instance, it can tell
whether it looks like a normal one or whether it is likely an anomaly
Semi-supervised learning
• Since labeling data is usually time-consuming and costly, you will
often have plenty of unlabeled instances, and few labeled instances.
Some algorithms can deal with data that’s partially labeled. This is
called semi supervised learning
Some photo-hosting services, such as
Google Photos, are good examples of
this
Self-supervised learning
• Another approach to machine learning involves actually
generating a fully labeled dataset from a fully unlabeled one.
Again, once the whole dataset is labeled, any supervised
learning algorithm can be used. This approach is called self-
supervised learning.
For example, if you have a large dataset of
unlabeled images, you can randomly mask a
small part of each image and then train a model
to recover the original image. During training,
the masked images are used as the inputs to the
model, and the original images are used as the
labels
Reinforcement learning
• The learning system, called
an agent in this context, can
observe the environment,
select and perform actions,
and get rewards in return (or
penalties in the form of
negative rewards. It must
then learn by itself what is
the best strategy, called a
policy, to get the most
reward over time. A policy
defines what action the
agent should choose when it
is in a given situation.
Batch Versus Online Learning
• Batch learning - In batch learning, the system is incapable of learning
incrementally: it must be trained using all the available data. This will
generally take a lot of time and computing resources, so it is typically
done offline. First the system is trained, and then it is launched into
production and runs without learning anymore; it just applies what it
has learned. This is called offline learning.
• Even a model trained to classify pictures of cats and dogs may need to be
retrained regularly, not because cats and dogs will mutate overnight, but because
cameras keep changing, along with image formats, sharpness, brightness, and
size ratios. Moreover, people may love different breeds next year, or they may
decide to dress their pets with tiny hats—who knows?
Online learning
• In online learning, you train the
system incrementally by feeding it
data instances sequentially, either
individually or in small groups
called minibatches. Each learning
step is fast and cheap, so the
system can learn about new data
on the fly
Instance-Based Versus Model-Based Learning
• Instance-based learning: the system learns the examples by heart,
then generalizes to new cases by using a similarity measure to
compare them to the learned examples (or a subset of them)
A (very basic) similarity
measure between two emails
could be to count the number
of words they have in
common. The system would
flag an email as spam if it has
many words in common with a
known spam email
• Another way to generalize from a set of examples is to build a model
of these examples and then use that model to make predictions. This
is called model-based learning.
# Suppose you want to know if money makes people happy,
so you download the Better Life Index data from the OECD’s
website and World Bank stats about gross domestic product
(GDP) per capita. Then you join the tables and sort by GDP
per capita
life_satisfaction = θ0 + θ1 × GDP_per_capita
This model has two model parameters, θ and θ .
Before you can use your model, you need to define the
parameter values θ0 and θ1.
The final trained model ready to be used for predictions (e.g.,

linear regression with one input and one output, using θ0 =
3.75 and θ1 = 6.78 × 10 ).
If all went well, your model will make good predictions. If not, you may
need to use more attributes (employment rate, health, air pollution,
etc.), get more or better-quality training data, or perhaps select a more
powerful model (e.g., a polynomial regression model).
• You studied the data.

• You selected a model.
• You trained it on the training data (i.e., the learning algorithm
searched for the model parameter values that minimize a cost
function).
• Finally, you applied the model to make predictions on new cases (this
is called inference), hoping that this model will generalize well.
Main Challenges of Machine Learning
• Insufficient Quantity of Training Data
• Nonrepresentative Training Data
• Poor-Quality Data
• Irrelevant Features
• Overfitting the Training Data
• Underfitting the Training Data
Linear Regression
@ life_satisfaction = θ0 + θ1 × GDP_per_capita.
This model is just a linear function of the input feature GDP_per_capita. θ0 and θ1
are the model’s parameters.
Linear regression model prediction

yˆ = θ0 + θ1x1 + θ2x2 + ⋯ + θnxn
• ŷ is the predicted value.
• n is the number of features.
• xi is the i-feature value.
• θj is the j model parameter, including the bias term θ0 and the feature weights θ1 ,
θ2 , ⋯, θn .
• Linear regression model prediction (vectorized form)
yˆ = h𝛉(x) = 𝛉 ⋅ x
The MSE of a linear regression hypothesis h on a training set X is calculated

using MSE cost function for a linear regression model
The Normal Equation
• To find the value of θ that minimizes the MSE, there exists a closed-
form solution— in other words, a mathematical equation that gives
the result directly. This is called the Normal equation
𝛉ˆ is the value of θ that minimizes the cost function

y is the vector of target values containing y(1) to y(m).
Gradient Descent
• Gradient descent is a generic optimization algorithm capable of finding optimal
solutions to a wide range of problems. The general idea of gradient descent is to
tweak parameters iteratively in order to minimize a cost function.
• In practice, you start by filling θ with random values (this is called random
initialization). Then you improve it gradually, taking one baby step at a time, each
step attempting to decrease the cost function (e.g., the MSE), until the algorithm
converges to a minimum.
The two main challenges with gradient descent. If the random initialization starts the algorithm on the left, then it will
converge to a local minimum, which is not as good as the global minimum. If it starts on the right, then it will take a
very long time to cross the plateau. And if you stop too early, you will never reach the global minimum.
The MSE cost function for a linear regression model happens to be a convex
function, which means that if you pick any two points on the curve, the line
segment joining them is never below the curve. This implies that there are no
local minima, just one global minimum.
Gradient descent with (left) and without (right) feature scaling

Batch Gradient Descent
To implement gradient descent, you need to compute the gradient of the cost function regarding each model
parameter θ . In other words, you need to calculate how much the cost function will change if you change θj just a
little bit.
Computes the partial derivative of the MSE regarding parameter θ ,
noted ∂j MSE(θ) / ∂θ
Partial derivatives of the cost function :
Gradient vector of the cost function
Gradient descent step

learning rate η
Gradient descent with various learning rates
On the left, the learning rate is too low: the algorithm will eventually reach the solution,
but it will take a long time. In the middle, the learning rate looks pretty good: in just a few
epochs, it has already converged to the solution. On the right, the learning rate is too
high: the algorithm diverges, jumping all over the place and actually getting further and
further away from the solution at every step.
Stochastic Gradient Descent
• The main problem with batch gradient descent is the fact
that it uses the whole training set to compute the gradients
at every step, which makes it very slow when the training
set is large.
• At the opposite extreme, stochastic gradient descent picks
a random instance in the training set at every step and
computes the gradients based only on that single instance.
• Working on a single instance at a time makes the algorithm
much faster because it has very little data to manipulate at
every iteration.
• It also makes it possible to train on huge training sets, since only one instance needs to be in memory at each
iteration
• On the other hand, due to its stochastic (i.e., random) nature, this algorithm is much less regular than batch
gradient descent: instead of gently decreasing until it reaches the minimum, the cost function will bounce up
and down, decreasing only on average.
• Over time it will end up very close to the minimum, but once it gets there it will continue to bounce around,
never settling down.
• Once the algorithm stops, the final parameter values will be good, but not optimal.
To minimize the cost function, the model
needs to have the best value of θ1 and θ2.
Initially model selects θ1 and θ2 values

randomly and then iteratively update these
value in order to minimize the cost function
until it reaches the minimum.
By the time model achieves the minimum

cost function, it will have the best θ1 and
θ2 values.
Using these finally updated values of θ1 and

θ2 in the hypothesis equation of linear
equation, the model predicts the value of x
in the best manner it can.
Mini-Batch Gradient Descent
• Mini-batch GD computes the gradients on small random sets of
instances called mini-batches. The main advantage of mini-batch GD
over stochastic GD is that you can get a performance boost from
hardware optimization of matrix operations, especially when using
GPUs.
Polynomial Regression
• A simple way to do this is to add powers of each feature as new
features, then train a linear model on this extended set of features.
This technique is called polynomial regression.
Clearly, a straight line will never fit this data

properly. So, let’s use Scikit-Learn’s
PolynomialFeatures class to transform our
training data, adding the square (second-degree
polynomial) of each feature in the training set as
a new feature.
For e.g., if there were two features a and b,
PolynomialFeatures with degree=3 would not
only add the features a2 , a3 , b2 , and b3 , but also
the combinations ab, a2b, and ab2 .
Learning Curves
• This high-degree polynomial regression
model is severely overfitting the training
data, while the linear model is underfitting
it.
• The model that will generalize best in this
case is the quadratic model, which makes
sense because the data was generated
using a quadratic model.
THE BIAS/VARIANCE TRADE-OFF
• An important theoretical result of statistics and machine learning is the fact that a
model’s generalization error can be expressed as the sum of three very different
errors:
• Bias This part of the generalization error is due to wrong assumptions, such as
assuming that the data is linear when it is actually quadratic.
• A high-bias model is most likely to underfit the training data.
• Variance This part is due to the model’s excessive sensitivity to small variations in
the training data.
• A model with many degrees of freedom (such as a high-degree polynomial model) is
likely to have high variance and thus overfit the training data.
• Irreducible error This part is due to the noisiness of the data itself. The only way
to reduce this part of the error is to clean up the data (e.g., fix the data sources,
such as broken sensors, or detect and remove outliers).
Increasing a model’s complexity will typically increase its variance and reduce its bias. Conversely, reducing a model’s
complexity increases its bias and reduces its variance
Regularized Linear Models
• A simple way to regularize a polynomial model is to reduce the number of
polynomial degrees. For a linear model, regularization is typically achieved by
constraining the weights of the model. We will now look at ridge regression, lasso
regression, and elastic net regression.
• Ridge Regression: This forces the learning algorithm to not only fit the data but also keep the model
weights as small as possible. Note that the regularization term should only be added to the cost
function during training.
• The hyperparameter α controls how much you want to regularize the model. If α = 0, then ridge
regression is just linear regression. If α is very large, then all weights end up very close to zero and
the result is a flat line going through the data’s mean.
If we define w as the vector of feature weights (θ1 to θn), then the regularization term is equal to α(∥ w ∥ 2 )2 / m,
where ∥ w ∥2 represents the ℓ2 norm of the weight vector.
• Lasso Regression: Least absolute shrinkage and selection operator
regression (usually simply called lasso regression) is another
regularized version of linear regression: just like ridge regression, it
adds a regularization term to the cost function, but it uses the ℓ1
norm of the weight vector.
• Notice that the ℓ1 norm is multiplied by 2α, whereas the ℓ2 norm

was multiplied by α / m in ridge regression.
• These factors were chosen to ensure that the optimal α value is

independent from the training set size: different norms lead to
different factors
Elastic Net Regression: Elastic net regression is a middle ground
between ridge regression and lasso regression. The regularization term
is a weighted sum of both ridge and lasso’s regularization terms, and
you can control the mix ratio r. When r = 0, elastic net is equivalent to
ridge regression, and when r = 1, it is equivalent to lasso regression
Ridge is a good default, but if you suspect that only a few features are useful, you should prefer lasso or elastic
net because they tend to reduce the useless features’ weights down to zero, as discussed earlier . In general,
elastic net is preferred over lasso because lasso may behave erratically when the number of features is greater
than the number of training instances or when several features are strongly correlated.
Early Stopping
• A very different way to regularize
iterative learning algorithms such as
gradient descent is to stop training
as soon as the validation error
reaches a minimum. This is called
early stopping.
• With early stopping you just stop
training as soon as the validation
error reaches the minimum. It is
such a simple and efficient
regularization technique that
Geoffrey Hinton called it a “beautiful
free lunch”
With stochastic and mini-batch gradient descent, the curves are not so smooth, and it may be hard to know whether you
have reached the minimum or not. One solution is to stop only after the validation error has been above the minimum for
some time then roll back the model parameters to the point where the validation error was at a minimum.
Logistic regression is one of the most popular machine learning algorithms for binary
classification. This is because it is a simple algorithm that performs very well on a wide range
of problems.
After reading this you will know:
 How to calculate the logistic function.

 How to learn the coefficients for a logistic regression model using stochastic gradient
descent.
 How to make predictions using a logistic regression model.
Let’s get started.
Tutorial Dataset
In this tutorial we will use a contrived dataset.
This dataset has two input variables (X1 and X2) and one output variable (Y). In input
variables are real-valued random numbers drawn from a Gaussian distribution. The output
variable has two values, making the problem a binary classification problem.
The raw data is listed below.
X1 X2
2.7810836 2.55053700
1.465489372 2.36212507
3.396561688 4.40029352
1 X1 X2 Y
2 2.7810836 2.550537003 0
3 1.465489372 2.362125076 0
4 3.396561688 4.400293529 0
5 1.38807019 1.850220317 0
6 3.06407232 3.005305973 0
7 7.627531214 2.759262235 1
8 5.332441248 2.088626775 1
9 6.922596716 1.77106367 1
10 8.675418651 -0.2420686549 1
11 7.673756466 3.508563011 1
Below is a plot of the dataset. You can see that it is completely contrived and that we can
easily draw a line to separate the classes.
This is exactly what we are going to do with the logistic regression model.
Logistic Regression Tutorial Dataset
Logistic Function
Before we dive into logistic regression, let’s take a look at the logistic function, the heart of
the logistic regression technique.
The logistic function is defined as:
transformed = 1 / (1 + e^-x)
Where e is the numerical constant Euler’s number and x is a input we plug into the function.
Let’s plug in a series of numbers from -5 to +5 and see how the logistic function transforms
them:
X Transformed
-5 0.006692850924
-4 0.01798620996
-3 0.04742587318
1 X Transformed
2 -5 0.006692850924
3 -4 0.01798620996
4 -3 0.04742587318
5 -2 0.119202922
6 -1 0.2689414214
7 0 0.5
8 1 0.7310585786
9 2 0.880797078
10 3 0.9525741268
11 4 0.98201379
12 5 0.9933071491
You can see that all of the inputs have been transformed into the range [0, 1] and that the
smallest negative numbers resulted in values close to zero and the larger positive numbers
resulted in values close to one. You can also see that 0 transformed to 0.5 or the midpoint of
the new range.
From this we can see that as long as our mean value is zero, we can plug in positive and
negative values into the function and always get out a consistent transform into the new
range.
Logistic Function
Get your FREE Algorithms Mind Map

Logistic Regression Model
The logistic regression model takes real-valued inputs and makes a prediction as to the
probability of the input belonging to the default class (class 0).
If the probability is > 0.5 we can take the output as a prediction for the default class (class 0),
otherwise the prediction is for the other class (class 1).
For this dataset, the logistic regression has three coefficients just like linear regression, for
example:
output = b0 + b1*x1 + b2*x2
The job of the learning algorithm will be to discover the best values for the coefficients (b0,
b1 and b2) based on the training data.
Unlike linear regression, the output is transformed into a probability using the logistic
function:
p(class=0) = 1 / (1 + e^(-output))
In your spreadsheet this would be written as:
p(class=0) = 1 / (1 + EXP(-output))
Logistic Regression by Stochastic Gradient Descent

We can estimate the values of the coefficients using stochastic gradient descent.
This is a simple procedure that can be used by many algorithms in machine learning. It works
by using the model to calculate a prediction for each instance in the training set and
calculating the error for each prediction.
We can apply stochastic gradient descent to the problem of finding the coefficients for the
logistic regression model as follows:
Given each training instance:
1. Calculate a prediction using the current values of the coefficients.

2. Calculate new coefficient values based on the error in the prediction.
The process is repeated until the model is accurate enough (e.g. error drops to some desirable
level) or for a fixed number iterations. You continue to update the model for training
instances and correcting errors until the model is accurate enough orc cannot be made any
more accurate. It is often a good idea to randomize the order of the training instances shown
to the model to mix up the corrections made.
By updating the model for each training pattern we call this online learning. It is also possible
to collect up all of the changes to the model over all training instances and make one large
update. This variation is called batch learning and might make a nice extension to this tutorial
if you’re feeling adventurous.
Calculate Prediction
Let’s start off by assigning 0.0 to each coefficient and calculating the probability of the first
training instance that belongs to class 0.
B0 = 0.0
B1 = 0.0
B2 = 0.0
The first training instance is: x1=2.7810836, x2=2.550537003, Y=0
Using the above equation we can plug in all of these numbers and calculate a prediction:
prediction = 1 / (1 + e^(-(b0 + b1*x1 + b2*x2)))
prediction = 1 / (1 + e^(-(0.0 + 0.0*2.7810836 + 0.0*2.550537003)))
prediction = 0.5
Calculate New Coefficients
We can calculate the new coefficient values using a simple update equation.
b = b + alpha * (y – prediction) * prediction * (1 – prediction) * x
Where b is the coefficient we are updating and prediction is the output of making a prediction
using the model.
Alpha is parameter that you must specify at the beginning of the training run. This is the
learning rate and controls how much the coefficients (and therefore the model) changes or
learns each time it is updated. Larger learning rates are used in online learning (when we
update the model for each training instance). Good values might be in the range 0.1 to 0.3.
Let’s use a value of 0.3.
You will notice that the last term in the equation is x, this is the input value for the
coefficient. You will notice that the B0 does not have an input. This coefficient is often called
the bias or the intercept and we can assume it always has an input value of 1.0. This
assumption can help when implementing the algorithm using vectors or arrays.
Let’s update the coefficients using the prediction (0.5) and coefficient values (0.0) from the
previous section.
b0 = b0 + 0.3 * (0 – 0.5) * 0.5 * (1 – 0.5) * 1.0
b1 = b1 + 0.3 * (0 – 0.5) * 0.5 * (1 – 0.5) * 2.7810836
b2 = b2 + 0.3 * (0 – 0.5) * 0.5 * (1 – 0.5) * 2.550537003
or
b0 = -0.0375
b1 = -0.104290635
b2 = -0.09564513761
Repeat the Process
We can repeat this process and update the model for each training instance in the dataset.
A single iteration through the training dataset is called an epoch. It is common to repeat the
stochastic gradient descent procedure for a fixed number of epochs.
At the end of epoch you can calculate error values for the model. Because this is a
classification problem, it would be nice to get an idea of how accurate the model is at each
iteration.
The graph below show a plot of accuracy of the model over 10 epochs.
Logistic Regression with Gradient Descent Accuracy versus Iteration
You can see that the model very quickly achieves 100% accuracy on the training dataset.
The coefficients calculated after 10 epochs of stochastic gradient descent are:
b0 = -0.4066054641
b1 = 0.8525733164
b2 = -1.104746259
Make Predictions
Now that we have trained the model, we can use it to make predictions.
We can make predictions on the training dataset, but this could just as easily be new data.
Using the coefficients above learned after 10 epochs, we can calculate output values for each
training instance:
0.2987569857
0.145951056
0.08533326531
0.2197373144
1 0.2987569857
2 0.145951056
3 0.08533326531
4 0.2197373144
5 0.2470590002
6 0.9547021348
7 0.8620341908
8 0.9717729051
9 0.9992954521
10 0.905489323
These are the probabilities of each instance belonging to class=0. We can convert these into
crisp class values using:
prediction = IF (output < 0.5) Then 0 Else 1
With this simple procedure we can convert all of the outputs to class values:
0
0
0
0
1 0
2 0
3 0
4 0
5 0
6 1
7 1
8 1
9 1
10 1
Finally, we can calculate the accuracy for the model on the training dataset:
accuracy = (correct predictions / num predictions made) * 100
accuracy = (10 /10) * 100
accuracy = 100%
Summary
In this post you discovered how you can implement logistic regression from scratch, step-by-
step. You learned:
 How to calculate the logistic function.

 How to learn the coefficients for a logistic regression model using stochastic gradient
descent.
 How to make predictions using a logistic regression model.
Machine Learning
 Up until now: how use a model to make optimal decisions
 Machine learning: how to acquire a model from data / experience

 Learning parameters (e.g. probabilities)
 Learning structure (e.g. BN graphs)
 Learning hidden concepts (e.g. clustering)
 Today: model‐based classification with Naive Bayes

Classification
Example: Spam Filter
 Input: an email Dear Sir.
 Output: spam/ham First, I must solicit your confidence in this

transaction, this is by virture of its nature
as being utterly confidencial and top
 Setup: secret. …
 Get a large collection of example emails, each labeled
“spam” or “ham” TO BE REMOVED FROM FUTURE
MAILINGS, SIMPLY REPLY TO THIS
 Note: someone has to hand label all this data! MESSAGE AND PUT "REMOVE" IN THE
 Want to learn to predict labels of new, future emails SUBJECT.
99 MILLION EMAIL ADDRESSES

 Features: The attributes used to make the ham / FOR ONLY $99
spam decision
 Words: FREE! Ok, Iknow this is blatantly OT but I'm
beginning to go insane. Had an old Dell
 Text Patterns: $dd, CAPS Dimension XPS sitting in the corner and
 Non‐text: SenderInContacts decided to put it to use, I know it was
working pre being stuck in the corner, but
 … when I plugged it in, hit the power nothing
happened.
Example: Digit Recognition
 Input: images / pixel grids 0

 Output: a digit 0‐9
1
 Setup:
 Get a large collection of example images, each labeled with a digit
 Note: someone has to hand label all this data! 2
 Want to learn to predict labels of new, future digit images
1
 Features: The attributes used to make the digit decision
 Pixels: (6,8)=ON
 Shape Patterns: NumComponents, AspectRatio, NumLoops
??
 …
Other Classification Tasks
 Classification: given inputs x, predict labels (classes) y
 Examples:
 Spam detection (input: document,
classes: spam / ham)
 OCR (input: images, classes: characters)
 Medical diagnosis (input: symptoms,
classes: diseases)
 Automatic essay grading (input: document,
classes: grades)
 Fraud detection (input: account activity,
classes: fraud / no fraud)
 Customer service email routing
 … many more
 Classification is an important commercial technology!

Model‐Based Classification
Model‐Based Classification
 Model‐based approach
 Build a model (e.g. Bayes’ net) where
both the label and features are
random variables
 Instantiate any observed features
 Query for the distribution of the label
conditioned on the features
 Challenges
 What structure should the BN have?
 How should we learn its parameters?
Naïve Bayes for Digits
 Naïve Bayes: Assume all features are independent effects of the label
 Simple digit recognition version: Y
 One feature (variable) Fij for each grid position <i,j>

 Feature values are on / off, based on whether intensity
is more or less than 0.5 in underlying image
F1 F2 Fn
 Each input maps to a feature vector, e.g.
 Here: lots of features, each is binary valued
 Naïve Bayes model:
 What do we need to learn?

General Naïve Bayes
 A general Naive Bayes model:

Y
|Y| parameters
F1 F2 Fn
|Y| x |F|n values n x |F| x |Y|

parameters
 We only have to specify how each feature depends on the class

 Total number of parameters is linear in n
 Model is very simplistic, but often works anyway
Inference for Naïve Bayes
 Goal: compute posterior distribution over label variable Y
 Step 1: get joint probability of label and evidence for each label
+
 Step 2: sum to get probability of evidence
 Step 3: normalize by dividing Step 1 by Step 2

General Naïve Bayes
 What do we need in order to use Naïve Bayes?

 Inference method (we just saw this part)
 Start with a bunch of probabilities: P(Y) and the P(Fi|Y) tables
 Use standard inference to compute P(Y|F1…Fn)
 Nothing new here
 Estimates of local conditional probability tables

 P(Y), the prior over labels
 P(Fi|Y) for each feature (evidence variable)
 These probabilities are collectively called the parameters of the model
and denoted by 
 Up until now, we assumed these appeared by magic, but…
 …they typically come from training data counts: we’ll look at this soon
Example: Conditional Probabilities
1 0.1 1 0.01 1 0.05

2 0.1 2 0.05 2 0.01
3 0.1 3 0.05 3 0.90
4 0.1 4 0.30 4 0.80
5 0.1 5 0.80 5 0.90
6 0.1 6 0.90 6 0.90
7 0.1 7 0.05 7 0.25
8 0.1 8 0.60 8 0.85
9 0.1 9 0.50 9 0.60
0 0.1 0 0.80 0 0.80
Naïve Bayes for Text
 Bag‐of‐words Naïve Bayes:
 Features: Wi is the word at positon i
 As before: predict label conditioned on feature variables (spam vs. ham)
 As before: assume features are conditionally independent given label
 New: each Wi is identically distributed Word at position
i, not ith word in
the dictionary!
 Generative model:
 “Tied” distributions and bag‐of‐words

 Usually, each variable gets its own conditional probability distribution P(F|Y)
 In a bag‐of‐words model
 Each position is identically distributed
 All positions share the same conditional probs P(W|Y)
 Why make this assumption?
 Called “bag‐of‐words” because model is insensitive to word order or reordering
Example: Spam Filtering
 Model:
 What are the parameters?
ham : 0.66 the : 0.0156 the : 0.0210

spam: 0.33 to : 0.0153 to : 0.0133
and : 0.0115 of : 0.0119
of : 0.0095 2002: 0.0110
you : 0.0093 with: 0.0108
a : 0.0086 from: 0.0107
with: 0.0080 and : 0.0105
from: 0.0075 a : 0.0100
... ...
 Where do these tables come from?

Spam Example
Word P(w|spam) P(w|ham) Tot Spam Tot Ham
(prior) 0.33333 0.66666 -1.1 -0.4
Gary 0.00002 0.00021 -11.8 -8.9
would 0.00069 0.00084 -19.1 -16.0
you 0.00881 0.00304 -23.8 -21.8
like 0.00086 0.00083 -30.9 -28.9
to 0.01517 0.01339 -35.1 -33.2
lose 0.00008 0.00002 -44.5 -44.0
weight 0.00016 0.00002 -53.3 -55.0
while 0.00027 0.00027 -61.5 -63.2
you 0.00881 0.00304 -66.2 -69.0
sleep 0.00006 0.00001 -76.0 -80.5
P(spam | w) = 98.9
Training and Testing
Important Concepts
 Data: labeled instances, e.g. emails marked spam/ham
 Training set
 Held out set
 Test set
 Features: attribute‐value pairs which characterize each x Training
Data
 Experimentation cycle
 Learn parameters (e.g. model probabilities) on training set
 (Tune hyperparameters on held‐out set)
 Compute accuracy of test set
 Very important: never “peek” at the test set!
 Evaluation Held-Out
 Accuracy: fraction of instances predicted correctly
Data
 Overfitting and generalization
 Want a classifier which does well on test data
 Overfitting: fitting the training data very closely, but not Test
generalizing well Data
 We’ll investigate overfitting and generalization formally in a few
lectures
Generalization and Overfitting
Overfitting
30
25
20
Degree 15 polynomial
15
10
-5
-10
-15
0 2 4 6 8 10 12 14 16 18 20
Example: Overfitting
2 wins!!
Example: Overfitting
 Posteriors determined by relative probabilities (odds ratios):
south-west : inf screens : inf

nation : inf minute : inf
morally : inf guaranteed : inf
nicely : inf $205.00 : inf
extent : inf delivery : inf
seriously : inf signature : inf
... ...
What went wrong here?

Generalization and Overfitting
 Relative frequency parameters will overfit the training data!
 Just because we never saw a 3 with pixel (15,15) on during training doesn’t mean we won’t see it at test time
 Unlikely that every occurrence of “minute” is 100% spam
 Unlikely that every occurrence of “seriously” is 100% ham
 What about all the words that don’t occur in the training set at all?
 In general, we can’t go around giving unseen events zero probability
 As an extreme case, imagine using the entire email as the only feature
 Would get the training data perfect (if deterministic labeling)
 Wouldn’t generalize at all
 Just making the bag‐of‐words assumption gives us some generalization, but isn’t enough
 To generalize better: we need to smooth or regularize the estimates

Smoothing
Maximum Likelihood?
 Relative frequencies are the maximum likelihood estimates
 Another option is to consider the most likely parameter value given the data
????
Unseen Events
Laplace Smoothing
 Laplace’s estimate:
 Pretend you saw every outcome r r b
once more than you actually did
 Can derive this estimate with

Dirichlet priors (see cs281a)
Laplace Smoothing
 Laplace’s estimate (extended):
 Pretend you saw every outcome k extra times
r r b
 What’s Laplace with k = 0?

 k is the strength of the prior
 Laplace for conditionals:

 Smooth each condition independently:
3%2$456%)&'7$
+%($#&"#$%/#"6<=/0$#,5#$%&"#(.803/*#
MEDLINE Article 4/>?#>6<=/0$#@($/;,.A#?&/.(.0%A#

$
• I2+(R(2.&+&$%2G$b2".P.+(6&$
• S-((G$D5JJ-7$
• ,"#=.&+67$
d$ • 365R$!"#6%J7$
• c=P67(-(R7$
• cJ.G#=.(-(R7$
• X$
N$
3%2$456%)&'7$
B/C$#@3(""&D0(8,:#
• I&&.R2.2R$&5Pe#0+$0%+#R(6.#&E$+(J.0&E$(6$R#26#&$
• DJ%=$G#+#01(2$
• I5+"(6&".J$.G#21/0%1(2$
• IR#fR#2G#6$.G#21/0%1(2$
• g%2R5%R#$bG#21/0%1(2$
• D#21=#2+$%2%-7&.&$
• X$
3%2$456%)&'7$
B/C$#@3(""&D0(8,:E#2/D:&8,:#
• !"#$%<$
• $%$G(05=#2+$&'
• '%$/*#G$&#+$()$0-%&&#&$$('h'i)8E$)KEXE$)*j$
• +$%#$%<$%$J6#G.0+#G$0-%&&$)$∈$('
3%2$456%)&'7$
@3(""&D0(8,:#4/$%,2"E##
?(:2F0,2/2#.63/"#
• Z5-#&$P%&#G$(2$0(=P.2%1(2&$()$@(6G&$(6$(+"#6$)#%+56#&$
• $&J%=<$P-%0';-.&+;%GG6#&&$kZ$l\G(--%6&^$I?3\"%>#$P##2$&#-#0+#G^m$
• I0056%07$0%2$P#$".R"$
• b)$65-#&$0%6#)5--7$6#/2#G$P7$#*J#6+$
• S5+$P5.-G.2R$%2G$=%.2+%.2.2R$+"#&#$65-#&$.&$#*J#2&.>#$
3%2$456%)&'7$
@3(""&D0(8,:#4/$%,2"E#
>6'/.9&"/2#4(0%&:/#G/(.:&:;#
• !"#$%,''
• %$G(05=#2+$&'
• '%$/*#G$&#+$()$0-%&&#&$$('h'i)8E$)KEXE$)*j'
• I$+6%.2.2R$&#+$()$-'"%2G;-%P#-#G$G(05=#2+&$.&/0)/1022220.&-0)-1'
• +$%#$%,''
• %$-#%62#G$0-%&&./#6$3,&'!')'
8[$
3%2$456%)&'7$
@3(""&D0(8,:#4/$%,2"E#
>6'/.9&"/2#4(0%&:/#G/(.:&:;#
• I27$'.2G$()$0-%&&./#6$
• ?%n>#$S%7#&$
• g(R.&10$6#R6#&&.(2$
• D5JJ(6+;>#0+(6$=%0".2#&$
• ';?#%6#&+$?#.R"P(6&$
• X$
3%2$456%)&'7$
H(I9/#J(A/"#!:$6&8,:#
• D.=J-#$l\2%n>#^m$0-%&&./0%1(2$=#+"(G$P%&#G$(2$
S%7#&$65-#$
• Z#-.#&$(2$>#67$&.=J-#$6#J6#&#2+%1(2$()$G(05=#2+$
• S%R$()$@(6G&$
3%2$456%)&'7$
B%/#<(;#,5#-,.2"#./'./"/:$(8,:#
I love this movie! It's sweet,
but with satirical humor. The
dialogue is great and the
adventure scenes are fun… It
γ( manages to be whimsical and

romantic while laughing at the
conventions of the fairy tale
genre. I would recommend it to
just about anyone. I've seen
it several times, and I'm
)=c
always happy to see it again
whenever I have a friend who
hasn't seen it yet.
3%2$456%)&'7$
B%/#<(;#,5#-,.2"#./'./"/:$(8,:#
I love this movie! It's sweet,
but with satirical humor. The
dialogue is great and the
adventure scenes are fun… It
γ( manages to be whimsical and

romantic while laughing at the
conventions of the fairy tale
genre. I would recommend it to
just about anyone. I've seen
it several times, and I'm
)=c
always happy to see it again
whenever I have a friend who
hasn't seen it yet.
3%2$456%)&'7$
B%/#<(;#,5#-,.2"#./'./"/:$(8,:E##
6"&:;#(#"6<"/$#,5#-,.2"#
x love xxxxxxxxxxxxxxxx sweet
xxxxxxx satirical xxxxxxxxxx
xxxxxxxxxxx great xxxxxxx
xxxxxxxxxxxxxxxxxxx fun xxxx
γ( xxxxxxxxxxxxx whimsical xxxx

romantic xxxx laughing
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxx recommend xxxxx
xx several xxxxxxxxxxxxxxxxx
)=c
xxxxx happy xxxxxxxxx again
xxxxxxxxxxxxxxxxx
3%2$456%)&'7$
B%/#<(;#,5#-,.2"#./'./"/:$(8,:#
great 2
love 2
γ( recommend
laugh
1
1
)=c
happy 1
... ...
3%2$456%)&'7$
J(;#,5#-,.2"#5,.#2,06)/:$#03(""&D0(8,:#
Test d$
document
Machine Garbage
parser Learning NLP Collection Planning GUI
language
label learning parser garbage planning ...
translation training tag collection temporal
… algorithm training memory reasoning
shrinkage translation optimization plan
network... language... region... language...
3%2$456%)&'7$
J(A/"K#L63/#M''3&/2#$,#N,06)/:$"#(:2#
@3(""/"#
o V(6$%$G(05=#2+$&$%2G$%$0-%&&$)'
P(d | c)P(c)
P(c | d) =
P(d)
3%2$456%)&'7$
H(I9/#J(A/"#@3(""&D/.#O!P#
MAP is “maximum a
cMAP = argmax P(c | d) posteriori” = most
c∈C likely class
P(d | c)P(c)
= argmax Bayes Rule
c∈C P(d)
= argmax P(d | c)P(c) Dropping the
denominator
c∈C
3%2$456%)&'7$
H(I9/#J(A/"#@3(""&D/.#O!!P#
cMAP = argmax P(d | c)P(c)

c∈C
Document d
= argmax P(x1, x2 ,…, xn | c)P(c) represented as
features
c∈C x1..xn
3%2$456%)&'7$
H(I9/#J(A/"#@3(""&D/.#O!QP#
cMAP = argmax P(x1, x2 ,…, xn | c)P(c)

c∈C
klp4p"op(pm$J%6%=#+#6&$ How often does this

class occur?
,(5-G$(2-7$P#$#&1=%+#G$.)$%$
We can just count the
>#67E$>#67$-%6R#$25=P#6$()$ relative frequencies in
+6%.2.2R$#*%=J-#&$@%&$ a corpus
%>%.-%P-#C$
3%2$456%)&'7$
4638:,)&(3#H(I9/#J(A/"#!:2/'/:2/:0/#
M""6)'8,:"#
P(x1, x2 ,…, xn | c)
• J(;#,5#+,.2"#(""6)'8,:<$I&&5=#$J(&.1(2$G(#&2q+$
=%L#6$
• @,:2&8,:(3#!:2/'/:2/:0/<$I&&5=#$+"#$)#%+56#$
J6(P%P.-.1#&$5l67p)8m$%6#$.2G#J#2G#2+$R.>#2$+"#$0-%&&$)2
P(x1,…, xn | c) = P(x1 | c)• P(x2 | c)• P(x3 | c)•...• P(xn | c)
3%2$456%)&'7$
4638:,)&(3#H(I9/#J(A/"#@3(""&D/.#
cMAP = argmax P(x1, x2 ,…, xn | c)P(c)

c∈C
cNB = argmax P(c j )∏ P(x | c)

c∈C x∈X
3%2$456%)&'7$
M''3A&:;#4638:,)&(3#H(&9/#J(A/"#
@3(""&D/."#$,#B/C$#@3(""&D0(8,:#
positions ←$%--$@(6G$J(&.1(2&$.2$+#&+$G(05=#2+$$$$$$
$ $ $
cNB = argmax P(c j )

c j ∈C
∏ P(xi | c j )
i∈ positions
3%2$456%)&'7$ Sec.13.3
G/(.:&:;#$%/#4638:,)&(3#H(I9/#J(A/"#4,2/3#
• V.6&+$%L#=J+<$=%*.=5=$-.'#-."((G$#&1=%+#&$
• &.=J-7$5&#$+"#$)6#r5#20.#&$.2$+"#$G%+%$
doccount(C = c j )
P̂(c j ) =
N doc
count(wi , c j )
P̂(wi | c j ) =
∑ count(w, c j )
w∈V
3%2$456%)&'7$
7(.()/$/.#/"8)(8,:#
count(wi , c j ) )6%01(2$()$1=#&$@(6G$97$%JJ#%6&$$
P̂(wi | c j ) =
∑ count(w, c j ) %=(2R$%--$@(6G&$.2$G(05=#2+&$()$+(J.0$)8'
w∈V
• ,6#%+#$=#R%;G(05=#2+$)(6$+(J.0$8$P7$0(20%+#2%12R$%--$G(0&$.2$
+".&$+(J.0$
• B&#$)6#r5#207$()$9$.2$=#R%;G(05=#2+$
3%2$456%)&'7$ Sec.13.3
7.,<3/)#-&$%#4(C&)6)#G&R/3&%,,2#
• Q"%+$.)$@#$"%>#$&##2$2($+6%.2.2R$G(05=#2+&$@.+"$+"#$@(6G$
!"#$"%&'#$%2G$0-%&&./#G$.2$+"#$+(J.0$',"&89/$l$()*+%,)-.d$
ˆ count("fantastic", positive)
P("fantastic" positive) = = 0
∑ count(w, positive)
$
$
w∈V
• s#6($J6(P%P.-.1#&$0%22(+$P#$0(2G.1(2#G$%@%7E$2($=%L#6$
+"#$(+"#6$#>.G#20#t$
cMAP = argmax c P̂(c)∏ P̂(xi | c)

i
3%2$456%)&'7$
G('3(0/#O(22FSP#"),,$%&:;#5,.#H(I9/#J(A/"#
ˆ count(wi , c) +1
P(wi | c) =
∑ (count(w, c))+1)
w∈V
count(wi , c) +1
=
 
 ∑ count(w, c) + V
 w∈V 
3%2$456%)&'7$
4638:,)&(3#H(I9/#J(A/"E#G/(.:&:;#
• V6(=$+6%.2.2R$0(6J5&E$#*+6%0+$Vocabulary$
• ,%-05-%+#$5l)8m'+#6=&$ • ,%-05-%+#$5l9<'p')8m'+#6=&$
• V(6$#%0"$)8'.2$($G($ • =>6%8'←$&.2R-#$G(0$0(2+%.2.2R$%--$&:);8'
'&:);8'←'%--$G(0&$@.+"$$0-%&&$h)8' • V(6'#%0"$@(6G$9<'.2$?:)@A$B@CD'
''''"<'←$u$()$(00566#20#&$()$9<'.2$=>6%8'
| docs j |
P(c j ) ← nk + α
| total # documents| P(wk | c j ) ←
n + α | Vocabulary |
3%2$456%)&'7$
H(I9/#J(A/"#(:2#G(:;6(;/#4,2/3&:;#
• ?%n>#$P%7#&$0-%&&./#6&$0%2$5&#$%27$&(6+$()$)#%+56#$
• BZgE$#=%.-$%GG6#&&E$G.01(2%6.#&E$2#+@(6'$)#%+56#&$
• S5+$.)E$%&$.2$+"#$J6#>.(5&$&-.G#&$
• Q#$5&#$,:3A$@(6G$)#%+56#&$$
• @#$5&#$(33$()$+"#$@(6G&$.2$+"#$+#*+$l2(+$%$&5P&#+m$
• !"#2$$
• ?%n>#$P%7#&$"%&$%2$.=J(6+%2+$&.=.-%6.+7$+($-%2R5%R#$
OM$ =(G#-.2RC$
3%2$456%)&'7$ D#0C8OCKC8$
U(0%#03(""#V#(#6:&;.()#3(:;6(;/#),2/3#
• I&&.R2.2R$#%0"$@(6G<$vl@(6G$p$0m$
• I&&.R2.2R$#%0"$&#2+#20#<$vl&p0mhΠ$vl@(6Gp0m$
,-%&&$#:;'
[C8 $b$
b$ -(>#$ +".&$ )52$ /-=$
[C8 $-(>#$
[C8$ [C8$ C[T$ [C[8$ [C8$
[C[8 $+".&$
[C[T $)52$
[C8 $/-=$ vl&$p$J(&m$h$[C[[[[[[T$$
3%2$456%)&'7$ Sec.13.2.1
H(I9/#J(A/"#("#(#G(:;6(;/#4,2/3#
• Q".0"$0-%&&$%&&.R2&$+"#$".R"#6$J6(P%P.-.+7$+($&d$
F(G#-$J(&$ F(G#-$2#R$
[C8 $b$ [CK $b$ b$ -(>#$ +".&$ )52$ /-=$
[C8 $-(>#$ [C[[8 $-(>#$
[C8$ [C8$ [C[8$ [C[T$ [C8$
[C[8 $+".&$ [C[8 $+".&$ [CK$ [C[[8$ [C[8$ [C[[T$ [C8$
[C[T $)52$ [C[[T $)52$

[C8 $/-=$ [C8 $/-=$
vl&pJ(&m$$w$$vl&p2#Rm$
3%2$456%)&'7$
N,0# +,.2"# @3(""#
ˆ = Nc
P(c) !6%.2.2R$ 8$ ,".2#&#$S#.e.2R$,".2#&#$ 0$
N K$ ,".2#&#$,".2#&#$D"%2R"%.$ 0$
O$ ,".2#&#$F%0%($ 0$
ˆ | c) = count(w, c) +1
P(w `$ !('7($4%J%2$,".2#&#$ e$
count(c)+ | V | !#&+$ T$ ,".2#&#$,".2#&#$,".2#&#$!('7($4%J%2$ d$
7.&,."E#
5l)mh$$ O$
`$ 8$ @%,,"&:;#(#03(""E#
5l8mh$$ $ $`$
'
vl0pGTm$$ ∝ $Of`$x$lOf9mO$x$8f8`$x$8f8`$$
$ $y$[C[[[O$
@,:2&8,:(3#7.,<(<&3&8/"E# $
vl,".2#&#p)m$h$ lTz8m$f$l:zNm$h$Nf8`$h$Of9$ $
vl!('7(p)m$$$$h$ l[z8m$f$l:zNm$h$8f8`$ vlepGTm$$ ∝ $8f`$x$lKfMmO$x$KfM$x$KfM$$$
vl4%J%2p)m$$$$$h$ l[z8m$f$l:zNm$h$8f8`$ $ $y$[C[[[8$
vl,".2#&#p8m$h$ l8z8m$f$lOzNm$h$KfM$$ $
vl!('7(p8m$$$$$h$ l8z8m$f$lOzNm$h$KfM$$ $
``$ vl4%J%2p8m$$$$$$h$$ l8z8m$f$lOzNm$h$KfM$$
Machine Learning
Classification Methods
Bayesian Classification, Nearest
Neighbor, Ensemble Methods
Bayesian Classification: Why?
⚫ A statistical classifier: performs probabilistic prediction,

i.e., predicts class membership probabilities
⚫ Foundation: Based on Bayes’ Theorem.
⚫ Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree
and selected neural network classifiers
⚫ Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is
correct — prior knowledge can be combined with
observed data
February 28, 2023 Data Mining: Concepts and Techniques 2

Bayes’ Rule
Understand ing Bayes' rule
P ( d | h) P ( h) d = data
p(h | d ) = h = hypothesis (model)
P(d ) - rearrangin g
p(h | d ) P(d ) = P(d | h ) P(h )
P(d , h) = P(d , h)
the same joint probabilit y
Who is who in Bayes’ rule on both sides
P(h ) : prior belief (probability of hypothesis h before seeing any data)

P(d | h ) : likelihood (probability of the data if the hypothesis h is true)
P( d ) =  P( d | h ) P( h ) : data evidence (marginal probability of the data)
h
P(h | d ) : posterior (probability of hypothesis h after having seen the data d )

Example of Bayes Theorem
⚫ Given:
⚫ A doctor knows that meningitis causes stiff neck 50% of the time
⚫ Prior probability of any patient having meningitis is 1/50,000
⚫ Prior probability of any patient having stiff neck is 1/20
⚫ If a patient has stiff neck, what’s the probability

he/she has meningitis?
P( S | M ) P( M ) 0.5 1 / 50000
P( M | S ) = = = 0.0002
P( S ) 1 / 20
Choosing Hypotheses
⚫ Maximum Likelihood
hypothesis:
hML = arg max P(d | h)
hH
⚫ Generally we want the most

probable hypothesis given
hMAP = arg max P(h | d )
hH
training data.This is the
maximum a posteriori
hypothesis:
⚫ Useful observation: it does
not depend on the
denominator P(d)
Bayesian Classifiers
⚫ Consider each attribute and class label as random
variables
⚫ Given a record with attributes (A1, A2,…,An)

⚫ Goal is to predict class C
⚫ Specifically, we want to find the value of C that maximizes
P(C| A1, A2,…,An )
⚫ Can we estimate P(C| A1, A2,…,An ) directly from

data?
Bayesian Classifiers
⚫ Approach:
⚫ compute the posterior probability P(C | A1, A2, …, An) for all values
of C using the Bayes theorem
P ( A A  A | C ) P (C )
P (C | A A  A ) = 1 2 n
P( A A  A )
1 2 n
1 2 n
⚫ Choose value of C that maximizes

P(C | A1, A2, …, An)
⚫ Equivalent to choosing value of C that maximizes

P(A1, A2, …, An|C) P(C)
⚫ How to estimate P(A1, A2, …, An | C )?

Naïve Bayes Classifier
⚫ Assume independence among attributes Ai when class is
given:
⚫ P(A1, A2, …, An |C) = P(A1| Cj) P(A2| Cj)… P(An| Cj)
⚫ Can estimate P(Ai| Cj) for all Ai and Cj.
⚫ This is a simplifying assumption which may be violated in
reality
⚫ The Bayesian classifier that uses the Naïve Bayes assumption
and computes the MAP hypothesis is called Naïve Bayes
classifier
cNaive Bayes = arg max P(c) P(x | c) = arg max P(c) P(ai | c)
c c i
How to Estimatel l
Probabilities
a a us
from Data?
e gor
ic
e gor
ic
tinuo
ss
ca
t
ca
t
co
n
cla ⚫ Class: P(C) = Nc/N
Tid Refund Marital Taxable ⚫ e.g., P(No) = 7/10,
Status Income Evade
P(Yes) = 3/10
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No ⚫ For discrete attributes:
4 Yes Married 120K No
5 No Divorced 95K Yes
P(Ai | Ck) = |Aik|/ Nc
k
6 No Married 60K No ⚫ where |Aik| is number of
7 Yes Divorced 220K No instances having attribute Ai and
8 No Single 85K Yes belongs to class Ck
9 No Married 75K No
⚫ Examples:
10 No Single 90K Yes
10
P(Status=Married|No) = 4/7
P(Refund=Yes|Yes)=0
How to Estimate Probabilities
from Data?
⚫ For continuous attributes:
⚫ Discretize the range into bins
⚫ one ordinal attribute per bin
⚫ violates independence assumption
⚫ Two-way split: (A < v) or (A > v)
⚫ choose only one of the two splits as new attribute
⚫ Probability density estimation:
⚫ Assume attribute follows a normal distribution
⚫ Use data to estimate parameters of distribution
(e.g., mean and standard deviation)
⚫ Once probability distribution is known, can use it to
estimate the conditional probability P(Ai|c)
How toricEstimate
al ic a l
ous Probabilities from
o or nu
Data? te g
te g
nti
la ss
ca ca co c
Tid Refund Marital
Status
Taxable
Income Evade
⚫ Normal distribution:
( Ai −  ij ) 2
1 −
P( A | c ) =
2  ij2
e
2
i j 2
ij
3 No Single 70K No
⚫ One for each (Ai,ci) pair
⚫ For (Income, Class=No):
6 No Married 60K No
7 Yes Divorced 220K No
⚫ If Class=No
8 No Single 85K Yes ⚫ sample mean = 110
9 No Married 75K No ⚫ sample variance = 2975

10
1 −
( 120−110 ) 2
P( Income = 120 | No) = e 2 ( 2975 )

= 0.0072
2 (54.54)
Naïve Bayesian Classifier:
Training Dataset
age income studentcredit_rating
buys_compu
<=30 high no fair no
<=30 high no excellent no
Class: 31…40 high no fair yes
C1:buys_computer = ‘yes’
>40 medium no fair yes
C2:buys_computer = ‘no’
>40 low yes fair yes
>40 low yes excellent no
New Data:
31…40 low yes excellent yes
X = (age <=30,
<=30 medium no fair no
Income = medium,
Student = yes <=30 low yes fair yes
Credit_rating = Fair) >40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
An Example
Given X (age=youth, income=medium, student=yes, credit=fair)

Maximize P(X|Ci)P(Ci), for i=1,2
First step: Compute P(C) The prior probability of each class can be
computed based on the training tuples:
P(buys_computer=yes)=9/14=0.643
P(buys_computer=no)=5/14=0.357
An Example

Second step: compute P(X|Ci)

P(X|buys_computer=yes)= P(age=youth|buys_computer=yes)x
P(income=medium|buys_computer=yes) x
P(student=yes|buys_computer=yes)x
P(credit_rating=fair|buys_computer=yes)
= 0.044
P(age=youth|buys_computer=yes)=0.222
P(income=medium|buys_computer=yes)=0.444
P(student=yes|buys_computer=yes)=6/9=0.667
P(credit_rating=fair|buys_computer=yes)=6/9=0.667
An Example

Second step: compute P(X|Ci)

P(X|buys_computer=no)= P(age=youth|buys_computer=no)x
P(income=medium|buys_computer=no) x
P(student=yes|buys_computer=no) x
P(credit_rating=fair|buys_computer=no)
= 0.019
P(age=youth|buys_computer=no)=3/5=0.666
P(income=medium|buys_computer=no)=2/5=0.400
P(student=yes|buys_computer=no)=1/5=0.200
P(credit_rating=fair|buys_computer=no)=2/5=0.400
An Example

We have computed in the first and second steps:

P(buys_computer=yes)=9/14=0.643
P(buys_computer=no)=5/14=0.357
P(X|buys_computer=yes)= 0.044
P(X|buys_computer=no)= 0.019
Third step: compute P(X|Ci)P(Ci) for each class

P(X|buys_computer=yes)P(buys_computer=yes)=0.044 x 0.643=0.028
P(X|buys_computer=no)P(buys_computer=no)=0.019 x 0.357=0.007
The naïve Bayesian Classifier predicts X belongs to class (“buys_computer =
yes”)
a l a l s
u
ric ric uo
Example a te
go
a te
go
ntin
cla
s s
c c co
Tid Refund Marital Taxable
Training set : Status Income Evade

3 No Single 70K No
6 No Married 60K No k
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10
Given a Test Record:
X = (Refund = No, Married, Income = 120K)

Example of Naïve Bayes Classifier
Given a Test Record:
X = (Refund = No, Married, Income = 120K)
naive Bayes Classifier:
P(Refund=Yes|No) = 3/7 P(X|Class=No) = P(Refund=No|Class=No)

P(Refund=No|No) = 4/7  P(Married| Class=No)
P(Refund=Yes|Yes) = 0  P(Income=120K| Class=No)
P(Refund=No|Yes) = 1 = 4/7  4/7  0.0072 = 0.0024
P(Marital Status=Single|No) = 2/7
P(Marital Status=Divorced|No)=1/7
P(Marital Status=Married|No) = 4/7 P(X|Class=Yes) = P(Refund=No| Class=Yes)
P(Marital Status=Single|Yes) = 2/7  P(Married| Class=Yes)
P(Marital Status=Divorced|Yes)=1/7  P(Income=120K| Class=Yes)
P(Marital Status=Married|Yes) = 0 = 1  0  1.2  10-9 = 0
For taxable income:
If class=No: sample mean=110 Since P(X|No)P(No) > P(X|Yes)P(Yes)
sample variance=2975 Therefore P(No|X) > P(Yes|X)
If class=Yes: sample mean=90
sample variance=25 => Class = No
Avoiding the 0-Probability Problem
⚫ If one of the conditional probability is zero, then the

entire expression becomes zero
⚫ Probability estimation:
N ic
Original : P( Ai | C ) = c: number of classes
Nc
p: prior probability
N ic + 1
Laplace : P( Ai | C ) =
Nc + c m: parameter
N ic + mp
m - estimate : P( Ai | C ) =
Nc + m
Naïve Bayes (Summary)
⚫ Advantage
⚫ Robust to isolated noise points
⚫ Handle missing values by ignoring the instance during probability
estimate calculations
⚫ Robust to irrelevant attributes
⚫ Disadvantage
⚫ Assumption: class conditional independence, which may cause loss
of accuracy
⚫ Independence assumption may not hold for some attribute.
Practically, dependencies exist among variables
⚫ Use other techniques such as Bayesian Belief Networks (BBN)
Remember
⚫ Bayes’ rule can be turned into a classifier

⚫ Maximum A Posteriori (MAP) hypothesis estimation
incorporates prior knowledge; Max Likelihood (ML) doesn’t
⚫ Naive Bayes Classifier is a simple but effective Bayesian
classifier for vector data (i.e. data with several attributes)
that assumes that attributes are independent given the
class.
⚫ Bayesian classification is a generative approach to
classification
Classification Paradigms
⚫ In fact, we can categorize three fundamental approaches
to classification:
⚫ Generative models: Model p(x|Ck) and P(Ck) separately
and use the Bayes theorem to find the posterior
probabilities P(Ck|x)
⚫ E.g. Naive Bayes, Gaussian Mixture Models, Hidden Markov
Models,…
⚫ Discriminative models:
⚫ Determine P(Ck|x) directly and use in decision
⚫ E.g. Linear discriminant analysis, SVMs, NNs,…
⚫ Find a discriminant function f that maps x onto a class
label directly without calculating probabilities
Slide from B.Yanik

Bayesian Belief Networks
⚫ Bayesian belief network allows a subset of the variables to
be conditionally independent
⚫ A graphical model of causal relationships
⚫ Represents dependency among the variables
⚫ Gives a specification of joint probability distribution
❑ Nodes: random variables

X Y ❑ Links: dependency
❑ X and Y are the parents of Z, and Y is
Z the parent of P
P ❑ No dependency between Z and P
❑ Has no loops or cycles

Bayesian Belief Network: An Example
Family The conditional probability table

Smoker
History (CPT) for variable LungCancer:
(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)
LC 0.8 0.5 0.7 0.1

LungCancer Emphysema ~LC 0.2 0.5 0.3 0.9
CPT shows the conditional probability for

each possible combination of its parents
Derivation of the probability of a
PositiveXRay Dyspnea particular combination of values of X,
from CPT:
n
P( x1 ,..., xn ) =  P( xi | Parents (Y i ))
Bayesian Belief Networks i =1

Bayesian network through an example by
creating a directed acyclic graph:
⚫ Example: Harry installed a new burglar alarm at his home to
detect burglary. The alarm reliably responds at detecting a
burglary but also responds for minor earthquakes. Harry has
two neighbors David and Sophia, who have taken a
responsibility to inform Harry at work when they hear the
alarm. David always calls Harry when he hears the alarm, but
sometimes he got confused with the phone ringing and calls
at that time too. On the other hand, Sophia likes to listen to
high music, so sometimes she misses to hear the alarm. Here
we would like to compute the probability of Burglary Alarm.
⚫ Problem: Calculate the probability that alarm has sounded,

but there is neither a burglary, nor an earthquake occurred,
and David and Sophia both called the Harry.

•The Bayesian network for the above problem is given below. The network
structure is showing that burglary and earthquake is the parent node of the alarm
and directly affecting the probability of alarm's going off, but David and Sophia's
calls depend on alarm probability.
•The network is representing that our assumptions do not directly perceive the
burglary and also do not notice the minor earthquake, and they also not confer
before calling.
•The conditional distributions for each node are given as conditional probabilities
table or CPT.
•Each row in the CPT must be sum to 1 because all the entries in the table
represent an exhaustive set of cases for the variable.
•In CPT, a boolean variable with k boolean parents contains 2K probabilities.

Hence, if there are two parents, then CPT will contain 4 probability values
List of all events occurring in this network:
•Burglary (B)
•Earthquake(E)
•Alarm(A)
•David Calls(D)
•Sophia calls(S)
P(S, D, A, ¬B, ¬E) = P (S|A) *P (D|A)*P (A|¬B ^ ¬E) *P (¬B) *P (¬E).
= 0.75* 0.91* 0.001* 0.998*0.999

= 0.00068045.
CS- 510
DECISION TREE
Sandeep Chaurasia, SPSU

Intro.
• Decision tree learning is one of the most widely used and
practical methods for inductive inference.
• It is a method for approximating discrete-valued functions

that is robust to noisy data and capable of learning
disjunctive expressions.
• Decision tree learning is a method for approximating

discrete-valued target functions, in which the learned
function is represented by a decision tree.
• Learned trees can also be re-represented as sets of if-then

rules to improve human readability.
An Example

Decision Tree Representation
Decision tree representation:
– Each internal node tests an attribute
– Each branch corresponds to attribute value
– Each leaf node assigns a classification
(Outlook = Sunny Λ Humidity = Normal)

V (Outlook = Overcast)
V(Outlook = Rain Λ Wind = Weak)
An instance is classified by starting at the root node of the tree, testing the attribute specified
by this node, then moving down the tree branch corresponding to the value of the attribute in
the given example. This process is then repeated for the subtree rooted at the new node.
APPROPRIATE PROBLEMS FOR
DECISION TREE LEARNING
• Instances are represented by attribute-value
pairs.
• The target function has discrete output values.
• Disjunctive descriptions may be required.
• The training data may contain errors.
• The training data may contain missing
attribute values.

DECISION TREE LEARNING ALGORITHM

Target Concept : Play Tennis

Which Attribute Is the Best Classifier?
• The central choice in the ID3 algorithm is selecting which attribute
to test at each node in the tree.
• We would like to select the attribute that is most useful for

classifying examples. What is a good quantitative measure of the
worth of an attribute?
• We will define a statistical property, called information gain, that

measures how well a given attribute separates the training
examples according to their target classification. Entropy is a
measure of uncertainty associated with a random variable
• ID3 uses this information gain measure to select among the

candidate attributes at each step while growing the tree.

Entropy, that characterizes the (im)purity
• Given a collection S, containing positive and negative

examples of some target concept, the entropy of S relative to
this Boolean classification is where p(+), is the proportion of
positive examples in S and p(-), is the proportion of negative
examples in S.
• In all calculations involving entropy we define 0 log 0 to be 0.

A typical Case:- 0log0
The entropy function relative

to a boolean classification,
as the proportion, p(+), of
positive examples varies
between 0 and 1.

INFORMATION GAIN
Given entropy as a measure of the impurity in a collection of
training examples, we can now define a measure of the
effectiveness of an attribute in classifying the training data.
The measure we will use, called information gain, is simply the
expected reduction in entropy caused by partitioning the
examples according to this attribute.
Values(A) is the set of all possible values for attribute A, and S, is the
subset of S for which attribute A has value

Decision Tree - Classification

A simple Example #2 Feature

Back to the problem

Decision Tree - Regression
Sandeep Chaurasia
Decision Tree Algorithm
• The core algorithm for building decision trees called ID3 by J. R. Quinlan
which employs a top-down, greedy search through the space of possible
branches with no backtracking. The ID3 algorithm can be used to construct
a decision tree for regression by replacing Information Gain with Standard
Deviation Reduction.
a) Standard Deviation
Standard Deviation (S) is for tree building (branching).
Coefficient of Deviation (CV) is used to decide when to

stop branching. We can use Count (n) as well.
Average (Avg) is the value in the leaf nodes.

b) Standard deviation for two attributes (target and predictor):
Standard Deviation Reduction (SDR)
The standard deviation reduction is based on the
decrease in standard deviation after a dataset is split on
an attribute.
Step 1: The standard deviation of the target is calculated.
Step 2: The dataset is then split on the different attributes.

The standard deviation for each branch is calculated. The
resulting standard deviation is subtracted from the standard
deviation before the split. The result is the standard deviation
reduction.
Step 3: The attribute with the largest standard deviation reduction is chosen for the decision node.
Step 4a: The dataset is divided based on the values of the selected attribute. This process is run recursively
on the non-leaf branches, until all data is processed
• In practice, we need some termination criteria. For example, when
coefficient of deviation (CV) for a branch becomes smaller than a
certain threshold (e.g., 10%) and/or when too few instances (n)
remain in the branch (e.g., 3).
Step 4b: "Overcast" subset does not need any further splitting because its CV (8%) is less than the threshold
(10%). The related leaf node gets the average of the "Overcast" subset.
Step 4c: However, the "Sunny" branch has an CV (28%) more than the threshold (10%) which needs further splitting.
We select "Temp" as the best best node after "Outlook" because it has the largest SDR.
Because the number of data points
for both branches (FALSE and TRUE) is
equal or less than 3 we stop further
branching and assign the average of
each branch to the related leaf node.
Step 4d: Moreover, the "rainy" branch has

an CV (22%) which is more than the
threshold (10%). This branch needs
further splitting. We select "Temp" as the
best node because it has the largest SDR.
CART Decision Tree Example
An algorithm can be transparent only if its decisions can be read and understood by people
clearly. Even though deep learning is superstar of machine learning nowadays, it is an opaque
algorithm and we do not know the reason of decision. Herein, Decision tree algorithms still
keep their popularity because they can produce transparent decisions. ID3 uses information
gain whereas C4.5 uses gain ratio for splitting. Here, CART is an alternative decision tree
building algorithm. It can handle both classification and regression tasks. This algorithm uses
a new metric named gini index to create decision points for classification tasks. We will
mention a step by step CART decision tree example by hand from scratch.
We will work on same dataset in ID3. There are 14 instances of golf playing decisions based
on outlook, temperature, humidity and wind factors.
Day Outlook Temp. Humidity Wind Decision

1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No
7 Overcast Cool Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No
Gini index
Gini index is a metric for classification tasks in CART. It stores sum of squared probabilities
of each class. We can formulate it as illustrated below.
Gini = 1 – Σ (Pi)2 for i=1 to number of classes
Outlook
Outlook is a nominal feature. It can be sunny, overcast or rain. I will summarize the final
decisions for outlook feature.
Number of
Outlook Yes No
instances
Sunny 2 3 5
Overcast 4 0 4
Rain 3 2 5
Gini(Outlook=Sunny) = 1 – (2/5)2 – (3/5)2 = 1 – 0.16 – 0.36 = 0.48
Gini(Outlook=Overcast) = 1 – (4/4)2 – (0/4)2 = 0
Gini(Outlook=Rain) = 1 – (3/5)2 – (2/5)2 = 1 – 0.36 – 0.16 = 0.48
Then, we will calculate weighted sum of gini indexes for outlook feature.
Gini(Outlook) = (5/14) x 0.48 + (4/14) x 0 + (5/14) x 0.48 = 0.171 + 0 + 0.171 = 0.342
Temperature
Similarly, temperature is a nominal feature and it could have 3 different values: Cool, Hot
and Mild. Let’s summarize decisions for temperature feature.
Number of
Temperature Yes No
instances
Hot 2 2 4
Cool 3 1 4
Mild 4 2 6
Gini(Temp=Hot) = 1 – (2/4)2 – (2/4)2 = 0.5
Gini(Temp=Cool) = 1 – (3/4)2 – (1/4)2 = 1 – 0.5625 – 0.0625 = 0.375
Gini(Temp=Mild) = 1 – (4/6)2 – (2/6)2 = 1 – 0.444 – 0.111 = 0.445
We’ll calculate weighted sum of gini index for temperature feature
Gini(Temp) = (4/14) x 0.5 + (4/14) x 0.375 + (6/14) x 0.445 = 0.142 + 0.107 + 0.190 = 0.439
Humidity
Humidity is a binary class feature. It can be high or normal.
Number of
Humidity Yes No
instances
High 3 4 7
Normal 6 1 7
Gini(Humidity=High) = 1 – (3/7)2 – (4/7)2 = 1 – 0.183 – 0.326 = 0.489

Gini(Humidity=Normal) = 1 – (6/7)2 – (1/7)2 = 1 – 0.734 – 0.02 = 0.244
Weighted sum for humidity feature will be calculated next
Gini(Humidity) = (7/14) x 0.489 + (7/14) x 0.244 = 0.367
Wind
Wind is a binary class similar to humidity. It can be weak and strong.
Number of
Wind Yes No
instances
Weak 6 2 8
Strong 3 3 6
Gini(Wind=Weak) = 1 – (6/8)2 – (2/8)2 = 1 – 0.5625 – 0.062 = 0.375
Gini(Wind=Strong) = 1 – (3/6)2 – (3/6)2 = 1 – 0.25 – 0.25 = 0.5
Gini(Wind) = (8/14) x 0.375 + (6/14) x 0.5 = 0.428
Time to decide
We’ve calculated gini index values for each feature. The winner will be outlook feature
because its cost is the lowest.
Gini
Feature
index
Outlook 0.342
Temperature 0.439
Humidity 0.367
Wind 0.428
We’ll put outlook decision at the top of the tree.

First decision would be outlook feature
You might realize that sub dataset in the overcast leaf has only yes decisions. This means that
overcast leaf is over.
Tree is over for overcast outlook leaf
We will apply same principles to those sub datasets in the following steps.
Focus on the sub dataset for sunny outlook. We need to find the gini index scores for
temperature, humidity and wind features respectively.

1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
11 Sunny Mild Normal Strong Yes
Gini of temperature for sunny outlook
Number of
Temperature Yes No
instances
Hot 0 2 2
Cool 1 0 1
Mild 1 1 2
Gini(Outlook=Sunny and Temp.=Hot) = 1 – (0/2)2 – (2/2)2 = 0
Gini(Outlook=Sunny and Temp.=Cool) = 1 – (1/1)2 – (0/1)2 = 0
Gini(Outlook=Sunny and Temp.=Mild) = 1 – (1/2)2 – (1/2)2 = 1 – 0.25 – 0.25 = 0.5
Gini(Outlook=Sunny and Temp.) = (2/5)x0 + (1/5)x0 + (2/5)x0.5 = 0.2
Gini of humidity for sunny outlook
Number of
Humidity Yes No
instances
High 0 3 3
Normal 2 0 2
Gini(Outlook=Sunny and Humidity=High) = 1 – (0/3)2 – (3/3)2 = 0
Gini(Outlook=Sunny and Humidity=Normal) = 1 – (2/2)2 – (0/2)2 = 0
Gini(Outlook=Sunny and Humidity) = (3/5)x0 + (2/5)x0 = 0
Gini of wind for sunny outlook
Number of
Wind Yes No
instances
Weak 1 2 3
Strong 1 1 2
Gini(Outlook=Sunny and Wind=Weak) = 1 – (1/3)2 – (2/3)2 = 0.266
Gini(Outlook=Sunny and Wind=Strong) = 1- (1/2)2 – (1/2)2 = 0.2
Gini(Outlook=Sunny and Wind) = (3/5)x0.266 + (2/5)x0.2 = 0.466
Decision for sunny outlook
We’ve calculated gini index scores for feature when outlook is sunny. The winner is humidity
because it has the lowest value.
Gini
Feature
index
Temperature 0.2
Humidity 0
Wind 0.466
We’ll put humidity check at the extension of sunny outlook.
Sub datasets for high and normal humidity
As seen, decision is always no for high humidity and sunny outlook. On the other hand,
decision will always be yes for normal humidity and sunny outlook. This branch is over.
Decisions for high and normal humidity
Now, we need to focus on rain outlook.
Rain outlook

4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No
10 Rain Mild Normal Weak Yes
14 Rain Mild High Strong No
We’ll calculate gini index scores for temperature, humidity and wind features when outlook
is rain.
Gini of temprature for rain outlook
Number of
Temperature Yes No
instances
Cool 1 1 2
Mild 2 1 3
Gini(Outlook=Rain and Temp.=Cool) = 1 – (1/2)2 – (1/2)2 = 0.5
Gini(Outlook=Rain and Temp.=Mild) = 1 – (2/3)2 – (1/3)2 = 0.444
Gini(Outlook=Rain and Temp.) = (2/5)x0.5 + (3/5)x0.444 = 0.466

Gini of humidity for rain outlook
Number of
Humidity Yes No
instances
High 1 1 2
Normal 2 1 3
Gini(Outlook=Rain and Humidity=High) = 1 – (1/2)2 – (1/2)2 = 0.5
Gini(Outlook=Rain and Humidity=Normal) = 1 – (2/3)2 – (1/3)2 = 0.444
Gini(Outlook=Rain and Humidity) = (2/5)x0.5 + (3/5)x0.444 = 0.466
Gini of wind for rain outlook
Number of
Wind Yes No
instances
Weak 3 0 3
Strong 0 2 2
Gini(Outlook=Rain and Wind=Weak) = 1 – (3/3)2 – (0/3)2 = 0
Gini(Outlook=Rain and Wind=Strong) = 1 – (0/2)2 – (2/2)2 = 0
Gini(Outlook=Rain and Wind) = (3/5)x0 + (2/5)x0 = 0
Decision for rain outlook
The winner is wind feature for rain outlook because it has the minimum gini index score in
features.
Gini
Feature
index
Temperature 0.466
Humidity 0.466
Wind 0
Put the wind feature for rain outlook branch and monitor the new sub data sets.
Sub data sets for weak and strong wind and rain outlook
As seen, decision is always yes when wind is weak. On the other hand, decision is always no
if wind is strong. This means that this branch is over.
Final form of the decision tree built by CART algorithm

Data Science & Machine Learning
CS-3203
LECTURE-19-20: SUPPORT VECTOR MACHINE
(METHODS & EXAMPLES)
Spring 2023
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

INTRODUCTION: SUPPORT VECTOR MACHINE
 SVM is related to statistical learning theory
 SVM was first introduced in 1992.
 SVM becomes popular because of its success in

handwritten digit recognition
 1.1% test error rate for SVM. This is the same as the error
rates of a carefully constructed neural network (NN).
 SVM is now regarded as an important example of

“kernel methods”, one of the key area in machine
learning. 2
SUPPORT VECTOR MACHINE: LINEAR CLASSIFIERS
denotes +1
x f yest
denotes -1
Estimation:
f(w,b) = sign(w. x + b)
w: weight vector
Plane x: data vector
Separating different
classes How would you
classify this data?
3
a
LINEAR CLASSIFIERS
x f yest
denotes +1
denotes -1
How would you

classify this data?
4
a
LINEAR CLASSIFIERS
x f yest
denotes +1
denotes -1
How would you

classify this data?
5
a
LINEAR CLASSIFIERS
x f yest
f(w,b) = sign(w. x +b)
denotes +1
denotes -1
How would you

classify this data?
6
a
LINEAR CLASSIFIERS
x f yest
denotes +1
denotes -1
Any of these
would be fine..
..but which is
best?
7
a
CLASSIFIER MARGIN
x f yest
f(w,b) = sign(w. x +b)
denotes +1
denotes -1 Define the margin
of a linear
classifier as the
width that the
boundary could be
increased by
before hitting a
datapoint.
8
a
MAXIMUM MARGIN
x f yest
f(w,b) = sign(w. x - b)
denotes +1
denotes -1 The maximum
margin linear
classifier is the
linear classifier
with the, um,
maximum margin.
This is the
simplest kind of
SVM (Called an
LSVM) 9
Linear SVM
a
MAXIMUM MARGIN
x f yest
denotes +1
margin linear
classifier is the
linear classifier
Support Vectors with the, um,
are those data
points that the maximum margin.
margin pushes This is the
up against
simplest kind of
SVM (Called an
LSVM) 10
Linear SVM
HYPERPLANE : NUMERICAL
The idea behind SVMs is to make use of a (nonlinear)

mapping function Φ that transforms data in input space to
data in feature space in such a way as to render a problem
linearly separable.
The SVM then automatically discovers the optimal
separating hyperplane (which, when mapped back into
input space via φ−1, can be a complex decision surface).
11
HYPERPLANE : NUMERICAL-1
12
HYPERPLANE : NUMERICAL
We would like to discover a simple SVM that accurately
discriminates the two classes. Since the data is linearly separable,
we can use a linear SVM (that is, one whose mapping function Φ()
is the identity function). By inspection, it should be obvious that
there are three support vectors (see Figure 2):
13
14
15
16
17
18
Our goal, again, is to discover a separating hyperplane that
accurately discriminates the two classes. Of course, it is
obvious that no such hyperplane exists in the input space (that
is, in the space in which the original input data live). Therefore,
we must use a nonlinear SVM (that is, one whose mapping
function is a nonlinear mapping from input space into some
feature space).
Define
19
20
We again use vectors augmented with a 1 as a bias input and will
differentiate them as before. Now given the [augmented] support
vectors, we must again and values for the αi
21
22
23
WHY MAXIMUM MARGIN?
denotes +1
margin linear
classifier is the
linear classifier
Support Vectors with the,
are those
datapoints that maximum margin.
the margin This is the
pushes up
against simplest kind of
SVM (Called an
LSVM) 24
How to calculate the distance from a point to a line?
denotes +1
denotes -1
x
wx +b = 0
X – Vector
W
W – Normal Vector
b – Scale Value (bias)
◼ In our case, w1*x1+w2*x2+b=0,

◼ thus, w=(w1,w2), x=(x1,x2)
25
ESTIMATE THE MARGIN
denotes +1
denotes -1 x
wx +b = 0
X – Vector
W
W – Normal Vector
b – Scale Value
 What is the distance expression for a point x to a line

wx+b= 0?
xw +b xw +b
d ( x) = =

2 d 2 26
w w
i =1 i
2
LARGE-MARGIN DECISION BOUNDARY
 The decision boundary should be as far away from the
data of both classes as possible
 We should maximize the margin, m
 Distance between the origin and the line wTx=-b is b/||w||
Class 2
Class 1
m
27
FINDING THE DECISION BOUNDARY
 Let {x1, ..., xn} be our data set and let yi  {1,-1} be the
class label of xi
 The decision boundary should classify all points
correctly 
 To see this:
when y= -1, we wish (wx+b)<1,

when y =1, we wish (wx+b)>1.
For support vectors, we wish y(wx+b)=1.
 The decision boundary can be found by solving the
following constrained optimization problem
28
NEXT STEP… OPTIONAL
 Converting SVM to a form we can solve

 Dual form
 Allowing a few errors
 Soft margin
 Allowing nonlinear boundary
 Kernel functions
29
DERIVATION
30
DERIVATION
31
DERIVATION
32
DERIVATION
33
DERIVATION
34
DERIVATION
35
DERIVATION
36
DERIVATION
37
DERIVATION
38
DERIVATION
39
DERIVATION
40
DERIVATION
41
DERIVATION
42
SVM : EXAMPLE
43
SVM : EXAMPLE
44
SVM : EXAMPLE
45
SVM : EXAMPLE
46
THE DUAL PROBLEM
 The new objective function is in terms of ai only
 It is known as the dual problem: if we know w, we know
all ai; if we know all ai, we know w
 The original problem is known as the primal problem
 The objective function of the dual problem needs to be

maximized!
 The dual problem is therefore:
Properties of ai when we introduce The result when we differentiate

47the
the Lagrange multipliers original Lagrangian w.r.t. b
THE DUAL PROBLEM
 This is a quadratic programming (QP) problem

 A global maximum of ai can always be found
 w can be recovered by
48
CHARACTERISTICS OF THE SOLUTION
 Many of the ai are zero (see next page for example)
 w is a linear combination of a small number of data points
 This “sparse” representation can be viewed as data
compression as in the construction of KNN classifier
 xi with non-zero ai are called support vectors (SV)
 The decision boundary is determined only by the SV
 Let tj (j=1, ..., s) be the indices of the s support vectors. We
can write
 For testing with a new data z
 Compute and
classify z as class 1 if the sum is positive, and class 2
otherwise
 Note: w need not be formed explicitly 49
SUPPORT VECTOR: GEOMETRICAL INTERPRETATION
Class 2
a8=0.6 a10=0
a7=0
a2=0
a5=0
a1=0.8
a4=0
a6=1.4
a9=0
a3=0
Class 1
50
LINEAR SVM: EXAMPLE
51
LINEAR SVM: EXAMPLE
 We would like to discover a simple SVM that accurately
discriminates the two classes. Since the data is linearly
separable, we can use a linear SVM.
 By inspection, it is obvious that there are three support
vectors.
52
LINEAR SVM: EXAMPLE
 In what follows we will use vectors augmented with a

1 as a bias input, and for clarity we will differentiate
these with an over-tilde.
So, if s1 = (10), then ~ s1 = (101).
 task is to find values for the i such that (based on SVM
architecture)
53
SUPPORT VECTOR ARCHITECTURE
54
EXAMPLE CONTINUES…
55
ALLOWING ERRORS IN OUR SOLUTIONS
We allow “error” xi in classification; it is based on
the output of the discriminant function wTx+b
 xi approximates the number of misclassified
samples
Class 2
56
Class 1
SOFT MARGIN HYPERPLANE
 If we minimize ixi, xi can be computed by
 xi are “slack variables” in optimization

 Note that xi=0 if there is no error for xi
 xi is an upper bound of t he number of errors
 We want to minimize
 C : tradeoff parameter between error and margin
 The optimization problem becomes
57
EXTENSION TO NON-LINEAR DECISION BOUNDARY
 So far, we have only considered large-margin classifier

with a linear decision boundary
 How to generalize it to become nonlinear?
 Key idea: transform xi to a higher dimensional space to

“make life easier”
 Input space: the space the point xi are located
 Feature space: the space of f(xi) after
transformation
58
TRANSFORMING THE DATA
f( )
f( ) f( )
f( ) f( ) f( )
f(.) f( )
f( ) f( )
f( ) f( )
f( ) f( )
f( ) f( ) f( )
f( )
f( )
Input space Feature space

Note: feature space is of higher dimension
than the input space in practice
 Computation in the feature space can be costly because it

is high dimensional
 The feature space is typically infinite-dimensional!
 The kernel trick comes to rescue
59
THE KERNEL TRICK
 Recall the SVM optimization problem
 The data points only appear as inner product

 As long as we can calculate the inner product in the
feature space, we do not need the mapping explicitly
 Many common geometric operations (angles,
distances) can be expressed by inner products
 Define the kernel function K by
60
AN EXAMPLE FOR F(.) AND K(.,.)
 Suppose f(.) is given as follows
 An inner product in the feature space is
 So, if we define the kernel function as follows, there

is no need to carry out f(.) explicitly
 This use of kernel function to avoid carrying out f(.)

explicitly is known as the kernel trick 61
MORE ON KERNEL FUNCTIONS
 Not all similarity measures can be used as kernel

function, however
 The kernel function needs to satisfy the Mercer function,
i.e., the function is “positive-definite”
 This implies that
 the n by n kernel matrix,
 in which the (i,j)-th entry is the K(xi, xj), is always positive
definite
 This also means that optimization problem can be
solved in polynomial time!
62
EXAMPLES OF KERNEL FUNCTIONS
 Polynomial kernel with degree d
 Gaussian: Radial basis function kernel with width s
 Closely related to radial basis function neural networks

 The feature space is infinite-dimensional
 Sigmoid with parameter k and q
 It does not satisfy the Mercer condition on all k and q

63
Non-linear SVMs: Feature spaces
◼ General idea: the original input space can always be mapped to
some higher-dimensional feature space where the training set is
separable:
Φ: x → φ(x)
64
EXAMPLE
 Suppose we have 5 one-dimensional data points
 x1=1, x2=2, x3=4, x4=5, x5=6, with 1, 2, 6 as class 1 and 4,
5 as class 2  y1=1, y2=1, y3=-1, y4=-1, y5=1
 We use the polynomial kernel of degree 2
 K(x,y) = (xy+1)2
 C is set to 100
 We first find ai (i=1, …, 5) by
65
EXAMPLE
 By using a Quadratic (QP) solver, we get
 a1=0, a2=2.5, a3=0, a4=7.333, a5=4.833
 Note that the constraints are indeed satisfied
 The support vectors are {x2=2, x4=5, x5=6}
 The discriminant function is
 b is recovered by solving f(2)=1 or by f(5)=-1 or by f(6)=1,

as x2 and x5 lie on the line and x4
lies on the line
 All three give b=9
66
EXAMPLE
Value of discriminant function
class 1 class 2 class 1
1 2 4 5 6
67
DEGREE OF POLYNOMIAL FEATURES
X^1 X^2 X^3
68
X^4 X^5 X^6
CHOOSING THE KERNEL FUNCTION
 Probably the most tricky part of using SVM.
69
SUMMARY: STEPS FOR CLASSIFICATION
 Prepare the pattern matrix
 Select the kernel function to use
 Select the parameter of the kernel function and

the value of C
 You can use the values suggested by the SVM
software, or you can set apart a validation set to
determine the values of the parameter
 Execute the training algorithm and obtain the ai
 Unseen data can be classified using the ai and
the support vectors
70
APPENDIX: DISTANCE FROM A POINT TO A
LINE
 Equation for the line: let u be a variable, then any
point on the line can be described as:
 P = P1 + u (P2 - P1)
 Let the intersect point be u, P2
 Then, u can be determined by:
 The two vectors (P2-P1) is orthogonal to P3-u: P
 That is,
 (P3-P) dot (P2-P1) =0
 P=P1+u(P2-P1)
 P1=(x1,y1),P2=(x2,y2),P3=(x3,y3) P3
P1
71
DISTANCE AND MARGIN
 x = x1 + u (x2 - x1)
y = y1 + u (y2 - y1)
 The distance therefore between the point P3 and the

line is the distance between P=(x,y) above and P3
 Thus,
 d= |(P3-P)|=
72
ARTIFICIAL NEURAL
NETWORKS: AN
INTRODUCTION
“Principles of Soft Computing, 2nd Edition”

by S.N. Sivanandam & SN Deepa
Copyright  2011 Wiley India Pvt. Ltd. All rights reserved.
DEFINITION OF NEURAL NETWORKS
According to the DARPA Neural Network Study (1988, AFCEA
International Press, p. 60):
• ... a neural network is a system composed of many simple processing

elements operating in parallel whose function is determined by network
structure, connection strengths, and the processing performed at
computing elements or nodes.
According to Haykin (1994), p. 2:
A neural network is a massively parallel distributed processor that has a

natural propensity for storing experiential knowledge and making it
available for use. It resembles the brain in two respects:
• Knowledge is acquired by the network through a learning process.
• Interneuron connection strengths known as synaptic weights are
used to store the knowledge.
“Principles of Soft Computing, 2
nd Edition”
BRAIN COMPUTATION
The human brain contains about 10 billion nerve cells, or
neurons. On average, each neuron is connected to other
neurons through approximately 10,000 synapses.

INTERCONNECTIONS IN BRAIN

BIOLOGICAL (MOTOR) NEURON

ARTIFICIAL NEURAL NET
 Information-processing system.
 Neurons process the information.
 The signals are transmitted by means of connection links.
 The links possess an associated weight.
 The output signal is obtained by applying activations to the net

input.

MOTIVATION FOR NEURAL NET
 Scientists are challenged to use machines more effectively for

tasks currently solved by humans.
 Symbolic rules don't reflect processes actually used by humans.
 Traditional computing excels in many areas, but not in others.

The major areas being:
 Massive parallelism
 Distributed representation and computation
 Learning ability
 Generalization ability
 Adaptivity
 Inherent contextual information processing
 Fault tolerance
 Low energy consumption.

ARTIFICIAL NEURAL NET
W1
X1 Y
W2
X2
The figure shows a simple artificial neural net with two input neurons
(X1, X2) and one output neuron (Y). The inter connected weights are
given by W1 and W2.

ASSOCIATION OF BIOLOGICAL NET
WITH ARTIFICIAL NET

PROCESSING OF AN ARTIFICIAL NET
The neuron is the basic information processing unit of a NN. It consists
of:
1. A set of links, describing the neuron inputs, with weights W1, W2,
…, Wm.
2. An adder function (linear combiner) for computing the weighted

sum of the inputs (real numbers):
m
u   W jX j
j 1
3. Activation function for limiting the amplitude of the neuron output.

y   (u  b)

BIAS OF AN ARTIFICIAL NEURON
The bias value is added to the weighted sum
∑wixi so that we can transform it from the origin.
Yin = ∑wixi + b, where b is the bias

x1-x2= -1
x2 x1-x2=0
x1-x2= 1
x1

MULTI LAYER ARTIFICIAL NEURAL NET
INPUT: records without class attribute with normalized attributes
values.
INPUT VECTOR: X = { x1, x2, …, xn} where n is the number of

(non-class) attributes.
INPUT LAYER: there are as many nodes as non-class attributes, i.e.

as the length of the input vector.
HIDDEN LAYER: the number of nodes in the hidden layer and the
number of hidden layers depends on implementation.

OPERATION OF A NEURAL NET
- Bias
x0 w0j

x1 w1j
f
Output y
xn wnj
Input Weight Weighted Activation

vector x vector w sum function

WEIGHT AND BIAS UPDATION
Per Sample Updating
• updating weights and biases after the presentation of each sample.
Per Training Set Updating (Epoch or Iteration)
• weight and bias increments could be accumulated in variables and

the weights and biases updated after all the samples of the
training set have been presented.

STOPPING CONDITION
 All change in weights (wij) in the previous epoch are below some
threshold, or
 The percentage of samples misclassified in the previous epoch is

below some threshold, or
 A pre-specified number of epochs has expired.
 In practice, several hundreds of thousands of epochs may be

required before the weights will converge.

BUILDING BLOCKS OF ARTIFICIAL NEURAL NET
 Network Architecture (Connection between Neurons)
 Setting the Weights (Training)
 Activation Function

LAYER PROPERTIES
 Input Layer: Each input unit may be designated by an attribute
value possessed by the instance.
 Hidden Layer: Not directly observable, provides nonlinearities for

the network.
 Output Layer: Encodes possible values.

TRAINING PROCESS
 Supervised Training - Providing the network with a series of
sample inputs and comparing the output with the expected
responses.
 Unsupervised Training - Most similar input vector is assigned to

the same output unit.
 Reinforcement Training - Right answer is not provided but

indication of whether ‘right’ or ‘wrong’ is provided.

ACTIVATION FUNCTION
 ACTIVATION LEVEL – DISCRETE OR CONTINUOUS
 HARD LIMIT FUCNTION (DISCRETE)

• Binary Activation function
• Bipolar activation function
• Identity function
 SIGMOIDAL ACTIVATION FUNCTION (CONTINUOUS)

• Binary Sigmoidal activation function
• Bipolar Sigmoidal activation function

ACTIVATION FUNCTION
Activation functions:
(A) Identity
(B) Binary step
(C) Bipolar step
(D) Binary sigmoidal
(E) Bipolar sigmoidal
(F) Ramp

CONSTRUCTING ANN
 Determine the network properties:
• Network topology
• Types of connectivity
• Order of connections
• Weight range
 Determine the node properties:

• Activation range
 Determine the system dynamics

• Weight initialization scheme
• Activation – calculating formula
• Learning rule

PROBLEM SOLVING
 Select a suitable NN model based on the nature of the problem.
 Construct a NN according to the characteristics of the application

domain.
 Train the neural network with the learning procedure of the

selected model.
 Use the trained network for making inference or solving problems.

NEURAL NETWORKS
 Neural Network learns by adjusting the weights so as to be able
to correctly classify the training data and hence, after testing phase,
to classify unknown data.
 Neural Network needs long time for training.
 Neural Network has a high tolerance to noisy and incomplete

data.

McCULLOCH–PITTS NEURON
 Neurons are sparsely and randomly connected
 Firing state is binary (1 = firing, 0 = not firing)
 All but one neuron are excitatory (tend to increase voltage of other
cells)
• One inhibitory neuron connects to all other neurons

• It functions to regulate network activity (prevent too many
firings)

LINEAR SEPARABILITY
 Linear separability is the concept wherein the separation of the

input space into regions is based on whether the network response
is positive or negative.
 Consider a network having

positive response in the first
quadrant and negative response
in all other quadrants (AND
function) with either binary or
bipolar data, then the decision
line is drawn separating the
positive response region from
the negative response region.

HEBB NETWORK
Donald Hebb stated in 1949 that in the brain, the learning is performed
by the change in the synaptic gap. Hebb explained it:
“When an axon of cell A is near enough to excite cell B, and repeatedly

or permanently takes place in firing it, some growth process or
metabolic change takes place in one or both the cells such that A’s
efficiency, as one of the cells firing B, is increased.”

HEBB LEARNING
 The weights between neurons whose activities are positively
correlated are increased:
dw ij
~ correlatio n ( x i , x j )
dt
 Associative memory is produced automatically
 The Hebb rule can be used for pattern association, pattern

categorization, pattern classification and over a range of other
areas.

FEW APPLICATIONS OF NEURAL NETWORKS

-=1
or' ·. ~c~~t} en}
Introduction I ---==
10 --1
I Artificial Neural Network:

for their application to optimization. The field of probabilistic reasoning is also sometimes included under the
soft computing umbrella foi- its control of randomness and uncertainty. The importance of soft computing
lies in using these methodologies in partnership - they all offer their own benefits which are· generally nor
competitive and can therefore, work together. As a result; several hybrid systems were looked at - systems in
which such partnerships exist.
An Introduction 2
learning Objectives
The fundamema.ls of artificial neural net~ Various terminologies and notations used
work. throughout the text.
The evolmion of neural networks. The basic fundamental neuron model -
Comparison between biological neuron and McCulloch-Pins neuron and Hebb network.
:inificial neuron. The concept of linear separability to form
Basic models of artificial neural networks. decision boundary regions.
The different types of connections of neural
nern'orks, learning and activation functions
are included.
;:;,\1 bjyO
)-(o,~ c.
ri ..~I'·,"
'~ .
J
I 2.1 Fundamental Concept
Neural networks are those information processing systems, which are constructed and implememed to model
the human brain. The main objective of the neural network research is to develop a computational device
for modeling the brain to perform various computational tasks at a faster rate .than the traditional systems .
.-..., Artificial neural ne.~qrks perfOFm various tasks such as parr~n·marchjng and~"dassificarion. oprimizauon
~on, approximatiOn, vector ·uamizatio d data..clus.te..di!fThese_r__'!_5~~~2'!2'..J~~for rraditiOiiif'
Computers, w ·,c are er 1 gomll~putational raskrlndrp;;ise !-rithmeric operatic~. Therefore,
for implementation of artificial n~~·speed digital corrlpurers are used, which makes the
simulation of neural processes feasible.
I 2.1.1 Artificial Neural Network
& already stated in Chapter 1, an artificial neural nerwork (ANN) is an efficient information processing
system which resembles in characteristics with a biological neural nerwork. ANNs possess large number of
highly interconnected processing elements called notUs or units or neurom, which usually operate in parallel
and are configured in regular architectures. Each neuron is connected wirh the oilier by a connection link. Each
connection link is associated with weights which contain info!£11ation about the_iapu.t signal. This information
is used by rhe neuron n;t to solve a .Particular pr.cl>lem. ANNs' collective behavior is characterized by their
ability to learn, recall and' generaUa uaining p®:erns or data similar to that of a human brain. They have the
ro
capability model networkS of ongma:l nellfOIIS as-found in the brain. Thus, rhe ANN processing elements
are called neurons or artificial neuro'f·\, , ·, l"-
( \- ,.'"
c' ,,· ,
'--· .r ,-
\ ' \
12 ~ Artificial Neural Network: An Introduction 2.1 Fundamental Concept 13
x, X,
...
~@-r Slope= m
t
y
Figure 2·1 Architecture of a simple anificial neuron net.
,
Input '(x) :- ~- i ' (v)l----~•·mx
~--.J
Figure 2·2 Neural ner of pure linear equation. X----->-
Figure 2·3 Graph for y = mx.

It should be noted that each neuron has an imernal stare of its own. This imernal stare is called ilie
activation or activity kv~l of neuron, which is the function of the. inputs the neuron receives. The activation Synapse
signal of a neuron is transmitted to other neurons. Remembe(i neuron can send only one signal at a rime,
which can be transmirred to several ocher neurons.
To depict rhe basic operation of a neural net, ·consider a set of neurons, say X1 and Xz, transmitting signals Nucleus --+--0
to a110ilier neuron, Y. Here X, and X2 are input neurons, which transmit signals, andY is the output neuron,
which receives signals. Input neurons X, and Xz are connected to the output neuron Y, over a weighted
interconnection links (W, and W2) as shown in Figure 2·1.
7 /•
-c_....-- . v,
For the above simple rleuron net architecture, the net input has to be calculated in the following way: DEindrites 1: t 1 1
]in= +XIWI +.xz102 Figure 2-4 Schcmacic diagram of a biological neuron.

~
where Xi and X2 ,gL~vations of the input neurons X, and X2, i.e., the output of input signals. The The biological neuron depicted in Figure 2-4 consists of dtree main pans:
output y of the output neuron Y can be o[)i"alneaOy applymg act1vanon~er the ner input, i.e., the function
of the net input: 1. Soma or cell body- where the cell nucleus is located.
2. Dendrites- where the nerve is connected ro the cell body.
3. Axon- which carries ~e impu!_s~=-;t the neuron.
J = f(y;,)
Output= Function (net input calculated)
Dendrites are tree-like networks made of nerve fiber connected to the cell body. An axon is a single, long
The function robe applied over the l]£t input is call:a;;dti:n fo'!!!f_on. There are various activation functions, conneC[ion extending from the cell body and carrying signals from the neuron. The end of dte axon splits into
which will be discussed in the forthcoming sect10 _ . e a ave calculation of the net input is similar tq the fineruands. It is found that each strand terminates into a small~ed JY1111pse. Ir is duo
calculation of output of a pure linear straight line equation (y = mx). The neural net of a pure linear cqu3.tion na se that e neuron introduces its si nals to euro , T e receiving ends o e a ses
is as shown in Figure 2·2. ~}be nrarhr neurons can be un both on the dendrites and on y. ere are approximatdy
Here, m oblain the output y, the slope m is directly multiplied with the input signal. This is a linear
equation. Thus, when slope and input are linearly varied, the output is also linearly varied, as shown in
f :,}_er neuron in me numan Drain.
~es are passed between the synapse and the dendrites. This type ofsignal uansmission involves
Figure 2·3. This shows that the weight involved in dte ANN is equivalent to the slope of the linear straight a. chemical process in which specific transmitter substances are rdeased from the sending side of the junccio
line. This results in increase or decrease in th~ inside the bOdy of the receiving cell. If the dectric
potential reaches a threshold then the receiving cell fires and a pulse or action potential of fixed strength and
I 2.1.2 Biological Neural Network duration is sent oulihro'iigh the axon to the apcic junctions of the other ceUs. After firing, a cd1 has to wait
for a period of time called th efore it can fire again. The synapses are said to be inhibitory if
It iswdl·known that dte human brain consists of a huge number of neurons, approximatdy 10 11 , with numer· they let passing impulses hind the receiving cell or txdtawry if they let passing impulses cause
ous interconnections. A schematic diagram of a biological neuron is s_hown in Figure 2-4. the firing of the receiving cell.
-·---""
.J
j
2. f Fundamental Concept 15
14 Artificial Neural Network: An Introduction
2. jJ'ocessing: Basically, the biological neuron can perform massive paralld operations simulraneously. The
Inputs artificial neuron can also perform several parallel operations simultaneouSlY, but, ih general, the artificial
~ Weights neuron ne[INork process is faster than that of the brain. .
x, ~ 3. Size and complexity: The total number of neUrons in the brain is about lOll and the total number of
interconnections is about 1015 • Hence, it can be rioted that the complexity of the brain is comparatively
";
higher, i.e. the computational work takes places not"Cmly in the brain cell body, but also in axon, synapse,
ere. On the other hand, the size and complOciry ofan ANN is based on the chosen application and
the ne[INork designer. The size and complexity of a biological neuron is more than iliac Of an arcificial
neurorr.-----
Processing
w, 4. Storage capacity (mnno,Y}: The biologica.l. neuron stores the information in its imerconnections or in
element
synapse strength but in an artificial neuron it is smred in its contiguous memory locations. In an artltlcial
~/ neuron, the continuous loading of new information may sometimes overload the memory locations. As a
X,
result, some of the addresses containing older memory locations may be destroyed. But in case of the brain,
Figure 2·5 Mathematical model of artificial neuron. new information can be added in the interconnections by adjusting the strength without descroying the
older infonnacRm. A disadvantage related to brain is that sometimes its memory niay fail to recollect the.
stored information whereas in an artificial neuron, once the information is stored in its me~ locations,
Table 2·1 Terminology relarioii:ShrpS b~tw-ee·n
biological and artificial neurons
Biological neuron
Cell
Anificial neuron
Neuron
-
it can be retrieved. Owing to these facts, rhe adaptability is more toward an artificial neuron.
5. Tokrance: The biola ical neuron assesses fault tolerant capability whereas the artificial neuron has no
fault tolerance. Th distributed natu of the biological neurons enables to store and retrieve information
even when the interconnections m em get disconnected. Thus biological neurons nc fault toleF.lm. But in
Dendrites Weights or inrerconnecrions case of artificial neurons, the mformauon gets corrupted if the network interconnections are disconnected.
Soma Nee inpur
Biological neurons can accept redundancies, which is not possible in artificial neurons. Even when some
Axon Outpm
ceHs die, the human nervous system appears to be performing with the same efficiency.
6. Control mechanism: In an artificial neuron modeled using a computer, there is a control unit present in
Figure 2~5 shows a mathematical represenracion of the above~discussed chemical processing raking place Central Processing Unit, which can transfe..! and control precise scalar values from unit to unit, bur there
in an artificial neuron. is no such control unit for monitoring in the brain. The srrengdl of a neuron in the brain depends on the
In chis model, the net input is elucidated as active chemicals present and whether neuron connections are strong or weak as a result ~mre layer
rather t~ synapses. However, rhe ANN possesses simpler interconnections and is freefrom
Yin = Xt WJ + XzW2 + · ·· + x,wn = L" x;w;

chemical actions similar to those raking place in brain (biological neuron). Thus, the control mechanism
of an arri6cial neuron is very simple compared to that of a biological neuron. --
i=l
So, we have gone through a comparison between ANNs and biological neural ne[INorks. In shan, we can
where i represents the ith processing elemem. The activation function is applied over it ro calculate the
say that an ANN possesses the following characteristic.s:
output. The r-reighc represents the strength of synapse connecting the input and the output neurons. ft pos·
irive weight corresponds to an excitatory synapse, and a negative weight corresponds to an inhibitory 1. It is a neurally implemented mathem~
synapse.
'!. 2. Ther~lilgfi(y'"interconnected processing elements called nwrom in an ANN.
The terms associated with the biological neuron and their counterparts in artificial neuron are prescmed
in Table 2-l. 3. The interconnections with their weighted linkages hold the informative knowledge.
4. The input signals arrive at the processing elelnents through connections and connecting weights.
2.1.3 Brain vs. Computer - Comparison Between Biolbgical Neuron and 5. The processing elements of the ANN have the ability to learn, recall and generalize from the given data
Artificial Neur9n (Brain vs. Computer) by suitable assignment or adjustment of weights.
6. The computational power can be demonstrated only by the collective behavior of neurons, and it should
A comparison could be made between biological and artificial neurons on the basis of the following criteria:
be noted that no single neuron carries specific information.
1. Speed· T~e of rxecurion in the ANN is of& .. wannsergnds whereas in the ci.se of biolog-
ical neuron ir is of a few millisecondS. Hence, the artificial neuron modeled using a com purer is more The above-mentioned characteristic.s make the ANNs as connectionist models, parallel distributed processing
faster. -
--
models, self-organizing systems, neuro-computing systems and neuro-morphic systems.
I
l
:w:
16 Artificial Neural Network: An Introduction " 17

2.3 Basic Models of Artificial Neural Network
I 2.2 Evolution of Neural Networks

The evolution of neural nenvorks has been facilitated by the rapid developmenr ofarchitectUres and algorithms ~are specified by the three basic. entities namely:
that are currently being used. The history of the developmenr of neural networks along with the names of
their designers is outlined Tab!~ 2~2. 1. the model's synaptic interconnectionS;
In the later years, the discovery of the neural net resulted in the implementation of optical neural nets, 2. the training or learning rules adopted for upda~ng arid adjusting the connection weights;
Boltzmann machine, spatiotemporal nets, pulsed neural networks and support vector machines. 3. their activation functions.
Table 2·2 Evolution of neural networks

I 2.3.1 Connections
Year Newal Designer Description The neurons should be visualized for their arrangements in layers. An ANN consists of a set of highly inter-
necwork connected processi elements (neurons) such that each processing element output is found ro·be connected
1943 McCulloch md McCulloch and The arran gemem of neurons in this case is a combination of logic throughc.. e1g ts to the other processing elements or to itself, delay lead and lag-free_.conn'eccions are allowed.
Pitts neuron Pins functions. Unique feature of this neuron is the concept of Hence, the arrange!llents of these orocessing elements and-dl'e" g:ametFy o'f-tJiciC'interconnectipns are essential
threshold. for an ANN. The point where the connection ongmates and terminates should De noted, :ind the function
1949 Hebb network Hebb It is based upon the fact that if two neurons are found to be active o ea~ processing element in an ANN should be specifie4.
simulraneously then the strength of the connection bmveen them Bes1 es e pie neuron shown in Figure??, there exist several other cypes of neural network connections.
should be increased. /fie arrangement of neuron:2form layers and the connection panem formed wi~in and between layers is
1958, Percepuon F<>nk Here the weighrs on the connection path can be adjusted. ~led the network architecture. here exist five basic types of neuron connection architectUres. They are:
1959. Rosenblau,
1962, Block, Minsky 1. single-layer feed-forwar network;
1988 and Papert 2. multilayer feed-forward network;
1960 Adaline Widrow and Here the weights are adjusted ro reduce the difference between the
Hoff net input to the output unit and the desired output. The result 3. single node with itS own feedback;
here is very negligible. Mean squared error is obtained. 4. single-layer recurrent network;
1972 Kohonen Kohonen The concept behind this network is that the inputs are clustered 5. mulrilayer recurrent network.
self-organizing together to obtain a fired ourput neuron. The clustering is
feature map performed by winner-take all policy. Figures 2-6-2-10 depict the five types of neural network architectures. Basically, neural nets are classified
1982, Hopfield John Hopfidd This neural network is based on fixed weights. These nets can also into single-layer or multilayer neural ners. A layer is formed by taking a processing element and combining it
1984, network and Tank act as associative memory nets. wirh other processing elements. Practically, a layer implies a stage, going stage by stage, i.e., the input srageand
1985, the output stage are linked with each other. These linked interconnections lead to the formation of various
1986, netw-ork architecrures. When a layer of the processing nodes is formed, the inputs can be connected to these
1987
1986 Back- Rumelhart, This network is multi-layer wirh error being propagated backwards I
propagation Hinton and from the output unirs ro the hidden unirs. lnpul Output
network
1988 Counter-
propagation
WiUiams
Grossberg This network is similar ro rhe Kohonen network; here the learning
occurs for all units in a panicular layer, and there exists no
.lI layer layer
network competition among these units.

1987- Adaptive
'
1990 Resonance
Carpenter and
Grossberg
The ART network is designed for both binary inputs and analog
valued inpur.s. Here the input pauems can be presented in any I Output '
neurons
Theory <ARn order.
1988 Radial basis Broomhead and This resembles a back propagation network bur the activation
&merion Lowe function used is a Gaussian function. I
network I
1988 Neo cogniuon Fukushima This network is essential for character recognition. The deficiency
occurred in cogniuon network (1975) was corrected by this
I
network.
Figure 2·6 Single~layer feed-forward network.
I
18/ Artificial Neural Network: An Introduction j 2.3 Basic Models of Artificial Neural Network 19
Input I
layer
®········ . ··~~! ...
Output
neurous
0>·· ~"··
Figure 2·7 Multilayer feed-forward network. 0 wnm ..
Figure 2·9 Single·layer recurrent network.
Output
Input
Input layer
--------..
Y, )---'-----
~
-£ 0 "" Vn
.· .. <:.:..::,/.......
··,
Feedback •"\\ -::::

-£
(A) (B) X
Figure 2·8 (A) Single node wirh own feedback. {B) Comperirive ners. 0 "'" v~
nodes with various weighrs, resulting in&MJ.n)rnp;u~~~eope~ Thus, a single-laye1 feed-forward

netw rk is formed.
A mu t1 erfeed-forward network (Figure 2-?) is formed by the interconnection of several layers. The
input layer is that which receives the input and this layer has no function except buffering the input si nal.
The output layer generates the output of the network. Any layer that is formed between e input and output
0·······. . ~n2
layers is called hidden layer. This hidde-n layer is internal to the network and has no direct contact with the
external environment. It should be noted that there may be zero to several hidden layers in an ANN. More the Figure 2·10 Multilayer recurrent ne[Work.
number of the hidden layers, more is Ute com lexi f Ute network This may, however, provide an efficient
output response. In case of out ut from one layer is connected to d
evlill' node in the next layer. feedback to itself. Figure 2~9 shows a single· layer network with a feedback connection in which a processing
A n'etw.Qrk is said m be a feed~forward nerwork if no neuron in the output layer is an input to a node in element's output can be directed back ro the processing element itself or to clte other processing element or
the same layer or in the preceding layer. On the other hand, when ou uts can be directed back as inputs to to both.
same or pr..t:eding layer nodes then it results in me formation e back networ. . The architecture of a competitive layer is shown in Figure 2~8(8), the competitive interconneccions having
If the feedback of clte om put of clte processing elements ts · recred back at input tO the processing fixed weights of -e. This net is called Maxnet, and will be discussed in the unsupervised learning network
elements in the same layer r.fen ic is tailed ilueral feedbi:Uk. Recurrent networks are feedback networks category. Apart from the network architectures discussed so far, there also exists another type of archirec~
with d(\'ied loop. Figure 2~8(A) shows a simple recurrent neural network having a single neuron with rure with lateral feedback, which is called the on·center--off-surround or latmzl inhibition strUCture. In this
~ ----
, 2.3 Basic Models of Artificial Neural Network
X -+
Neural
network y
21
(lnpu :) w (Actual output)
r
<
'
__o;,L
0
,,
1.•
;~'' ? rl
1
__ ~.\~;-.\
'h-'"'""==::':::C:=:=:'::
Flgure2-11~on.&~r~
c< \,. ,'
Error
(0-Y) <
Error
signal b
-~-' signals
o' ..;'s~?ucture, each processing neuron receives two differem classes of inputs- "excitatory" input &om nearby ~ generator (Desi ·ad output)
c~ ·\'
processing elements and "inhibitory" inputs from more disramly_lggted..pro@~ elements. This cype of
inter~ is shown in Figure"2:-1T:·--·--···------- ----~ Figure 2-12 Supervised learning.
In Figure 2-11, the connections with open circles are excitatory connections and the links with solid con-
nective circles are inhibitory connections. From Figure 2-10, it can be noted that a processing element output
can be directed back w the nodes in a preceding layer, forming a multilayer recunmt network. Nso, in these ence, the
networks, a processing dement output can be directed back to rhe processing element itself and to other pro-
cessing elemenrs in the same layer. Thus, the various network architecrures as discussed from Figures 2~6-2·11
can be suitably used for giving effective solution ro a problem by using ANN.
I 2.3,2 Learning
2.3,2,2 Unsupervised Learning
The learning here is performed without the help of a teacher. Consider the learning process of a tadpole, it
The main property of an ANN is its capability to learn. Learning or training is a process by means of which a learns by itself, that is, a child fish learns to swim by itself, it is not taught by its mother. Thus, its learning
neural network adapts itself to a stimulus by making$rop~~rer adjustm~ resulting in the production process is independent and is nor supervised by a teacher. In ANNs following unsupervised learning, the
of desired response. Broadly, there are nvo kinds o{b;ning in ANNs: \ input vectors of simil~pe are grouped without th use of training da.ta t specify ~ch
'~ group looks or to which group a number beloogf n e training process, efietwork receives rhe input
1. Parameter learning: h updates the connecting weights in a neural net.
~-·~paii:erns and organizes these patterns to form clusters. When a new input panern is applied, the neural
2. Strncttm learning: It focuses on the change in network structure (which includes the number of processing " ·· network gives an output response i dicar.ing..ili_~c which the input pattern belongs. If for an input,
elemems as well as rheir connection types). a pattern class cannot be found the a new class is generated The block 1agram of unsupervised learning is
The above two types oflearn.ing can be performed simultaneously or separately. Apart from these two categories shown in Figure 2~13.
of learning, the learning in an ANN can be generally classified imo three categories as: supervised learning; From Figure 2·13 it is clear that there is no feedback from the environment to inform what the outputs
unsupervised learning; reinforcement learning. Let us discuss rhese learning types in detail. should be or whether the outputs are correct. In this case, the network must itself discover patterns~~
lariries, features or categories from the input data and relations for the input data over (heOUtj:lut. While
2-_3,2, 1 Supervised Learning discovering all these features, the network undergoes change m Its parameters. I h1s process IS called self
The learning here is performed with the help of a teacher. Let us take the example of the learning process organizing in which exact clusters will be formed by discovering similarities and dissimilarities among the
of a small child. The child doesn't know how to readlwrite. He/she is being taught by the parenrs at home objects.
and by the reacher in school. The children are trained and molded to recognize rhe alphabets, numerals, etc.
Their each and every action is supervised by a teacher. Acrually, a child works on the basis of the output that 2.3.2.3 Reinforcement Learning
he/She has to produce. All these real-time events involve supervised learning methodology. Similarly, in ANNs This learning process is similar ro supervised learning. In the case of supervised learning, the correct rarget
following the supervised learning, each input vector re uires a cor din rar et vector, which represents output values are known for each input pattern. But, in some cases, less information might be available.
the desired output. The input vecror along with the target vector is called trainin
informed precisely about what should be emitted as output. The block 1a
~
working of a supervised learning network. X y
(lnpu al output)
During training. the input vector is presented to the network, which results in an output vecror. This
outpur vector is the actual output vecwr. Then the actual output vector is compared with the desired (target)
Figure 2-13 Unsupervised learning.
output ·vector. If there exists a difference berween the two output vectors then an error signal is generated by
2.3 Basic Models of Artificial Neural Network
23
The output here remains the same as input. The input layer uses the idemity activation function.
Neural 2. Binary step function: This function can be defined as
X network y
(lnpu t) w (Actual output)
f(x) = { 1 if x) e
0 1fx<e
where 8 represents the lhreshold value. This function is most widely used in single-layer nets to convert
the net input to an output that is a binary (1 or 0).
Error Error
signals signal A 3. Bipolar step fimction: This function can be defined as
generator (Relnlforcement
siignal) 'f(x)=\ .1 ifx)8
-1 tf x< (}
Figure 2~14 Reinforcement learning.
where 8 represents the dueshold value. This function is also used in single-layer nets to convert the nee
For example, the necwork might be told chat its actual output is only "50% correct" or so. Thus, here only input to an output that is bipolar(+ 1 or -1).
critic information is available, nor the exacr information. The learning based on this crjrjc jofnrmarion is 4. Sigmoidal fonctions-. The sigmoidal functions are widely used in back-propagation nets because of the
called reinforCfment kaming and the feedback sent is called reinforcement sb relationship between the value of the functions ar a point and the value of the derivative at that ~nt
The block diagram of reinforcement leammg IS shown in Figure 2-14. The reinforcement learning is a which reduces the computational blJ!den d~ng.
form of su ervis the necwork receives some feedback from its environment. However, the Sigm01dil funcnons are of two types: -
feedback obtained here is only evaluative and not mstrucr1ve. e extern rem orcemenr signals are processed
Binmy sigmoid fonction: It is also rermed as logistic sigmoid function or unipolar sigmoid function.
in the critic signal generator, andilie obtained ;rnc signals are sent to the ANN for adjustment of weights
It can be defined as
properly so as to get better critic feedback in furure. The reinforcement learning is also called learning with a
critic as opposed ro learning with a teacher, which indicates supervised learning. I
So, now you've a fair understanding of the three generalized learning rules used in the training process of f(x) = 1 + ,-'-'
ANNs.
where A is the steepness parameter. The derivative of rhis funcrion is
c---·---·--··... """\
I 2.3.3 Activation Functions / J'(x) =J.f(x)[l- f(x)] \
To better understand the role. of the activation function, let us assume a person is performing some work. Here the range of che sigmoid funct~~iS"fr~~ Qr~ 1~· -···-· - ___ ..
To make the work more efficient and to obrain exact output, some force or activation may be given. This
• Bipo!dr sigmoid fimction: This function is defined as
aaivation helps in achieving the exaa ourpur. In a similar \vay, the aaivation function is applied over the net
inpu~eulate.the output of an ANN. 2 1-e-Ax
The information processing of a processing element can be viewed as consisting of two major parts: input f ( x )1= ---1=--
+ e-Ax l + e-Ax
and output. An integration fun~tion (say[) is associated with the input of a processing element. This function
serves to combine activation, information or evidence from an external source or other processing elements where A is thesteef'n~~rand the sigmoid function range is between -1 and+ 1. The derivative
into a net mpm ro the processing element. I he nofllmear actlvatlon-fi:iiicfion IS usei:l to ensure that a neuron's ofthisiilliC:·~.:· I ..
response is ~nded - diat 1s, the acrual response of the neuron is conditioned or dampened as a reru.h-of A
large or small activating stimuli and is thus controllabl_s. J'(x) = [1 +f(x)][l - f(x)]
Certain nonlinear fllncnons are used to aCh.eve dle advantages of a multilayer network from a single-layer
2
nerwork. When a signal is fed thro~ a multilayer network with linear activation functions, che output The bipolar sigmoidal function is closely related ro hyperbolic rangenr &merion, which is written as
obtained remains same as that could be obtained using a single~layer network. Due to this reason, nohlinear
et-e-x 1-e-b:
functions are widely used in multilayef networks compared ro linear functions. h(x)=--=--
There are several activation functions. Let us discuss a few in chis section: rF\ r+e-x 1 +e-2x
1. Identity fimction: It is a linear function and can be defined as 'I. \Y ':I '
(.
The derivative of the hyperbolic tangent function is
~
f(x) = x foe all x \.r.'

' \,,' . '-'
-~ h'(x) =[I +h(x)][l- h(x)]
c~-
24 Artificial Neural Network: An Introduction 2.4 Important Tenninologies of ANNs 25
If the nerwork uses a binary data, it is better to conven it to bipolar form and use ilie bipolar sigmoidal
1 ,l(x)
acnvauon funcnon or hyperbolic tangent function.
5. Ramp function: The ~p funaion is defined as
if X> 1 f(x)'
f(x) = U if Q.:::: X .:5: 1

if x< 0
0 X
X
(A) (B)
The graphical representations of all the activation functions are Shown in Figure 2-I5(A)-(F).
I(!C)
I 2.4 Important Terminologies of ANNs
This section introduces you ro the various terminologies related with ANNs. +1f-----
I 2.4.1 Weights 0 X
\
In the architecrure ofan ANN, each neuron is connected ro other neurons by means ofdirected communication -1
links, and each communication link is associated with weights. The weighrs contain information about e
if'!pur ~nal. This information is used by the net ro solve a problem. The we1ghr can ented in
-rem1sOf matrix. T4e weight matrix can alSO bt c:rlled connectzon matrix. To form a mathematical notation, it (C) (D)
is assumed that there are "n" processingelemenrs in~ each processing element has exaaly "m"
adaptive weighr.s. Thus, rhe weight matrix W is defined by l(x),
wT\ \w'' WJ2 WJm

\
'',.
-,, I(!C)
WT W22 \~·,, "\
W=
2
I=
""' IU)_m
+1
'·"'
w~j LWn] 7Vn2 1Unm
+1 X
(E) (F)
where w; = [wil, w;2 •... , w;m]T, i = 1,2, ... , n, is the weight vector of processing dement and Wij is the Figure 2-15 Depicrion of activation functions: (A) identity function; (B) binary step function; (C) bipolar step
weight from processing element":" (source node) to processing element "j' (destination node). function; (D) binary sigmoidal function; (E) bipolar sigmoidal function; (F) ramp function.
If the weight matrix W contains all the adaptive elements of an ANN, then the set of aH W matrices
will determine dte set of all possible information processing configurations for this ANN. The ANN can be
The bias is considered. like another weight, dtat is&£= b}
Consider a simple network shown in Figure 2-16
with bias. From Figure 2-16, the net input to dte ourput neuron Yj is calculated as
realized by finding an appropriate matrix W Hence, the weights encode long-term memory (LTM) and rhe
activation states of neurons encode short-term memory (STM) in a neural network. "
Jinj = Lx;Wij = XOWOj +X] W]j + XlWJ.j + ··· + X 11 Wnj
I 2.4-2 Bias i=O

"
The hi · the necwork has its impact in calculating the net input. The bias is included by adding =wo1+ Lx;wif
i=l
a component .ro 1 to the input vector us, the input vector ecomes
"
X= (l,XJ, ... ,X;, ... ,Xn) Ji"j = bj + Ex;wij
i=l
26 Artificial Neural Network: An Introduction 2.5 McCu!loch-Pitts Neuron 27
-r ~r
I
~
2.4.4 Learning Rate .__f o\ ,~'
' '"'
'
bj The learning rate is denoted by "a." It is used to ,co-9-uol the amounfofweighr adillStmegr ar each step of
w,J ~- The learning rate, ranging from 0 -to 1, 9'erer.ffi_iri.es the rate of learning at each time step.
X~ w11 :"( I 2.4.5 Momentum Factor

w,l
Convergence is made faster if a momenrum factor is added to the weight updacion erocess. This is generally
done in the back propagation network. If momentum has to be used, the weights from one or more previous
x, uaining patterns must be saved. Momenru.nl helps the net in reasonably large we1ght adjustments until the
correct1ons are in lhe same general direction for several patterns.
Figure 2·16 Simple net with bias.
c(Bias)
I 2.4.6 Vigilance Parameter
(Weight) ~
Input J@ m ]; Y• )• y.=mx+c
Figure 2·17 Block diagram for straight line.

I 2.4. 7 Notations
The-notations mentioned in this section have been used in this textbook for explaining each network.
The activation function discussed in Section 2.3.3 is applied over chis nee input to calculate the ouqmt. The
bias can also be explain~d as follows: Consider an equation of straight line, x;: Activation of unit Xi, inp_uc signal.
y;: Activation of unit Yj, Jj = f(J;nj)
y= mx+c Wij: Weight on connection from unit X; ro unit Yj.
where xis the input, m is rhe weight, cis !he bias andy is rhe output. The equation of the suaight line can bj: Bias acting on unitj. Bias has a constant activation of 1.
also be represemed as a block diagram shown in Figure 2~17. Thus, b}as plays a major role in dererrnj_njng W: Weight matrix, W = {wij}
the ouq~ut of rhe nerwork. Yinj= Net input to unit Yj given by Yinj = bj + L;XiWij
The bias can be of two types: positive bias and negaiive bias. The positive bias helps in increasing ~et l!x\1: Norm of magnitude vector X.
input of the network and rhe negative bias helps in decreasing the n_~_r)!!.R-1.!-.~ o(Jli!!_p.et\licid{. I hus, as a result Bj: Threshold for activation of neuron Yj-
of the bias effect, the output of rhe network can be varied. ·--- S: Training input vector, S = (s 1 , ••• , s;, ... , s11)
I 2.4.3 Threshold
T:
X:
Training ourput vector, T = (tJ, ... , fj, •.. , t 71 )
Input vector, X= (XI> ••• , Xi> ••• , x11)
Thr~ldis a set yalue based upon which the final outp_~t-~f ~e network may be calculated. The threshold D..wij: Change in weights given by 8.wij = Wij(new) - Wij(old)
vafue is used in me activation function. X co.mparrso·n is made between the Cil:co.lared:·net>•input and the a: Learning rate; it controls the amount of weight adjustment at each step of training.
threshold to obtain the ne ork outpuc. For each and every apPlicauon;·mere1S'a-dlle5hoidlimit. Consider a
direct current DC) motor. If its maximum spee~then lhe threshold based on the speed is 1500
rpm. If lhe motor is run on a speed higher than its set threshold,-it-m~amage motor coils. Similarly, in neural I 2.5 McCulloch-Pitts Neuron
networks, based on the threshold value, the activation functions ar-;;-cres.iie(l"al:td the ourp_uc is calculated. The
activation function using lhreshold can be defined as ----- I 2.5.1 Theory
The McCulloch-Pitts neuron was the earliest neural network discovered in 1943. It is usually called as M-P
/(net)={_: if net "?-8
ifnet<8 neuron. The M-P neurons are connected by directed weighted paths. It should be noted that the activation of
aM-P neuron is binary, that is, at any time step the neuron maY fire or may por 6re The weights associated
where e ~ the fixed threshold value. wilh the communication links may be excitatocy (weight is positive) or inhibioocy (weight is negative). All ilie
.L
/
28 Artificial Neural Network: An Introduction 2.6 Linear Separabilily 29
excitatory connected weights entering into a particular neuron will have same weights. The threshold plays
a major role in M-P neuron: There is a fiXed threshold for each neuron, and if ilie net input to the neuron
I 2.6 Linear Separability
is greater than the.threshold then ilie neuron fires. Also, it should be noted that any nonzero inhibitory ~ fu'l'N does not give an exact solution for a nonlinea;-. problem. However, it provides possible approximate
input would prevent the neuro,n from firing. The M-P neurons are most widely used in the case of logic solutions nonlinear problems. Linear separability, is _ifie ~ritept wherein the separatiOn of the input space
functiOn~.------------ into regions is ase on w e er e network respoilse isJositive or negative.
A decision line is drawn tO separate positive and negative responses. The decision line may also be called as
I 2.5.2 Architecture
the decision-making line or decision-support line or linear-separable line. The necessity of the linear separability
concept was felt to classify the patterns based upon their output responses. Generally the net input @cU'Iau:a-
to t1te output Unu IS given as
A simple M-P neuron is shown in Figure 2-18. As already discussed, the M-P neuron has both excitatory and
inhibitory connections. It is excitatory with weight (w > 0) or inhibitory with weight -p(p < 0). In Figure "
2-18, inpms &om Xi ro Xn possess excitatory weighted connections and inputs from Xn+ 1 m Xn+m possess Yin = b + z:x,w;
inhibitory weighted interconnections. Since the firing of ilie output neuron is based upon the threshold, the i=l
activation function here is defined as
For example, if 4hlpolar srep acnvanoijfunction is used over the calculated ner input (y;,) then the value of
the funct:ion fs" 1 for a positive net input and -1 for a negative net input. Also, it is clear that there exists a
f(y;,)=(l ify;,;?:-0 boundary between the regions where y;, > 0 andy;, < 0. This region may be called as decision boundary and
0 ify;n<8
can be determined by the relation
For inhibition to be absolute, the threshold with the activation function should satisfy the following condition:
"
b+ Lx;w;=O
() > nw- p l~l
The output wiH fire if it receives sa6·:~~citatory ·i·n~~~~ut no inhibitory inputs, where On the basis of the number of input units in the network, the above equation may represenr a line, a plane
kw:>:O>(k-l)w
---- or a hyperplane. The linear separability of the nerwork is based on the decision-boundary line. If there exist
weights (with bias) for which the training input vectors having positive (correct:) response,+ l,lie on one side
of the decision boundary and all the other vectors having negative (incorrect) response, -1, lie on rhe other
The M-P neuron has no particular training algorithm. An analysis has to be performed m determine the side of the decision boundary. then we can conclude the/PrObleffi.Js "linearly separable."
values of the weights and the ili,reshold. Here the weights of the neuron are set along with the threshold to Consider a single-layer network as shown in Figure 2-~ias irlduded. The net input for the ne[Work
shown in Figure 2-l9 is given as
make the neuron "perform a simple logic functiofk-Xhe-M J?. neurons are used as buildigs ~ocks on...which
we can model any funcrion or phenomenon, which can be represented as a logic furfction. y;,=h+xtwl +X21V2
The sepaming line for wh-ich the boundary lies between the values XJ and X'2· so that the net gives a positive
x, response on one side and negative response on other side, is given as
~
~
'J b+xtw1 +X2Ui2 = 0
~
X,
-·X,
-
b
~' 'y
xm,
-p:;?? x, X, w,
~ w,
Xm•
Figure 2·18 McCulloch- Pins neuron model. Figure 2·19 A single-layer neural net.
30 Artificial Neural Network: An Introduction 2.7 Hebb Network 31
If weight WJ. is not equal to 0 then we get However, the dara representation mode has to be decide_d - whether it would be in binary form or in
bipolar form. It may be noted that the bipolar reoresenta'tion is bener than the
WI b
= Using bipolar data
--
X2 --Xl--
w, w, ues are represeru;d can be represented by
Thus, the requirement for the'positive response of the net is vice-versa.
0t~l W\ + "2"'2 > '!) 1 2.7 H~bb Network (e-n (,j 19,., ":_ w1p--tl u,.,; t-)
During training process, lhe values of Wi> W2 and bare determined so that the net will produce a positive ~ <..J I
(correct) response for the training data. if on the other hand, threshold value is being used, then the condmon- I 2. 7.1 Theory •
for obtaining the positive response from ourpur unit is
Net input received> ()(threshOld) I For a neural net, the Hebb learning rule is a simple one. Let us understand it. Donald Hebb stated in 1949
that in the brain, the learning is performed by th c ange m e syna nc ebb explained it: "When an
Yir~-> 8 axon of cell A is near enough to excite cdl B, an y or permanently takes pia~ it, some
XtW\ + XZW2 > (} growth process or merahgljc cheag;e rakes place in one or both the cells such that Ns efficiency, as one of the
cellS hrmg B. is increased.,
The separating line equation will then be According to the Hebb rule, the weight vector is found to increase proportionately to the product of the
input and the learning signal. Here the learning signal is equal tO the neuron's output. In Hebb learning,
XtWJ +X2W2 =()
if two interconnected neurons are 'on' simu)taneously then the weights associated w1ih these neurons can
W\ 8 be increased by ilie modification made in their synapnc gap (strength). The weight update in Hebb rule is
"'=--XI+- (with w, 'f' 0)
w, w, given by
During training process, the values of WJ and W2 have to be determined, so that the net will have a correct w;(new) = w;(old) + x;y
response to the training data. For this correct response, the line passes close rhrough the origin. In certain
situations, even for correct response, the separating line does not pass through the origin. The Hebb rule is more suited for ~ data than binary data. If binary data is used, ilie above weight
Consider a network having positive response in the first quadram and negative response in all other updation formula cannot distinguish two conditions namely;
quadrants (AND function) with either binary or bipolar data, then the decision line is drawn separating the 1. A training pair in which an input unir is "on" and target value is "off."
positive response region from rhe negative response region. This is depicred in Figure 2-20.
2. A training pair in which both ilie input unit and the target value are "off."
Thus, based on the conditions discussed above, the equation of this decision line may be obtained.
Also, in all the networks rhat we would be discussing, the representation of data plays a major role. Thus, iliere are limitations in Hebb rule application over binary data. Hence, the represemation using bipolar
data is advanrageous.
X,
I 2. 7.2 Flowchart of Training Algorithm
+ The training algorithm is used for rhe calculation and -~diustmem of weights. The flowchart for the training
(Positive response region)
algorithm ofHebb ne[Work is given in Figure 2-21. The notations used in the flowchart have already been
discussed in Section 2.4.7.
(Negalive response region) In Figure 2-21, s: t refers to each rraining input and target output pair. Till iliere exists a pair of training
input and target output, the training process takes place; elSe, IE tS stopped.
-x, x,
Decision
I 2. 7.3 Training Algorithm
line The training algorithm ofHebb network is given below:
I Step 0: First initialize ilie weights. Basically in this network iliey may be se~ro zero, i.e., w; = 0 fori= 1 \
-x, to n where "n" may be the total number of input neurons. '
Figure 2·20 Decision boundary line. Step 1: Steps 2-4 have to b~ performed for each input training vector and mger output pair, s: r.
i
l
2.9 Solved Problems 33
The above five steps complete the algorithmic process. In S~ep 4, rhe weight updarion formula can also be
given in vector form as
w(newl'= u,(old) +xy

Here the change in weight can be expressed as·
D.w = xy
As a result,
For
No w(new) = w(old) + l>.w
each
s: t
The Hebb rule can be used for pattern association, pattern categorization, parcem classification and over a
range of other areas.
Yes
Activate input units I 2.8 Summary

XI= Sl
In this chapter we have discussed dte basics of an ANN and its growth. A detailed comparison between
biological neuron and artificial neuron has been included to enable the reader understand dte basic difference
between them. An ANN is constructed with few basic building blocks. The building blocks are based on
dte models of artificial neurons and dte topology of few basic structures. Concepts of supervised learning,
Activate output units
unsupervised learning and reinforcement learning are briefly included in this chapter. Various activation
y=t
functions and different types oflayered connections are also considered here. The basic terminologies of ANN
are discussed with their typical values. A brief description on McCulloch-Pius neuron model is provided.
The concept of linear separability is discussed and illustrated with suitable examples. Derails are provided for
the effective training of a Hebb network.
Weight update
w1(new)= w1(old) +X1Y
I 2.9 Solved Problems
I. For the network shown in Figure I, calculate the weights are

Bias update
b(new)=b(old)+y net input to the output neuron.
[xi, x,, XJI = [0.3, 0.5, 0.6]
0.3 [wJ,w,,w,] = [0.2,0.1,-0.3]
X~
('
\ l8 ' ~ The net input can be calculated as
·~
tI " , , Figure 2~21 Flowchm ofHebb training algorithm.
0.5
@
0.1
y Yin =X] WJ + X'2WZ + X3W3
,. S~~ 2: Input units acrivations are ser. Generally, the activation function of input layer is idemiry funcr.ion: = 0.3 X 0.2+0.5 X 0.1 + 0.6 X (-0.3)
0- s; fori- tiiiJ
__/"
-0.3 = 0,06 + 0.05-0,18 = -O.D7
'c Step 3:., Output umts activations are set: y 1= t. i

Step 4: Weight adjustments and bias adjtdtments are performed:
wz{new) = w;(old} + x;y

Figure 1 Neural net.
Solution: The given neural net consists of three input

2. Calculate the ner input for the network shown in
Figure 2 with bias included in the network.
Solution: The given net consistS of two input

b(new) = b(old) + y
neurons and one output neuron. The inputs and neurons, a bias and an output neuron. The inputs are
35
Artificial Neural Network: An Introduction 2.9 Solved Problems
34
Table2
The net input ro the omput neuron is
Xj X2
- y_
0.3
y;, = b+ Lx;w;
"
w1=1
0 0 0 @
y i::l y' ;: ~
0
(n = 3, because only 0
0.7 3 input neurons are given] ~

= b + XJ.Wt + X'2W2 + X3W3 ~- The given function gives an ourputonlywhenxi = 1
andX2 ;:; 0. The weights have to bedecidedonlyafi:er
= 0.35 + 0.8 X 0.1 + 0.6 X OJ the analysis. The net Qn be represented as shown in
Figure 2 Simple neural net. Figure 4 Neural net.
+ 0.4 X (-0.2) < Figure 5. , ..X , 0 \ 0
[x1, X2l = [0.2, 0.6] and the weigh" are [w 1, w,] = = 0.35 + 0.08 + 0.18 - 0.08 = 0.53 tt>l>"':u I 'n
[0.3, 0.7]. Since the bias is included b = 0.45 and For an AND function, the output is high if both the
bias input xo is equal to 1, the net input is calcu- (i) For binary sigmoidal activation function, inputs are ~igh. For this condition, the net input is
w151
lated as calculated as 2. Hence, based on ch.is net input, the
1 1 threshold is set, i.e. if the threshold value is greater
y=f(y;.) = 1 + e_,m
-·· = l+e-053
· = 0.625 than or equal m 2 then the neuron fires, else it does y
Yin= b+xJWI +X2W2
nor fire. So the threshold value is set equal to2((J"= 2).
= 0.45 + 0.2 X 0.3 + 0.6 X 0.7 (ii) For bipolar sigmoidal activation function, This can also be ob£ained by
w2521
= 0.45 + 0.06 + 0.42 = 0.93
- 2_ - 1 =
. - __ 2 - 1
y-f(y,.,)- 1 +0'• 1 +e 0.53 -\ "{ e?- nw- p
Therefore y;, = 0.93 is the ner input.
= 0.259 , •.
~
/'} ,., ....... Figure 5 Neural net (weights fixed after analysis).
3. Obtain rhe output of the neuron Y for the net- Here, ~ = 2, w = 1 (excitatory weights) and p = 0
work shown in Figure 3 using activation func- 4. Implement AND function using McCulloch-Fitts (no inhibitory weights). Substituting these values in Case 1: Assume thac both weights W! and 'W'z. are
neuron (cake binary da£a). the above~rnencioned equation we get excitatory, i.e.,
tions as: (i) binary sigmoidal and (ii) bipolar
sigmoidal. Solution: Consider the truth table for AND function WJ=W2=1
8~2xl-0=>8~2
(Table 1).
1.0 Then for the four inputs calculace che net input using
Table 1
Thus, the output of neuron Y can be written as .
0.1 0.35 Xi X2 y ·' y;,=XIW] +l.11V1
1 1 ... ]\
0.6 x,l o.3 ;r y 1 0 0 l ify,.?-2 "'; For inputs
0 1 0 y = f(y;,) = 0 if y;, < 2 j \ ..
0 0 0 \ (1, 1), Yin= 1 X 1 +l X1= 2
/ \""
-0.2 0 (1, 0), Yin= 1 X 1+0 X I= 1
0.4
x, In McCulloch-Pires neuron, only analysis is being where "2" represents che threshold value. (0, 1), Yiu = 0 X 1+1 X 1= 1
performed. Hence, assume che weights be WI = 1
Figure 3 Neural ner. and w1 = 1. The network architecture is shown in ..--- 5. lmplemem ANDNOT function using (0, 0), Yitl = 0 X 1+0 X 1= 0
Figure 4. Wiili chese assumed weights, che nee input McCulloch-Pirrs neuron (use binary data
is calculated for foul inputs: For inputs representation). From the calculated net inputs, it is not possible co
Solution: The given nerwork has three input neu- fire ilie neuron for input (1, 0) only. Hence, t~ese J-.
rons with bias and one output neuron. These form (1,1), y;n=xiwt+X2wz=l x 1+1 xI =2 Solution: In the case of ANDNOT funcrion, the weights are norsUirable. IJI-il'b'l
1\(Jp / . /·
a single-layer network. The inpulS are given as response is true if the first input is true and the Assume one weight as excitato\Y and the qther as --\-- rr
(l,O), Yi11 =XJWJ +X2Wz = 1 X 1 +0 X 1= 1
[xi>X2•X3] = [0.8,0.6,0.4] and the weigh<S are second input is fa1se. For all ocher input variations, inhibitory, i.e., ,.. ' l,t-) ...tit'
[w 1, w,, w3] = [0.1, 0.3, -0.2] with bias b = 0.35
(Q, 1), Ji• = XJ Wj +X2W2 = 1+ 1 X 1 = 1
0 X
rhe response is fa1se. The truth cable for AND NOT
(0,0), )'in =XIWl +X2W2 = 0 X 1 +OX 1 = 0 WI =1, wz=-1
(irs input is always 1). function is given in Table 2.
36 Artificial Neural Network: An Introduction 2.9 Solved Problems 37
Now calculate the net input. For the inputs A single-layer net is not sufficient to represent the ,, x, Case 2: Assume one weight as excitatory and the
function. An intermediate layer is necessary. -1 ot~er as inhibitory, i.e.,
(1,1), y;, = 1 X 1 + 1 X -1 = 0 ~
(1,0), y;,=1x1+0x. -1=1'
w12=-1; wzz=l
(0,1), J;, = 0 X 1 + l X -1 = -1
(0, 0), Yin= 0 X 1 +0 X -1 = 0 ~ y)-.-y Now calculate the net inputs. For the inputs
Figure 8 Neural ner for Z2.
From the calculated net inputs, now it is possible (0, 0), Z2in :::: 0 X -1 +0 X 1= 0
co fire the neuron for input (1, 0) only by fixing a
threshold of 1, i.e.,()~ 1 for Y unit. Thus,
Calculate the net inputs. For inputs 21'1-3.2-fr
Figure 6 Neural net for XOR function (ilie (Q, 0), Zlin = 0 X 1+0 X -1 = Q "
tl!i=:=l; 1112=-1; 6?:.1 weights
Nou: The value ~f() is caJ'?llared using the following: shown are obtained after analysis). (Q, 1), ZJin = QX 1 + 1 X -1 = -1 (1, 1), Z2j11 = 1 X
·) ., (1, 0), Z\in = 1 X } + 0 X -1 :=: 1
8?:. nw-:-p Thus, based on this
First function (zJ = XJ.Xi"): The rrut:h table for
function ZJ is shown in Table 4. (1, 1), Ziin = 1 X 1 + 1 X -1 =0 possible to get the requi
(}?:. 2 x 1- 1 ·~.,[for "p" inhibitory only
~. ~'7 magnitude consitk'red] Table4 On the basis of this calculated net input, it is
9?:.1 J~r,J X] possible to get the required output. Hence, =1
~. S\) "'- Zi W22
Thus, the output of neuron Y can
1
y=f(y;,)= 0 ify;,< 1
1 ify;,::01
be written as 0
0
0
1
0
1
0
0
1
0
w11
WZI
e~ 1
== 1
= -1
for the zl neuron
f.{; r '',I SIV
---
8~1
Third function (;o. = ZJ OR zz): The truth rable

for this function is shown in Table 6.
"6. lmplementXORfunction using McCulloch-Pitts The net representation is given as ~--------

Second function =
(zz XIX2.): The truth table for Table&
neuron (consider binary data). Case 1: Assume both weighrs as excitatory, i.e., function Z2 is shown in Table 5. y zz
(
·~ ·v Xi "'- Zi
Solution: The trmh table for XOR function is given wu = 1021 = 1 TableS I) 0 0 0 0 0
in Table 3. 0 1 1 0 1
Calculate the net inpms. For inputs, Xi "'- zz 1 0 1 1 0
Table3 0 0 0 I 1 0 0 0
X]
"'- y (0, 0), Zj,0 = 0 X 1+ 0 X I= 0 0 I 1
0 0 0 (Q, 1), ZJin = 0 X 1+ l X 1= l 0 0 Here the net input is calculated using
0 1 1 (1, 0), Z!i, = 1 X 1+ 0 1= 1
X 1 0
~]in :::
1
1
0
1
1
0
In this case, ilie output is "ON" foronlyoddnumber

ofl's. For rhe rest it is "OFF." XOR function cannot
(1,1), ZJin = 1 X 1+1 X 1= 2
Hence, it is not possible to obtain function z1

using these weighlS.
Case 2: Assume one weight as excitatory and the
The net representation is given as follows:
~e 1: Assume both weights as excitatory, i.e.,
w12 = wn = 1
-· Z] V] + Z2VZ
Case 1: Assume both weights as excitatory, i.e.,
V] ::: VZ = 1
)
be ;presented by simple and single logic function; it oilier as inhibitory, i.e.,

is represented as Now calculate the net inputs. For the inputs Now calculate the net inp~t. For inputs
\Lf::~
WB=l; U/21=-l
(O,O),Z2in=Ox 1+0x 1=0 (O,O),y;,=Ox 1+0x 1=0
y=z, +za
,, 1
~·
x, z,)(z,..,=x,w,,+JC2w2d (0, 1), ZJ..1; 1 =0 X 1 + 1 X 1 = 1 (0, 1), y;, = 0 X 1+ 1 X 1= 1

where (l,Q),Z2in = 1 X 1 +0 X 1:::1 (1, 0), Ji, = 1 X 1 + 0 X 1= 1
-1
Z! = Xlii (function 1) X, (1, 1), zz;, = 1 X 1 + 1 X 1= 2 (1,1), y;, = 0 X 1+ 0 X 1= 0
Z2 = XJx:z (funccion 2) "'
y = zi(OR)z, (function 3) Figure 7 Neural net for Z 1. Hence, it is not possible to obtain function zz (because for X] = 1 and X2 = l, ZJ = 0 and
using these weights. Z2 = 0)
,!
l
38 Artificial Neural Network: An lntn:iduction 2.9 Solved Problems 39
z, z,
, where the threshold is taken as "I" (e = 1) based final (new) weights obtained by presenting the
on the calculated net input. Hence, using the linear first input paaern, i.e.,
separability concept, the response is obtained fo.r
(-1,1) /7 (1,1) [wi w, b] = [1 l 1]
+ / + "OR" function.
y
-~x,, y,) 8. Design a Hebb net to implement logical AND The weight change here is
.(-1,0) function (use bipolar inputs and targets). ·
x, t:..w 1 =x1y= 1 X -1 = -1
Solution: The training data for the AND function is l>w, =xzy= -I X -I= I
given in Table 9.
(>,, y,)
l>b=y=-1
Figure 9 Nemal ner for Y(Z1 ORZ,). +
(0,-1)
(-1, -1) (1, -1) Table9
The new weights here are
e
Swing a threshold -of 2::. 1' Vj == 1'2 = I, which
Function decision
boundary
Inputs Target
implies that the net is recognized. Therefore, the Xi X2 b y w1(new) = w1(old) + 6.w1 =I -1 = 0
analysis is made for XOR function using M-P Figure 10 Graph for 'OR' function.
1 1 1 1 w, (new) = w,(old) + l>w, = 1 + 1 = 2
neurons. Thus for XOR function, the weights are
1 -1 1 -1 b(new) = b(old) + l>b = 1- 1 = 0
obtained as Using this value the equation for the line is given as
-I 1 1 -1
y = mx+c= (-1)x-l = -x-1 -1 -I 1 -1 Similarly, by presenting the third and fourth
wu = Zll22 = 1 (excitatory) input patterns, the new weights can be calculated.
WJ2 = W21 = -1 (inhibitory) Here the quadrants are nm x andy but XJ and xz, so The neMork is trained using the Hebb network train- Table 10 shows the values of weights for all inputs.
VJ = Vz = 1 (excirarory) the above equation becomes ing algorithm discussed in Section 2.7 .3.lnitially the .
weights and bias are set to zero, i.e., .- ~-- ~~J--- Table 10
7. Using the linear separability concept, obtain the \ xz =-xi -1 (2.1)
Inputs Weight changes Weights
response for OR function (rake bipolar inputs and
This can be wrinen as 0'2_~\ ·"'0 Xj X2, b y D.w, D.wz t:..b w1 wz b
bipolar targets). (0 0 0)
-WI b First input [xi xz b] = [1 1 1] and target = 1 I 1
Solution: Table 7 is the truth table for OR function xz= --XI-'-- (2.2)
[i.e., y = 1]: Setting the initial weights as old
I I 1 1 1 I I
with bipolar inputs and targets. wz wz I -1 I -I -I I -1 0 2 0
weights and applying the Hebb rule, we get -1 -1 I I -1
Comparing Eqs. (2.1) and (2.2), we get -1 I I -1 1
Table7 -1 -1 I -1 1 1 -1 2 2 -2
w;(new) = w;(old) + x;y
Xi X2 y Wi b
w2 =I; w 1(new) = w1 (old) +Xi]= 0 + I x l= 1 The sepaming line equation is given by
I 1
-I 1 '"' w,(new) = w,(oid) + xzy = 0 + I x I = f
-I I I Therefore, WJ = l, wz = 1 and b =
1. Calculating
b(new) = b(old) +y = 0 + 1 = I
-WJ
xz= - - x , - -
b
-I -I -I the net input and output of OR function on the basis ruz wz
of these weights and bias, we get emries in Table 8.
The weights calculated above arc the final weights
The uurh table inpurs and corresponding outputs that are obtained after presenting the first input. For all inputs, use the final weights obtained
TableS
.------;-:=::;:=~----:.cl
have been plotted in Figure 10. If output is 1, it is These weights are used as rhe initial weights when for each input to obtain the separating line.
~[Y•·=b+~}D
denoted as"+" else"-." Assuming rbe ~res ~ X2 the second input pattern is presented. The weight For the first input [1 I 1), the separating line is
as ( l, 0) 3.nd (0, ll; (x,, Yl) and (.xz,yz), the slope change here is t:..w; = x;y. Hence weight changes given by
1 1
"m" of the straight line can be obtained as I -1 1 1 1 relating to the first input are
-1 1
-1 I I 1 1 XZ = - X i - - ::::} XZ = -XJ - 1
)'2-yi -1-0 -1 -1 -1 -1 -1 t:..w1 = XJJ = l x 1 = I 1 1
m=--=--=-=-1
X2-X] 0+1 1 "'"" =w= 1 x 1 = 1
~ Similarly, for the second input [ 1 -1 1], the
Thus, the output of neuron Y can be written as y,·] l>b=y=l separating line is
We now calculate c:
y=f(;;;,) = 11OifJin<1
if y;,) I • Second input [x, X2, b] = [1 - 1 1] and
XZ =
-0
-x, --02 => xz = 0
'= Ji- '"-"i = 0- (~1)(-1) = -1 =
y -1: The initial or old weights here are the 2
I
_L_
T;:•·
40 Artificial Neural Network: An Introduction 2.9 Solved Problems 41
X, Forthethirdinput[-lll],itis Solution: The training pair for the OR function is ilie output response is "-1" lies on ilie other side -of
(-1, 1)
given in Table 11.
the bOun J, ~ . '·-!. N;
I (1,1)
X2. =
-1
-x,
I
+-I ; ;:;} X2 = -x, + 1 Table 11 , I W] = 2; W, = 2; b = 2 ~}' v.J<" Q
' I
Inputs Target ~~ nerwork can be represented as shown in L ...
- x,
Finally, for the fourth input [ -1 - 1 1], the
separating line is X]
"' I
b
I
-
y
I
Figure 14.
X,.
(i
v' .y.:,J-"
·~.w
J,,.,;;:rl>
J '>
X2. = zxl + 2 ::::}

-2 2
X2. = -.X"J +1 -I
-I
I
I
I
I
I
(-1,1) (1, 1)
<ii ~',..
(1, -1) -I -1 I -I + +
r~
The graphs for each of these separating lines
obtained are shown in Figure 11. In this figure Initially the weights and bias are set to zero, i.e.,
(A) First Input "+" mark is used for output "1" and"-" mark
is used for output "-1." From Figure 11, it can Wj =w2=h=O x,
X, be noticed rhat. for the first input, the decision
boundary differentiates only the first and fourth The nerwork is trained and the final weights are out·
(-1,1)
I (1,1)
+
inputs, and nor all negative responses are separated
from positive responSes. When rhe second input
pattern is presented, the decision boundary sep·
lined using the Hebb training algorithm discussed
in Section 2.7.3. The weighrs are considered as final
weights if the boundary line obtained from these
(-1, -1) '
(1. -1)
ar.ues (1, I) from (I, -I) and (-I. -I) and nor weights separates the positive response region and .112"'-x, -1
~ negative response region.
(-1, I). But the boundary line is same for the both
- x,
third and fourth training pairs. And, the decision
boundary line obtained from these input training
pairs separates the positive response region from
By presenting all the input patterns, the weights
are calculated. Table 12 shows the weights calculated
for all the inputS.
Figure 13 Decision boundary for OR function.
the negative response region. H~

(-1, -1) (1, -1)
obtained from this are the final weigh!§_afld-are Table 12 2
(B) Second input

-
given a:s
WI =2; tuz=2;
--
b=-2 Xj
Inputs
Xz b J
Weight changes
l).wl t.,wz t..b WI
(0
Weights
W1.
0
b
0)
x1
x1 2
2
y
X, The nerwork can be represented as shown in
(-1.~1
Figure 12. -I -1 I 2 0 2
(1,1)
-I -1 I I I I 3 Figure 14 Hebb net for OR function.
' -1 -( -1 I I -1 2 2 2
10. Use the Hebb rule method to implement XOR
~·
-2 function {take bipolar inputs and targets).
Using the final weights, the boundary line equation
x1 x 2 _ can be obtained. The separating line equation is Solution: The training patterns for an XOR function
1 y are shown in Table 13.
-wl b -2 2
X,= --X] - - = - X I - - =-X\ - 1 Table 13
(-1. -1) (1, -1) 2 wz wz2 2
__Inpu~ Target
The decision region for this net is shown in Figure 13.
It is observed in Figure 13 that straight li~e X']. =
b y
"'I
X]
(C) Third and fourth inputs

Figure 12 Hebb net for AND function. -xl -1 separates the pattern space into rwo regions. I -I
Figure 11 Decision boundary for AND Theinputparrerns [(I, I), (1, -I), (-1, I)] for which I -I I I
function using Hebb rule for 9. Design a Hebb net to implement OR function the output response is "1" lie on one side of the -I I I I
each training pair. (consider bipolar inputs and targets).
' boundary, and the input pattern (-1, -1) foi which -1 -1 I -I
II I
42 Artificial Neural Network: An lnlraduc\ion 2.9 Solved Problems 43
'";'.·
_,__.
Here, a single-layer network with two input neurons, The XOR function can be made linearly separable by [J,'"2 = X2J = 1 X 1 = 1 w,(new) = w,(old) + x,y = 1 + 1 x -1 = 0
one bias and one output neuron is considered. In solving it in a manner as discussedjn Problem 6. This
method of solving will result in rwo decision bound- l:J.w3 = X3Y = 1 X 1= 1 w4(n.W) = w,(old) + X4J = -1 + 1 x -1 = -2
this case also, the initial weights are assumed to be
zero: ary lines for separating positive and negative regions
ofXOR function.
l:J.w4 =xv= -1 x 1 = -1 u:s(new) = ws(old) +xsy =I+ -1 x -1 = 2
WJ ==Wz='b=O l:J.w5 = XSJ = 1 X 1 = l wG(new) = WG(old) +xGy = -1 + 1 X -1 = -2

11. Using the Hebb rule, find the weights required to
perfotm the following classifications of the give it l:J.wG =XGJ= -1 X l = -1 107(new) = 107(old) +XJy = 1 + 1 x -1 = 0
By using the Hebb training algorithm,· the network is input patterns shown in Figure 16. The pattern
uained and the final weights are calculated as shown is shown as 3 x 3 matrix form in the squares. The
!J,107 = XJY = 1 X 1 = 1 W,(new) = w,(oJd) + XSJ = 1 + 1 X -1 = 0
in the following Table 14.
"+" sytpbols represent the value" 1" and empty l:J.wa=xsy=1xl=l
W<J(new) = w,(old) + x9y = 1 + 1 x -1 = 0
squares indicate "-1." Consider "I" belongs to
IJ,W<j =X<)y= 1 X 1= 1
Table 14 the members of class {so has target value 1) and b(new) = b(old) + y = 1 + 1 x -1 = 0
lnputs Weight changes Weights "0" does not belong to the members of class M=y= 1
--- (so has target value -1). The final weights after presenting rhc second input
l\w1 6.W]. !::J.b WI Wz b pattern are given as
Xj
"' b y We now cakulate the new weights using the formula
(0 0 0) W(newJ=[OOO -22 -20000]
§ill §ili
1 I -1 -1 -1 -1 -1 -1 -1 + +
w;(new) = wi(old) + l:J.wi The weights obtained are indicated in the Hebb net
1 -1 1 1 1 -1 1 0 -2 0 + + shown in Figure 17,
-1 1 1 1 -1 1 I -1 -1 1 Setting the old weights as the initial weights here,
we obrairt 12. Find the weights required to perform the follow-
-1-11-1 1 1 -1 0 0 0 + + +
ing classifications of given input patterns using
'I' ·o· the Hebb rule. The inpurs are "1" where''+"
WJ (new) = WJ (old) + l:J.w1 = 0 + 1 = 1
symbol is present and" -1 ''where"," is presem.
The final weights obtained after presenting aH the Figure 16 Data for input patterns. '"2(new) = '"2(old) + !J,'"2 = 0 + 1 = 1 "L" pattern belongs to the class (target value+ 1)
inpm pauerns do nm give correct output for all pat-
w,(new) = w3(o!d) + IJ,w3 = 0 + 1 = 1 and "U" pattern does not belong to the class
terns. Figure 15 shows that the input patterns are Solution: The training input patterns for the given (target value -1).
linearly non-separable. The graph shown in Figure 15 net (Figure 16) are indicated in Table 15.
indicates that the four input pairs that are present can- Similarly, calculating for other weights we get Solution: The training input patterns for Figure 18
not be divided by a single line m separate them into Table 15 are given in Table 16.
two regions. Thus XORfi.mcrion is a case of a panern Pattern Inputs Target W4(new) = -1, ws(new) = l, WG(new):::: -1,
classification problem, which is not linearly separable. WJ(new) = 1, wa(new) = 1, llJ9(new) = 1, Table 16
XI X2 X3 :Gj xs X6X7xaX9b y
1 1 -1 1 -1 1 1 I ·1 b(new) = 1 Pattern Inputs Target
X, 0 1 1 1 1 -1 1 1 I 1 1 -1 X3 X4 X) XG
X] X2. '-7XSX9 b J
')i!-/ The weights after presenting first input pattern are L 1-1-11-1-1
(-1, 1) (1,1)
\ ;,,_ Here a single-layer ne[Work with nine input netUons,
·P one bias and one output neuron is formed. Set rhe W(new) = [1 1 1 -1 1 -I 1 I 1 1] u -1 1 I -1 -1
+
' initial weights and bias to zero, i.e.,
IN::::\
x, booodmy u'") W] ::=W2=W3=W<i=Ws
Case 2: Now we present the second input pattern
(0). The initial weights used here are the final weights
obtained after presenting the fim input pa~ern. Here,
A single-layer ne[Work with nine input neurons, one
bias and one output neuron is formed. Set the initial
=wG =w-, =wa =llJ9 = b= 0 weights and bias tO zero, i.e.,
the weights are calculated as shown below (y = -1
+ wiclHheinitialweighrsbeing[1ll-11-ll1I1]). W]::=W2=W3=W<j:=W5
(-1, -1) (1, -1) Case 1: Presenting first input panern (I), we calculate
=wG=UJ?=wa=U19=b=O
change in weights: w;(new) = w;(old) + l:J.x; [l:J.w; = x;y] The weights are calculated using
X,
f:..w;=x,y, i= 1 to9 w,(new) = WJ(old) + XiJ =I+ 1 X -1 = 0
w;(new) = w;(old) + x;y
Figure 15 Decision boundary for XOR function.
f:..w 1 = XIJ = 1 X 1= } ""(new)= '"2(old) + x,y = 1 + 1 x -1 = 0
I
2.9 Solved Problems v
45
The calculated weights are given in Table 17.
x, Table 17
Tatge' Weights
X, "' "'-
X3 X4 X5
Inpuu
X6X7XSX9b ----
j WJ
(0
Uf2 Ul3 W4 W5
0 0 0 0
W6 W'J
0 0
Wg
0
U19
0 0)
b
-1 -1 1 -1 -1
,, -1 -1 1 -1 -1 1 1 1 1, '
-I 0 0 -2 0 0 -2 0 0 0 0
-I I I -I
.... The obtained weights are indicated in rhe Hebb net

The final weights after preseming the rwo input
~hown in Figure 19.
patterns are
y
X, WlnewJ~[OO -200-200001
"'
,, x,
,, ,,
,,
x 9
1 (x9
Figure 17 Hebb ner for the data matrix shown in Figure 16. ....
y
,,
' ' '

"'
' + + ,,
+ + + + "" x,
'
·e ·u·
Figure 18 Input clara for given parrerns. -- ,(X,
Figure 19 flebb ne< of Figure 18.
46 Artificial Neural Network An Introduction 2.12 Projects 47
I 2.10 Review Questions 2. Calculate the output of neuron Y for the net
shown in Figure 21. Use binary and bipolar
{b) Construct a recurrent network with four
input nodes, three hidden nodes and two output
l. Define an artificial neural network. 15. What is the necessity of activation function? sigmoidal activation functions. nodes that has feedback links from the hidden
layer to the input layer.
2. Srate ilie properties of the processing element of 16. List the commonly used accivation functions.
an artificial neural network. 6:. l)singlinear separability oo~cept, obtain the
17. What is me impact of weight in an anifidal .0.9
response for NAND funccion.
3. How many signals can be sent by a neuron at a neural network?
particular rime instant? 7. Design a Hebb net to implement logical AND
18. What is the mher name for weight? 0.7~ y
function with
4. Draw a simple artificial neuron and discuss dte 19. Define bias and threshold. (a) binary inputs and targets and
calculation of net input.
20. What is a learning rate parameter? (b) binary inputs and bipolar targets.
5. What is the influence of a linear equation over
21. How does a momentum factor make faster 8. Implement NOR function using Hebb net with
the net input calculation?
convergence of a network? Figure 21 Neural net. {a) bipolar inputs and targets and
6. List the main components of ilie biological (b) bipolar inputs and binary targets.
22. State the role of vigilance parameter iE:l ART 3. Design neural networks wiili only one M-P
neuron.
network. neuron that implements the three basic logic 9. Classify the input panerns shown in Figure 22
7. Compare and contrast biological neuron and using Hebb training algorithm.
artificial neuron. 23. Why is the McCu!loch-Pins neuron widely used operations:
8. Srate ilie characteristics of an artificial neural

in logic functions?
(i) NOT (x.J; + + + + + +
"
network. 24. Indicate the difference between excitatory and (ii) OR (x,, X2h + + +
inhibitory weighted interconnections. + + + + + +
9. Discuss in derail ilie historical development of (iii) NAND (x" "2), where x1 and"2 E {0, 1].
25. Define linear separability. + + +
artificial neural networks.
4. (a) Show that ilie derivative of unipolar sig- + + + + +
10. What are the basic models of an artificial neural 26. Justify- XOR function is non·linearly separable 'A' 'E'
moidal function is -1
network? by a single decision boundary line. Target value + 1
27. How can the equation ofa straight line be formed j'(x) =AJ(x)[1 - [(x)j
11. Define net architecmre and give ilS classifica·
tlons. using linear separability? Figure 22 Inpur panern.
(b) Show that the derivative of bipolar sigmoidal
12. Define learning. 28. In what ways is bipolar representation better rhan ftmcrion is
binary representation? 10. Using Hebb rule, find dte weighLS required ro
13. Differentiate beP.veen supervised and unsuper- A perform following classifications. The vecrors
vised learning. 29. Stare the uaining algorithm used for the Hebb /' (x) = 2[1 +f(x)][1 - [(x)]
(1 -1 1 -1) and (111-1) belong to class (target
nerwork.
14. How is the critic information used in the learning 5. {a) Construct a feed-forward nerwork wirh five ,aJue+1);,eetors(-1-11l)and(11-1-l)
process? 30. Compare feed·fonvard and feedback network. input nodes, three hidden nodes and four output do nor belong to class (target value -1). Also
nodes that has lateral inhibition structure in the using each of training xvecmrs as input, test the
I 2.11 Exercise Problems output layer. response of net.
1. For the neP.vork shown in Figure 20, calculate the net input to rhe output neuron.
I 2.12 Projects
~
1. Write a program to classify ilie letters and numer- 2. Wtit$:.~~ira~ programs for implementing logic
y als using Hebb learning rule. Take a pair of letters functions usin~cCulloch-Pitts neuron.
- 6 or numerals of your own. Also, after training 3. Write a computer program to train a Madaline to
0.2 the fl.erwork, test the response of ilie net using perform AND function, using MRI algorithm.
suitable activation function. Perform the clas-
0.3 ~
4: Write a program for implementing BPN for
sification using bipolar data as well as binary
training a single·hidden·layer back-propagation
dara.
Figure 20 Neural net.
CHAPTER 3
SUPERVISED LEARNING
NETWORK

DEFINITION OF SUPERVISED LEARNING NETWORKS
 Training and test data sets
 Training set; input & target are specified

PERCEPTRON NETWORKS

 Linear threshold unit (LTU)
x1 w1
w0
w2
x2  n
o
. 
. wn
wi xi
. i=0
n
xn 1 if  wi xi >0
f(xi)= { i=0
-1 otherwise

PERCEPTRON LEARNING
wi = wi + wi
wi =  (t - o) xi
where
t = c(x) is the target value,
o is the perceptron output,
 Is a small constant (e.g., 0.1) called learning rate.
 If the output is correct (t = o) the weights wi are not changed
 If the output is incorrect (t  o) the weights wi are changed such

that the output of the perceptron for the new weights is closer to t.
 The algorithm converges to the correct classification

• if the training data is linearly separable
•  is sufficiently small
LEARNING ALGORITHM
 Epoch : Presentation of the entire training set to the neural
network.
 In the case of the AND function, an epoch consists of four sets of

inputs being presented to the network (i.e. [0,0], [0,1], [1,0],
[1,1]).
 Error: The error value is the amount by which the value output by
the network differs from the target value. For example, if we
required the network to output 0 and it outputs 1, then Error = -1.

 Target Value, T : When we are training a network we not only
present it with the input but also with a value that we require the
network to produce. For example, if we present the network with
[1,1] for the AND function, the training value will be 1.
 Output , O : The output value from the neuron.
 Ij : Inputs being presented to the neuron.
 Wj : Weight from input neuron (Ij) to the output neuron.
 LR : The learning rate. This dictates how quickly the network

converges. It is set by a matter of experimentation. It is typically
0.1.

TRAINING ALGORITHM
 Adjust neural network weights to map inputs to outputs.
 Use a set of sample patterns where the desired output (given the
inputs presented) is known.
 The purpose is to learn to

• Recognize features which are common to good and bad
exemplars

MULTILAYER PERCEPTRON
Output Values
Output Layer
Adjustable
Weights
Input Layer
Input Signals (External Stimuli)

LAYERS IN NEURAL NETWORK
 The input layer:
• Introduces input values into the network.
• No activation function or other processing.
 The hidden layer(s):

• Performs classification of features.
• Two hidden layers are sufficient to solve any problem.
• Features imply more layers may be better.
 The output layer:

• Functionally is just like the hidden layers.
• Outputs are passed on to the world outside the neural
network.

ADAPTIVE LINEAR NEURON (ADALINE)
In 1959, Bernard Widrow and Marcian Hoff of Stanford developed
models they called ADALINE (Adaptive Linear Neuron) and MADALINE
(Multilayer ADALINE). These models were named for their use of
Multiple ADAptive LINear Elements. MADALINE was the first neural
network to be applied to a real world problem. It is an adaptive filter
which eliminates echoes on phone lines.

ADALINE MODEL

ADALINE LEARNING RULE
Adaline network uses Delta Learning Rule. This rule is also called as
Widrow Learning Rule or Least Mean Square Rule. The delta rule for
adjusting the weights is given as (i = 1 to n):

USING ADALINE NETWORKS
 Initialize
Initialize • Assign random weights to all links
 Training
• Feed-in known inputs in random sequence
• Simulate the network
Training • Compute error between the input and the
output (Error Function)
• Adjust weights (Learning Function)
• Repeat until total error < ε
Thinking  Thinking
• Simulate the network
• Network will respond to any input
• Does not guarantee a correct solution even
for trained inputs
MADALINE NETWORK
MADALINE is a Multilayer Adaptive Linear Element. MADALINE was the
first neural network to be applied to a real world problem. It is used in
several adaptive filtering process.

BACK PROPAGATION NETWORK

 A training procedure which allows multilayer feed forward Neural
Networks to be trained.
 Can theoretically perform “any” input-output mapping.
 Can learn to solve linearly inseparable problems.

MULTILAYER FEEDFORWARD NETWORK
Inputs
Hiddens
I0
Outputs
h0
I1 o0
h1
I2 o1
h2 Outputs
I3 Hiddens
Inputs
MULTILAYER FEEDFORWARD NETWORK:
ACTIVATION AND TRAINING
 For feed forward networks:
• A continuous function can be
• differentiated allowing
• gradient-descent.
• Back propagation is an example of a gradient-descent technique.
• Uses sigmoid (binary or bipolar) activation function.

In multilayer networks, the activation function is
usually more complex than just a threshold function,
like 1/[1+exp(-x)] or even 2/[1+exp(-x)] – 1 to allow for
inhibition, etc.

GRADIENT DESCENT
 Gradient-Descent(training_examples, )
 Each training example is a pair of the form <(x1,…xn),t> where

(x1,…,xn) is the vector of input values, and t is the target output
value,  is the learning rate (e.g. 0.1)
 Initialize each wi to some small random value
 Until the termination condition is met, Do

• Initialize each wi to zero
• For each <(x1,…xn),t> in training_examples Do

 Input the instance (x1,…,xn) to the linear unit and compute
the output o
 For each linear unit weight wi Do
• wi= wi +  (t-o) xi

• For each linear unit weight wi Do
• wi=wi+wi

MODES OF GRADIENT DESCENT
 Batch mode : gradient descent
w=w -  ED[w] over the entire data D
ED[w]=1/2d(td-od)2
 Incremental mode: gradient descent

w=w -  Ed[w] over individual training examples d
Ed[w]=1/2 (td-od)2
 Incremental Gradient Descent can approximate Batch Gradient

Descent arbitrarily closely if  is small enough.

SIGMOID ACTIVATION FUNCTION
x0=1
x1 w1
w0 net=i=0n wi xi o=(net)=1/(1+e-net)
w2
x2  o
.
. wn
(x) is the sigmoid function: 1/(1+e-x)
. d(x)/dx= (x) (1- (x))
xn
Derive gradient decent rules to train:
• one sigmoid function
E/wi = -d(td-od) od (1-od) xi
• Multilayer networks of sigmoid units
backpropagation

BACKPROPAGATION TRAINING ALGORITHM
 Initialize each wi to some small random value.
 Until the termination condition is met, Do
• For each training example <(x1,…xn),t> Do

• Input the instance (x1,…,xn) to the network and compute the
network outputs ok
• For each output unit k
– k=ok(1-ok)(tk-ok)
• For each hidden unit h
– h=oh(1-oh) k wh,k k
• For each network weight w,j Do
• wi,j=wi,j+wi,j where
– wi,j=  j xi,j

BACKPROPAGATION
 Gradient descent over entire network weight vector
 Easily generalized to arbitrary directed graphs
 Will find a local, not necessarily global error minimum -in practice
often works well (can be invoked multiple times with different initial
weights)
 Often include weight momentum term

wi,j(t)=  j xi,j +  wi,j (t-1)
 Minimizes error training examples
 Will it generalize well to unseen instances (over-fitting)?
 Training can be slow typical 1000-10000 iterations (use Levenberg-

Marquardt instead of gradient descent)
APPLICATIONS OF BACKPROPAGATION
NETWORK
 Load forecasting problems in power systems.
 Image processing.
 Fault diagnosis and fault detection.
 Gesture recognition, speech recognition.
 Signature verification.
 Bioinformatics.
 Structural engineering design (civil).

RADIAL BASIS FUCNTION NETWORK
 The radial basis function (RBF) is a classification and functional
approximation neural network developed by M.J.D. Powell.
 The network uses the most common nonlinearities such as

sigmoidal and Gaussian kernel functions.
 The Gaussian functions are also used in regularization networks.
 The Gaussian function is generally defined as

RADIAL BASIS FUCNTION NETWORK

SUMMARY
This chapter discussed on the several supervised learning networks like
 Perceptron,
 Adaline,
 Madaline,
 Backpropagation Network,
 Radial Basis Function Network.
Apart from these mentioned above, there are several other supervised
neural networks like tree neural networks, wavelet neural network,
functional link neural network and so on.

Ot"fv!. 0v'J..oM
'.;;
network with bipolar sigmoidal units (A= 1) ro
.,
. ~!
3
testing. The input-output data are obtained by
achieve the following [)YO-to-one mappings: varying inpuc variables (xt,Xz) within [-1,+1] ~
• y = 6sin(rrxt) + cos(rrx,) randomly. Also the output dara are normalized it

• y = sin(nxt) cos(0.2Jr"2) within [-1, 1]. Apply training ro find proper
weights in the network.
]
:f
Supervised Learning Network
Ser up rwo sets of data, each consisting of 10 :?.
input-output pairs, one for training and oilier for
~
Learning Objectives -----'''-----------------,
The basic networks in supervised learning. Adaline, Madaline, back~propagarion and
How the perceptron learning rule is better radial basis funcrion network.
rhan the Hebb rule. The various learning facrors used in BPN.
Original percepuon layer description. • An overview of Ttme Delay, Function Link,
Delta rule with single output unit. Wavelet and Tree Neural Networks.
Architecture, flowchart, training algorithm Difference between back-propagation and

and resting algorithm for perceptron, RBF networks.
I 3.1 Introduction
The chapter covers major topics involving supervised learning networks and their associated single-layer
and multilayer feed-forward networks. The following topics have been discussed in derail- rh'e- perceptron
learning r'Ule for simple perceptrons, the delta rule (Widrow-Hoff rule) for Adaline and single-layer feed-
forward flC[\VOrks with continuous activation functions, and the back-propagation algorithm for multilayer
feed-forward necworks with cominuous activation functions. ln short, ali the feed-forward networks have
been explored.
I 3.2 Perceptron Networks

1 3.2.1 Theory
Percepuon networks come under single-layer feed-forward networks and are also called simple perceptrons.
As described in Table 2-2 (Evolution of Neural Networks) in Chapter 2, various cypes of perceptrons were
designed by Rosenblatt (1962) and Minsky-Papert (1969, 1988). However, a simple perceprron network was
discovered by Block in 1962.
The key points to be noted in a perccptron necwork are:
I. The perceptron network consists of three units, namely, sensory unit (input unit), associator unit (hidden
unit), response unit (output unit).
~
50 SupeNised Learning Network 3.2 Perceptron Networks
51
2. The sensory units are connected to associamr units with fixed weights having values 1, 0 or -l, which are Output
assigned at random. · - o·or 1 Output Desired
II
3. The binary activation function is used in sensory unit and associator unit. Fixed _weight
t Oar 1 output
4. The response unit has an'activarion of l, 0 or -1. The binary step wiili fixed threshold 9 is used as
activation for associator. The output signals £hat are sem from the associator unit to the response unit are
valUe ciN., 0, -1
at randorr\ .
\ 0 G) y,
9~
only binary.
5. TiiCQUt'put of the percepuon network is given by
- ---
i . \.--- '
~¢~
' iX1
{', r' '-

c
y = f(y,,)
X X X
\i
;x,
G) G)
..., \.~ X
:.I\
' < •
.,cl. 1 X I I \ Xn
where J(y;n) is activation function and is defmed as &.,
·'~
"
•<
· ~. tr Sensory unit 1
•
f(\- ,-. \~ ..
~
sensor grid " /
) ·~ if J;n> 9
)\t \ .. ._ representing any-·'
f(y;,) ={ if -9~y;11 56
'Z -1 if y;71 <-9
lJa~------ .
@ @ ry~
6. The perceptron learning rule is used in the weight updation between the associamr unit and the response e,
unit. For each training input, the net will calculate the response and it will Oetermine whelfier or not an Assoc1ator un~ . Response unit
error has occurred.
Figure 3·1 Ori~erceprron network.
w~t fL 9-·
7. The error calculation is based on the comparison of th~~~~~rgets with those of the ca1~t!!_~~ed
outputs. b'"~ r>-"-j ::.Kq>
(l.. u,•>J;>.-l? '<Y\
' ~ AJA I &J)
8. The weights on the connections from the units that send the nonzero signal will get adjusted suitably. I 3.2.2 Perceptron Learning Rule '
9. The weights will be adjusted on the basis of the learning_rykjf an error has occurred for a particular
training patre_!Jl.,..i.e..,- In case of the percepuon learrling rule, the learning signal is the difference between esir.ed...and.actuaL...- -·--,
~ponse of a neuron. The perceptron learning rule IS exp rune as o ows: j ~ f.:] (\ :._ PK- A-£. )
Wi{new) = Wj{old) + a tx1• Consider a finite "n" number of input training vectors, with their associated r;g~ ~ired) values x(n) {
and t{n), where "n" r~o N. The target is either+ 1 or -1. The ourput ''y" is obtained on the
b(new) = b(old) + at basis of the net input calculated and activation function being applied over the net input.
If no error occurs, there is no weight updarion and hence the training process may be stopped. In the above
equations, the target value "t" is+ I or-land a is the learningrate.ln general, these learning rules begin with
an initial guess at rhe weight values and then successive adjusunents are made on the basis of the evaluation
of an ob~~ve function. Evenrually, the lear!Jillg rules reac~.a near~optimal or optimal solution in a finite __
y = f(y,,) = l~
-1
if J1i1 > (}
if-{} 5Jirl 58
if Jin < -{}
\r~~ ~r
I
-~~
~
'
.,
number of steps. -------
APcrceprron nerwork with irs three units is shown in Figure 3~1. A£ shown in Figure 3~1. a sensory unir The weight updacion in case of perceprron learning is as shown. X~ -~~. ·.
can be a two-dimensional matrix of 400 photodetectors upon which a lighted picture with geometric black
and white pmern impinges. These detectors provide a bif!.~.{~) __:~r-~lgl.signal__if.f\1_~ i~.u.~und lfy ,P • then /I
co exceei~. certain value of threshold. Also, these detectors are conne ed randomly with the associator ullit. w{new) = w{old) + a tx {a - learning rate)
The associator unit is found to conSISt of a set ofsubcircuits called atrtre predicates. The feature predicates are else, we have
(
hard-wired to detect the specific fearure of a pattern and are e "valent to the feature detectors. For a particular
w(new) = w(old)
fearure, each predicate is examined with a few or all of the ponses of the sensory unit. It can be found that
the results from the predicate units are also binary (0 1). The last unit, i.e. response unit, contains the
pattern~recognizers or perceptrons. The weights pr tin the input layers are all fixed, while the weights on
the response unit are trainable.
~I
l
52 Supervised Learning Network 3.2 Perceptron Networks 53
For
each No
Figure 3-2 Single classification perceptron network. s:t
training patterns, and this learning takes place within a finite number of steps provided that the solution
exists."-
In the original perceptron ne[Work, the output obtained from the associator unit is a binary vector, and hence
that output can be taken as input signal to the res onse unit and classificanon can be performed. Here only
the weights be[l.veen the associator unit and the output unit can be adjuste , an, t e we1ghrs between the
sensory _and associator units are faxed. As a result, the discussion of the network is limited. to a single portion. Apply activation, obtain
Thus, the associator urut behaves like the input unit. A simple perceptron network architecrure is shown in Y= f(y,)
Figure 3•2. --~·------
In Figure 3-2, there are n input neurons, 1 output neuron and a bias. The inpur-layer and output-
layer neurons are connected through a directed communication link, which is associated with weights. The
goal of the perceptron net is to classify theJ!w.w: pa~~tern as a member or not a member to a p~nicular
class. · -···-·.··-· ·-.. --- ......
y!l~~~
~1
~.J.';) clo...JJ<L [j 1f'fll r').\-~Oo-t" Cl....\ ~ ··~Len sy (\fll-

1 3.2.4 Flowchart for Training Process
Yes
The flowchart for the perceprron nerwork training is shown in Figure 3-3. The nerwork has to be suitably
trained to obtain the response. The flowchan depicted here presents the flow of the training process. w1(new) = W1{old)+ atx1 W1(new)= w1{old)
As depicted in the flowchart, fim the basic initialization required for rhe training process is performed. .~ b{new) = b(old) +at b(new) = b(old)
'
The entire loop of the training process continues unril the training input pair is presented to rhe network.
The training {weight updation) is done on the basis of the comparison between the calculated and desired
output. The loop is terminated if there is no change in weight.
If
Yes
3.2.5 Perceptron Training Algorithm for Single Output Classes weight
changes
The percepuon algorithm can be used for either binary or bipolar input vectors, having bipolar targets,
threshold being fixed and variable bias. The algorithm discussed in rh1~ section is not particularly sensitive
No
to the initial values of the wei~fr or the value of the learning race. In the algorithm discussed below, initially
the inputs are assigned. Then e net input is calculated. The output of the network is obtained by app1ying Stop
the. activation function over the calculated net input. On performing comparison over the calculated and
Figure 3·3 Flowcha.n: for perceptron network with ·single ourput.
54 Supervised Learning Network 3.2 Parceptron Networks 55
ilie desired output, the weight updation process is carried out. The entire neMork is trained based on the Step 2: Perform Steps 3--5 for each bipolar or binary training vector pair s:t.
mentioned stopping criterion. The algorithm of a percepuon network is as follows: Step 3, Set activation (identity) of each input unit i = 1 ton:
I StepO: Initi-alize ili~weights a~d th~bia~for ~ ~culation they can b-e set to zero). Also initialize the / x;;= ~{
learning race a(O < a,;:= 1). For simplicity a is set to 1.
Step 1: Perform Steps 2-6 until the final stopping condition is false. Step 4, irst, the net input is calculated as i A
Step 2: Perform Steps 3-5 for each training pair indicated by s:t. I ,_..... It: -.-' --.~ ',_ -~,
---- :::::::::J~ "'( ;-;· J)'
Step 3: The input layer containing input units is applied with identity activation functions: ~----
(,. Yinj = bj + Lx;wij

n
.1~'
r<t' V" p•\ \ '' /
. u· \'.
'
(}.C: \\ , :/ r·
x; =si
\ ~
i=l
. v··· '<.''
Step 4: Calculate the output of the nwvork. To do so, first obtain the net input: Then activations are applied over the net input to calculate the output response:
"
~
Yin= b+ Lx;w; ify;11j > 9
i=I
Jj = f(y;.y) = { if-9 :S.Jinj :S.9 II
,,
where "n" is the number of input neurons in the input layer. Then apply activations over the net -I ify;11j < -9
input calculated to obmin the output:
Step 5: Make adjustment in weights and bias for j = I to m and i = I to n.
~
ify,:n>B
y= f(y;.) = { if -8 S.y;, s.B If;· # Jj• then
-I ify;n < -9 Wij(new) = Wij(old) + CXfjXi
Step 5, Weight and bias adjustment: Compare ilie value of the actual (calculated) output and desired bj(new) = bj(old) + Ofj
(target) output. else, we have \1
Ify i' f, then wij(new) = Wij(old) li'
w;(new) = w;(old) + atx; ~{new) = ~{old)
b(new) = b(old) + Of
Step 6: Test for the stopping condition, i.e., if there is no change in weights then stop the training process,
else, we have
1 else stan again from Step 2. 1
I.'
7Vi(new) = WJ(old}
b(new) = b(old) It em be noticed that after training, the net classifies each of the training vectors. The above algorithm is
I
i
Step 6: Train the nerwork until diere is no weight change. This is the stopping condition for the network. suited for the architecture shown in Figure 3~4. ~
j
If this condition is not met, then start again from Step 2. i
I
The algorithm discussed above is not sensitive to the initial values of the weights or the value of the
3.2. 7 Percept ron Network Testing Algorithm
~
~
It is best to test the network performance once the training process is complete. For efficient performance
learning rare.
of the network, it should be trained with more data. The testing algorithm (application procedure) is as
I
~
follows: ~!I
3.2.6 Perceptron Training Algorithm for Multiple Output Classes il\!
For multiple output classes, the perceptron training algorithm is as follows:
\ Step 0:-- Initialize the weights, biases and learning rare suitably.
Step 1: Check for stopping c?ndirion; if it is false, perform Steps 2-6.
I
I Step 0: The initi~ weights to be used here are taken from the training algorithms (the final weights I
obtained.i:l.uring training).
Step 1: For each input vector X to be classified, perform Steps 2-3.
Step 2: Set activations of the input unit.
II
I:;i
I
.,,.
011
r
56 Supervised Learriing Network
~- 3.3 Adaptive Unear Neuron (Adaline) 57
~~
~,
~~3.3 Adaptive Linear Neuron (Adaline)
'
1
I 3.3.1 Theory ,
x, 'x,
The unirs with linear activation function are called li~ear.~ts. A network ~ith a single linear unit is called
an Adaline (adaptive linear neuron). That is, in an Adaline, the input-output relationship is linear. Adaline
./~~ \~\J
/w,, uses bipolar activation for its input signals and its target output. The weights be.cween the input and the
omput are adjustable. The bias in Adaline acts like an adjustable weighr, whose connection is from a unit
with activations being always 1. Ad.aline is a net which has only one output unit. The Adaline nerwork may
w,l be trained using delta rule. The delta rule may afso be called as least mean square (LMS) rule or Widrow~Hoff
Xi
(x;)~ "/ ~ y 1:
~(s)--
YJ
I • -----+- YJ
rule. This learning rule is found to minimize the mean~squared error between the activation and the target
value.
I 3.3.2 Delta Rule for Single Output Unit
The Widrow-Hoff rule is very similar to percepuon learning rule. However, rheir origins are different. The
perceptron learning rule originates from the Hebbian assumption while the delta rule is derived from the
x, ( x,).£::::___ _ _~
w -
gradienc~descem method (it can be generalized to more than one layer). Also, the perceptron learning rule
stops after a finite number ofleaming steps, but the gradient~descent approach concinues forever, converging
Figure 3·4 Network archirecture for percepuon network for several output classes. only asymptotically to the solution. The delta rule updates the weights between the connections so as w
minimize the difference between the net input ro the output unit and the target value. The major aim is to
Step 3: Obrain the· response of output unit. minimize the error over all training parrerns. This is done by reducing the error for each pattern, one at a
rime.
The delta rule for adjusting rhe weight of ith pattern {i = 1 ro n) is
Yin = L" x;w; / ' ·
i=l
D.w; = a(t- y1,)x1
where D.w; is the weight change; a the learning rate; xthe vector of activation of input unit;y;, the net input
I if y;, > 8
to output unit, i.e., Y Li=l
= x;w;; t rhe target output. The deha rule in case of several output units for
Y = f(yhl) = { _o ~f ~e sy;, ~8 _,/'\ adjusting the weight from ith input unit to the jrh output unit (for each pattern) is
1 tfy111 <-8 IJ.wij = a(t;- y;,,j)x;
Thus, the testing algorithm resLS the performance of nerwork. I 3.3.3 Architeclure
As already stated, Adaline is a single~unir neuron, which receives input from several units and also from one
unit called bias. An Adaline inodel is shown in Figure 3~5. The basic Adaline model consists of trainable
weights. Inputs are either of the two values (+ 1 or -1) and the weights have signs (positive or negative).
The condition for separaring the response &om re~o is Initially, random weights are assigned. The net input calculated is applied to a quantizer transfer function
(possibly activation function) that restOres the output to +1 or -1. The Adaline model compares the actual
WJXJ + tiJ2X]. + b> (} output with the target output and on the basis of the training algorithm, the weights are adjusted.
_______
The condition for separating the resPonse from_...r~~o t~~ion of nega~ve
..
~--
is I 3.3.4 Flowchart lor Training Process
WI X} + 'WJ.X]_ + b < -(} The flowchan for the training process is shown in Figure 3~6. This gives a picrorial representation of the
network training. The conditions necessary for weight adjustments have co be checked carefully. The weights
The conditions- above are stated for a siilgie:f.i~p;;~~~ ~~~~;~k~ith rwo Input neurons and one output and other required parameters are initialized. Then the net input is calculated, output is obtained and compared
neuron and one bias. with the desired output for calculation of error. On the basis of the error Factor, weights are adjusted.
58 Supervised Learning Network 3.3 Adaptive Linear Neuron (Adaline) 59
Set initial values-weights

Ym= I.A/W1 y and bias, lear·rltrig-state
X, \
X2r j w2 ''-
_,.., If· b, a
w"
Y1"
X"
X"
Adaptive
algorithm I• e = t- Ym 1 Output error
generator +t For
No
each
.. ................................. Learning supervisor
~... s: t
Figure 3·5 Adaline model.
Yes
I 3.3.5 Training Algorithm

Activate input layer units
X =s (i=1ton)
1 1
The Adaline nerwork training algorithm is as follows:
.Step 0: Weights and bias are set to some random values bur not zero. Set the learning rate parameter ct.
Step 1: Perform Steps 2-6 when stopping condition is false.
Step 2: Perform Steps 3~5 for each bipolar training pair s:t.
Step 3: Set activations for input units i = I to n.
Weight updation
x;=s; w;(new) = w1(old) + a(t- Y1n)Xi
b(new) = b(old) + a(r- Yinl
Seep 4: Calculate the net input to the output unit.
"
y;, = b+ Lx;w;
i=J
Step 5: Update the weights and bias fori= I ron:
w;(new) = w;(old) + a (t- Yin) x; No If

b(new) = b (old) + a (t- y,,) E;=Es
Step 6: If the highest weight change rhat occurred during training is smaller than a specified toler-
ance ilien stop ilie uaining process, else continue. This is the rest for stopping condition of a
network.
The range of learning rate Can be be[Ween 0.1 and 1.0. Figure 3·6 Flowcharr for Adaline training process.
I
I
I
1._
.~
60 Supervised Learning Network 3.4 Multiple Adaptive Linear Neurons 61
I 3.3.6 Testing Algorithm
Ic is essential to perform the resting of a network rhat has been trained. When training is completed, the
Adaline can be used ro classify input patterns. A step &merion is used to test the performance of the network.
The resting procedure for thC Adaline nerwc~k is as follows:
J Step 0: Initialize the weights. (The weights are obtained from ilie ttaining algorithm.) J
Step 1: Perform Steps 2-4 for each bipolar input vecror x.
Step 2: Set the activations of the input units to x.
Step 3: Calculate the net input to rhe output unit:
]in= b+ Lx;Wj
Step 4: Apply the activation funcrion over the net input calculated: Figure 3·7 Archireaure of Madaline layer.
1 ify,"~o
y= and the output layer are ftxed. The time raken for the training process in the Madaline network is very high
{ -1 ifJin<O
compared to that of the Adaline network.

I 3.4 Multiple Adaptive Linear Neurons
In this training algorithm, only the weights between the hidden layer and rhe input layer are adjusted, and
I 3.4.1 Theory the weighu for the output units are ftxed. The weights VI, 112, ... , Vm and the bias bo that enter into output
unit Yare determined so that the response of unit Yis 1. Thus, the weights entering Yunit may be taken as
The multiple adaptive linear neurons (Madaline) model consists of many Adalin~el with a single
Vi ;::::V2;::::···;::::vm;::::!
output unit whose value is based on cerrain selection rules. 'It may use majOrity v(;re rule. On using this rule,
rhe output would have as answer eirher true or false. On the other hand, if AND rule is used, rhe output is and the bias can be taken as
true if and only ifborh rhe inputs are true, and so on. The weights that are connected from the Adaline layer
to ilie Madaline layer are fixed, positive and possess equal values. The weighrs between rhe input layer and bo;:::: ~
the Adaline layer are adjusted during the training process. The Adaline and Madaline layer neurons have a
The activation for the Adaline (hidden) and Madaline (output) units is given by
bias of excitation "l" connected to them. The uaining process for a Madaline system is similar ro that of an
{_
Adaline. lifx~O
f(x) = 1 if x < 0
I 3.4.2 Architectury>
A simple Madaline architecture is shown in Figure 3-7, which consists of"n" uniu of input layer, "m" units Step 0: Initialize the weighu. The weights entering the output unit are set as above. Set initial small
ofAdaline layer and "1" unit of rhe Madaline layer. Each neuron in theAdaline and Madaline layers has a bias random values for Adaline weights. Also set initial learning rate a.
of excitation 1. The Adaline layer is present between the input layer and the Madaline (output) layer; hence, Step 1: When stopping condition is false, perform Steps 2-3.
the Adaline layer can be considered a hidden layer. The use of the hidden layer gives the net computational
Step 2: For each bipolar training pair s:t, perform Steps 3-7.
capability which is nor found in single-layer nets, but chis complicates rhe training process to some extent.
The Adaline and Madaline models can be applied effectively in communication systems of adaptive Step 3: Activate input layer units. Fori;:::: 1 to n,
equalizers and adaptive noise cancellation and other cancellation circuits. x;:;: s;
I 3.4.3 Rowchart of Training Process Step 4: Calculate net input to each hidden Adaline unit:
The flowchart of the traini[lg process of the Madaline network is shown in Figure 3-8. In case of training, the "
Zinj:;:bj+ LxiWij, j:;: l tom
weighu between the input layer and the hidden layer are adjusted, and the weights between the hidden layer i=l
62 Supervised Learning Network 3.4 Multiple Adaptive Linear Neurons 63
(
p A
Initial & fixed weights

& bias between hidden & Yes
u
output layers
t=y
T
Set small random value
weights for adallne layer.
Initialize a
c}----~
t= 1" No
Yes
No
>--+---{8
Update weights on unit z1whose
net input is closest to zero.
b1(new) = b1(old) + a(1-z~)
w,(new) = wi(old) + a(1-zoy)X1
Activate input units
X10': s,, b1 ton
Update weights on units zk which

j has positive net inpul.
bk(new) = bN(old) + a(t-z.,.,)
Find net input to hidden layer
wilr(new) = w,.(old) + a(l-z.)x1
...
Zn~=b1 +tx1 w~,j=l tom
I
Calculate output
zJ= f(z.,)
I If no
Calculater net input to output unit No ( weight changes
c) (or) specilied
Y..,=b0 ·;i:zyJ
,., ' number of
epochs
T ' / (8
Calculate output Yes '
Y= l(y,)
cb Figure 3·8 (Continued).
Figure 3·8 Flowcharr for rraining ofMadaline,

I
L
-
Supe!Vised Learning Network 3.5 Back·Propagation Network 65
64
Step 5: Calculate output of each hidden unit: The back-propagation algorithm is different from mher networks in respect to the process by whic
weights are calculated during the learning period of the ne[INork. The general difficulty with the multilayer
Zj = /(z;n) pe'rceprrons is calculating the weights of the hidden layers in an efficient way that would result in a very small
or zero output error. When the hidden layers are incteas'ed the network training becomes more complex. To
Step 6: Find the output of the net: update weights, the error must be calculated. The error, Which is the difference between the actual (calculated)
and the desired (target) output, is easily measured at the"Output layer. It should be noted that at the hidden
y;, = bo + Lqvj
"' layers, there is no direct information of the en'or. Therefore, other techniques should be used to calculate an
j=l error at the hidden layer, which will cause minimization of the output error, and this is the ultimate goal.
The training of the BPN is done in three stages - the feed-forward of rhe input training pattern, the
y =f(y;")
calculation and back-propagation of the error, and updation of weights. The tescin of the BPN involves the
Step 7: Calculate the error and update ilie weighcs. compuration of feed-forward phase onlx.,There can be more than one hi en ayer (more beneficial) bur one
hidden layer is sufhcienr. Even though the training is very slow, once the network is trained it can produce
1. If t = y, no weight updation is required. its outputs very rapidly.
2. If t f y and t = +1, update weights on Zj, where net input is closest to 0 (zero):
bj(new) = bj(old) + a (1 - z;11j}
wij(new) = W;i(old) + a (1 - z;11j)x; A back-propagation neural network is a multilayer, feed~forv.rard neural network consisting of an input layer,
a hidden layer and an output layer. The neurons present in che hidden and output layers have biases, which
3. If t f y and t = -1, update weights on units Zk whose net input is positive: are rhe connections from the units whose activation is always 1. The bias terms also acts as weights. Figure 3-9
shows the architecture of a BPN, depicting only the direction of information Aow for the feed~forward phase.
w;k(new) = w;k(old) + a (-1 - z;, k) x;
1 During the b~R3=l)3tion phase of learnms., si nals are sent in the reverse direction
b,(new) = b,(old) +a (-1- z;,.,) The inputs sent to the BPN and the output obtained from the net could be e1ther binary (0, I) or
bipolar (-1, + 1). The activation function could be any function which increases monotonically and is also
Step 8: Test for the stopping condition. (If there is no weight change or weight reaches a satisFactory level, differentiable.
or if a specifted maximum number of iterations of weight updarion have been performed then
1 stop, or else continue). I
Madalines can be formed with the weights on the output unit set to perform some logic functions. If there
are only t\VO hidden units presenr, or if there are more than two hidden units, then rhe "majoriry vote rule"
function may be used. /
I 3.5 Back·Propagation Network ...>,

:fu I\"..L.·'"''
,-- J f.
·
~ I
~
·-
(""~-~
I'
1 3.5.1 Theory
The back~propagarion learning algorithm is one of the most important developments in neural net\vorks
(Bryson and Ho, 1969; Werbos, 1974; Lecun, 1985; Parker, 1985; Rumelhan, 1986). This network has re-
awakened the scientific and engineering community to the model in and rocessin of nu
phenomena usin ne networks. This learning algori m IS a lied !tilayer feed-forward ne_two_d~
con;rung o processing elemen~S with continuous renua e activation functions. e networks associated
with back-propagation learning algorithm are so e ac -propagation networ. (BPNs). For a given set
of training input-output pair, chis algorithm provides a procedure for changing the weights in a BPN to
classify the given input patterns correctly. The basic concept for this weight update algorithm is simply the
gradient-des em method as used in the case of sim le crce uon networks with differentiable units. This is a r(~.
method where the error is propagated ack to the hidden unit. he aim o t e neur networ IS w train the ''
net to achieve a balance between the net's ability to respond (memorization) and irs ability to give reason~e
I ~~ure3·9
l
Architecture of a back-propagation network.
responses to rhe inpm mar "simi,.,. bur not identi/to me one mar is used in ttaining (generalization).
66 Super.<ise_d Learni~g Network 3.5 Back·Propagalion Network 67
I 3.5.3 Flowchart for Training Process
The flowchart for rhe training process using a BPN is shown in Figure 3-10. The terminologies used in the
flowchart and in the uaining algorithm are as follows:
x = input training vecro.r (XJ, ... , x;, ... , x11 )
t = target output vector (t), ... , t/r, ... , tm) -
a = learning rate parameter
x; :;::. input unit i. (Since rhe input layer uses identity activation function, the input and output signals © "
here are same.)
VOj = bias on jdi hidd~n unit
wok = bias on kch output unit FOr each No
~=hidden unirj. The net inpUt to Zj is training pair >-~----(B
x. t
"
Zinj = llOj +I: XjVij
i=l Yes
and rhe output is
Zj = f(zi"j) Receive Input signal x1 &
transmit to hidden unit
Jk = output unit k. The net input m Yk is

p
]ink = Wok + L ZjWjk In hidden unit, calculate o/p,

j=:l "
Z;nj::: Voj + i~/iVij
z;=f(Z;nj), ]=1top
and rhe output is i= 1\o n
y; = f(y,";)
Ok =. error correction weight adjusrmen~. for Wtk ~hat is due tO an error at output unit Yk• which is
back-propagared m the hidden uni[S thai feed into u~
Of = error correction weight adjustment for Vij that is due m the back-proEagation of error to the
hidden uni<zj- b>• '\f"-( L""'-'iJ ~-fe_,l.. ,,'-'.fJ Z-J' ...--
Also, ir should be noted that tOe commonly used acrivarion functions are l:imary sigmoidal and bipolar
sigmoidal activation functions (discussed in Section 2.3.3). These functions are used in the BPN because of Calculate output signal from
the following characteristics: (i) continui~; (ii) djffereorjahilit:ytlm) nQndeCreasing mon0£9.11Y· output layer,
p
The range of binary sigmoid is fio;Q to 1, and for bipolar sigmoid it is from -1 to+ 1. Yink =- Wok+ :E z,wik
"'
Yk = f(Yink), k =1 tom
I 3.5.4 Training Algorilhm
The error back-propagation learning algorithm can be oudined in ilie following algorithm:
Figure 3·10
!Step 0: Initialize weights and learning rate (take some small random values).
Step 1: Perform Sreps 2-9 when stopping condition is false.
Step 2: Perform Steps 3-8 for~ traini~~r.
I
L
Supervised learning Network
3.5 Back·Propagation Network 69
68
_, - ------------._
lf:edjorward p~as' (Phas:fJ_I
A
Compute error correction !actor

t,= (1,-yJ f'!Y~o.l
(between output and hidden)
--
Step 3: Each input unit receives input signal x; and sends it to the hidden unit (i
Step 4: Each hidden unit Zj(j = 1 top) sums irs Weighted inp~;~t signals to calculate net input:
..:/
Zfnf' =
-
v;j + LX
"
ill;;
I
-v
Y. '..,
= l to n}.
,I
'rJ
i=l
Calculate output of the hidden uilit by applying its activation functions over Zinj (binary or bipolar
Find weight & bias correction term
ll.Wjk. = aO,zj> l\W01c = ~J"II
Calculate error term bi

-
sigmoidal activation function}:
Zj = /(z;,j)
and send the output signal from the hidden unit to the input of output layer units.
Step 5: For each output unity,~o (k = I to m),_ca.lcuhue the net input: ,I
,\--. o\•\
(between hidden and input)
m
~nJ=f}kWjk ' I
p
~ = 0,,1f'(z1,p
Yink = Wok + L ZjWjk
j~l
I
Compute change in weights & bias based
on bj.l!.vii= aqx;. ll.v01 = aq
Update weight and bias on

output unit
-----:::~
......
f~ropagation ofen-or (Phase ll)j
St:ql-6: --Each output unu JJr(k
-
and apply the activation function to compute output signal
Yk = f(y;,,)
I to m) receives a target parrern corr~ponding to rhe input training

pattern and computes theferrorcorrectionJffii'C)
w111 (new) = w111 (old) + O.w_;11
I'
\
wok (new)= w0k (old)+ ll.w011
··= (t,- ykl/'(y;,,)
The derivative J'(y;11k) can be calculated as in Section 2.3.3. On the basis of the calculated error
correction term, update ilie change in weights and bias:
Update weight and bias on
hidden unil \,
t1wjk = cxOkzj; t1wok = cxOrr {j
v 11 (new) =V~(old) +I.Nq Of
V01 (new)= V01 (old) + t:N01 rJ
Also, send Ok to the hidden layer baCkwards.
Step 7: Each hidden unit (zj,j = I top) sums its delta inputs from the output units:
"'
8inj= z=okwpr
k=l
The term 8inj gets multiplied wirh ilie derivative of j(Zinj) to calculate the error tetm:
8j=8;11jj'(z;nj)
The derivative /'(z;71j) can be calculated as C!TS:cllssed in Section 2.3.3 depending on whether
binary or bipolar sigmoidal function is used. On the basis of the calculated 8j, update rhe change
in weights and bias:
t1vij = cx8jx;; tlvoj = aOj

,.
\.
I
-I
'
70 Supervised Learning Network 3.5 Back-Propagation Network 71 :IIf
. Wlighr and bias upddtion (PhaJ~ Ill): I from the beginning itself and the system may be smck at a local minima or at a very flat plateau at the starting
•
point itself. One method of choosing the weigh~ is choosing it in the range
Step 8: Each output unit (yk, k = 1 tom) updates the bias and weights:
I
Wjk(new) = Wjk(old)+6.wjk I -3' 3 J.
[ .fO;' _;a,'
= WQk(oJd)+L'.WQk '
WOk(new)
i
Each hidden unit (z;,j = 1 top) updates its bias and weights: I
Vij(new) = Vij(o!d)+6.vij
'<y(new) = VOj(old)+t.voj
Step 9: Check for the sropping condition. The stopping condition may be cenain number of epochs
1 reached or when ilie actual omput equals the t<Uget output. 1 V,j'(new) =y Vij(old)
llvj(old)ll
The above algorithm uses the incremental approach for updarion of weights, i.e., the weights are being
where Vj is the average weight calculated for all values of i, and the scale factory= 0.7(P) 11n ("n" is the
changed immediately after a training pattern is presented. There is another way of training called batch-mode
number of input neurons and "P" is the nwnber of hidden neurons).
training, where the weights are changed only after all the training patterns are presented. The effectiveness of
rwo approaches depends on the problem, but batch-mode training requires additional local storage for each
3.5.5.2 Learning Rate a
connection to maintain the immediate weight changes. When a BPN is used as a classifier, it is equivalent to
the optimal Bayesian discriminant function for asymptOtically large sets of statistically independent training The learning rate (a) affects the convergence of the BPN. A larger value of a may speed up the convergence
but might result in overshooting, while a smaller value of a has vice-versa effecr. The range of a from 10- 3
pauerns.
The problem in this case is whether the back-propagation learning algorithm can always converge and find to 10 has been used successfulfy for several back-propagation algorithmic experiments. Thus, a large learning I
proper weights for network even after enough learning. It will converge since it implements a gradient-descent rate leads to rapid learning bm there is oscillation of wei_g!lts, while the lower learning rare leads to slower
on the error surface in the weight space, and this will roll down the error surface to the nearest minimum error learning. -
and will stop. This becomes true only when the relation existing between rhe input and the output training
patterns is deterministic and rhe error surface is deterministic. This is nm the case in real world because the 3.5.5.3 Momentum Factor
produced square-error surfaces are always at random. This is the stochastic nature of the back-propagation The gradient descent is very slow if the learning rare a is small and oscillates widely if a is roo large. One
algorithm, which is purely based on the srochastic gradient-descent method. The BPN is a special case of very efficient and commonly used method that altows a larger learning rate without oscillations is by adding
stochastic approximation. a momentum factor ro rhc;_.!,LQ!DlaLgradient-descen_t __m~_r]l_Qq., _
If rhe BPN algorithm converges at all, then it may get smck with local minima and may be unable to The-iil"Omemum E'cror IS denoted by 1] E [0, i] and the value of 0.9 is often used for the momentum
find satisfactory solutions. The randomness of the algorithm helps it to get out of local minima. The error factor. Also, this approach is more useful when some training data are ve rem from the ma·oriry
functions may have large number of global minima because of permutations of weights that keep the network of clara. A momentum factor can be used with either p uern y pattern up atillg or batch-"iiii e up a -
input-output function unchanged. This"6.uses the error surfaces to have numerous troughs. ing.-I'iicase of batch mode, it has the effect of complete averagirig over rhe patterns. Even though the
averaging is only partial in the panern-by-pattern mode, it leaves some useful i-nformation for weight
updation.
3.5.5 Learning Factors _of Back-Propagation Network
The weight updation formulas used here are
The training of a BPN is based on the choice of various parameters. Also, the convergence of the BPN is
Wjk(t+ I)= Wji(t) + ao,Zj+ry [Wjk(t)- Wjk(t- I)]
based on some important learning factors such as rhe initial weights, the learning rare, the updation rule,
the size and nature of the training set, and the architecture (number of layers and number of neurons per ll.•uj~(r+ 1)
layer).
and
3.5.5.1 Initial Weights
Vij(t+ 1) = Vij(t) + a8jXi+1J{Vij(t)- Vij(t- l)]
The ultimate solution may be affected by the initial weights of a multilayer feed-forward nerwork. They are
ll.v;j(r+ l)
initialized at small random values. The choice of r wei t determines how fast the network converges. I
The initial weights cannm be very high because t q~g-~oidal acriva · ed here may get samrated I The momenlum factor also helps in fas"r convergence.
L
'.
72 Supervised Learning Network 3.6 Radiat Basis Function Network 73
3.5.5.4 Generalization Step 4: Now c?mpure the output of the output layer unit. Fork= I tom,
The best network for generalization is BPN. A network is said robe generalized when it sensibly imerpolates p
with input networks thai: are new to the nerwork. When there are many trainable parameters for the given link =:WOk + L ZjWjk
amount of training dam, the network learns well bm does not generalize well. This is usually called overfitting ·. ·j=l
or overtraining. One solurion to this problem is to moniror the error on the rest sec and terminate the training
when che error increases. With small number of trainable parameters, ~e network fails to learn the training Jk = f(yj,,)
_r!-'' ~.,_,r; data and performs very poorly. on the .test data. For improving rhe abi\icy of the network ro generalize from Use sigmoidal activation functions for calculating the output.
.-.!( ~o_ a training data set w a rest clara set, ir is desirable to make small changes in rhe iripur space of a panern,
}{i 1
.,'e,) without changing the output components. This is achieved by introducing variations in the in pur space of
-0
c..!( '!f.!' training panerns as pan of the training set. However, computationally, this method is very expensive. Also,
,-. ,:'\ j a net With large number of nodes is capable of membfizing the training set at the cost of generali:zation ...As a I 3.6 Radial Basis Function Network
?\ Ji result, smaller nets are preferred than larger ones.
r I 3.6.1 Theory
3.5.5.5 Number of Training Data
The radial basis function (RBF) is a classification and functional approximation neural network developed
The training clara should be sufficient and proper. There exisrs a rule of thumb, which states !!!:r rhe training
by M.J.D. Powell. The newark uses the most common nonlineariries such as sigmoidal and Gaussian kernel
dat:uhould cover the entire expected input space, and while training, training-vector pairs should be selected
functions. The Gaussian functions are also used in regularization networks. The response of such a function is
randomly from the set. Assume that theffiput space as being linearly separable into "L" disjoint regions
positive for all values ofy; rhe response decreases to 0 as lyl _. 0. The Gaussian function is generally defined as
with their boundaries being part of hyper planes. Let "T" be the lower bound on the ~umber~ of training
pens. Then, choosing T suE!!_ that TIL ») will allow the network w discriminate pauern classes using f(y) = ,-1
fine piecewise hyperplane parririomng. Also in some cases, scaling.ornot;!:flalization has to be done to help
learning. __ ,•' ··: }) \ .. The derivative of this function is given by
3.5.5.6 Number of Hidden Layer Nodes .•. A/77 _/ ['(yl = -zy,-r' = -2yf(yl
If there exists more than one hidden layer in a BPN, rhe~~ICufarions
performed for a single layer are The graphical represemarion of this Gaussian Function is shown in Figure 3-11 below.
repeated for all the layers and are summed up at rhe end. In case of"all mufnlayer feed-forward networks, When rhe Gaussian potemial functions are being used, each node is found to produce an idemical outpm
rhe size of a h1dden layer i'f"VeTy important. The number of hidden units required for an application needs for inputs existing wirhin the fixed radial disrance from rhe center of the kernel, they are found m be radically
to be determined separately. The size of a hidden lay~_:___is usually determi_~Q~~p_qim~~- For a network symmerric, and hence the name radial basis function network. The emire network forms a linear combination
of a reasonable size,~ SIZe of hidden nod -- araariVel}r~mall fraction of the inpllrl~For of the nonlinear basis function.
example, if the network does not converge to a solution, it may need mor hidduJ lmdes:-i3~and,
overa.ll system performance.
3.5.6 Testing Algorithm of Back-Propagation Network

---
if rhe net\vork converges, the user may try a very few hidden nodes and then settle finally on a size based on
f(y)
The resting procedure of the BPN is as follows:
Step 0: Initialize the weights. The weights are taken from the training algorithm.
Step 1: Perform Steps 2-4 for each input vector.
Step 2: Set the activation of input unit for x; (i = I ro n).
Step 3: Calculate the net input to hidden unit x and irs output-. For j = 1 ro p,
"
Zinj = VOj + L XiVij ~----~~--r---L-~--~r-----~Y
i:=l -2 -1 0 2
Z; = f(z;n;) Figure 3·11 Gaussian kernel fimcrion.

74 Supervised Learning Network 3.6 Radial Basis Function Network 75
x,
X,
For "'- No
each >--
x,
Input Hidden Output
layer layer (RBF) layer
Figure 3·12 Architecture ofRBE
Select centers of RBF functions;

I 3.6.2 Architecture sufficient number has to be
selected to ensure adequate sampling
The archirecmre for the radial basis function network (RBFN) is shown in Figure 3-12. The architecture
consim of two layers whose output nodes form a linear combination of the kernel (or basis) functions
computed by means of the RBF nodes or hidden layer nodes. The basis function (nonlinearicy) in the hidden
layer produces a significant nonzero response w the input stimulus it has received only when the input of it
falls within a smallloca.lized region of the input space. This network can also be called as localized receptive
field network.
I 3.6.3 Flowchart for Training Process
The flowchart for rhe training process of the RBF is shown in Figure 3-13 below. In this case, the cemer of
the RBF functions has to be chosen and hence, based on all parameters, the output of network is calculated.
The training algorithm describes in derail ali rhe calculations involved in the training process depicted in rhe
flowchart. The training is starred in the hidden layer with an unsupervised learning algorithm. The training is
continued in the output layer with a supervised learning algorithm. Simultaneously, we can apply supervised
learning algorithm to ilie hidden and output layers for fme-runing of the network. The training algorithm is
If no
given as follows. 'epochs (or)
no
I Ste~ 0: Set the weights to small random values. No weight
hange
Step 1: Perform Steps 2-8 when the stopping condition is false.
Step 2: Perform Steps 3-7 for each input. Yes f+------------'
Step 3: Each input unir .(x; for all i ::= 1 ron) receives inpm signals and transmits to rhe next hidden layer
unit.
Figure 3-13 Flowchart for the training process ofRBF.
76 Supervised Learning Network
3.8 Functional Link Networks 77
·Step 4: Calculate the radial basis function.
Step 5: Select the cemers for che radial basis function. The cenrers are selected from rhe set of input
vea:ors. It should be ·noted that a sufficient number of centen; have m be selected to ensure
X( I)
Delay line
l
adequate sampli~g of the input vecmr space. X( I) !<(1-D X( I-n)
Step 6: Calculate the output from the hidden layer unit:
-
Multllayar perceptron
r
t,rxji- Xji)']
v;(x;) =
exp [-
J-
a2
T
0(1)
'
where Xj; is the center of the RBF unit for input variables; a; the width of ith RBF unit; xp rhe Figure 3·14 Time delay neural network (FIR fiher).
jth variable of input panern.
Step 7: Calculate the output of the neural network:
Y11n = L W;mv;(x;) + wo X(!) X( I-n)
i=l
where k is the number of hidden layer nodes (RBF funcrion);y,m the output value of mrh node in Multilayer perceptron z-1
output layer for the nth incoming panern; Wim rhe weight between irh RBF unit and mrh ourpur
node; wo the biasing term at nrh output node.
Step 8: Calculate the error and test for the stopping condition. The stopping condition may be number 0(1)
of epochs or ro a certain ex:renr weight change.
Figure 3·15 TDNN wirh ompur feedback (IIR filter).
Thus, a network can be trained using RBFN.
I 3.8 Functional Link Networks

I 3.7 Time Delay Neural Network -
These networks are specifically designed for handling linearly non-separable problems using appropriate
The neural network has to respond to a sequence of patterns. Here the network is required to produce a input representacion. Thus, suitable enhanced representation of the inpm data has to be found out. This
particular ourpur sequence in response to a particular sequence of inputs. A shift register can be wnsidered can be achieved by increasing the dimensions of the input space. The input data which is expanded is
as a tapped delay line. Consider a case of a multilayer perceptron where the tapped outputs of rhe delay line used for training instead of the actual input data. In this case, higher order input terms are chosen so that
are applied to its inputs. This rype of network constitutes a time delay Jlfurtzlnerwork (TONN}. The ourpm they are linearly independent of the original pattern components. Thus, the input representation has been
consists of a finite temporal dependence on irs inpms, given a~ enhanced and linear separability can be achieved in the extended space. One of the functional link model
networks is shown in Figure 3·16. This model is helpful for learning continuous functions. For this model,
U(t) = F[x(t),x(t-1), ... ,x(t- n)] the higher-order input terms are obtained using the onhogonal basis functions such as sinTCX, cos JrX, sin 2TCX,
cos 2;rtr, etc.
where Fis any nonlinearity function. The multilayer perceptron with delay line is shown in Figure 3-14. The most common example oflinear nonseparabilicy is XOR problem. The functional link networks help
When the function U(t) is a weigh red sum, then the· TDNN is equivalent to a finite impulse response in solving this problem. The inputs now are
filter (FIR). In TDNN, when the output is being fed back through a unit delay into rhe input layer, then the
net computed here is equivalent to an infinite impulse response (IIR) filter. Figure 3-15 shows TDNN with x:z t
output feedback. "'-I-I
"'"'
Thus, a neuron with a tapped delay line is called a TDNN unit, and a network which consists ofTDNN -I I -I -I
units is called a TDNN. A specific application ofTDNNs is speech recognition. The TDNN can be trained -I -I -I
using the back-propagatio·n-learning rule with a momentum factor.
78
Supervised ~aming Network 3.10 Wavelet Neural Networks 79
Yes No
I "=' I I C=21 I C=1 I I C=31

Figure 3·18 Binary classification tree.
obtained by a multilayer network at a panicular decision node is used in the following way:
Figure 3·16 Functional line nerwork model.
x directed to left child node tL, if y < 0
x directed to right child node tR, if y ::: 0
x, 'x, The algorithm for a TNN consists of two phases:
~
/
1. Tree growing phase: In this phase, a large rree is grown by recursively fmding the rules for splitting until
x, x, 0 all the terminal nodes have pure or nearly pure class membership, else it cannot split further.
y y
2. Tree pnming phase: Here a smaller tree is being selected from the pruned subtree to avoid the overfilling
1 of data.
The training ofTNN involves [\VO nested optimization problems. In the inner optimization problem, the
~G BPN algorithm can be used to train the network for a given pair of classes. On the other hand, in omer
Figure 3·17 The XOR problem.
optimization problem, a heuristic search method is used to find a good pair of classes. The TNN when rested
on a character recognition problem decreases the error rare and size of rhe uee relative to that of the smndard
classifiCation tree design methods. The TNN can be implemented for waveform recognition problem. It
Thus, ir can be easily seen rhar rhe functional link nerwork in Figure 3~ 17 is used for solving this problem. obtains comparable error rates and the training here is faster than the large BPN for the same application.
The li.Jncriona.llink network consists of only one layer, therefore, ir can be uained using delta learning rule Also, TNN provides a structured approach to neural network classifier design problems.
instead of rhe generalized delta learning rule used in BPN. As, a result, rhe learning speed of the fUnc6onal
link network is faster rhan that of the BPN.
I 3.10 Wavelet Neural Networks
I 3.9 Tree Neural Networks The wavelet neural network (WNN) is based on the wavelet transform theory. This nwvork helps in
approximating arbitrary nonlinear functions. The powerful tool for function approximation is wavelet
The uee neural networks (TNNs) are used for rhe pattern recognition problem. The main concept of this decomposition.
network is m use a small multilayer neural nerwork ar each decision-making node of a binary classification Letj(x) be a piecewise cominuous function. This function can be decomposed into a family of functions,
tree for extracting the non-linear features. TNNs compbely extract rhe power of tree classifiers for using which is obtained by dilating and translating a single wavelet function¢: !(' --')- R as
appropriate local fearures at the rlilterent levels and nodes of the tree. A binary classification tree is shown in
Figure 3-18.
The decision nodes are present as circular nodes and the terminal nodes are present as square nodes. The
j(x) = L' w;det [D) 12] ¢ [D;(x- 1;)]
i::d
terminal node has class label denoted 'by Cassociated with it. The rule base is formed in the decision node
(splitting rule in the form off(x) < 0 ). The rule determines whether the panern moves to the right or to the where D,. is the diag(d,·), d,. EJ?t
ate dilation vectors; Di and t; are the translational vectors; det [ ] is the
left. Here,f(x) indicates the associated feature ofparcern and"(}" is the threshold. The pattern will be given determinant operator. The w:..velet function¢ selecred should satisfy some properties. For selecting¢: If' --')o
the sJass label of the terminal node on which it has landed. The classification here is based on the fact iliat R, the condition may be
the appropriate features can be selected ar different nodes and levels in the tree. The output feature y = j(x)
,P(x) =¢1 (XJ) .. t/J1 (X 11 ) forx:::: (x, X?.· . . , X11 )
L_i.._
..~~'"·
80 Supervised Learning Network 3.12 Solved Problems 81
ro form a Madaline network. These networks are trained using delta learning rule. Back-propagation network
-r is the most commonly used network in the real time applications. The error is back-propagated here and is
fine runed for achieving better performance. The basic difference between the back-propagation network and
~)-~-{~Q-{~}-~~-~ 7 radial basis function network is the activation funct'ion. use;d. The radial basis function network mostly uses
Gaussian activation funcr.ion. Apart from these nerWor~; some special supervised learning networks such as
:_,, : \] : : ~ time delay neural ne[Wotks, functional link networks, tree neural networks and wavelet neural networks have
also been discussed.
0----{~J--[~]-----{~-BJ------0-r
:· I I
K
Input( X
3.12 Solved Problems
Output
I. I!Jlplement AND function using perceptron net- Calculate the net input
~ //works for bipol~nd targets.
&-c~J-{~~J-G-cd
y;, = b+xtWJ +X2W2
Solution: Table 1···shows the truth table for AND
function with bipolar inputs and targelS:
=O+Ix0+1x0=0
Figure 3·19 Wavelet neural network. The output y is computed by applying activations
Table 1
over the net input calculated:
I {
X]
where "'I I ify;,> 0 · - .
-I -I y = f(;y;,) = 0 if y;, = 0
¢, (x) = -xexp ( -~J -I I -I -1 ify;71 <0
-I -I -I . - ··-· . .- -- .--_-==-...
Here we have rake~-1) = O.)Hence, when,y;11 = 0,
is called scalar wavelet. The network structure can be formed based on rhe wavelet decomposirion as y= 0. ---···
The perceptron network, which uses perceptron
" learning rule, is used to train the AND function. Check whether t = y. Here, t = 1 andy = 0, so
y(x) = L w;¢ [D;(x- <;)] +y The network architecture is as shown in Figure l. t f::. y, hence weight updation takes place:
i=l
The input patterns are presemed to the network one
w;(new) = zv;(old) + ct.t:x;
where J helps to deal with nonzero mean functions on finite domains. For proper dilation, a rotation can be by one. When all the four input patterns are pre-
made for bener network operation: sented, then one epoch is said to be completed. The WJ(new) = WJ(oJd}+ CUXJ =0+] X I X l = 1
initial weights and threshold are set to zero, i.e., W2(ncw) = W2(old) + atx:z = 0 + 1 x l x 1 = I
WJ = WJ. = h = 0 and IJ = 0. The learning rate
y(x) = L" w;¢ [D;R;(x- <;)] + y a is set equal to 1.
b(ncw) = h(old) + at= 0 + 1 x I = l
i=l
Here, the change in weights are
x,~
where R; are the rotation marrices. The network which performs according to rhe above equation is called
Ll.w! = ~Yt:q;
wavelet neural network. This is a combination of translation, rotarian and dilation; and if a wavelet is lying on
the same line, then it is called wavekm in comparison to the neurons in neural networks. The wavelet neural Ll.W2 = atxz;
network is shown in Figure 3-19. b..b = at
X, ~ y y
w, The weighlS WJ = I, W2 = l, b = 1 are the final

1 3.11 Summary ~
X,
weighlSafrer first input pattern is presented. The same
process is repeated for all the input patterns. The pro-
In chis chapter we have discussed the supervised learning networks. In most of the classification and recognition cess can be stopped when all the wgets become equal
Figure 1 Perceptron network for AND function. to the cllculared output or when a separating line is
problems, the widely used networks are the supervised learning networks. The.architecrure, the learning rule,
flowchart for training process-and training algorithm are discussed in detail for perceptron network, Adaline, For the first input pattern, x 1 = l, X2 = I and obrained using the final weights for separating the
Madaline, back-propagation network and radial basis function network. The percepuon network can be t = 1, with weights and bias, w1 = 0, W2 = 0 and positive responses from negative responses. Table 2
trained for single output clasSes as well as mulrioutput classes. AJso, many Adaline networks combine together b=O, shows the training of perceptron network until its
82 Supervised Learning Network 3.12 Solved Problems 83
Table2
Weights 0----z The final weights at the end of third epoch are
w, =2,W]_ = l,b= -1
Input
Target Net input
Calculated
output
Weight changes
W) W]. b x, X w,~y y
Fu-rther epochs have to be done for the convergence
~
X) X]. (t) (y,,) (y) ~WI f:j.W'l M (0 0 0) of'the network.
· 3. _Bnd-the weights using percepuon network for
EPOCH-I
I 0 0 I /AND NOT function when all the inpms are pre-
I
-I -1 -I -I 0 2 0 sented only one time. Use bipolar inputS and
Figure 3 Perceptron network for OR function.
-I -I 2 +I -1 -I I -I ' targets.
0 0 0 1 -1 -I The perceptron network, which uses perceptron Solution: The truth table for ANDNOT function is
-1 -1 -I -3 -I
learning rule, is used to train the OR function. shown in Table 5.
EPOCH-2
0 0 0 -I The network architecture is shown in Figure 3. TableS
I I
0 0 0 -I The initial values of the weights and bias are taken
I -1 -I -1 -I t
-1 as zero, i.e., Xj "'-
-I -I -I -I 0 0 0
-I
I I -I
-I -1 -3 -I 0 0 0
WJ=W]_:::::b:::::O 1 -I I
-I I -1
target and calculated ourput converge for all the ~ Also the learning rate is 1 and threshold is 0.2. So, -I -I -I
patterns. the aaivation function becomes
The final weights and bias after second epoch are The network architecture of AND NOT function is
/'~-..._.};.- . .~:- 1 if y;/1> 0.2 ~ shown as in Figure 4. Let the initial weights be zero
=l,W'l=l, b=-1 (-1, 1)
,_~-- ~Yin ~ 0.2
W[
0 . _,. \.. . .. , [(yin) ;::: { O if - 0.2 and ct = l,fJ = 0. For the first input sample, we
compme the net input as
Since the threshold for the problem is zero, the
equation of the separating line is '
-x,
l~
~
,x,. . J/
";?").. The network is trained as per the perceptron training "
·. ~i algorithm and the steps are as in problem 1 (given for Yin= b+ Lx;w; = h+x1w 1 +xzlil2
w, b
X2 = - - X i - -
:.. /1--'
first pattern}. Table 4 gives the network rraining for i=-1
(-1,-1) (1,-1) ~=-X,+1 '/
Here
'"' "" 3 epochs. =O+IxO+IxO=O
Table4
W[X! + lli2X2 + b > $ Weights
W]X] + UlzX2 + b> Q -X,
Input Calculated Weight changes
Figure 2 Decision boundary for AND function
Target Net input output w, W2 b
Thus, using the final weights we obtain Xi X2 (t) {y;,,) (y) ~W) ~., ~b (0 0 0)
in perceptron training{$= 0).
I (-1) EPOCH-I
X2 = -}x' - -~- ~'mplemenr OR function with binary inputs and I 0 0 I I I
0 2 0 0 0
lil~J
L _ -xt+l · bipolar targw using perceptron training algo-
rithm upto 3 epochs.
0 2 0 0 0 I I 0
h can be easily found that the above straight line 0 0 -I 0 0 -I I I 0
Solution: The uuth table for OR function with EPOCH-2
separates the positive response and negative response
binary inputs and bipolar targets is shown in Table 3. 2 0 0 0 I I 0
region, as shown in Figure 2.
I 0 0 0 0 I I 0
The same methodology can be applied for imple- Table 3 0 I I 0 0 0 I I 0
menting other logic functions such as OR, AND- t 0 0 -I 0 0 0 0 0 I I -I
NOT, NAND, etc. If there exists a threshold value
Xj
"'-
EPOCH-3
f) ::j:. 0, then two separating lines have to be obtained, I
I I I 0 0 0 I I -I
i.e., one to se-parate positive response from zero 0 I 0 0 0 I 0 I 2 I 0
and the other for separating zero from the negative 0 I 0 I I I I 0 0 0 2 I 0
0 0 -I 0 0 -I 0 0 0 0 -I 2 I -I
response.
"'
C:J
Supervised Learning Network
84 3.12 Solved Problema 85
For the third input sample, XI = -1, X2 = 1,
0----z t = -1, the net input is calculated as,
4. Pind the weights required to perform the follow- Table.7
w,~y
/ ing classification using percepuon network. The Input
(/ vectors (1,), 1, 1) and (-1, 1 -1, -1) are belong-
x, x, _..,;¥'
y '
]in= b+ Lx;w;= b+XJWJ +X2WJ. ing to the class (so have rarger value 1), vectors X2 b Targ.t (t)
w, i=l (1, 1, 1, -1) and (1, -1, -1, 1) are not belong-
'J
··) 1 "'1 "'
=0+-1 X O+ 1 X -2=0+0-2= -2 ing to the class (so have target value -1). Assume -1 1 -1 -1
X,
X, learning rate as 1 and initial weights as 0.
-1 1 -1
Figure 4 Network for AND NOT function. The output is oblained as y = fi.J;n) -1. Since = Solution: The truth table for lhe given vectors is given -1 -1 1 1 -1
t = y, no weight changes. Thus, even after presenting
in Table_?.·-· -·---.. ><
Applying the activation function over the net input, clJe third input sample, the weights are
Le~·Wt = ~~.l/l3. = W< "' b ,;;-p and the
we obtain lear7cng ratec; = 1. Since the thresWtl = 0.2, so Thus,ln the third epoch, all the calculated outputs
w=[O -2 0]
become equal to targets and the necwork has con-
=I ~
ify;,. > 0 the.' ctivation function is
y=f(y,,) if-O~y;11 ::S:0 For the fourth input sample, x1 = -1, X2 = -1, verged. The network convergence can also be checked
y., { ~
if ]in> 0.2
l-1 ify;,. < -0
t = -1, the net input is calculated as
'
if -0.2 :S Yin :S 0.1
by forming separating line equations for separating
positive response regions from zero and zero from
negative response region.
Hence, the output y = f (y;,.) = 0. Since t ::/= y, U.e -1 if Yin< -0.2
]in= b+ Lx;w; = b+x1w1 +X21112 The network architecture is shown in Figure 5.
new weights are computed as
i=l The net input is given by
WJ (new) = W] (o\d) + (UX] = 0 + 1 X -} X 1 = -} =0+-lxO+(-lx-2)
5. Classify the two-dimensiona1 input pattern shown
]in= b+x1w1 +xzWJ. +X3W3 _/ in Figure 6 using perceptron network. The sym~
U12(new) = W2.(old) + cttx2_ = 0 + 1 x -1 x l = -1 =0+0+2=2 bol "*" indicates the da[a representation to be +1
+x4w4
b(new) = b(old)+ at= 0 + 1 x -1 = -1 and "•" indicates data robe -1. The patterns are
The output is obtained as y = f (y;n) = 1. Since The training is performed and the weights are tabu- I-F. For panern I, the targer is+ 1, and for F, the
The weights after presenting the first sample are t f. y, the new weights on updating are given as lated in Table 8. target is -1.
w=[-1-1-1] WJ (new) = WJ (old)+ £UXj = 0+ l X -I X -I = 1
Tables
For the seconci inpur sample, we calculate the net IU2(new) = Ul!(old) + ct!X'z = -2 +I x -1 x -1 =-I
inpur as Weights
b(ncw) = b{old) +at= O+ 1 X -1 = -1 Inputs Target Net input Output Weight changes (w, w, w, w4 b)
' (x, X4 b) (t) (Y;,) (y) (.6.w1 /J.llJ2 .6.w3 IJ.w4 !:J.b) (0 0 0 0 0)
Yin= b + L:x;w; = b +x1w1 +X2W.Z X2
i:= I
The weights after presenting foun:h input sample are
w= [1 -1 -1]
EPOCH-! "'
=-l+lx-1+(-lx-1) ( 1 1 1 1 1) 1 0 0 1 1 1 l 1 1 1 1 1 1
(-1 1 -1 -1 1) 1 -1 -1 -1 1 -1 -1 1 0 2 0 0 2
One epoch of training for AND NOT function using
=-1-1+1=-1 ( 1 1 l -1 1) -1 4 I -1 -1 -I 1 -1 -1 1 -1 1
perceptron network is tabulated in Table 6.
( 1 -1 -1 1 1) -1 1 1 -1 1 1 -1 -1 -2 2 0 0 0
The output y = f(y;") is obtained by applying
Table& EPOCH-2
activation function, hence y = -1.
( 1 1 1 1 1) 1 0 0 1 1 1 1 1 -1 3 1
Since t i= y, the new weights are calculated as Weights
Calculated (-1 1 -1 -1 1) 1 3 1 0 0 0 0 0 -1 3 1
Input
Wj{new) = WJ(oJd) + CUXJ = -l + 1 X I X J = 0 _ _ _ Target Net input output WJ "'2 b ( 1 1 1 -1 1) -1 4 1 -1 -1 -1 1 -1 -2 2 0 2 0
(y) (0 0 0)
XI X:Z 1 (t) (y;,)
I
( 1 -1 -1 1 1) -1 -2 -1 0 0 0 0 0 -2 2 0 2 0
Ul2(new) = Ul2(old) + CtD:l = -1 + 1 x l x-I= -2
1 1 -1 0 0 -1 -1 -1 EPOCH-3
b(new) = b{old) +at= -1 + l xI =0
0 -2 0 ( 1 1 1 1 1) 1 2 1 0 0 0 0 0 -2 2 0 2 0
1 -1 1 1 -1 -1
The weights after presenting the second sample are -1 1 1 -1 -2 -1 0 -2 0 I (-1
( 1 1 1
1 -1 -1
-1
l)
1) -1
1 2
-2 -1
1 0
0
0
0
0
0
0
0
0
0
-2 2
-2 2
0
0
2 0
2 0
-1 -1 1 -1 2 1 1 -1 -l
l
w= [0 -2 0] !__1_ -1 -1 1 1) -1 -2 -1 0 0 0 0 0 -2 2 0 2 0
I
86
= b +x1w1 + XZW2 +X3w3 +X4W4 +xsws
+XGW6 + X7WJ + xawa + X9W9
Supervised learning Network
1
3.12 Solved Problems
w;(new) = w;(old)+ O:IXS = 1 + 1 x -1 x 1 = 0 lnitiaJly all the weights and links are assumed to be
W6(new) == WG(oJd) + 0:0:6 = -1 + 1 X -1 X 1 = -2 small raridom values, say 0.1, and the learning rare is
87
II
also set to 0.1. Also here the least mean square error
=0+1 x0+1 x0+1 x 0+(-1) xO W?{new) = W?(old) + atx'] =I+ 1 x -1 x 1 = 0
+1xO+~Dx0+1x0+1x0+1xO wg(new) = ws(old)+ o:txs = 1 + 1 x -1 x -1 = 2
· miy Qe set. The weights are calculated until the least
m~ square error is obtained.
I
Yin= 0 fU9(new) == fV9(old) + etfX9 = 1 + 1 x -1 x -1 "== 2 The initial weighlS are taken to be WJ = W2 =
b[new) = b(old) +or= I+ 1 x -1 = 0 b = 0.1 and rhe learning rate ct = 0.1. For the first
Therefore, by applying the activation function the
input sample, XJ = 1, X2 = 1, t = 1, we calculate the
output is given by y = ff.J;n) = 0. Now since t '# y, The weighlS afrer presenting rhe second input sam~ net input as
the new weights are computed as pie are ~
Wi(new) = WJ(oJd)+ atx1 =-0+ 1 X 1 X 1 = 1

w = [0 0 0 - 2 0 -2 0 2 2 0]
' 2
Yin= b+ Lx;w; = b+ Lx;w;
w,(new) = w,(old) + 01>2 = 0 + 1 x 1 x 1 = 1
The network architecture is as shown in Figure 7. The i=l i=l
w3(new) = w3(old) + at:q =0+ 1 x 1 x 1= 1 network can be further trained for its convergence. = b+x1w1 +xzwz
Figure 5 Network archirecrure. W.j(new) = W4(o!d) + CUX4:;:: 0 + l X l X -1 = -1
= 0.1 + 1 X 0.1 + 1 X 0.1 = 0.3
w;(new) = w;(old) + atx;_ = 0 + 1 x 1 x l = 1
••• ••• WG(new) = W6(old) + CttxG = 0 + 1 X 1 X -1 = -1 Now compute (t- y;n) = (1- 0.3) = 0.7. Updating
the weights we obrain,
•• W)(new) = W)(old)+ O"'J = 0 + 1 x 1 x 1 = 1
ws(new) = wg(old) + ""' = 0 + 1 x 1 x 1 = 1 w;(new) = w;(old) + a(t- y;n)x;
•••
W<J(new) = rlJ9(old) + O:fX9 = 0 + 1 x l x 1 = 1
'I' 'P where a(t- y;11 )x; is called as weight change fl.w;.
b(new) = b(old) + ot = 0 + 1 x 1 = 1
The new weights are obtained as
Figure 6 I~F data representation.
The weights afrer presenting first input sample are y
Solution: The training patterns for this problem are
w,(new) = WJ(old)+fl.wl = 0.1 +O.l X 0.7 X 1
w = [11 1 - 1 1 - 1 1 1 1 1]
tabulated in Table 9. = 0.1 + 0.07 = 0.17
Forrhesecondinputsample,xz=[1111111-1 w,(new) = w,(old)+L>W2 = 0.1
Table 9 -1 1], t= -1, rhe ner inpm is calculated as
+ 0.1 X 0.7 X 1 = 0.17
Input
b(new) = b(old)+M = 0.1 + 0.1 x 0.7 = 0.17
Pattern x 1 xz X3 .r4 x5 X6 Xi xa X9 1 Target (t) Yir~ = b+ L:x;w;
r"=l where
1-11-111111
F 1 1 1 1 1 ' 1 -1 -11 -1 = b +X] W] + XZWJ. + X3W3 + X4W4 + X5W5
6.w1 = a(t- JirJ~l
I~
+ X6W6 + X7lll] + XflWB +X<) IV<) Figure 7 Network architecture.
.6.wz = a(t- y;,)X2
The initial weights are all assumed to be zero, i.e., =1+ l X 1+ 1 X l+ l X 1+ 1 X -1 + 1 X 1 lmplemenr OR function with bipolar inputs and
e = 0 and a = 1. The activation function is given by
+1x-1+1x1+(-1)x 1+(-1)x1 targelS using Adaline network.
t.b = o(t- y;,)
~y~ {····~· ifJ.rn> ·o

if-O:Sy;, 1 .::;:0 i Yin= 2 Solution: The truth table for OR function with
bipolar inpulS and targers is shown in Table 10.
Now we calculare rhe error:
. -1 ifyrn < -0 I Therefore the output is given by y = f (y;u) = l. E = (r- y;,) 2 = (0.7) 2 = 0.49
I Since t f= y, rhe new weights are Table 10
For the first input sample, Xj = [l 1 L ~ I--1 -1 1 1 t The final weights after presenting ftrsr inpur sam·
1 1], t = l, the net input is calculated as
w,(new) == + o:oq == l + 1 x -1
WJ(old) X\== 0 Xj X:z
- pie are
fV2(new) == fV2(old) + O:tx]. = 1 + 1 X -1 Xl=0 1
-1 w= [0.17 0.17 0.17]
w3(new) = w3(old)+ O:b:J =I+\ X -1 X1= 0
y;, = b + Lx;w; -1
i=l
w~(new) = wq(old) + CtP:4 =-I+ 1 x -1 x t = -2 -1 -1 -1 and errorE= 0.49.
11
II
88 Supervised learning Network 3.12 Solved Problems 89
These calculations are performed for all the input Table 12 7. UseAdaline nerwork to train AND NOT funaion w,(new) = w,(old) + a(t- y,,)x:z
'
samples and the error is caku1ared. One epoch is
completed when all the input patterns are presented.
Summing up all the errors obtained for each input
Epoch
Epoch I
Total mean square error
3.02
with bipolar inputs and targets. Perform 2 epochs
of training.
= 0.2+ 0.2 X (-1.6) X I= -0.12
b(new) = b(old) + a(t- y;,) !
Epoch 2 1.938 Solution: The truth table for ANDNOT function = 0.2+ 0.2 (-1.6) = -0.12
sample during one epoch will give the mtal mean X '!:
Epoch 3 1.5506 with bipolar inputs and targets is shown in Table 13.
square error of that epoch. The network training is
Epoch 4 1.417 Table 13 Now we compute the error,
continued until this error is minimized to a very small
Epoch 5 1.377
value.
Adopting the method above, the network training E= (t- y;,) 2 = (-1.6) 2 = 2.56
~-
is done for OR function using Adaline network and
is tabulated below in Table 11 for a = 0.1. The final weights after presenting first input sample
-".~
The total mean square error aft:er each epoch is a<e w = [-0.12- 0.12- 0.12] and errorE= 2.56.
The operational steps are carried for 2 epochs
given as in Table 12. ,1 @ w1 == 0.4893 f::\_ ~
of training and network performance is noted. It is
Thus from Table 12, it can be noticed that as
training goes on, the error value gets minimized.
~~1'~Y Initially the weights and bias have assumed a random
tabulated as shown in Table 14.
Hence, further training can be continued for fur~ - value say 0.2. The learning rate is also set m 0.2. The
weights are calculated until the least mean square error
The total mean square error at the end of two
epochs is summation of the errors of all input samples
t:her minimization of error. The network archirecrure ~ is obtained. The initial weights are WJ = W1. b = =
of Adaline network for OR function is shown in as shown in Table 15.
0.2, and a= 0.2. For the fim input samplex1 = 1,
Figure 8. Figure 8 Network architecture of Adaline.
.::q = l, & = -1, we calculate the net input as Table15
Yin= b + XtWJ + X2lli2
)
Table 11
= 0.2+ I X 0.2+ I X 0.2= 0.6
Epoch Total mean square error
ll
Weights
Epoch I 5.71 :~
Net Epoch 2 2.43 ·'
Inputs T: Weight changes Now compute (t- Yin} = (-1- 0.6) = -1.6.
- - a<get input Wt b Enor
X] x:z I t Yin (r- Y;,l) i>wt
"'"" i>b (0.1 ""
0.1 0.1) (t- Y;,? Updacing ilie weights we obtain
Hence from Table 15, it is clearly undersrood rhat the .,
EPOCH-I w,-(new) = w,-(old) + o:(t- y,n)x; mean square error decreases as training progresses.
I I I I 0.3 0.7 0,07 0,07 om 0.17 0.17 0.17 0.49
Also, it can be noted rhat at the end of the sixth
'
\;
I -1 I I 0.17 0.83 0.083 -0.083 0.083 0.253 0.087 0.253 0.69 The new weights are obtained as
-I I I I 0.087 0.913 -0.0913 0,0913 0,0913 0.1617 0.1783 0.3443 0.83 epoch, rhe error becomes approximately equal to l.
-1 -1 1 -I 0.0043 -1.0043 0.1004 0.1004 -0.1004 0.2621 0.2787 0.2439 1.01 WI (new) ::::: w, (old) + ct(t- Jj )x,
11 The network architecture for ANDNOT function
EPOCH.2 = 0.2 + 0.2 X (-1.6) X I= -0.12 using Adaline network is shown in Figure 9.
1 I 1 1 0.7847 0.2153 0.0215 0.0215 0.0215 0.2837 0.3003 0.2654 0.046
I -1 1 I 0.2488 0.7512 0.7512 -0.0751 0.0751 0.3588 0.2251 0.3405 0.564 Table 14
-I I 1 I 0.2069 0.7931 -0.7931 0.0793 0.0793 0.2795 0.3044 0.4198 0.629
-1 -1 I -I Weights
-0.1641 -0.8359 0.0836 0.0836 -0.0836 0.3631 0.388 0.336 0.699 Ne<
Inputs Weight changes
EPOCH-3 _ _ Target input w, b Error
I I I I 1.0873 -0.0873 -0.087 -0.087 -0.087 0.3543 0.3793 0.3275 0.0076
t>w, M (0.2 ""
0.2 0.2) (t- Y;n)2
-I
I -1 I
I I
-1 -1 1 -1
I
I
0.3025 +0.6975
0.2827
0.0697 -0.0697 0.0697 0.4241 0.3096 0.3973
0.7173 -0.0717 0,0717 0,0717 0.3523 0.3813 0.469
-0.2647 -0.7353 0.0735 0.0735 -0.0735 0.4259 0.4548 0.3954
0.487
0.515
0.541
X[ X:Z
EPOCH-I
I t Y;" (t-y;rl)
"'""
I -I 0.6 -1.6 -0.32 -0.32 -0.32 -0.12 -0.12 -0.12 2.56
EPOCH-4
I I I I 0,076 -I I I -0.12 1.12 0.22 -0.22 0.22 0.10 -0.34 0.10 1.25
1.2761 -0.2761 -0.0276 -0.0276 -0.0276 0.3983 0.4272 0.3678
I -1 I I 0.3389 0.6611 0.0661 -0.0661 0.0661 0.4644 0.3611 0.4339 0.437 -I I I -I -0.34 -0.66 0.13 -0.13 -0.13 0.24 -0.48 -0.03 0.43
-I I 1 I 0.3307 0.6693 -0.0669 0.0669 0.0699 0.3974 0.428 0.5009 0.448 -1 -1 I -I 0.21 -1.2 0.24 0.24 -0.24 0.48 -0.23 -0.27 1.47
-1 -1 I -I -0.3246 -0.6754 0.0675 0.0675 -0.0675 0.465 0.4956 0.4333 0.456 EPOCH-2
EPOCH-5 -I -0.02 -0.98 -0.195 -0.195 -0.195 0.28 -0.43 -0.46 0.95
I I I I 1.3939 -0.3939 -0.0394 -0.0394 -0.0394 0.4256 0.4562 0.393 0.155
I -1 I I 0.25 0.76 0.15 -0.15 0.15 0.43 -0.58 -0.31 0.57
I -1 I I 0.3634 0.6366 0.0637 -0.0637 0.0637 0.4893 0.3925 0.457 0.405
-I I I I 0.3609 0.6391 -0.0639 0.0639 0.0639 0.4253 0.4654 0.5215 0.408 -I I I -I -1.33 0.33 -0.065 0.065 0.065 0.37 -0.51 -0.25 0.106
-1 -1 I -I -0.3603 -0.6397 0.064 0.064 -0.064 0.4893 0.5204 0.4575 0.409 -1 -1 I -I -0.11 -0.90 0.18 0.18 -0.18 0.55 -0.38 0.43 0.8
I
-~
3.1 '2 Solved Problems 91
90 Supervised learning Network
input sample, XJ = 1, X2 = l, target t = -1, and w11 (new) =W21 (old)+a(t-ZinJ)XZ
...
b.,o learning rate a equal to 0.5: =0.2+0.5(-1-0.55) X 1 =-0.575
"'22 (new)= "'22 (old)+ a(t- z;" 2)"2
x, x ') w1=o.ss
Calculate net input to the hidden units:
1 y =0.2+0.5(-1-0.45)x 1=-0.525'
Zinl = + XJ WlJ + X2U/2J
b1
y
>Nz"'_o.~ = 0.3 + 1 X 0.05 + 1 X 0.2 = 0.55
b2 (new]= b2 (old)+ a(t- z;d
x, x, Zin2 = /n. +X} WJ2 + xiW22 = 0.15+0.5(-1-0.45)=-0.575
= 0.15 + 1 X 0.1 + 1 X 0.2 = 0.45 All the weights and bias between the input layer and
Figure 9 Network architecrure for ANDNOT hidden layer are adjusted. This completes the train-
function using Adaline nerwork.. Calculate the output z 1,Z2 by applying the activa- ~::-1.08
ing for the first epoch. The same process is repeated
tions over the net input computed. The activation until the weight converges. It is found that the weight
8 Using Madaline network, implement XOR func- function is given by Figure 11 Madaline network for XOR function
tion with bipolar inputs and targets. Assume the converges at the end of 3 epochs. Table 17 shows the
(final weights given).
required parameters for training of the network. I ifz;,<:O training performance of Madaline network for XOR y
! (Zir~) = ( -1 ifz;11 <0 function.
Solution: The uaining pattern for XOR function is The network architecture for Madaline network
given in Table 16. Hence, with final weights for XOR function is shown in
Table 16 z1 = j(z;,,) = /(0.55) = I Figure 11.
z, = /(z;,,) = /(0.45) = 1 9._}Jsing back-propagation_ network, find the new
• After computing the output of the hidden units, / weights ~or the ~et shown in Figure 12. It is pre- .0.5
, semed wuh the mput pattern [0, 1] and the target 0.3,
then find the net input entering into the output
output is 1. Use a learning rare a = 0.25 and
unit:
binary sigmoidal activation function.
Yin= b3 +zJVJ +z2112
Solution: The new weights are calculated based
The Madaline Rule I (MRI) algorithm in which the = 0.5 + 1 X 0.5 + I X 0.5 = 1.5 on the training -algorithm in Section 3.5.4. The
-oj
weights between the hidden layer and ourpur layer Figure 12 Ne[Work.
remain fixed is used for uaining the nerwork. Initializ- • Apply the activation function over the net input initial weights are [v11 v11 vod = [0.6 -0.1 0.3],
ing the weights to small random values, the net\York Yin to calculate the output y.
Table 17
architecture is as shown in Figure 10, widt initial
y = f(;y;,) = /(1.5) = 1 Inputs Target
weights. From Figure 10, rhe initial weights and bias b, b2
X~ (t} wn
are [wu "'21 bd = [0.05 0.2 0.3], [wn "'22 b,] = Since t f:. y, weight updation has to be performed. Zinl Zinl ZJ Zl Y;11 Y "'21 W12
'""
[0.1 0.2 0.15] and [v 1 v, b3] = [0.5 0.5 0.5]. For fim Also since t = -1, the weights are updated on z1
EPOCH-I
and Zl that have positive net input. Since here both
I I 1 -1 0.55 0.45 I 1 1.5 1-0.725 -0.58 -0.475-0.625 -0.525 -0.575
1lbj=0.3 net inputs Zinl and Zinl are positive, updating the 1-1 I I -0.625 -0.675 -1-1 -0.5 -1 0.0875-1.39 0.34 -0.625 -0.525 -0.575
weights and bias on both hidden units, we obtain -I 1 1 I -1.1375 -0.475 -I -1 -0.5 -I 0.0875 -1.39 0.34 -1.3625 0.2125 0.1625
Wij(new) = Wij(old) + a(t- Zin)x; -1-1 1 -1 1.6375 1.3125 1 1 1.5 1 1.4065 -0.069 -0.98 -0.207 1.369 -0.994
bj(new) = bj(old) + a(t- z;"j) EPOCH-2
1 I I -1 0.3565 0.168 1 I 1.5 I 0.7285 -0.75 -1.66 -0.791 -0.207 -1.58
y
This implies: 1-1 I 1 -0.1845-3.154 -1-1-0.5-1 1.3205-1.34 -1.068-0.791 0.785 -1.58
-1 1 I 1 -3.728 -0.002 -1-1-0.5-1 1.3205 -1.34 -1.068- 1.29 0.785 -1.08
WI! (new)= WI! (old)+ a(t- ZinJ)XJ
-1-1 I -1 -1.0495-1.071 -1-1-0.5-1 1.3205 -1.34 -1.068-1.29 1.29 -1.08
=0.05+0.5(-1-0.55) X 1 = -0.725
EPOCH-3
WJ2(new) = WJ2(old) + a(t- Zin2)Xl 1.32 -1.34 -1.07 - 1.29 1.29 -1.08
1 1 1 -1 -1.0865-1.083 -1-1-0.5-1
'bz =0.15 =0.!+0.5(-1-0.45) X I =-0.625 -1.34 -1.07 -1.29 1.29 -1.08
1-1 I I 1.5915-3.655 1-1 0.5 I 1.32
b1(new)= b1(old)+a(t-z;"Il -I 1 I I -3.728 1.501 -1 1 0.5 1 1.32 -1.34 -1.07 -1.29 1.29 -1.08
Figure 10 Nerwork archicecrure ofMadaline for 1.29
=0.3+0.5( -I- 0.55) = -0.475 1-1 1 -1 -1.0495-1.701 -1-1-0.5-1 1.32 -1.34 -1.07 -1.29 -1.08
XOR funcr.ions .(initial weights given).
92 SupeJVised Learning Network
-I 3.12 Solved Problems 93
I Compute rhe final weights of the network:
[v12 vn "02l = [-0.3 0.40.5] and [w, w, wo] = [0.4 This implies
0.1 -0.2], and the learning' rate is a = 0.25. Acti- v11(new) = VIt(old)+b.vJI = 0.6 + 0 = 0.6
!, = (I - 0.5227) (0.2495) = 0.1191
vation function used is binary sigmoidal activation vn(new) = vn(old)+t.v12 = -0.3 + 0 = -0.3 .
function and is given by Find the change5~Ulweights be~een hidden and "21 (new) = "21 (oldl+<'>"21
output layer:.
I = -0.1 + 0.00295 = -0.09705
f(x) = I+ ,-• <'>wi = a!1 ZI = 0.25 X 0.1191 X 0.5498 vu(new) = vu(old)+t>vu
,-- 0.0164 ::>
= 0.4 + 0.0006125 = 0.4006125
Given the output sample [x 1, X2] = [0, 1] and target
t= 1, t.w, = a!1 Z2 = 0.25 X 0.1191 X 0.7109 w,(new) = w1(old)+t.w, = 0.4 + 0.0164,
Calculate the net input: For zt layer ---=o:o2iT7 = 0.4164
Figure 13 Network.
<'>wo = a! 1 = 0.25 x 0.1191 = 0.02978 w2(now) = w,(old)+<'>W2 = 0.1 + 0.02!17
Zinl = !lQJ + XJ + X2V21
V11
Compute the error portion 8j between input and = 0.!2!17 For z2layer
= 0.3+0 X 0.6+ I X -0.1 = 0.2 hidden layer (j = 1 to 2): VOl (new) = VOl (old)+<'>•OI = 0.3 + 0.00295
For z2 layer ~f'( = 0.30295
z;,2 = V02 + XJVJ2 + X2V22
Dj= O;,j Zinj)
= 0.5 + (-1) X -0.3 +I X 0.4 = l.2
vo2(new) = 1102(old)+.6.vo2
Zjril = VQ2 + Xj V!2 + X2.V1.2 '
O;,j= I:okwjk = 0.5 + 0.0006125 = 0.5006!25 Applying activation to calculate the output, we
= 0.5 + 0 X -0.3 +I X 0.4 = 0.9 k=!/
.,.(new)= .,.(old)+8wo = -0.2 + 0.02976 obtain
8;nj = 81 Wj! I·.' only one output neuron]
Applying activation co calculate Ute output, we 1_ 1 _ t'0.4
obrain ------
=>!;,I= !1 wn = 0.1191.K0ft = 0.04764
-~
= -0.!7022
Thus, the final weights hav~ been computed for the
t-"inl
ZI =f(z; 1l = - - - = - - = -0.!974
n 1 + t'-z:;nl 1 + /1.4
I
ZI = f(z;,,) = - - - = - - - = 0.5498
1 + e-z.o.1 1 + t-0.2
I =>O;,z = Ot Wzl = 0.1191
_,- X 0.1 = 0.01191
_-:~
network shown in Figure 12.
zz =/(z;,2) = -
1- t'-Z:,;JL
- - = - -1- 2 = 0.537
l - t'-1.2
Error, 81 =O;,,f'(Zirll). 1+t-Zin2 1 +e-.
I 1 19. Find rhe new weights, using back-propagation
z2 = f(z· 2l = - - - = - - - = 0.7109 j'(z;,I) = f(z;,,) [1- f(z;,,)] network for the network shown in Figure 13.
m 1 + e-Zilll 1 + e-0.9 Calculate lhe net input entering the output layer.
= 0.5498[1- 0.5498] = 0.2475 The network is presented with the input pat- For y layer
Calculate the net input entering the output layer. 0 1 =8;,1/'(z;,J) tern l-1, 1] and the target output is +1. Use a
For y layer
= 0.04764 X 0.2475 = 0.0118
learning rate of a = 0.25 and bipolar sigmoidal Yin= WO + ZJWJ +zzWz
activation function. = -0.2 + (-0.1974) X 0.4 + 0.537 X 0.1
Ji11 = WO+ZJWJ +z2wz Error, Oz =0;,a/'(z;,2) Sn_ly.tion: The initial weights are [vii VZI vod = [0.6 = -0.22526
= -0.2 + 0.5498 X 0.4 + 0.7109 X 0.1
j'(z;,) = f(z;d [1 - f(z;,2)] ·0.1 0.3], [v12 "22 vo2l = [ -0.3 0.4 0.5] and [w,
= 0.09101 Wz wo] = [0.4 0.1 -0.2], and die learning rme is Applying activations to calculate the output, we
= 0.7109[1 - 0.7!09] = 0.2055
Applying activations to calculate the output, we
a= 0.25. obtain
Oz =8;,zf' (z;,2) Activation function used is binary sigmoidal 1 1 0.22526
obtain
= 0.01191 X 0.2055 = 0.00245 activacion function and is given by 1 - t'- '" _-_--",=< -0.1!22
1 1
y = f(y;,) = l + t'-y,.. = 1 + 11-22526
Y = f{y;n) = ~ = 1 + e-0.09101 = 0.5227 Now find rhe changes in weights between input 2 1 -e-x
and hidden layer: f (x )----1---
- 1 +e-x - 1 +e-x Compute the error portion 8k:
Compute the error portion 811.:
.6.v 11 =a0 1x1 =0.25 x0.0118 x0=0 Given the input sample [x1, X21 = [-1, l] and target !, = (t, - yllf' (y;,,)
!,= (t,- y,)f'(y,,.,) <'>"21 = a!pQ=0.25 X 0.0118 X I =0.00295 t= 1:
Now
f'(J;,) = f(y;,)[1 - f(J;,)] = 0.5227[1- 0.5227]

<'>vo1 =a!, =0.25 x0.0118=0.00295
.6.v 12 =a82x1 =0.25 x0.00245 xO=O
Calculate the net input: For ZJ layer
Zin\ =VOl +xJVJJ +X2t121

Now
'
----------------
I f'(J;.) = 0.5[1 + f(J;,)] [I- f(J;,)]
= 0.5[! - 0.1122][1 + 0.1122] = 0.4937 .
-- .
-~~
ll:"22 =a!2X'2 =0.25 X 0.00245 X I =0.0006125 I = Q.3 + (-1) X 0.6 +I X -0.1 = -0.4
!' (J;,) = 0.2495 <'>v02 =a!2=0.25 x 0.00245 =0.0006!25
I '-..
-·---
)
l _...-/
l
3.14 Exercise Prob!ems 95
94 Supervised learning Network
13. State the testing algorithm used in perceptron 34. What are the activations used in back-
This implies f>'OI =•01 = 0.25 X 0.1056';'0.0264 propagation network algorithm?
algorithm.
t,.,,=•o 2x, =0.25 x 0.0195 x -1 =-0.0049 35. What is meant by local minima and global
,, = (l + 0.1122) (0.4937) = 0.5491 14. How is _the linear separability concept imple-
[,."22 = cl02X, =0.25 X 0.0195 X 1 =0.0049 mented using perceprron network training? minima?
Find the changes in weights between hidden and l>'02 = •o2= 0.25 X 0.0195 =0.0049 3i5. · Derive the generalized delta learning rule.
15. Define perceprron learning rule.
output layer:
16. Define d_dta rule. 37. Derive the derivations of the binary and bipolar
Comp'Lite the final weights of the nerwork:
L\w1 = a81 ZJ = 0.25 X 0.5491 X -0.1974 1.1~ SGlte the error function for delta rule. sigmoidal activation function.
= -0.0271 18. What is the drawback of using optimization 38. What are the factors that improve the conver-
""(new) = "" (old)+t., 11 = 0.6- 0.0264
gence of learning in BPN network?
/).w, = •01 Z2 = 0.25 X 0.549! X 0.537 = 0.0737 = 0.5736 algorithm?
39. What is meant by incremenrallearning?
L\wo = a81 = 0.25 x 0.5491 = 0.1373 ,,(n<w) = ,,(old)+t.,, = -0.3-0.0049 19. What is Adaline?
40. Why is gradient descent method adopted to
20. Draw the model of an Adaline network.
Compute the error portion Bj beMeen input and = -0.3049 minimize error?
21. Explain the training algorithm used in Adaline
hidden layer (j = 1 to 2): "21 (new) = "21 (old)+t...., 1 = -0.1 + 0.0264 41. What are the methods of initialization of
network.
= -0.0736 weights?
81 = 8;/ljj' (z;nj) 22. How is a Madaline network fOrmed?
m ...,,(new) = "22(old)+t."22 = 0.4 + 0.0049 42. What is the necessity of momentum factor in
23. Is it true that Madaline network consists of many
8inj = L 8k Wjk = 0.4049 perceptrons?
weight updation process?
43. Define "over fitting" or "over training."
~I 24. Scare the characteristics of weighted interconnec-
WI (new) = WI (old)+t.w 1 = 0.4- 0.0271
._ 8inj = 81 WjJ [· •· only one output neuron] tions between Adaline and Madaline. 44. State the techniques for proper choice oflearning
= 0.3729 rate.
=>8in1 =81 WJJ = 0.5491 X 0.4 = 0.21964
w,(n<w) = w,(old)+t.w, = 0.1 + 0.0737
25. How is training adopted in Madaline network
using majority vme rule? 45. What are the limitations of using momentum
( =>o;., =o, ""' = o.549I x o.1 = o.05491 = 0.1737 factor?
Error, 81 =8;,J/'(z;nJ) = 0.21964 X 0.5 26. State few applications of Adaline and Madaline;
1 ''' (n<w) = "OI (old)+l>'OI = 0.3 + 0.0264 46. How many hidden layers can there be in a neural
27. What is meant by epoch in training process?
~
X (I +0.1974)(1- 0.1974) = 0.1056 network?
= 0.3264 28. Wha,r is meant by gradient descent meiliod?
Error, 82 =8;112/'(z;,2) = 0.05491 X 0.5 47. What is the activation function used in radial
"oz(n<w) = '02(old)+t..,, = 0.5 + 0.0049 29. State ilie importance of back-propagation
X (1- 0.537)(1 + 0.537) = 0.0195 basis function network?
= 0.5049 algorithm.
48. Explain the training algorithm of radial basis
Now find the changes in weights berw-een input wo(new) = wo(old)+t.wo = -0.2 + 0.1373 30. What is called as memorization and generaliza- function network.
and hidden layer: = -0.0627 tion? 49. By what means can an IIR and an FIR filter be
31. List the stages involved in training of back- formed in neural network?
f'l.V]J =Cl:'8]X1 =0.25 X 0.1056 X -1 = -0.0264
Thus, the final weight has been computed for the propagation network.
/).'21 =•OiX, =0.25 X 0.1056 X 1 =0.0264 50. What is the importance of functional link net-
network shown in Figure 13. 32. Draw the architecture of back-propagation algo· work?
I 3.13 Review Questions
rithm.
33. State the significance of error portions 8k and Oj
51. Write a short note on binary classification tree
neural network.
1. What is supervised learning and how is it differ- 7. Smte the activation function used in perceprron in BPN algorithm.
52. Explain in detail about wavelet neural network.
em from unsupervised learning? network.
2. How does learning take place in supervised 8. What is the imporrance of threshold in percep-
learning? tron network? I 3.14 Exercise Problems
3. From a mathematical point of view, what is the 9. Mention the applications of perceptron network.
1. Implement NOR function using perceptron are belonging to the class (so have targ.etvalue 1),
process of learning in supervised learning? 10. What are feature detectors?
network for bipolar inputs and targets. vector (-1, -1, -1, 1) and (-1, -1, 1 1) are
4. What is the building block of the perceprron? 11. With a neat flowchart, explain the training not belonging to the class_ (so have target· value
2. Find the weights required to perform the fol-
5. Does perceprron require supervised learning? If process of percepuon network. -1). Assume learning rate 1 and initial weighlS
lowing classifications using perceptron network
no, what does it require? 12. What is the significance of error signal in per- "0.
The vectors (1, 1, -1, -1) ,nd (!,-I. 1, -I)
6. List the limitations of perceptron. ceptron network?
,L
The neuron
I The sigmoid equation is what is typically used as a transfer
function between neurons. It is similar to the step function,
but is continuous and differentiable.
The neuron
I
1
σ(x) = (1)
1 + e −x
x
-5 -4 -3 -2 -1 0 1 2 3 4 5
Figure: The Sigmoid Function

The neuron
I
1
σ(x) = (1)
1 + e −x
x
-5 -4 -3 -2 -1 0 1 2 3 4 5
Figure: The Sigmoid Function
I One useful property of this transfer function is the simplicity

of computing it’s derivative. Let’s do that now...
The derivative of the sigmoid transfer function

d d 1
σ(x) =
dx dx 1 + e −x

d d 1
σ(x) =
dx dx 1 + e −x
e −x
=
(1 + e −x )2

d d 1
σ(x) =
dx dx 1 + e −x
e −x
=
(1 + e −x )2
1 + e −x − 1
=
(1 + e −x )2

d d 1
σ(x) =
dx dx 1 + e −x
e −x
=
(1 + e −x )2
(1 + e −x ) − 1
=
(1 + e −x )2

d d 1
σ(x) =
dx dx 1 + e −x
e −x
=
(1 + e −x )2
(1 + e −x ) − 1
=
(1 + e −x )2
1 + e −x 1
= −
(1 + e −x )2 (1 + e −x ) 2

d d 1
σ(x) =
dx dx 1 + e −x
e −x
=
(1 + e −x )2
(1 + e −x ) − 1
=
(1 + e −x )2
1 + e −x

1 2
= −
(1 + e −x )2 1 + e −x

d d 1
σ(x) =
dx dx 1 + e −x
e −x
=
(1 + e −x )2
(1 + e −x ) − 1
=
(1 + e −x )2
1 + e −x

1 2
= −
(1 + e −x )2 1 + e −x
= σ(x) − σ(x)2

d d 1
σ(x) =
dx dx 1 + e −x
e −x
=
(1 + e −x )2
(1 + e −x ) − 1
=
(1 + e −x )2
1 + e −x

1 2
= −
(1 + e −x )2 1 + e −x
= σ(x) − σ(x)2
σ 0 = σ(1 − σ)
Single input neuron
ω
ξ σ O
Figure: A Single-Input Neuron
In the above figure (2) you can see a diagram representing a single
neuron with only a single input. The equation defining the figure is:
O = σ(ξω)
Single input neuron
ω
ξ σ O
Figure: A Single-Input Neuron
In the above figure (2) you can see a diagram representing a single
neuron with only a single input. The equation defining the figure is:
O = σ(ξω + θ)
Multiple input neuron
θ
ω1
ξ1
ω2 P
ξ2 σ O
ω3
ξ3
Figure: A Multiple Input Neuron
Figure 3 is the diagram representing the following equation:
O = σ(ω1 ξ1 + ω2 ξ2 + ω3 ξ3 + θ)
A neural network
Figure: A layer
A neural network
Figure: A neural network

A neural network
I J K
Figure: A neural network

The back propagation algorithm
Notation
I xj` : Input to node j of layer `
Notation
I Wij` : Weight from layer ` − 1 node i to layer ` node j
Notation
1
I σ(x) = 1+e −x
: Sigmoid Transfer Function
Notation
1
I σ(x) = 1+e −x
I θ` : Bias of node j of layer `
j
Notation
1
I σ(x) = 1+e −x
j
I Oj` : Output of node j in layer `
Notation
1
I σ(x) = 1+e −x
j
I Oj` : Output of node j in layer `
I tj : Target value of node j of the output layer
The error calculation
Given a set of training data points tk and output layer output Ok

we can write the error as
1X
E= (Ok − tk )2
2
k∈K
We let the error of the network for a single training iteration be

∂E
denoted by E . We want to calculate ∂W ` , the rate of change of
jk
the error with respect to the given connective weight, so we can
minimize it.
Now we consider two cases: The node is an output node, or it is in
a hidden layer...
Output layer node
∂E
=
∂Wjk
Output layer node
∂E ∂ 1X
= (Ok − tk )2
∂Wjk ∂Wjk 2
k∈K
Output layer node
∂E ∂
= (Ok − tk ) Ok
∂Wjk ∂Wjk
Output layer node
∂E ∂
= (Ok − tk ) σ(xk )
∂Wjk ∂Wjk
Output layer node
∂E ∂
= (Ok − tk )σ(xk )(1 − σ(xk )) xk
∂Wjk ∂Wjk
Output layer node
∂E
= (Ok − tk )Ok (1 − Ok )Oj
∂Wjk
Output layer node
∂E
= (Ok − tk )Ok (1 − Ok )Oj
∂Wjk
For notation purposes I will define δk to be the expression
(Ok − tk )Ok (1 − Ok ), so we can rewrite the equation above as
∂E
= Oj δk
∂Wjk
where
δk = Ok (1 − Ok )(Ok − tk )
Hidden layer node
∂E
=
∂Wij
Hidden layer node
∂E ∂ 1X
= (Ok − tk )2
∂Wij ∂Wij 2
k∈K
Hidden layer node
∂E X ∂
= (Ok − tk ) Ok
∂Wij ∂Wij
k∈K
Hidden layer node
∂E X ∂
= (Ok − tk ) σ(xk )
∂Wij ∂Wij
k∈K
Hidden layer node
∂E X ∂xk
= (Ok − tk )σ(xk )(1 − σ(xk ))
∂Wij ∂Wij
k∈K
Hidden layer node
∂E X ∂xk ∂Oj
= (Ok − tk )Ok (1 − Ok ) ·
∂Wij ∂Oj ∂Wij
k∈K
Hidden layer node
∂E X ∂Oj
= (Ok − tk )Ok (1 − Ok )Wjk
∂Wij ∂Wij
k∈K
Hidden layer node
∂E ∂Oj X
= (Ok − tk )Ok (1 − Ok )Wjk
∂Wij ∂Wij
k∈K
Hidden layer node
∂E ∂xj X
= Oj (1 − Oj ) (Ok − tk )Ok (1 − Ok )Wjk
∂Wij ∂Wij
k∈K
Hidden layer node
∂E X
= Oj (1 − Oj )Oi (Ok − tk )Ok (1 − Ok )Wjk
∂Wij
k∈K
Hidden layer node
∂E X
∂Wij
k∈K
But, recalling our definition of δk we can write this as

∂E X
= Oi Oj (1 − Oj ) δk Wjk
∂Wij
k∈K
Hidden layer node
∂E X
∂Wij
k∈K
But, recalling our definition of δk we can write this as

∂E X
= Oi Oj (1 − Oj ) δk Wjk
∂Wij
k∈K
Similar to before we will now define all terms besides the Oi to be

δj , so we have
∂E
= Oi δj
∂Wij
How weights affect errors
For an output layer node k ∈ K
∂E
= Oj δk
∂Wjk
where
δk = Ok (1 − Ok )(Ok − tk )
For a hidden layer node j ∈ J
∂E
= Oi δj
∂Wij
where X
δj = Oj (1 − Oj ) δk Wjk
k∈K
What about the bias?
If we incorporate the bias term θ into the equation you will find
that
∂O ∂θ
= O(1 − O)
∂θ ∂θ
and because ∂θ/∂θ = 1 we view the bias term as output from a
node which is always one.
What about the bias?
If we incorporate the bias term θ into the equation you will find
that
∂O ∂θ
= O(1 − O)
∂θ ∂θ
and because ∂θ/∂θ = 1 we view the bias term as output from a
node which is always one.
This holds for any layer ` we are concerned with, a substitution
into the previous equations gives us that
∂E
= δ`
∂θ
(because the O` is replacing the output from the “previous layer”)
1. Run the network forward with your input data to get the
network output
2. For each output node compute
δk = Ok (1 − Ok )(Ok − tk )
3. For each hidden node calulate
X
δj = Oj (1 − Oj ) δk Wjk
k∈K
4. Update the weights and biases as follows

Given
∆W = −ηδ` O`−1
∆θ = −ηδ`
apply
W + ∆W → W
θ + ∆θ → θ
Feature Extraction
Bag of Words
and
Term frequency inverse document frequency(Tf-IDF)
BOW (Count vectorizer)
Tfidf
Term Frequency
Length of every vector = size of vocabulary
Concept : Word present in all document is least relevant
• Till now tf-idf has been applied to one word only .
• Tf-idf vectorizer can also be applied to n gram(e.g. Word bi gram) in
which it will calculate relevant word bigram.
Example #2
• Example: If we are given 4 reviews for an Italian pasta dish.
• Review 1 : This pasta is very tasty and affordable.
• Review 2: This pasta is not tasty and is affordable.
• Review 3 : This pasta is delicious and cheap.
• Review 4: Pasta is tasty and pasta tastes good.
• Now if we count the number of unique words in all the four

reviews we will be getting a total of 12 unique words.
• Below are the 12 unique words :
1.‘This’
2.‘pasta’ Now if we take the first review and plot count of each
word in the below table we will have where row 1
3.‘is’
corresponds to the index of the unique words and row 2
4.‘very’
corresponds to the number of times a word occurs in a
5.‘tasty’ review.
6.‘and’ (Here review 1) : This pasta is very tasty and
7.‘affordable’ affordable.
8.‘not’
9.‘delicious’
10.‘cheap’
11.‘tastes’ Review 4: Pasta is tasty and pasta tastes good.
12.‘good’
• BOW doesn’t work very well when there are small changes in the
terminology we are using as here we have sentences with similar
meaning but with just different words.
• This results in a vector with lots of zero scores called a sparse
vector or sparse representation.
• Sparse vectors require more memory and computational
resources when modeling and the vast number of positions or
dimensions can make the modeling process very challenging for
traditional algorithms.
There are simple text cleaning techniques that can be used as a first step, such as:
· Ignoring case
· Ignoring punctuation
· Ignoring frequent words that don’t contain much information, called stop words, like “a,” “of,”
etc.
· Fixing misspelled words.
· Reducing words to their stem (e.g. “play” from “playing”) using stemming algorithms.
N-grams Model:
• A more sophisticated approach is to create a vocabulary of
grouped words. This changes both the scope of the vocabulary
and allows the bag-of-words to capture a little bit more meaning
from the document.
• In this approach, each word or token is called a “gram”.
Creating a vocabulary of two-word pairs is, in turn, called
a bigram model. Again, only the bigrams that appear in the
corpus are modeled, not all possible bigrams.
An N-gram is an N-token sequence of words: a 2-gram
(more commonly called a bigram) is a two-word sequence
of words like “please turn”, “turn your”, or “your homework”,
and a 3-gram (more commonly called a trigram) is a three-
word sequence of words like “please turn your”, or “turn
TF-IDF:
tf–idf or TFIDF, short for term frequency-inverse document
frequency, is a numerical statistic that is intended to reflect how
important a word is to a document in a collection or corpus.
The tf–idf value increases proportionally to the number of times a
word appears in the document and is offset by the number of
documents in the corpus that contain the word, which helps to
adjust for the fact that some words appear more frequently in
general.
tf–idf is one of the most popular term-weighting schemes today;
83% of text-based recommender systems in digital libraries use
tf–idf.
This concept includes:
· Counts. Count the number of times each word appears in a document.
· Frequencies. Calculate the frequency that each word appears in a document out of all
the words in the document.
Term frequency :
• Term frequency (TF) is used in connection with information
retrieval and shows how frequently an expression (term, word)
occurs in a document.
• Term frequency indicates the significance of a particular term
within the overall document. It is the number of times a word wi
occurs in a review rj with respect to the total number of words in
review rj
TF can be said as what is the probability of finding a word in a document

(review).
Inverse document frequency:
• The inverse document frequency is a measure of how much
information the word provides, i.e., if it’s common or rare across all
documents.
• It is used to calculate the weight of rare words across all
documents in the corpus.
• The words that occur rarely in the corpus have a high IDF score. It
is the logarithmically scaled inverse fraction of the documents that
contain the word (obtained by dividing the total number of
documents by the number of documents containing the term, and
then taking the logarithm of that quotient)
Term frequency–Inverse document frequency:
• TF–IDF is calculated as
A high weight in tf–idf is reached by a high term frequency (in the given document) and a low
document frequency of the term in the whole collection of documents; the weights hence tend
to filter out common terms.
Since the ratio inside the IDF's log function is always greater than or equal to 1, the value of
IDF (and tf–idf) is greater than or equal to 0.
As a term appears in more documents, the ratio inside the logarithm approaches 1, bringing
the IDF and tf–idf closer to 0.
TF-IDF gives larger values for less frequent words in the document corpus. TF-IDF value is
high when both IDF and TF values are high i.e the word is rare in the whole document but
frequent in a document.
TF-IDF also doesn’t take the semantic meaning of the words.
Let’s take an example to get a clearer understanding.
• Sentence 1: The car is driven on the road.
• Sentence 2: The truck is driven on the highway.
• In this example, each sentence is a separate document.
• We will now calculate the TF-IDF for the above two documents, which
represent our corpus.
From the mentioned table, we can see that the TF-IDF of common
words was zero, which shows they are not significant.
On the other hand, the TF-IDF of “car”, “truck”, “road”, and “highway”
are non-zero. These words have more significance.
Reviewing- TFIDF is the product of the TF and IDF scores of the

term.
TF = number of times the term appears in the doc/total number of
words in the doc
IDF = ln(number of docs/number docs the term appears in)
Higher the TFIDF score, the rarer the term is and vice-versa.
TFIDF is successfully used by search engines like Google, as a
ranking factor for content.
The whole idea is to weigh down the frequent terms while scaling up
the rare ones.
Word2Vec :
• Word Embedding is a word representation type that allows
machine learning algorithms to understand words with similar
meanings. It is a language modeling and feature learning
technique to map words into vectors of real numbers using
neural networks, probabilistic models, or dimension reduction
on the word co-occurrence matrix. Some word embedding
models are Word2vec (Google), Glove (Stanford), and fastest
(Facebook).
• Word2Vec model is used for learning vector representations of
words called “word embeddings”.
• This is typically done as a preprocessing step, after which the
learned vectors are fed into a discriminative model (typically an
RNN) to generate predictions and perform all sorts of interesting
things.
• It takes the semantic meaning of words.
SOM – Self Organized Maps
Kmean Clustering
Hierarchical Clustering
Sandeep Chaurasia
Self Organizing Maps – Kohonen Maps
• Self Organizing Map (or Kohonen Map or SOM) is a type of
Artificial Neural Network which is also inspired by biological
models of neural systems from the 1970s.
• It follows an unsupervised learning approach and trained its
network through a competitive learning algorithm.
• SOM is used for clustering and mapping (or dimensionality
reduction) techniques to map multidimensional data onto lower-
dimensional which allows people to reduce complex problems for
easy interpretation.
• SOM has two layers, one is the Input layer and the other one is
the Output layer.
• The architecture of the Self Organizing Map with two clusters and
n input features of any sample is given below:
• Let’s say an input data of size (m, n) where m is the number of training
examples and n is the number of features in each example. F
• irst, it initializes the weights of size (n, C) where C is the number of clusters.
• Then iterating over the input data, for each training example, it updates the
winning vector (weight vector with the shortest distance (e.g Euclidean
distance) from training example).
• Weight updation rule is given by :
wij = wij(old) + alpha(t) * (xik - wij(old))
where alpha is a learning rate at time t, j denotes the winning vector, i denotes
the ith feature of training example and k denotes the kth training example from
the input data.
After training the SOM network, trained weights are used for clustering new
examples. A new example falls in the cluster of winning vectors.
• Training:
• Step 1: Initialize the weights wij random value may be assumed. Initialize the
learning rate α.
• Step 2: Calculate squared Euclidean distance.
D(j) = Σ (wij – xi)^2 where i=1 to n and j=1 to m
• Step 3: Find index J, when D(j) is minimum that will be considered as winning index.
• Step 4: For each j within a specific neighborhood of j and for all i, calculate the new
weight.
wij(new)=wij(old) + α[xi – wij(old)]
• Step 5: Update the learning rule by using :
α(t+1) = 0.5 * t
• Step 6: Test the Stopping Condition.
SOM Example
KMean Clustering
• K-Means Clustering is an Unsupervised Machine Learning algorithm, which
groups the unlabeled dataset into different clusters.
• Unsupervised Machine Learning is the process of teaching a computer to
use unlabeled, unclassified data and enabling the algorithm to operate on
that data without supervision. Without any previous data training, the
machine’s job in this case is to organize unsorted data according to
parallels, patterns, and variations.
• The goal of clustering is to divide the population or set of data points into a
number of groups so that the data points within each group are more comparable
to one another and different from the data points within the other groups. It is
essentially a grouping of things based on how similar and different they are to
one another.
• We are given a data set of items, with certain features, and values for these
features (like a vector). The task is to categorize those items into groups.
To achieve this, we will use the K-means algorithm; an unsupervised
learning algorithm. ‘K’ in the name of the algorithm represents the number
of groups/clusters we want to classify our items into.
• The algorithm works as follows:

• First, we randomly initialize k points, called means or cluster centroids.
• We categorize each item to its closest mean and we update the mean’s coordinates, which
are the averages of the items categorized in that cluster so far.
• We repeat the process for a given number of iterations and at the end, we have our clusters.
Example K-Mean
Hierarchal Clustering
• Hierarchical Clustering in Machine Learning: Hierarchical clustering is
another unsupervised machine learning algorithm, which is used to group
the unlabeled datasets into a cluster and also known as hierarchical cluster
analysis or HCA.
• In this algorithm, we develop the hierarchy of clusters in the form of a
tree, and this tree-shaped structure is known as the dendrogram.
The hierarchical clustering technique has two approaches:

• Agglomerative: Agglomerative is a bottom-up approach, in which the
algorithm starts with taking all data points as single clusters and merging
them until one cluster is left.
• Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as
it is a top-down approach.
Agglomerative Hierarchical clustering
• The agglomerative hierarchical clustering algorithm is a popular
example of HCA. To group the datasets into clusters, it follows
the bottom-up approach.
• It means, this algorithm considers each dataset as a single
cluster at the beginning, and then start combining the closest
pair of clusters together. It does this until all the clusters are
merged into a single cluster that contains all the datasets.
• This hierarchy of clusters is represented in the form of the
dendrogram.
Step - 1 Step - 2 Step - 3
Step - 4 Step - 5 Step - 6

Working of Dendrogram in Hierarchical
clustering
The dendrogram is a tree-
like structure that is mainly
used to store each step as
a memory that the HC
algorithm performs. In the
dendrogram plot, the Y-
axis shows the Euclidean
distances between the
data points, and the x-axis
shows all the data points
of the given dataset.
Measure for the distance between two
clusters
The closest distance between the two clusters is crucial for the hierarchical
clustering. There are various ways to calculate the distance between two clusters,
and these ways decide the rule for clustering. These measures are called Linkage
methods. Some of the popular linkage methods are given below:
Single Linkage: It is the Shortest Distance between the closest points of the
clusters.
Complete Linkage: It is the farthest distance between the two points of two
different clusters. It is one of the popular linkage methods as it forms tighter
clusters than single-linkage.
Average Linkage: It is the linkage method in which the distance between each pair
of datasets is added up and then divided by the total number of datasets to
calculate the average distance between two clusters. It is also one of the most
popular linkage methods.
Centroid Linkage: It is the linkage method in which the distance between the
centroid of the clusters is calculated. Consider the below image:
Single Linkage: It is the Shortest Distance between the closest
points of the clusters.
Complete Linkage: It is the farthest distance between the two
points of two different clusters. It is one of the popular linkage
methods as it forms tighter clusters than single-linkage.
Average Linkage: It is the linkage method in which the distance
between each pair of datasets is added up and then divided by
the total number of datasets to calculate the average distance
between two clusters. It is also one of the most popular linkage
methods.
Single Linkage: It is the Complete Linkage: It is the farthest
Shortest Distance between the distance between the two points of
closest points of the clusters two different clusters. It is one of the
popular linkage methods as it forms
tighter clusters than single-linkage.
Centroid Linkage: It is the linkage method in which the distance between the centroid of the clusters is
calculated. Consider the below image:
Single Linkage Example

DSML

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DSML

Uploaded by

Copyright:

Available Formats

Introduction to Descriptive

Statistics and Probability for

Statistics is the science of collecting, organizing, summarizing,

Fig. Using sample statistics to estimate population parameters.

• Descriptive measures of samples are called statistics and are typically

where xi is an element in the data set, N is the number of elements in the

• Compute the standard deviation of the

To compare standard deviations between different populations or

Ex: Store wait time in minutes

Find the CDF, in tabular form of the random variable, X, as defined

Set-Up for One-Sample Hypotheses

For a significance level of 5 %

For a significance level of 5 %, this results in 3.841

Treatment Fever No fever Total

Binomial Random Variable: A binomial random variable is a number of

• Example. Suppose you flipped a coin. The probability of getting heads

P(x, n, P) = nCx * Px * (1 - P)n-x

• Where, n = the number of experiments, x = 0, 1, 2, 3, 4, … (total

#2: 60% of people who purchase sports cars are men. If 10

To find the variance formula of a

Bernoulli distribution is a case of

• Example 2: If a Bernoulli distribution has a parameter 0.45 then find its

• Example 3: If a Bernoulli distribution has a parameter 0.72 then find its

f(x) = P(X=x) = (e-λ λx )/x! , where

Given: λ = 3.4, and x = 6.

• A computer program is said to learn from experience E with respect

• Analyzing images of products on a production line to automatically classify them

A typical supervised learning task is classification. The spam filter is a good

The final trained model ready to be used for predictions (e.g.,

• You studied the data.

Linear regression model prediction

The MSE of a linear regression hypothesis h on a training set X is calculated

𝛉ˆ is the value of θ that minimizes the cost function

Gradient descent with (left) and without (right) feature scaling

Gradient vector of the cost function

Gradient descent step

Initially model selects θ1 and θ2 values

By the time model achieves the minimum

Using these finally updated values of θ1 and

Clearly, a straight line will never fit this data

• Notice that the ℓ1 norm is multiplied by 2α, whereas the ℓ2 norm

• These factors were chosen to ensure that the optimal α value is

After reading this you will know:

 How to calculate the logistic function.

Let’s get started.

The raw data is listed below.

The logistic function is defined as:

Get your FREE Algorithms Mind Map

output = b0 + b1*x1 + b2*x2

In your spreadsheet this would be written as:

Logistic Regression by Stochastic Gradient Descent

Given each training instance:

1. Calculate a prediction using the current values of the coefficients.

The first training instance is: x1=2.7810836, x2=2.550537003, Y=0

prediction = 1 / (1 + e^(-(b0 + b1*x1 + b2*x2)))

prediction = 1 / (1 + e^(-(0.0 + 0.0*2.7810836 + 0.0*2.550537003)))

Calculate New Coefficients

b = b + alpha * (y – prediction) * prediction * (1 – prediction) * x

b0 = b0 + 0.3 * (0 – 0.5) * 0.5 * (1 – 0.5) * 1.0

b1 = b1 + 0.3 * (0 – 0.5) * 0.5 * (1 – 0.5) * 2.7810836

b2 = b2 + 0.3 * (0 – 0.5) * 0.5 * (1 – 0.5) * 2.550537003

Repeat the Process

The coefficients calculated after 10 epochs of stochastic gradient descent are:

output = b0 + b1x1 + b2x2

prediction = 1 / (1 + e^(-(b0 + b1x1 + b2x2)))

prediction = 1 / (1 + e^(-(0.0 + 0.02.7810836 + 0.02.550537003)))