Cs601pc - Machine Learning Unit - 2-1

B.Tech.
III YEAR II Semester
MACHINE LEARNING
CS601PC: MACHINE LEARNING
DR DV RAMANA,
DATA STRATEGIST –CONSULTANT
AND
DVR
ACADEMIC ADVISOR
MACHINE LEARNING
MACHINE LEARNING
Prerequisites
Data Structures
Knowledge on statistical methods
.
DVR
MACHINE LEARNING
MACHINE LEARNING
Course Objective
This course explains machine learning techniques such as decision tree learning, Bayesian learning etc
To understand computational learning theory.

.
To study the pattern comparison techniques
DVR
MACHINE LEARNING
MACHINE LEARNING
Course Outcomes
Understand the concepts of computational intelligence like machine learning
Ability to get the skill to apply machine learning techniques to address the real time problems in different areas
.
Understand the Neural Networks and its usage in machine learning application.
DVR
MACHINE LEARNING
MACHINE LEARNING
References
TEXT BOOK
Machine Learning – Tom M. Mitchell, - MGH
.
REFERENCE BOOK:
Machine Learning: An Algorithmic Perspective, Stephen Marshland, Taylor & Francis
DVR
MACHINE LEARNING
MACHINE LEARNING
Agenda – Unit- II
Artificial Neural Networks - 1
Introduction - Artificial Neural Networks
Neural network representation

Appropriate problems for neural network learning
Perceptions
Multilayer networks and the back-propagation algorithm
Remarks on the Back-Propagation algorithm
An illustrative example: Face recognition
Advanced topics in artificial neural networks.
DVR
MACHINE LEARNING
MACHINE LEARNING
Agenda – Unit- II
Evaluation Hypotheses
Motivation
Estimation hypothesis accuracy

Basics of sampling theory
A general approach for deriving confidence intervals

Difference in error of two hypotheses
Comparing learning algorithms
DVR
MACHINE LEARNING
DVR
MACHINE LEARNING
INTRODUCTION
Artificial neural networks (ANNs) provide a general, practical method

for learning real-valued, discrete-valued, and vector-valued target
functions from examples.
DVR
9
MACHINE LEARNING
DVR
MACHINE LEARNING
Biological Motivation
• The study of artificial neural networks (ANNs) has been inspired by the
observation that biological learning systems are built of very complex webs of
interconnected Neurons
• Human information processing system consists of brain neuron: basic building
block cell that communicates information to and from various parts of body
• Simplest model of a neuron: considered as a threshold unit –a processing element
(PE)
• Collects inputs & produces output if the sum of the input exceeds an internal
threshold value
DVR
11
12
DVR MACHINE LEARNING

13

14

15

16

MACHINE LEARNING
DVR
MACHINE LEARNING
Facts of Human Neurobiology
• Number of neurons ~ 1011

• Connection per neuron ~ 10 4 – 5
• Neuron switching time ~ 0.001 second or 10 -3
• Scene recognition time ~ 0.1 second
• 100 inference steps doesn’t seem like enough
• Highly parallel computation based on distributed representation
DVR
18
Neuron
19

Neuron
20

Neuron
21

Neuron
22

MACHINE LEARNING
NEURAL NETWORK REPRESENTATIONS
DVR
23
24

MACHINE LEARNING
• A prototypical example of ANN learning is provided by Pomerleau's (1993)
system ALVINN, which uses a learned ANN to steer an autonomous vehicle
driving at normal speeds on public highways.
• The input to the neural network is a 30x32 grid of pixel intensities obtained from
a forward-pointed camera mounted on the vehicle.
• The network output is the direction in which the vehicle is steered.
DVR
25
MACHINE LEARNING
DVR
MACHINE LEARNING
DVR
MACHINE LEARNING
DVR
MACHINE LEARNING
DVR
MACHINE LEARNING
DVR
MACHINE LEARNING
DVR
MACHINE LEARNING
DVR
MACHINE LEARNING
Artificial neural networks (ANNs) provide a general, practical method for learning
Real-valued
Discrete-valued, and
Vector-valued functions from examples
Algorithms such as Back Propagation and Gradient Descent to tune network

parameters to best fit a training set of input-output pairs
ANN learning is robust to errors in the training data and has been successfully
applied to problems such as
Interpreting visual scenes
Speech recognition, and
Learning robot control strategies
DVR
MACHINE LEARNING
ANNs are computational models which work similar to the functioning of a human nervous
system.
There are several kinds of ANNs.
These type of networks are implemented based on the mathematical operations and a set
of parameters required to determine the output
NNs can be
Biological models
Artificial models
Desire to produce artificial systems capable of sophisticated computations similar to the

human brain
DVR
MACHINE LEARNING
Brain vs Digital Computers
Computers require hundreds of cycles to simulate a firing of a neuron
The brain can fire all the neurons in a single step
Parallelism
Serial computers require billions of cycles to perform some tasks but the
brain takes less than a second
Example
Face Recognition
DVR
MACHINE LEARNING
Human Brain
• 1011 Neurons
• On average, each connected to 104 others
• The fastest neuron switching times are known to be
on the order of 10-3 seconds--quite slow compared to
computer switching speeds of 10-10 seconds.
• It requires approximately 10-1 seconds to visually
recognize mother.
• Highly parallel processes on distributed environment.
DVR
MACHINE LEARNING
• ANNs are loosely motivated by biological neurons.
• ANNs whose individual units output a single constant value, whereas
biological neurons output a complex time series of spikes.
• One set of researchers are using ANNs to study biological learning
processes.
• Other set of researchers are using ANNs to build Machine Learning
algorithms without considering of mirroring of biological process.
DVR
MACHINE LEARNING
Connectionist Models
Consider Humans:
Neuron Switching time ~.001
Number of neurons ~1010
Connections per neuron ~ 104-5
Scene recognition time ~.1 sec
100 inference steps does'nt seem like enough
DVR
MACHINE LEARNING
Properties of Neural Networks
Many neuron-like threshold switching units
Many weighted interconnections among units
Highly parallel , distributed process
Emphasis on tuning weights automatically
Input is a high-dimensional discrete or real-valued (Example: SENSOR INPUT)
DVR
MACHINE LEARNING
When to consider Neural Networks?
Input is a high-dimensional discrete or real-valued(e.g Sensor Input)
Output is discrete or real-valued
Output is a vector of values
Possibly noisy data
Form of target function is unknown
Human readability of result is unimportant
Examples:
Speech phoneme recognition
Image classification
Financial Position
DVR
MACHINE LEARNING
Neuron
Neurons, also known as nerve cells, send and receive signals from your brain
While neurons have a lot in common with other types of cells, they're
structurally and functionally unique
Specialized projections called axons allow neurons to transmit electrical
and chemical signals to other cell
Parts of a Neuron
DVR
MACHINE LEARNING
Characteristics of Artificial Neural Network
Large number of very simple processing neuron-like processing elements
Larger number of weighted connections between the elements
Distributed representation of knowledge over the connections
Knowledge is acquired by network through a learning process.
DVR
MACHINE LEARNING
Application of ANN
Controlling the movements of a robot based on self- perception and other

information
Deciding the category of potential food items in an artificial world
Recognizing a visual object
Predicting where a moving object goes , when robot wants to catch it
DVR
MACHINE LEARNING
Strengths of Artificial Neural Network
1. Greatest power of Neural Networks is that is endowed with a finite number of

hidden units, can yet approximate any continuous function to any desired degree
of accuracy.
This has been commonly referred to as the property of

universal approximate.
2. No prior knowledge of the data generating process is needed for

Implementing Neural Network(NN)
3. Problem of model misspecification does not occur
4. No prior knowledge of the data generating process is needed for

Implementing Neural Network(NN)
Adaptive Learning
An ability to learn how to do tasks based on the data given for training or initial
DVR
experience
MACHINE LEARNING
Strengths of Artificial Neural Network
Self Organization
An Artificial Neural Networks(ANN) can create its own organization or

representation of the information it receives during learning time
DVR
MACHINE LEARNING
Weakness of Artificial Neural Network
1. The addition of too many hidden units incites the problem of overfitting the data
2. The construction of the Neural Networks(NN) model can be a time consuming process
DVR
Neural Network Representation
MACHINE LEARNING
Neural networks are a biologically-inspired algorithm that attempt to mimic the functions of neurons
in the brain
Each neuron acts as a computational unit, accepting input from the dendrites and outputting signal
through the axon terminals
Actions are triggered when a specific combination of neurons are activated
Single layer Neural Network is a Neural network with one input layer, one hidden layer and the output
layer, which is a single node layer, and it is responsible for generating the predicted value y^
DVR
MACHINE LEARNING
We have the following parts of the neural network:
x1,x2 and x3 are inputs of a Neural Network. These elements are scalars and they
are stacked vertically. This also represents an input layer
Variables in a hidden layer are not seen in the input set. Thus, it is called a hidden
layer
The output layer consists of a single neuron only and y^ is the output of the neural
network
DVR
MACHINE LEARNING
We have the following parts of the neural network:
x1,x2 and x3 are inputs of a Neural Network. These elements are scalars and they
are stacked vertically. This also represents an input layer
Variables in a hidden layer are not seen in the input set. Thus, it is called a hidden
layer
The output layer consists of a single neuron only and y^ is the output of the neural
network
In the training set we see what the inputs are and we see what the output should be. But the things
in the hidden layer are not seen in the training set, so the name hidden layer just means you don’t
DVR
see it in the training set.
NEURAL NETWORK REPRESENTATIONS
MACHINE LEARNING
DVR
MACHINE LEARNING
ANN learning is well-suited to problems in which the training data corresponds to
noisy, complex sensor data, such as inputs from cameras and microphones
The BACKPROPAGATION algorithm is the most commonly used ANN learning
technique
ANN learning technique is appropriate for problems with the following
characteristics:
Instances are represented by many attribute-value pairs
The target function output may be discrete-valued, real-valued, or a

vector of several real- or discrete-valued attributes
The training examples may contain errors.
The ability of humans to understand the learned target function is not

important.
DVR
MACHINE LEARNING
The target function to be learned is defined over instances that can be described by
a vector of predefined features, such as the pixel values in the ALVINN example
These input attributes may be highly correlated or independent of one another

Input values can be any real values.
DVR
MACHINE LEARNING
The target function output may be discrete-valued, real-valued, or a vector of several real-
or discrete-valued attributes
The value of each output is some real number between 0 and 1, which in this case
corresponds to the confidence in predicting the corresponding steering direction
We can also train a single network to output both the steering command and
suggested acceleration, simply by concatenating the vectors that encode these two
output predictions
DVR
MACHINE LEARNING
DVR
MACHINE LEARNING
The training examples may contain errors.
ANN learning methods are quite robust to noise in the training data
Long training times are acceptable
Network training algorithms typically require longer training times than,

say, decision tree learning algorithms
Training times can range from a few seconds to many hours, depending on
factors such as the number of weights in the network, the number of
training examples considered, and the settings of various learning
algorithm parameters
DVR
MACHINE LEARNING
Fast evaluation of the learned target function may be required
Although ANN learning times are relatively long, evaluating the learned
network, in order to apply it to a subsequent instance, is typically very fast
Example
ALVINN applies its neural network several times per second to continually
update its steering command as the vehicle drives forward
The ability of humans to understand the learned target function is not important.
The weights learned by neural networks are often difficult for

humans to interpret
Learned neural networks are less easily communicated to humans than

learned rules
DVR
Perceptron
MACHINE LEARNING
A Perceptron is an Artificial Neuron
Perceptron is the simplest possible Neural Network Neural Networks are the building blocks of
Machine Learning
A perceptron is a simple model of a biological neuron in an artificial neural network
Perceptron is also the name of an early algorithm for supervised learning of binary classifiers
A Perceptron is a neural network unit that does certain computations to detect features or
business intelligence in the input data
Perceptron is a function that maps its input “x,” which is multiplied by the learned weight
coefficient, and generates an output value ”f(x)
Perceptrons could simulate brain principles, with the ability to learn and make decisions
DVR
Perceptron
MACHINE LEARNING
The original Perceptron was designed to take a number of binary inputs, and produce one binary
output (0 or 1)
The idea was to use different weights to represent the importance of each input, and that the
sum of the values should be greater than a threshold value before making a decision like true
or false (0 or 1).
DVR
Perceptron
MACHINE LEARNING
Perceptron Example
Imagine a perceptron (in your brain)
The perceptron tries to decide if you should go to a
concert
Is the artist good? Is the weather good?
What weights should these facts have?
DVR
Perceptron
MACHINE LEARNING
The Perceptron Algorithm Frank Rosenblatt suggested this algorithm:
Set a threshold value
Multiply all inputs with its weights
Sum all the results
Activate the output

1. Set a threshold value:
Threshold = 1.5
2. Multiply all inputs with its weights:
x1 * w1 = 1 * 0.7 = 0.7
x2 * w2 = 0 * 0.6 = 0
x3 * w3 = 1 * 0.5 = 0.5
x4 * w4 = 0 * 0.3 = 0
x5 * w5 = 1 * 0.4 = 0.4
3. Sum all the results:
0.7 + 0 + 0.5 + 0 + 0.4 = 1.6 (The Weighted Sum)
4. Activate the Output:
DVR
Return true if the sum > 1.5 ("Yes I will go to the Concert")
Perceptron

Perceptron

Perceptron

Reprentational Power of Perceptron's
MACHINE LEARNING
DVR
Reprentational Power of Perceptron's
MACHINE LEARNING
DVR
The Perceptron Training Rule
MACHINE LEARNING
The learning problem is to determine a weight vector that causes the perceptron to produce
the correct +1 or -1 output for each of the given training examples
To learn an acceptable weight vector:
Begin with random weights, then iteratively apply the perceptron to each training example,
modifying the perceptron weights whenever it misclassifies an example
This process is repeated, iterating through the training examples as many times as needed
until the perceptron classifies all training examples correctly.
Weights are modified at each step according to the perceptron training rule, which revises the
weight wi associated with input xi according to the rule
DVR
The Perceptron Training Rule
MACHINE LEARNING
DVR
Limitations of the perceptron
MACHINE LEARNING
While the perceptron classified the instances of AND/OR/NOT well, the model has
limitations.
Linear models like the perceptron with a Heaviside activation function are not universal
function approximators ; they cannot represent some functions.
Specifically, linear models can only learn to approximate the functions for linearly
separable datasets.
The linear classifiers that we have examined find a hyperplane that separates the
positive classes from the negative classes;
if no hyperplane exists that can separate the classes, the problem is not linearly
separable.
DVR
69
Limitations of the perceptron
MACHINE LEARNING
DVR
70
Multilayer networks and the back-propagation algorithm
MACHINE LEARNING
Multilayer networks solve the classification problem for non linear sets by employing hidden layers, whose neurons
are not directly connected to the output.
The additional hidden layers can be interpreted geometrically as additional hyper-planes, which enhance the
separation capacity of the network
Backpropagation (backward propagation) is an important mathematical tool for improving the accuracy of
predictions in data mining and machine learning. Essentially,
Backpropagation is an algorithm used to calculate derivatives quickly.
The backpropagation algorithm performs learning on a multilayer feed-forward neural network
Backpropagation algorithm iteratively learns a set of weights for prediction of the class label of tuples
A multilayer feed-forward neural network consists of an input layer, one or more hidden layers, and an output
layer
Backpropagation is a common method for training a neural network.
Backpropagation is a method to calculate the gradient of the loss function with respect to the weights in an
artificial neural network.
Backpropagation is commonly used as a part of algorithms that optimize the performance of the network by
DVR
adjusting the weights, for example in the gradient descent algorithm.
Back propagation algorithm
MACHINE LEARNING
The Back propagation algorithm in neural network computes the gradient of the loss function for a single weight
by the chain rule
Back propagation algorithm efficiently computes one layer at a time, unlike a native direct computation
Back propagation algorithm computes the gradient, but it does not define how the gradient is used
Back-propagation learning algorithm computes the mean-squared error of the difference between the desired
output and the real output value and adjusts the weight and bias coefficients until the mean-squared error
function reaches a predetermined minimum value
Steps involved in Backpropagation:

Step – 1: Forward Propagation
Step – 2: Backward Propagation
Step – 3: Putting all the values together and calculating the updated weight value.
DVR
Back propagation algorithm
MACHINE LEARNING
Backpropagation is an essential mechanism by which neural networks get trained
Backpropagation is a mechanism used to fine-tune the weights of a neural network (otherwise referred to as a
model in this article) in regards to the error rate produced in the previous iteration
Backpropagation is similar to a messenger telling the model if the net made a mistake or not as soon as it
predicted
Backpropagation is a mechanism used to fine-tune the weights of a neural network (otherwise referred to as a
model in this article) in regards to the error rate produced in the previous iteration
Backpropagation in neural networks is about the transmission of information and relating this information to the
error generated by the model when a guess was made
DVR
Backpropagation method seeks to reduce the error, which is otherwise referred to as the loss function
How does backpropagation work
MACHINE LEARNING
Back propagation works. It has four layers: input layer, hidden layer, hidden layer II and final output layer.
Main three layers are: Input layer

Hidden layer
Output layer
Each layer has its own way of working and its own way to take action such that we are able to get the desired results
and correlate these scenarios to our conditions.
This image summarizes the functioning of the
backpropagation approach.
Input layer receives x
Input is modeled using weights w
Each hidden layer calculates the output and data is ready at
the output layer
Difference between actual output and desired output is
known as the error
Go back to the hidden layers and adjust the weights so that
this error is reduced in future runs
This process is repeated till we get the desired output.
DVR
The training phase is done with supervision. Once the model is stable, it is used in production.
Back Propagation Algorithm
MACHINE LEARNING
The Backpropagation algorithm learns the weights for a multilayer network, given a network with a
fixed set of units and interconnections
Backpropagation employs gradient descent to attempt to minimize the squared error between the
network output values and the target values for these outputs
Because we are considering networks with multiple output units rather than single units as before,
we begin by redefining E to sum the errors over all of the network output units
Where outputs is the set of output units in the network, and tkd and okd are the target and output
values associated with the kth output unit and training example d
DVR
MACHINE LEARNING
DVR
MACHINE LEARNING
DVR
MACHINE LEARNING
DVR
MACHINE LEARNING
DVR
MACHINE LEARNING
DVR
MACHINE LEARNING
DVR
MACHINE LEARNING
DVR
MACHINE LEARNING
DVR
MACHINE LEARNING
DVR
MACHINE LEARNING
Convergence and Local Minima
Representational power of feed forward networks
Hypothesis space search and inductive bias
Hidden Layer representation
Generalization Overfitting and stopping criterion
Functions
Boolean
Continuous
Arbitrary
DVR
MACHINE LEARNING
Convergence and Local Minima
1.The Backpropagation multilayer networks is only guaranteed to converge towards some local
minimum in E and not necessarily to the global minimum error.
Despite the lack of assured convergence to the global minimum error, Backpropagation is a
highly effective function approximation method in practice
Local minima can be gained by considering the manner in which network weights evolve as the
number of trainings iterations increases
Backpropagation over multilayer networks is only guaranteed to converge towards some local
minimum in "E" and not necessarily to the global minimum error
When gradient descent falls into a local minimum with respect to one of these weights, it will
not necessarily be in a local minimum with respect to other weights
A 2nd perspective on local minima can be gained by considering the manner in which n/w weights
evolve as the number of training iterations increases
DVR
Back-Propagation algorithm - Advantages
MACHINE LEARNING
Backpropagation is fast, simple and easy to program
Backpropagation has no parameters to tune apart from the numbers of input
Backpropagation is a flexible method as it does not require prior knowledge about the network
DVR
Back-Propagation algorithm - Disadvantages
MACHINE LEARNING
Back Propagation Algorithm relies on input to perform on a specific problem
Sensitive to complex/noisy data
Back Propagation Algorithm needs the derivatives of activation functions for the network
design time
DVR
Back-Propagation algorithm - Applications
MACHINE LEARNING
Improving the accuracy of predictions in data mining and machine learning
Sensitive to complex/noisy data
Back Propagation Algorithm needs the derivatives of activation functions for the network
design time
DVR
MACHINE LEARNING
DVR
An illustrative example: Artificial Neural Networks-Face Recognition
MACHINE LEARNING
The main benefit of the neural network in facial recognition is the ability to train a system to
capture a complex class of facial patterns
The neural networks are non-linear in the network, so it is a widely used technique for facial
recognition
The OpenCV method is a common method in face detection
OpenCV method firstly extracts the feature images into a large sample set by extracting the face
Haar features in the image and then uses the AdaBoost algorithm as the face detector
Facial recognition is a technology that can recognize a person only by looking at them
Facial recognition uses machine learning techniques to identify, collect, store, and evaluate face
characteristics so that they can be matched to photos of people in a database
DVR
MACHINE LEARNING
AI face recognition work
Facial recognition software reads the geometry of your face
Key factors include the distance between your eyes and the distance from
forehead to chin
The software identifies facial landmarks — one system identifies 68 of them —

that are key to distinguishing your face
The result: your facial signature
Biometric technology use unique traits of particular part of the body comprising facial
recognition, iris scan, fingerprint scan, etc
AI transform these informative distinctive traits into codes or make them more
complex and easy to understand by the system
DVR
MACHINE LEARNING
DVR
MACHINE LEARNING
Face Detection:
The main function of this step is to determine
1. Whether human faces appear in a given image, and
2. Where these faces are located at.

• The expected outputs of this step are patches containing each face in the input image. In
order to make further face recognition system more robust and easy to design, face
alignment are performed to justify the scales and orientations of these patches.
• Besides serving as the pre-processing for face recognition, face detection could be used for
region-of-interest detection, retargeting, video and image classification, etc.
DVR
MACHINE LEARNING
Feature Extraction:
• After the face detection step, human-face patches are extracted from images.
• Directly using these patches for face recognition have some disadvantages.
• First, each patch usually contains over 1000 pixels, which are too large to build a robust recognition system1 .
• Second, face patches may be taken from different camera alignments, with different face expressions, illuminations,
and may suffer from occlusion and clutter.
• To overcome these drawbacks, feature extractions are performed to do information packing, dimension reduction,
salience extraction, and noise cleaning.
• After this step, a face patch is usually transformed into a vector with fixed dimension or a set of fiducial points and
their corresponding locations.
DVR
MACHINE LEARNING
Face Recognition:
• After formulizing the representation of each face, the last step is to recognize the identities of these faces.
• In order to achieve automatic recognition, a face database is required to build. For each person, several
images are taken and their features are extracted and stored in the database.
• Then when an input face image comes in, we perform face detection and feature extraction, and compare its
feature to each face class stored in the database.
DVR
MACHINE LEARNING
Face detection algorithms are classified into four categories:
• Knowledge-based
• Feature invariant
• Template matching
• Appearance-based method
1.Knowledge-based methods :
• These rule-based methods encode human knowledge of what constitutes a typical face.
• Usually, the rules capture the relationships between facial features.
• These methods are designed mainly for face localization, which aims to determine the image position of a
single face.
DVR
MACHINE LEARNING
• Knowledge-based
2. Feature invariant approaches:

• These algorithms aim to find structural features that exist even when the pose, viewpoint, or lighting
conditions vary, and then use these to locate faces. These methods are designed mainly for face
localization.
• To distinguish from the knowledge-based methods, the feature invariant approaches start at feature
extraction process and face candidates finding, and later verify each candidate by spatial relations
among these features, while the knowledge-based methods usually exploit information of the whole
image and are sensitive to complicated backgrounds and other factors
DVR
MACHINE LEARNING
• Knowledge-based
3 Template Matching :
Template Matching method uses pre-defined or parameterized face templates to locate or detect the
faces by the correlation between the templates and input images
Example:- a human face can be divided into eyes, face contour, nose, and mouth
Also, a face model can be built by edges just by using edge detection method
This approach is simple to implement, but it is inadequate for face detection
However, deformable templates have been proposed to deal with these problems
DVR
MACHINE LEARNING
• Knowledge-based
4. Appearance-based methods:
In contrast to template matching, the models (or templates) are learned from a set of training images which
should capture the representative variability of facial appearance.These learned models are then used for
detection.
The appearance-based method depends on a set of delegate training face images to find out face models
The appearance-based approach is better than other ways of performance
In general appearance-based method rely on techniques from statistical analysis and machine learning to
find the relevant characteristics of face images
This method also used in feature extraction for face recognition
DVR
MACHINE LEARNING
How the Face Detection Works:
There are many techniques to detect faces, with the help of these techniques, we can identify faces with
higher accuracy
These techniques have an almost same procedure for Face Detection such as OpenCV, Neural Networks,
Matlab, etc
The face detection work as to detect multiple faces in an image

In general appearance-based method rely on techniques from statistical analysis and machine learning to
find the relevant characteristics of face images
This method also used in feature extraction for face recognition
DVR
Neural Nets for Face Recognition
(http://www.cs.cmu.edu/tom/faces.html)
• Training images : 20 different persons with 32 images per person.

• After 260 training images, the network achieves an accuracy of
90% over test set.
• Algorithm parameters : η=0.3, α=0.3
Alternative Error Functions
• Penalize large weights: (weight decay) : Reducing the risk of

overfitting
• Train on target slopes as well as values:
• Minimizing the cross entropy : Learning a probabilistic output

function (chapter 6)
  t d log od  (1  t d ) log(1  od )
d∈ D
Recurrent Networks
(a) (b) (c)
(a) Feedforward network

(b) Recurrent network
(c) Recurrent network unfolded
in time
Dynamically Modifying Network Structure
• To improve generalization accuracy and training

efficiency
• Cascade-Correlation algorithm (Fahlman and Lebiere 1990)
– Start with the simplest possible network (no hidden units) and
add complexity
• Lecun et al. 1990
– Start with the complex network and prune it as we find that
certain connectives are inessential.
Evaluation Hypotheses
Hypothesis
Hypothesis is a supposition or proposed explanation made on the basis of limited evidence as a starting
point for further investigation
Hypothesis can be defined as the “claim” about the truth related to something that exists in the world
Hypothesis is the “claim”
Until the “claim” is proven to be true, it is called the hypothesis. Once the claim is proved, it becomes the
new truth about the thing
Example
A claim is made that students studying for more than 6 hours a day gets more than 90% of marks in their
examination
Now, this is just a claim and not the truth in the real world
In order for the claim to become the truth for widespread adoption, it needs to be proved using pieces of
evidence
In order to prove the claim or reject this claim, one needs to do some empirical analysis by gathering data
samples and evaluating the claim
Hypothesis
The process of gathering data and evaluating the claim with the goal to reject or failing to reject the
claim is termed hypothesis testing. Note the wordings – “failing to reject”
It means that we don’t have enough evidence to reject the claim

until the time that new evidence comes up, the claim can be considered the truth
There are different techniques to test the claim in order to reach the conclusion of whether the
hypothesis can be used to represent the truth of the world
A hypothesis is a statement of a relationship between two or more variables

Hypothesis
• A hypothesis is an explanation for something.

• It is a provisional idea, an educated guess that requires some evaluation.
• A good hypothesis is testable; it can be either true or false.
• In science, a hypothesis must be falsifiable, meaning that there exists a test whose outcome could mean
that the hypothesis is not true. The hypothesis must also be framed before the outcome of the test is k
nown.
• A good hypothesis fits the evidence and can be used to make predictions about new observations or new
situations.
• The hypothesis that best fits the evidence and can be used to make predictions is called a theory, or is p
art of a theory.
• Hypothesis in Science: Provisional explanation that fits the evidence and can be confirmed or disproved.
Hypothesis
• Statistical hypothesis tests are techniques used to calculate a critical value called an “effect.”
• The critical value can then be interpreted in order to determine how likely it is to observe the effect if
a relationship does not exist.
• If the likelihood is very small, then it suggests that the effect is probably real.
• If the likelihood is large, then we may have observed a statistical fluctuation, and the effect is
probably not real.
• One hypothesis is that there is no difference between the population means, based on the data samples.
• This is a hypothesis of no effect and is called the null hypothesis and is used as the statistical
hypothesis test to either reject this hypothesis, or fail to reject (retain) it.
• It is not said “accept” because the outcome is probabilistic and could still be wrong, just with a very low
probability.
Hypothesis
• If the null hypothesis is rejected, then we assume the alternative hypothesis that there exists some di
fference between the means.
• Null Hypothesis (H0): Suggests no effect.
• Alternate Hypothesis (H1): Suggests some effect.
• Statistical hypothesis tests don’t comment on the size of the effect, only the likelihood of the presenc
e or absence of the effect in the population, based on the observed samples of data.
• Hypothesis in Statistics: Probabilistic explanation about the presence of a relationship between observ
ations.
Hypothesis in Machine Learning
• Machine learning, specifically supervised learning, can be described as the desire to use available data to
learn a function that best maps inputs to outputs.
• Technically, this is a problem called function approximation, where we are approximating an unknown target
function (that we assume exists) that can best map inputs to outputs on all possible observations from the
problem domain.
• An example of a model that approximates the target function and performs mappings of inputs to outputs
is called a hypothesis in machine learning.
• The choice of algorithm (e.g. neural network) and the configuration of the algorithm (e.g. network topology
and hyperparameters) define the space of possible hypothesis that the model may represent.
• Learning for a machine learning algorithm involves navigating the chosen space of hypothesis toward the be
st or a good enough hypothesis that best approximates the target function
• A common notation is used where lowercase-h (h) represents a given specific hypothesis and uppercase-h
(H) represents the hypothesis space that is being searched.
• h (hypothesis): A single hypothesis, e.g. an instance or specific candidate model that maps inputs to output
s and can be evaluated and used to make predictions.
• H (hypothesis set): A space of possible hypotheses for mapping inputs to outputs that can be searched, of
ten constrained by the choice of the framing of the problem, the choice of model and the choice of model
configuration.
Motivation
MACHINE LEARNING
Evaluating the performance of learning systems is important because
Learning systems are usually designed to predict the class of “future” unlabeled data points
In some cases, evaluating hypotheses is an integral part of the learning process (example, when
pruning a decision tree)
DVR
Motivation
MACHINE LEARNING
Underfitting
A statistical model or a machine learning algorithm is said to have under fitting when it cannot
capture the underlying trend of the data.
It’s just like trying to fit undersized pants!
Underfitting destroys the accuracy of our machine learning model

Underfitting occurrence simply means that our model or the algorithm does not fit the data
well enough
Underfitting usually happens when we have fewer data to build an accurate model and also when
we try to build a linear model with fewer non-linear data
In such cases, the rules of the machine learning model are too easy and flexible to be applied
on such minimal data and therefore the model will probably make a lot of wrong predictions
Underfitting can be avoided by using more data and also reducing the features by feature
selection
DVR
Motivation
MACHINE LEARNING
Techniques to reduce Underfitting
Increase model complexity

Increase the number of features, performing feature engineering
Remove noise from the data.
Increase the number of epochs or increase the duration of training to get better results
DVR
Motivation
MACHINE LEARNING
Overfitting
A statistical model is said to be overfitted when we train it with a lot of data
just like fitting ourselves in oversized pants!

When a model gets trained with so much data, it starts learning from the noise and inaccurate
data entries in our data set
Then the model does not categorize the data correctly, because of too many details and noise
The causes of overfitting are the non-parametric and non-linear methods because these types
of machine learning algorithms have more freedom in building the model based on the dataset
and therefore they can really build unrealistic models
A solution to avoid overfitting is using a linear algorithm if we have linear data or using the
parameters like the maximal depth if we are using decision trees
DVR
Motivation
MACHINE LEARNING
Techniques to reduce Overfitting
Increase training data
Reduce model complexity

Early stopping during the training phase (have an eye over the loss over the training period as
soon as loss begins to increase stop training)
Ridge Regularization and Lasso Regularization
Use dropout for neural networks to tackle overfitting.
DVR
Motivation – Underfitting and Overfitting
MACHINE LEARNING
DVR
Motivation – Underfitting and Overfitting
MACHINE LEARNING
DVR
Motivation
MACHINE LEARNING
In many cases it is important to evaluate the performance of learned hypotheses as precisely
as possible
One reason is simply to understand whether to use the hypothesis
Example:
When learning from a limited-size database indicating the effectiveness of different

medical treatments, it is important to understand as precisely as possible the accuracy of
the learned hypotheses
A second reason is that evaluating hypotheses is an integral component of many learning

methods
Example:
in post-pruning decision trees to avoid overfitting, we must evaluate the impact of possible
pruning steps on the accuracy of the resulting decision tree
DVR
Difficulties in Evaluating Hypotheses when only limited data are available
MACHINE LEARNING
It is important to understand the likely errors inherent in estimating the accuracy of the
pruned and unpruned tree
Estimating the accuracy of a hypothesis is relatively straightforward when data is plentiful
When we must learn a hypothesis and estimate its future accuracy given only a limited set of
data, two key difficulties arise:
Bias in the estimate
First, the observed accuracy of the learned hypothesis over the training examples is often a
poor estimator of its accuracy over future examples
Because the learned hypothesis was derived from these examples, they will typically provide an
optimistically biased estimate of hypothesis accuracy over future examples
This is especially likely when the learner considers a very rich hypothesis space, enabling it to
overfit the training examples
To obtain an unbiased estimate of future accuracy, we typically test the hypothesis on some
set of test examples chosen independently of the training examples and the hypothesis
DVR
Difficulties in Evaluating Hypotheses when only limited data are available
MACHINE LEARNING
Variance in the estimate
First, the observed accuracy of the learned hypothesis over the training examples is often a
poor estimator of its accuracy over future examples
Second, even if the hypothesis accuracy is measured over an unbiased set of test examples
independent of the training examples, the measured accuracy can still vary from the true
accuracy, depending on the makeup of the particular set of test examples
The smaller the set of test examples, the greater the expected variance
DVR
MACHINE LEARNING
When evaluating a learned hypothesis we are most often interested in estimating the accuracy
with which it will classify future instances.
At the same time, we would like to know the probable error in this accuracy estimate (i.e., what
error bars to associate with this estimate)
Example
We consider the following setting for the learning problem
There is some space of possible instances X (e.g., the set of all people) over which various
target functions may be defined (e.g., people who plan to purchase new skis this year).
We assume that different instances in X may be encountered with different frequencies
A convenient way to model this is to assume there is some unknown probability distribution
D that defines the probability of encountering each instance in X (e-g., 23 might assign a
higher probability to encountering 19-year-old people than 109-year-old people)
DVR
Questions Considered
MACHINE LEARNING
• Given the observed accuracy of a hypothesis over a limited sample of
data, how well does this estimate its accuracy over additional
examples?
• Given that one hypothesis outperforms another over some sample
data, how probable is it that this hypothesis is more accurate, in
general?
• When data is limited what is the best way to use this data to both learn
a hypothesis and estimate its accuracy?
DVR
124
MACHINE LEARNING
Example
We consider the following setting for the learning problem
Notice 23 says nothing about whether x is a positive or negative example; it only detennines
the probability that x will be encountered
The learning task is to learn the target concept or target function f by considering a space
H of possible hypotheses
We assume that different instances in X may be encountered with different frequencies
Training examples of the target function f are provided to the learner by a trainer who
draws each instance independently, according to the distribution D, and who then forwards
the instance x along with its correct target value f (x) to the learner
DVR
MACHINE LEARNING
To illustrate
Consider learning the target function "people who plan to purchase new skis this year," given
a sample of training data collected by surveying people as they arrive at a ski resort
In this case the instance space X is the space of all people, who might be described by
attributes such as their
Age
Occupation,
How many times they skied last year, etc
The distribution D specifies for each person x the probability that x will be encountered as
the next person arriving at the ski resort
The target function f : X + {O,1) classifies each person according to whether or not they
plan to purchase skis this year.
DVR
MACHINE LEARNING
To illustrate
Within this general setting we are interested in the following two questions:
1. Given a hypothesis h and a data sample containing n examples drawn at random

according to the distribution D, what is the best estimate of the accuracy of h over
future instances drawn from the same distribution?
2. What is the probable error in this accuracy estimate?
DVR
Sample Error and True Error
MACHINE LEARNING
Sample Error represents the fraction of the sample which is misclassified
Sample Error is used to estimate the errors of the sample
The sample error (denoted errors(h)) of hypothesis h with respect to target function f and data
sample S is:
errors(h)= 1/n xS(f(x),h(x))
where n is the number of examples in S, and the quantity (f(x),h(x)) is 1 if f(x)  h(x), and 0,
otherwise.
DVR
128
MACHINE LEARNING
True error represents the probability that a random sample from the population is misclassified
True error is used to estimate the error of the population
The true error (denoted errorD(h)) of hypothesis h with respect to target function f and
distribution D, is the probability that h will misclassify an instance drawn at random according to D.
errorD(h)= PrxD[f(x)  h(x)]
DVR
129
MACHINE LEARNING
True Error is the error Calculated with respect to Data Distribution D
Sample Error represents the fraction of the sample which is misclassified
Sample Error is 0.2 (i.e 1 out of 5) from the OUR SAMPLE circle
True Error is 0.5 (i.e 10 out of 20) from all the DATA DISTRIBUTION
DVR
130
Advanced topics in artificial neural networks.
MACHINE LEARNING
The multimodal neurons are one of the most advanced neural networks to date.
The researchers have found these advanced neurons can respond to a cluster of abstract concepts
centered around a common high-level theme rather than a specific visual feature.
A recurrent neural network (RNN) is a type of artificial neural network which uses sequential data or time series
data.
RNN's are mainly used for

Sequence Classification
Sentiment Classification & Video
Classification
Sequence Labelling — Part of speech tagging & Named entity

recognition
Recurrent Neural Networks(RNN) are a type of Neural Network where the output from the previous
step is fed as input to the current step
DVR
MACHINE LEARNING
Random Variable
A random variable can be viewed as the name of an experiment with a probabilistic

outcome. Its value is the outcome of the experiment.
Random variable value is the outcome of the experiment.
Probability Distribution
A probability distribution for a random variable Y specifies the probability Pr(Y = yi)
that Y will take on the value yi, for each possible value yi.
Expected value
The expected value, or mean, of a random variable Y is E[Y] = Ciyi P r(Y = yi). The
symbol p)i~s commonly used to represent E[Y]
Variance of a Random Variable
The variance of a random variable is Var(Y) = E[(Y - p~)~]T.he variance characterizes
DVR
the width or dispersion of the distribution about its mean
MACHINE LEARNING
Standard Deviation
The standard deviation of Y is SQRT(Var(Y)).
The symbol σy is often used to represent the standard deviation of Y
Binomial Distribution
The Binomial distribution gives the probability of observing r heads in a series of n

independent coin tosses, if the probability of heads in a single toss is p
Normal Distribution
The Normal distribution is a bell-shaped probability distribution that covers many

natural phenomena
Central Limit Theorem
The Central Limit Theorem is a theorem stating that the sum of a large number of
independent, identically distributed random variables approximately follows a Normal
distribution
DVR
MACHINE LEARNING
Estimator
An estimator is a random variable Y used to estimate some parameter p of an

underlying population
Estimation Bias
The estimation bias of Y as an estimator for p is the quantity (E[Y] - p).
An unbiased estimator is one for which the bias is zero.
Confidence Interval
A N% confidence interval estimate for parameter p is an interval that includes p with

probability N%
Hypothesis
A N% confidence interval estimate for parameter p is an interval that includes p with

probability N%
DVR
BASICS OF SAMPLING THEORY
MACHINE LEARNING
Error Estimation and Estimating Binomial Proportions
Collect a random sample S of n independently drawn instances from the distribution D, and then
measure the sample error errors(h).
Repeat this experiment many times, each time drawing a different random sample Si of size n, we
would expect to observe different values for the various errorsi(h), depending on random differences
in the makeup of the various Si. We say that errorsi(h), the outcome of the ith such experiment, is a
random variable
DVR
MACHINE LEARNING
 Imagine that we were to run k random experiments, measuring the random variables errors1(h), errors2(h) . .
. errorssk(h) and plotted a histogram displaying the frequency with which each possible error value is
observed.
 As k grows, the histogram would approach a particular probability distribution called the Binomial
distribution which is shown in below figure.
DVR
MACHINE LEARNING
DVR
MACHINE LEARNING
The general process includes the following steps:
1. Identify the underlying population parameter "p" to be estimated
Example:
errorD(h)
2. Define the estimator Y(e.g. errorS(h).

It is desirable to choose a minimum variance, unbiased estimator
3. Determine the probability distribution Dy that governs the

estimator Y, including its mean and variance
4. Determine the N% confidence interval by finding thresholds "L"

and "V" such that N% of mass in the probability distribution Dy falls
b/w "L" and "V".
DVR
MACHINE LEARNING
Central Limit Theorem :
Central Theorem states that "The probability distribution governing Yn approaches a normal distribution as
n--- ,regardless of distribution that governs the underlying random variables Yi“
ȣ
Further more, the mean of distribution governing Yn approaches "u"

and Standard deviation approaches
3. Determine the probability distribution Dy that governs the

ȣ
estimator Y, including its mean and variance
4. Determine the N% confidence interval by finding thresholds "L"

and "V" such that N% of mass in the probability distribution Dy falls
b/w "L" and "V".
DVR
MACHINE LEARNING
A statistical hypothesis test is a procedure for deciding between two possible
statements about a population
The phrase significance test means the same thing as the phrase "Hypothesis test"
A hypothesis test is a statistical method that uses sample data to evaluate a

hypothesis about a population
The general goal of a hypothesis test is to rule out chance as a plausible explanation
for the results from a research study
The goal in hypothesis testing is to analyze a sample in an attempt to distinguish

between population characteristics that are likely to occur and population
characteristics that are unlikely to occur
DVR
MACHINE LEARNING
Consider the case where we have two hypotheses hl and h2 for some discrete valued target function.
Hypothesis hl has been tested on a sample S1 containing nl randomly drawn examples, and h2 has
been tested on an independent sample S2 containing n2 examples drawn from the same
distribution.
Suppose we wish to estimate the difference d between the true errors of these two hypotheses
gives the unbiased estimate of d,that is E[ ]=d
DVR
Difference in Error of two Hypotheses
MACHINE LEARNING
● Let h1 and h2 be two hypotheses for some discrete-valued target function. H1 has been
tested on a sample S1 containing n1 randomly drawn examples, and h2 has been tested on
an independent sample S2 containing n2 examples drawn from the same distribution.
● Let’s estimate the difference between the true errors of these two hypotheses, d, by
computing the difference between the sample errors:
● dˆ = errorS1(h1)-errorS2(h2)
● The approximate N% confidence interval for d is:
● d^ ± ZN√errorS1(h1)(1-errorS1(h1))/n1 + errorS2(h2)(1-errorS2(h2))/n2
DVR
142
Comparing Learning Algorithms
MACHINE LEARNING
● Which of LA and LB is the better learning method on average for learning some particular
target function f ?
● To answer this question, we wish to estimate the expected value of the difference in their
errors: ES⊂D [errorD(LA(S))-errorD(LB(S))]
● Of course, since we have only a limited sample D0 we estimate this quantity by dividing D0
into a training set S0 and a testing set T0 and measure:
errorT0(LA(S0))-errorT0(LB(S0))
● Problem: We are only measuring the difference in errors for one training set S0 rather than
the expected value of this difference over all samples S drawn from D
Solution: k-fold Cross-Validation
DVR
143
k-Fold Cross-Validation
MACHINE LEARNING
1. Partition the available data D0 into k disjoint
subsets T1, T2, …, Tk of equal size, where this size
is at least 30.
2. For i from 1 to k, do
use Ti for the test set, and the remaining data for
training set Si
• Si <- {D0 - Ti}
• hA <- LA(Si)
• hB <- LB(Si)
• δi <- errorTi(hA)-errorTi(hB)
3. Return the value avg(δ), where
. avg(δ) = 1/k Σi=1k δi
DVR
144
Confidence of the k-fold Estimate
MACHINE LEARNING
● The approximate N% confidence interval for estimating
ES⊂D0 [errorD(LA(S))-errorD(LB(S))] using avg(δ), is given by:
avg(δ)±tN,k-1savg(δ)
where tN,k-1 is a constant similar to ZN (See [Mitchell, Table 5.6]) and
savg(δ) is an estimate of the standard deviation of the distribution
governing avg(δ)
savg(δ))=√1/k(k-1) Σi=1k (δi -avg(δ))2
DVR
145
MACHINE LEARNING
MACHINE LEARNING
Assignment - 2
Please complete the following UNIT-2 videos and prepare the notes on the following videos
Introduction to Artificial Neural Networks & their Representation of Neural
Networks |ML| https://www.youtube.com/watch?v=WtUfElRD3rU
An Introduction to Artificial Neural Networks - Big Data AnalyticsTutorial by
Mahesh Huddar https://www.youtube.com/watch?v=5glV6PQAekY
https://www.coursera.org/lecture/neural-networks-deep-
learning/neural-networks-overview-qg83v
https://www.youtube.com/watch?v=MnKPzItvqQw
Neural Networks Representation in Machine Learning https://www.youtube.com/watch?v=uGWV4Mbf0PU
Remarks On Back Propagation Algorithm |ML| https://www.youtube.com/watch?v=RQmaFw7Pc_k
Neural Network Representation | Deep Learning https://www.youtube.com/watch?v=MnKPzItvqQw
What Is Backpropagation? — A Step-By-Step Guide To Backpropagation https://medium.com/edureka/backpropagation-bd2cf8fdde81
Back Propagation Algorithm With Example Part-1 |ML| https://www.youtube.com/watch?v=3oOTnFM-3lg
6 Basics Of Sampling Theory |ML| https://www.youtube.com/watch?v=WdYZkv2rtCo
Evaluating The Hypothesis - Motivation, Estimating Hypothesis Accuracy |ML| https://www.youtube.com/watch?v=qXX5xyZj31s

The Perceptron and The Perceptron training rule |ML| https://www.youtube.com/watch?v=LbLcEiSfA-c
Neural Networks Representation in Machine Learning https://www.youtube.com/watch?v=uGWV4Mbf0PU
Difference in Error Of Two Hypothesis - Hypothesis Testing - Type 1& Type 2
Errors |ML| https://www.youtube.com/watch?v=XMT3QMVXadA
I will check each and every one Notes of the above Videos. This is your Assignment.
Unit-1 Introduction to Decision Trees https://www.youtube.com/watch?v=FuJVLsZYkuE

Unit-1 Overfitting https://www.youtube.com/watch?v=y6SpA2Wuyt8
DVR
MACHINE LEARNING
DR DV RAMANA,
DATA STRATEGIST –CONSULTANT
AND
CHIEF ACADEMIC ADVISOR
MAIL ADDRESS: drdvramana@pallaviengineeringcollege.ac.in
TO CONTACT: +91 9959423084
DVR

Cs601pc - Machine Learning Unit - 2-1

Uploaded by

Copyright:

Available Formats

You might also like

Cs601pc - Machine Learning Unit - 2-1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cs601pc - Machine Learning Unit - 2-1

Uploaded by

Copyright:

Available Formats

B.Tech.

III YEAR II Semester

To understand computational learning theory.

Understand the concepts of computational intelligence like machine learning

Introduction - Artificial Neural Networks

Neural network representation

Artificial Neural Networks - 2

Remarks on the Back-Propagation algorithm

An illustrative example: Face recognition

Advanced topics in artificial neural networks.

Estimation hypothesis accuracy

A general approach for deriving confidence intervals

Artificial neural networks (ANNs) provide a general, practical method

DVR MACHINE LEARNING

DVR MACHINE LEARNING

DVR MACHINE LEARNING

DVR MACHINE LEARNING

DVR MACHINE LEARNING

• Number of neurons ~ 1011

DVR MACHINE LEARNING

DVR MACHINE LEARNING

DVR MACHINE LEARNING

DVR MACHINE LEARNING

DVR MACHINE LEARNING

• The network output is the direction in which the vehicle is steered.

Vector-valued functions from examples

Algorithms such as Back Propagation and Gradient Descent to tune network

Interpreting visual scenes

Speech recognition, and

Learning robot control strategies

There are several kinds of ANNs.

Desire to produce artificial systems capable of sophisticated computations similar to the

Computers require hundreds of cycles to simulate a firing of a neuron

The brain can fire all the neurons in a single step

Neuron Switching time ~.001

Number of neurons ~1010

Connections per neuron ~ 104-5

Scene recognition time ~.1 sec

100 inference steps does'nt seem like enough

Many neuron-like threshold switching units

Many weighted interconnections among units

Highly parallel , distributed process

Emphasis on tuning weights automatically

Input is a high-dimensional discrete or real-valued (Example: SENSOR INPUT)

Input is a high-dimensional discrete or real-valued(e.g Sensor Input)

Output is discrete or real-valued

Output is a vector of values

Possibly noisy data

Form of target function is unknown

Human readability of result is unimportant

Speech phoneme recognition

Large number of very simple processing neuron-like processing elements

Larger number of weighted connections between the elements

Distributed representation of knowledge over the connections

Knowledge is acquired by network through a learning process.

Controlling the movements of a robot based on self- perception and other

Recognizing a visual object