Unit IV

SKN Sinhagd College of Engineering Korti.
Pandharpur
Class-Bachelor of Engineering
Subject-Machine Learning
Department of Computer Science

& Engineering
Presentation Prepared By
Mr. Subhash V. Pingale
1
Unit-IV
Validating Machine Learning

Explaining how correct sampling is critical in
machine learning
Highlighting errors dictated by bias and
variance
Proposing different approaches to validation
and testing
Warning against biased samples, overfitting,
underfitting, and snooping
2 Thursday 25 April 2024

Statistics expect that the future won’t differ too much from
the past. Thus, you can base future predictions on past data
by employing random sampling theory.
If you select examples randomly without a criterion, you do
have a good chance of choosing a selection of examples that
won’t differ much from future examples, or, in statistical
terms, you can expect that the distribution of your present
sample will closely resemble the distribution of future
samples.
Ensuring that your algorithm is learning correctly from data
is the reason you should always check what the algorithm has
learned from in-sample data (the data used for training) by
testing your hypothesis on some out-of-sample data. Out-of-
3
sample data is data you didn’t have at learning time, and it
Thursday 25 April 2024
should represent the kind of data you need to create forecasts.
Looking for generalization
. Generalization is the capability to learn from data at
hand the general rules that you can apply to all other
data. Out-of- sample data therefore becomes essential
to figuring out whether learning from data is possible,
and to what extent.

Bias and Variance
Bias is the difference between the average prediction of
our model and the correct value.
 Model with high bias pays very little attention to the
training data and oversimplifies the model. It always
leads to high error on training and test data.
Variance is the variability of model prediction for a given
data point or a value which tells us spread of our
data. Model with high variance pays a lot of attention to
training data and does not generalize on the data which it
hasn’t seen before. As a result, such models perform very
well on training data but has high error rates on test data.

Regardless of training sample, or size of training
sample, model will produce consistent errors

Different samples of training data yield different
model fits

Bias-Variance Trade Off

Under-fitting:
A statistical model or a machine learning algorithm is said to have under

fitting when it cannot capture the underlying trend of the data.
 Under fitting destroys the accuracy of our machine learning model.
 Its occurrence simply means that our model or the algorithm does not fit
the data well enough.

 It usually happens when we have fewer data to build an
accurate model and also when we try to build a linear

model with fewer non-linear data. In such cases, the rules
of the machine learning model are too easy and flexible to
be applied on such minimal data and therefore the model
will probably make a lot of wrong predictions.
 Under-fitting can be avoided by using more data and also
reducing the features by feature selection.

Techniques to reduce under-fitting:
Increase model complexity
Increase the number of features, performing feature
engineering
Remove noise from the data.
Increase the number of epochs or increase the
duration of training to get better results.

Overfitting:
A statistical model is said to be over fitted when we train it with a lot of
data When a model gets trained with so much data, it starts learning from
the noise and inaccurate data entries in our data set. Then the model does
not categorize the data correctly, because of too many details and noise.
 The causes of overfitting are the non-parametric and non-linear methods
because these types of machine learning algorithms have more freedom in

building the model based on the dataset and therefore they can really
build unrealistic models.

 A solution to avoid overfitting is using a linear algorithm if we have
linear data or using the parameters like the maximal depth if we are
using decision trees.

Techniques to reduce overfitting:
Increase training data.
Reduce model complexity.
Early stopping during the training phase (have an
eye over the loss over the training period as soon

as loss begins to increase stop training).

Bias and variance using bulls-eye diagram
•

Given the simplicity of its mapping of the response,
your algorithm tends to systematically overestimate or
underestimate the real rules behind the data,
representing
its bias. The bias is characteristic of simpler algorithms
that can’t express complex mathematical formulations

Model Assessment
 The generalization performance of a machine learning
method relates to its prediction capability on independent
test sets.
 Assessment of this performance is extremely important in

practice, since it guides the choice of the machine learning
method or model.
 Further, this gives us a measure of the quality of the

ultimately chosen model

Model Assessment cont..
If we are in a data-rich situation, the best approach for both
model selection and model assessment is to randomly divide
the dataset into three parts: training set, validation set, and
test set.
The training set is used to fit the models. The validation set
is used to estimate prediction error for model selection. The
test set is used for assessment of the prediction error of the
final chosen model.
A typical split might by 50% for training, and 25% each for
validation and testing.
Best solution: use a large designated test set, which is often

not available. For the methods presented here, there is
insufficient data to split it into three parts.
There are some methods to make mathematical adjustments to

the training error rate in order to estimate the test error rate:

 Suppose that we would like to find a set of variables that
give the lowest validation error rate (an estimate of the test
error rate).
 If we have a large data set, we can achieve this goal by
randomly splitting the data into separate training and
validation data sets.
 Then, we use the training data set to build each possible
model and select the model that gives the lowest error rate
when applied to the validation data set.
Training Data Validation Data

Validation Set Approach
Left Panel: Validation error estimates for a single split into

training and validation data sets.
Right Panel: Validation error estimates for multiple splits;
shows the test error rate is highly variable.
Advantages:
Conceptually simple and easy implementation.
Drawbacks:
The validation set error rate (MSE) can be highly variable.
Only a subset of the observations (those in the training set)
are used to fit the model.
Machine learning methods tend to perform worse when
trained on fewer observations.
Thus, the validation set error rate may tend to overestimate
the test error rate for the model fit on the entire data set.

Cross Validation
Cross-Validation is a resampling technique that helps
to make our model sure about its efficiency and
accuracy on the unseen data.
It is a method for evaluating Machine Learning models
by training several other Machine learning models on
subsets of the available input data set and evaluating
them on the subset of the data set

Steps in Cross Validation
The first step is to divide the cleaned data set into K
partitions of equal size.
Then we need to treat the Fold-1 as a test fold while
the other K-1 as train folds and compute the score of
the test-fold.
We need to repeat step 2 for all folds taking another
fold as a test while remaining as a train.
Last step would be to take the average of scores of all
the folds.

Types of Cross-Validation
 1. Holdout Method :- This technique works on removing a part

of the training data set and sending that to a model that was
trained on the rest of the data set to get the predictions. We then
calculate the error estimation which tells how our model is doing
on unseen data sets. This is known as the Holdout Method.
 Pros
 This Method is Fully independent of data.
 This Method only needs to be run once so has lower
computational costs.
 Cons
 The Performance is subject to higher variance given the smaller
size of the data.

Types of cross Validation cont..
2. LOOCV :- Leave One Out Cross Validation
In this approach we leave 1 data points out of training data
out of a total n data points, then n-1 samples are used to
train the model and 1 points are used as the validation set.
This is repeated for all combinations, and then the error is
averaged.
Pros
It has Zero randomness
The Bias will be lower
Cons
This method is exhaustive and computationally infeasible.

3. In K-Fold cross-validation, the data is divided into k
subsets or we can take it as a holdout method repeated k
times, such that each time, one of the k subsets is used as
the validation set and the other k-1 subsets as the training
set. The error is averaged over all k trials to get the total
efficiency of our model.
We can see that each data point will be in a validation
set exactly once and will be in a training set k-1 time.
This helps us reduce bias as we are using most of the
data for fitting and reduces variance as most of the data
is also being used in the validation set.
Pros
This will help to overcome the problem of computational
power.
Models may not be affected much if an outlier is present
in data.
It helps us overcome the problem of variability.
Cons
Imbalanced data sets will impact our model.

4. Stratified K-Fold Cross-Validation : K Fold Cross
Validation technique will not work as expected for an
Imbalanced Data set. When we have an imbalanced data
set, we need a slight change to the K Fold cross
validation technique, such that each fold contains
approximately the same strata of samples of each output
class as the complete. This variation of using a stratum in
K Fold Cross Validation is known as Stratified K Fold
Cross Validation.

Pros
It can improve different models using hyper-parameter
tuning.
Helps us compare models.
It helps in reducing both Bias and Variance.

Biological Neural Systems
 Neuron switching time : > 10-3 secs
 Number of neurons in the human brain: ~1010
 Connections (synapses) per neuron : ~104–105
 Face recognition : 0.1 secs
 High degree of distributed and parallel computation
 Highly fault tolerent
 Highly efficient
 Learning is key

Decision Tree Learning
Introduction to Decision Tree.
Steps are used to generating decision tree
Classification and regression trees
CART Model Representation
Decision Tree Example

β0+β1xi+ϵi.
β0+β1xi+ϵi.
Introduction to Decision Tree

• Decision tree are collection of divide and conquer
problem solving strategic that use tree like structure to
predict the outcome of variable
• It is one of the most powerful predictive analytic
technique used for generating business rule
• Decision tree are effective to solve classification problem
in which the response variable take discrete value
• Decision tree employed divide and conquer strategy in
which the original data is divide into multiple groups or
subset
• And the strategy is establish a group such as within group
the data is homogeneous

Steps are used to generating decision tree
Start with root node in which all the data is present

Decide on splitting criteria and stopping criteria: the root
node is split into two or more subset leading to tree
branches Using the splitting criteria this node is called as
internal node each internal node has exactly one incoming
node
Further divide each node until no further splitting is possible
or the stopping criteria is met
Terminal nodes are used for generating business rules
Tree pruning(a process of restricting the size of tree) is used
to avoid large trees and over fitting the data the pruning is
achieved through different stopping criteria
Classification and regression trees
 Classification and regression trees are machine-learning

methods for constructing prediction models from data. The
models are obtained by recursively partitioning the data space
and fitting a simple prediction model within each partition. As a
result, the partitioning can be represented graphically as a
decision tree.
 Classification trees are designed for dependent variables that
take a finite number of unordered values, with prediction error
measured in terms of misclassification cost.
 Regression trees are for dependent variables that take
continuous or ordered discrete values, with prediction error
typically measured by the squared
 difference between the observed and predicted values.

CART Model Representation
• The representation for the CART model is a binary
tree.
• This is your binary tree from algorithms and data
structures, Each root node represents a single input
variable (x) and a split point on that variable
(assuming the variable is numeric).
• The leaf nodes of the tree contain an output variable
(y) which is used to make a prediction.
• Given a dataset with two inputs (x) of height in
centimeters and weight in kilograms the output of sex
as male or female, below is a crude example of a
binary decision tree (completely fictitious for
39 demonstration purposes only). Thursday 25 April 2024
Decision Tree Example

Entropy
Entropy is an information theory metric that measures
the impurity or uncertainty in a group of observations.
It determines how a decision tree chooses to split data.
The image below gives a better description of the
purity of a set.

A Neuron
ak Wkj
inj output
Input
links  aj output
links
inj   Wkj * Ik
ai = output(inj)
j
 Computation:
 input signals  input function(linear)  activation
function(nonlinear)  output signal
Part 1. Perceptrons: Simple NN
inputs
weights
x1 w1
activation output
w2
x2  
y
. a=i=1n wi xi
.
. wn
xn Xi’s range: [0, 1]
1 if a  
y= { 0 if a < 
Decision Surface of a Perceptron
1 1 Decision line
x2
w w1 x1 + w2 x2 = 
1
0
0
0
x1
1
0 0
Linear Separability
x2 w1=?
w1=1
w2=1 w2=? 0 1
0 1
=1.5 x1 = ? x1
0 0 1 0
Logical AND Logical XOR

x1 x2 a y x1 x2 y
0 0 0 0 0 0 0
0 1 1 0 0 1 1
1 0 1 0 1 0 1
1 1 2 1 1 1 0
Threshold as Weight: W0
=w0
x0=-1
x1 w1
w0
w2
x2  y
. a= i=0n wi xi
.
. wn
xn 1 if a  
y= { 0 if a <
Thus, y= sgn(a)=0 or 1
Training the Perceptron
 Training set S of examples {x,t}
 x is an input vector and
 t the desired target vector
 Example: Logical And
S = {(0,0),0}, {(0,1),0}, {(1,0),0}, {(1,1),1}

 Iterative process
 Present a training example x , compute network output y ,
compare output y with target t, adjust weights and thresholds
 Learning rule

Specifies how to change the weights w and thresholds  of the
network as a function of the inputs x, output y and target t.
Perceptron Learning Rule
 w’=w +  (t-y) x
wi := wi + wi = wi +  (t-y) xi (i=1..n)
 The parameter  is called the learning rate.
 In Han’s book it is lower case L

It determines the magnitude of weight updates wi .
 If the output is correct (t=y) the weights are not
changed (wi =0).
 If the output is incorrect (t  y) the weights wi are
changed such that the output of the Perceptron for
the new weights w’i is closer/further to the input xi.
Perceptron Training Algorithm
Repeat
for each training vector pair (x,t)
evaluate the output y when x is the input
if yt then
form a new weight vector w’ according
to w’=w +  (t-y) x
else
do nothing
end if
end for
Until y=t for all training vector pairs or # iterations > k
Thank you

Unit IV

Uploaded by

Copyright:

Available Formats

You might also like

Unit IV

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit IV

Uploaded by

Copyright:

Available Formats

SKN Sinhagd College of Engineering Korti.

Department of Computer Science

Validating Machine Learning

2 Thursday 25 April 2024

4 Thursday 25 April 2024

5 Thursday 25 April 2024

6 Thursday 25 April 2024

7 Thursday 25 April 2024

8 Thursday 25 April 2024

A statistical model or a machine learning algorithm is said to have under

the data well enough.

9 Thursday 25 April 2024

accurate model and also when we try to build a linear

reducing the features by feature selection.

Increase model complexity

Increase the number of features, performing feature

Increase the number of epochs or increase the

duration of training to get better results.

because these types of machine learning algorithms have more freedom in

13 Thursday 25 April 2024

using decision trees.

14 Thursday 25 April 2024

Increase training data.

Reduce model complexity.

Early stopping during the training phase (have an

eye over the loss over the training period as soon

15 Thursday 25 April 2024

17 Thursday 25 April 2024

18 Thursday 25 April 2024

 Assessment of this performance is extremely important in

 Further, this gives us a measure of the quality of the

19 Thursday 25 April 2024

Best solution: use a large designated test set, which is often

There are some methods to make mathematical adjustments to

21 Thursday 25 April 2024

Training Data Validation Data

22 Thursday 25 April 2024

Left Panel: Validation error estimates for a single split into

24 Thursday 25 April 2024

25 Thursday 25 April 2024

26 Thursday 25 April 2024

 1. Holdout Method :- This technique works on removing a part

27 Thursday 25 April 2024

28 Thursday 25 April 2024

30 Thursday 25 April 2024

31 Thursday 25 April 2024

32 Thursday 25 April 2024

33 Thursday 25 April 2024

35 Thursday 25 April 2024

Introduction to Decision Tree

36 Thursday 25 April 2024

Start with root node in which all the data is present

 Classification and regression trees are machine-learning

38 Thursday 25 April 2024

40 Thursday 25 April 2024

41 Thursday 25 April 2024

Logical AND Logical XOR

 Example: Logical And