Foundations of Machine Learning

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 15

Foundations of Machine Learning

(CSE-4132)
Lecture – 5
Statistical Learning - II

Ajit K Nayak, Ph.D.


Prof. , Dept. of CSIT
Room# C-118
ajitnayak@soa.ac.in
9338749992
Statistical Learning - II
• In general regression function for vector X
– f(X) = f(x1, x2, . . ., xp) = E(Y | X1 = x1,X2 = x2, . . ., Xp = xp)
• f is estimated to
– Prediction
– Inference
• Prediction
– Given a set of inputs X , predict Y using

– fˆ is the estimate for f and is the
Yˆ resulting output
– the error term averages to zero
– The accuracy of asYˆa prediction for Y depends on two
quantities
• The reducible error and the irreducible error.
The Goal for Prediction
• The average, or expected value, of the squared
expected difference between the predicted and actual
value of Y
ˆ ˆ
E (Y  Y )  E[ f ( x)    f ( x)]
2 2

 [ f ( x)  fˆ ( x)]2  Var ( )

Reducible Irreducible
• To study techniques for estimating f with the aim of
minimizing the reducible error.
– With the fact that the irreducible error will always provide an
upper bound on the accuracy of our prediction for Y
Inference
• Goal is to understand the way that Y is affected as X1, . . . ,
Xp change.
– We estimate f, but our goal is not necessarily to make predictions
for Y . We instead want to understand the relationship between X
and Y , or more specifically, to understand how Y changes as a
function of X1, . . .,Xp.
• Example
– Which predictors are associated with the response?
– What is the relationship between the response and each predictor?
– Can the relationship between Y and each predictor be adequately
summarized using a linear equation, or is the relationship more
complicated?
Estimating f
• Prerequisite
– Observe a set of n different data points called training data (X,
Y)
• Goal
– Apply a statistical learning method to training data to estimate
the unknown function f
– i.e. find fˆ , s.t. Y for ˆ ( x)observation (X,Y)
 fany
• Approaches
– Parametric Methods
– Non-Parametric Methods
Parametric Methods
• Model Selection: Assumption about f’s functional
form
– Example: f linear (linear regression)
– f(X) = 0 + 1 X1 + 2 X2 +. . .+ pXp
– Need to estimate p+1 coefficients (0 …p)
• Training procedure: Use training data to fit / train the
model
– Find values for (0 …p) such that
– Y  0 + 1 X1 + 2 X2 +. . .+ pXp
– Example: least squares, and other approaches
Non-Parametric Methods
• No explicit assumptions about the functional form of f
• Instead they seek an estimate of f that gets as close to
the data points as possible.
• Advantage
– they have the potential to accurately fit a wider range of
possible shapes for f.
• Disadvantage
– a very large number of observations is required in order to
obtain an accurate estimate for f.
Prediction vs Interpretation
• Some are less flexible, or more restrictive
– generates few shapes to predict f
– more interpretable and hence good for inference
• Others are more flexible
– generates a much wider range of possible shapes
– Suitable for prediction
Regression vs Classification
• Regression
– Problems with a quantitative response
– Example: Predict wage, value of a house, price of a stock . . .
• Classification
– Problems with a qualitative response
– i.e. take on values in one of K different classes, or categories.
– Predict persons gender; brand of product purchased; cancer
diagnosis. . .
Model Accuracy - Quality of Fit
• How well its predictions actually match the observed
data.
• mean squared error (MSE) for training data
1 n
MSE   ( yi  fˆ ( xi )) 2
n i 1
• MSE for test data
ˆ
Ave ( f ( x0 )  y0 ) 2

– Where (x0, y0) is a previously unseen test observation not used


to train the model.
• Goal: minimize test MSE
Model Accuracy – Bias vs Variance - I
• Variance
ˆ
f
– refers to the amount by which would change if we estimated
it using a different training data set.
– Goal: the estimate for f should not vary too much between
training sets.
– more flexible methods have higher variance
• Bias
– refers to the error that is introduced by approximating a real-
life problem.
– more flexible methods have less bias
• In case of more flexible methods, the variance will
increase and the bias will decrease
• test MSE increases as the variance goes up
Model Accuracy – Bias vs Variance - II

     
E y0  fˆ ( x0 )  Var fˆ ( x0 )  Bias fˆ ( x0 )  2
 Var ( )
 
• Where E y0  fˆ ( x0 ) is the averaged test MSE obtained by
repeatedly estimating f using large number of training sets and
tested at x0.

Flexibility level
corresponding
to smallest test
MSE
Classification Model Accuracy
• Accuracy testing scheme need to be modified for
classification problems as yi is no longer numerical but
categorical.
• Training Error rate
1 n  1, yi  yˆ i

n i 1
I ( yi  yˆ i ) where I  
0, otherwise
• Testing Error rate
Ave  I ( y0  yˆ 0 ) 
• ŷ 0 is the predicted class label that results from applying
the classifier to the test observations with predictor x0
Model Accuracy- Bayes Classifier
• Pr(Y = j | X = x0)
– Probability that Y = j, given the observed test predictor x0.
• Example: (two class problem)
– If Pr(Y = j | X = x0) > 0.5 => class 1, otherwise class 2

• The orange shaded region is


the set of points for which
• Pr(Y = orange | X) > 50%
• Pr(Y = blue | X) < 50%
• Probability=50%; Bayes decision
boundary
• Bays Error Rate
– 1- E(maxPr(Y = j | X))
Thank You

You might also like