Foundations of Machine Learning

Foundations of Machine Learning
(CSE-4132)
Lecture – 5
Statistical Learning - II
Ajit K Nayak, Ph.D.

Prof. , Dept. of CSIT
Room# C-118
ajitnayak@soa.ac.in
9338749992
Statistical Learning - II
• In general regression function for vector X
– f(X) = f(x1, x2, . . ., xp) = E(Y | X1 = x1,X2 = x2, . . ., Xp = xp)
• f is estimated to
– Prediction
– Inference
• Prediction
– Given a set of inputs X , predict Y using
–
– fˆ is the estimate for f and is the
Yˆ resulting output
– the error term averages to zero
– The accuracy of asYˆa prediction for Y depends on two
quantities
• The reducible error and the irreducible error.
The Goal for Prediction
• The average, or expected value, of the squared
expected difference between the predicted and actual
value of Y
ˆ ˆ
E (Y  Y )  E[ f ( x)    f ( x)]
2 2
 [ f ( x)  fˆ ( x)]2  Var ( )
Reducible Irreducible
• To study techniques for estimating f with the aim of
minimizing the reducible error.
– With the fact that the irreducible error will always provide an
upper bound on the accuracy of our prediction for Y
Inference
• Goal is to understand the way that Y is affected as X1, . . . ,
Xp change.
– We estimate f, but our goal is not necessarily to make predictions
for Y . We instead want to understand the relationship between X
and Y , or more specifically, to understand how Y changes as a
function of X1, . . .,Xp.
• Example
– Which predictors are associated with the response?
– What is the relationship between the response and each predictor?
– Can the relationship between Y and each predictor be adequately
summarized using a linear equation, or is the relationship more
complicated?
Estimating f
• Prerequisite
– Observe a set of n different data points called training data (X,
Y)
• Goal
– Apply a statistical learning method to training data to estimate
the unknown function f
– i.e. find fˆ , s.t. Y for ˆ ( x)observation (X,Y)
 fany
• Approaches
– Parametric Methods
– Non-Parametric Methods
Parametric Methods
• Model Selection: Assumption about f’s functional
form
– Example: f linear (linear regression)
– f(X) = 0 + 1 X1 + 2 X2 +. . .+ pXp
– Need to estimate p+1 coefficients (0 …p)
• Training procedure: Use training data to fit / train the
model
– Find values for (0 …p) such that
– Y  0 + 1 X1 + 2 X2 +. . .+ pXp
– Example: least squares, and other approaches
Non-Parametric Methods
• No explicit assumptions about the functional form of f
• Instead they seek an estimate of f that gets as close to
the data points as possible.
• Advantage
– they have the potential to accurately fit a wider range of
possible shapes for f.
• Disadvantage
– a very large number of observations is required in order to
obtain an accurate estimate for f.
Prediction vs Interpretation
• Some are less flexible, or more restrictive
– generates few shapes to predict f
– more interpretable and hence good for inference
• Others are more flexible
– generates a much wider range of possible shapes
– Suitable for prediction
Regression vs Classification
• Regression
– Problems with a quantitative response
– Example: Predict wage, value of a house, price of a stock . . .
• Classification
– Problems with a qualitative response
– i.e. take on values in one of K different classes, or categories.
– Predict persons gender; brand of product purchased; cancer
diagnosis. . .
Model Accuracy - Quality of Fit
• How well its predictions actually match the observed
data.
• mean squared error (MSE) for training data
1 n
MSE   ( yi  fˆ ( xi )) 2
n i 1
• MSE for test data
ˆ
Ave ( f ( x0 )  y0 ) 2
– Where (x0, y0) is a previously unseen test observation not used

to train the model.
• Goal: minimize test MSE
Model Accuracy – Bias vs Variance - I
• Variance
ˆ
f
– refers to the amount by which would change if we estimated
it using a different training data set.
– Goal: the estimate for f should not vary too much between
training sets.
– more flexible methods have higher variance
• Bias
– refers to the error that is introduced by approximating a real-
life problem.
– more flexible methods have less bias
• In case of more flexible methods, the variance will
increase and the bias will decrease
• test MSE increases as the variance goes up
Model Accuracy – Bias vs Variance - II
     
E y0  fˆ ( x0 )  Var fˆ ( x0 )  Bias fˆ ( x0 )  2
 Var ( )
 
• Where E y0  fˆ ( x0 ) is the averaged test MSE obtained by
repeatedly estimating f using large number of training sets and
tested at x0.
Flexibility level
corresponding
to smallest test
MSE
Classification Model Accuracy
• Accuracy testing scheme need to be modified for
classification problems as yi is no longer numerical but
categorical.
• Training Error rate
1 n  1, yi  yˆ i

n i 1
I ( yi  yˆ i ) where I  
0, otherwise
• Testing Error rate
Ave  I ( y0  yˆ 0 ) 
• ŷ 0 is the predicted class label that results from applying
the classifier to the test observations with predictor x0
Model Accuracy- Bayes Classifier
• Pr(Y = j | X = x0)
– Probability that Y = j, given the observed test predictor x0.
• Example: (two class problem)
– If Pr(Y = j | X = x0) > 0.5 => class 1, otherwise class 2
• The orange shaded region is

the set of points for which
• Pr(Y = orange | X) > 50%
• Pr(Y = blue | X) < 50%
• Probability=50%; Bayes decision
boundary
• Bays Error Rate
– 1- E(maxPr(Y = j | X))
Thank You

Foundations of Machine Learning

Uploaded by

Copyright:

Available Formats

You might also like

Foundations of Machine Learning

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Foundations of Machine Learning

Uploaded by

Copyright:

Available Formats

Foundations of Machine Learning

Ajit K Nayak, Ph.D.

– Where (x0, y0) is a previously unseen test observation not used

• The orange shaded region is

You might also like