Professional Documents
Culture Documents
Foundations of Machine Learning
Foundations of Machine Learning
Foundations of Machine Learning
(CSE-4132)
Lecture – 5
Statistical Learning - II
[ f ( x) fˆ ( x)]2 Var ( )
Reducible Irreducible
• To study techniques for estimating f with the aim of
minimizing the reducible error.
– With the fact that the irreducible error will always provide an
upper bound on the accuracy of our prediction for Y
Inference
• Goal is to understand the way that Y is affected as X1, . . . ,
Xp change.
– We estimate f, but our goal is not necessarily to make predictions
for Y . We instead want to understand the relationship between X
and Y , or more specifically, to understand how Y changes as a
function of X1, . . .,Xp.
• Example
– Which predictors are associated with the response?
– What is the relationship between the response and each predictor?
– Can the relationship between Y and each predictor be adequately
summarized using a linear equation, or is the relationship more
complicated?
Estimating f
• Prerequisite
– Observe a set of n different data points called training data (X,
Y)
• Goal
– Apply a statistical learning method to training data to estimate
the unknown function f
– i.e. find fˆ , s.t. Y for ˆ ( x)observation (X,Y)
fany
• Approaches
– Parametric Methods
– Non-Parametric Methods
Parametric Methods
• Model Selection: Assumption about f’s functional
form
– Example: f linear (linear regression)
– f(X) = 0 + 1 X1 + 2 X2 +. . .+ pXp
– Need to estimate p+1 coefficients (0 …p)
• Training procedure: Use training data to fit / train the
model
– Find values for (0 …p) such that
– Y 0 + 1 X1 + 2 X2 +. . .+ pXp
– Example: least squares, and other approaches
Non-Parametric Methods
• No explicit assumptions about the functional form of f
• Instead they seek an estimate of f that gets as close to
the data points as possible.
• Advantage
– they have the potential to accurately fit a wider range of
possible shapes for f.
• Disadvantage
– a very large number of observations is required in order to
obtain an accurate estimate for f.
Prediction vs Interpretation
• Some are less flexible, or more restrictive
– generates few shapes to predict f
– more interpretable and hence good for inference
• Others are more flexible
– generates a much wider range of possible shapes
– Suitable for prediction
Regression vs Classification
• Regression
– Problems with a quantitative response
– Example: Predict wage, value of a house, price of a stock . . .
• Classification
– Problems with a qualitative response
– i.e. take on values in one of K different classes, or categories.
– Predict persons gender; brand of product purchased; cancer
diagnosis. . .
Model Accuracy - Quality of Fit
• How well its predictions actually match the observed
data.
• mean squared error (MSE) for training data
1 n
MSE ( yi fˆ ( xi )) 2
n i 1
• MSE for test data
ˆ
Ave ( f ( x0 ) y0 ) 2
E y0 fˆ ( x0 ) Var fˆ ( x0 ) Bias fˆ ( x0 ) 2
Var ( )
• Where E y0 fˆ ( x0 ) is the averaged test MSE obtained by
repeatedly estimating f using large number of training sets and
tested at x0.
Flexibility level
corresponding
to smallest test
MSE
Classification Model Accuracy
• Accuracy testing scheme need to be modified for
classification problems as yi is no longer numerical but
categorical.
• Training Error rate
1 n 1, yi yˆ i
n i 1
I ( yi yˆ i ) where I
0, otherwise
• Testing Error rate
Ave I ( y0 yˆ 0 )
• ŷ 0 is the predicted class label that results from applying
the classifier to the test observations with predictor x0
Model Accuracy- Bayes Classifier
• Pr(Y = j | X = x0)
– Probability that Y = j, given the observed test predictor x0.
• Example: (two class problem)
– If Pr(Y = j | X = x0) > 0.5 => class 1, otherwise class 2