Introduction To Statistical Learning

Introduction to Statistical Learning
(ISLR 2.1)
Yingbo Li
Clemson University
MATH 8050
Yingbo Li (Clemson) Intro to Statistical Learning MATH 8050 1 / 16

Outline
1 Why Estimate f
2 How to Estimate f
3 Trade-Off: Prediction Accuracy and Model Interpretability
4 Supervised vs Unsupervised Learning
5 Regression vs Classification

Why Estimate f
The Advertising data

For n = 200 different markets
Sales: sales of the product in this market (Y )
TV: advertising budget for TV (X1 )
Radio: advertising budget for radio (X2 )
Newspaper: advertising budget for newspaper (X3 )

Why Estimate f
We believe that there is a relationship between Y and X

Y : output variable, response
X = (X1 , X2 , X3 ): input variables, predictors
25
25
25
20
20
20
Sales
Sales
Sales
15
15
15
10
10
10
5
5
0 50 100 200 300 0 10 20 30 40 50 0 20 40 60 80 100
TV Radio Newspaper

Why Estimate f
Model the relationship between Y and X

The regression function
Y = f (X) +
f : unknown function
: random error with mean zero, i.e., E() = 0.
In the Advertising example:
f (X1 , X2 , X3 ) = E(Y | X1 , X2 , X3 )
Statistical learning, and this course, are all about how to estimate f .
Why?
Prediction.
Inference.

Why Estimate f
Prediction
If we can get a good estimate for f, we can make accurate predictions for
the response Y , based on a new value of X.
For a new market, given three media budgets, what’s the sales?
Just want to predict sales, not to know which media is more
important.
Suppose our estimate for f is fˆ, the output Y for input X is predicted as
Ŷ = fˆ(X)
Mean square error
E(Y − Ŷ )2 = E[f (X) − fˆ(X)]2 + V ()

| {z } |{z}
Reducible Irreducible

Why Estimate f
Inference
We are often interested in understanding the relationship between that Y
and each of X1 , . . . , Xp . For example
1 Which predictors actually affect the response?
2 Is the relationship positive or negative?
3 Is the relationship a simple linear one or is it more complicated?
How much impact does TV budges have on the sales.

Which media generate the biggest boost in sales?

How to Estimate f
How to estimate f
Use the training data and a statistical method to estimate f .
We have observed a set of training data
{(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )},
where each xi = (xi,1 , xi,2 , . . . , xi,p )0 , and yi is a scalar.

Statistical learning methods:
I parametric
I non-parametric

How to Estimate f
Income vs Education, Seniority
Incom
e
y
rit
Ye
io
a
n
rs
Se
of
Ed
uc
ati
on

How to Estimate f
Parametric methods
Reduces the problem of estimating f down to one of estimating a (finite)
set of parameters. A two-step model based approach:
1 Come up with a model (some functional form assumption about f ).
The most common example is a linear model.
f (X) = β0 + β1 X1 + β2 X2 + · · · + βp Xp
I We only need to estimate p + 1 parameters β0 , β1 , . . . , βp .

I Although it is almost never correct, a linear model often serves as a
good and interpretable approximation to the unknown true f (X).
2 Use the training data to fit the model.
Estimate the unknown parameters such as β̂0 , β̂1 , . . . , β̂p .
I The most common approach is ordinary least squares (OLS).
I We will see later that there are other superior approaches.

How to Estimate f
A linear model fˆL (X) = β0 + β1 X gives a reasonable fit here.
A quadratic model fˆQ (X) = β0 + β1 X + β2 X 2 fits slightly better.

How to Estimate f
A linear regression fit to the Income data
Incom
e
ity
or
Ye
ni
ars
Se
of
Ed
uc
ati
on
Income = β0 + β1 × Education + β2 × Seniority

How to Estimate f
Non-parametric methods
They do not make explicit assumptions about the functional form of f .
Advantages: accurately fit a wider range of possible shapes of f .
Disadvantages: require large n to obtain an accurate estimate.
Incom
Incom
e
e
ity
ity
or
Ye
or
Ye
ni
a
ni
a rs
Se
rs
Se
of of
E Edu
du ca
ca tio
tio n
n
A smooth thin-plate spline fit: A rough thin-plate spline fit:

flexible overfitting

Trade-Off: Prediction Accuracy and Model Interpretability
Some trade-offs
Prediction accuracy vs model interpretability
Linear models are easy to interpret; thin-plate splines are not.
Good fit vs over-fit
A model that overfits the training data may not predict well.
High
Subset Selection
Lasso
Least Squares
Interpretability
Generalized Additive Models

Trees
Bagging, Boosting
Support Vector Machines

Low
Low High
Flexibility
Supervised vs Unsupervised Learning
Supervised vs unsupervised learning

Supervised learning: both X and Y are available
Unsupervised learning: only X is available; there is no Y .
I Example: market segmentation where we try to divide potential
customers into groups based on their characteristics.
I A common approach is clustering.
12
8
10
8
6
X2
X2
6
4
4
2
2
0 2 4 6 8 10 12 0 2 4 6
X1 X1

Regression vs Classification
Regression vs classification
Regression: Y is continuous (quantitative).
I Predicting the value of the Dow in 6 months.
I Predicting the value of a given house based on various inputs.
Classification: Y is categorical (qualitative).
I Will the Dow be up (U) or down (D) in 6 months?
I Is this email a SPAM or not?

Introduction To Statistical Learning

Uploaded by

Copyright:

Available Formats

You might also like

Introduction To Statistical Learning

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Statistical Learning

Uploaded by

Copyright:

Available Formats

Introduction to Statistical Learning

Yingbo Li (Clemson) Intro to Statistical Learning MATH 8050 1 / 16

3 Trade-Off: Prediction Accuracy and Model Interpretability

4 Supervised vs Unsupervised Learning

Yingbo Li (Clemson) Intro to Statistical Learning MATH 8050 2 / 16

The Advertising data

Yingbo Li (Clemson) Intro to Statistical Learning MATH 8050 3 / 16

We believe that there is a relationship between Y and X

Yingbo Li (Clemson) Intro to Statistical Learning MATH 8050 4 / 16

Model the relationship between Y and X

Yingbo Li (Clemson) Intro to Statistical Learning MATH 8050 5 / 16

Mean square error

E(Y − Ŷ )2 = E[f (X) − fˆ(X)]2 + V ()

Yingbo Li (Clemson) Intro to Statistical Learning MATH 8050 6 / 16

How much impact does TV budges have on the sales.

Yingbo Li (Clemson) Intro to Statistical Learning MATH 8050 7 / 16

{(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )},

where each xi = (xi,1 , xi,2 , . . . , xi,p )0 , and yi is a scalar.

Yingbo Li (Clemson) Intro to Statistical Learning MATH 8050 8 / 16

Income vs Education, Seniority

Yingbo Li (Clemson) Intro to Statistical Learning MATH 8050 9 / 16

I We only need to estimate p + 1 parameters β0 , β1 , . . . , βp .

Yingbo Li (Clemson) Intro to Statistical Learning MATH 8050 10 / 16

A linear model fˆL (X) = β0 + β1 X gives a reasonable fit here.

A quadratic model fˆQ (X) = β0 + β1 X + β2 X 2 fits slightly better.

Yingbo Li (Clemson) Intro to Statistical Learning MATH 8050 11 / 16

A linear regression fit to the Income data

Income = β0 + β1 × Education + β2 × Seniority

Yingbo Li (Clemson) Intro to Statistical Learning MATH 8050 12 / 16

A smooth thin-plate spline fit: A rough thin-plate spline fit:

Yingbo Li (Clemson) Intro to Statistical Learning MATH 8050 13 / 16

Generalized Additive Models

Support Vector Machines

Supervised vs unsupervised learning

Yingbo Li (Clemson) Intro to Statistical Learning MATH 8050 15 / 16

Yingbo Li (Clemson) Intro to Statistical Learning MATH 8050 16 / 16

You might also like

E(Y − Ŷ )2 = E[f (X) − fˆ(X)]2 + V ()