Introduction To Statistical Learning

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Introduction to Statistical Learning

(ISLR 2.1)

Yingbo Li

Clemson University

MATH 8050

Yingbo Li (Clemson) Intro to Statistical Learning MATH 8050 1 / 16


Outline

1 Why Estimate f

2 How to Estimate f

3 Trade-Off: Prediction Accuracy and Model Interpretability

4 Supervised vs Unsupervised Learning

5 Regression vs Classification

Yingbo Li (Clemson) Intro to Statistical Learning MATH 8050 2 / 16


Why Estimate f

The Advertising data


For n = 200 different markets
Sales: sales of the product in this market (Y )
TV: advertising budget for TV (X1 )
Radio: advertising budget for radio (X2 )
Newspaper: advertising budget for newspaper (X3 )

Yingbo Li (Clemson) Intro to Statistical Learning MATH 8050 3 / 16


Why Estimate f

We believe that there is a relationship between Y and X


Y : output variable, response
X = (X1 , X2 , X3 ): input variables, predictors
25

25

25
20

20

20
Sales

Sales

Sales
15

15

15
10

10

10
5

5
0 50 100 200 300 0 10 20 30 40 50 0 20 40 60 80 100

TV Radio Newspaper

Yingbo Li (Clemson) Intro to Statistical Learning MATH 8050 4 / 16


Why Estimate f

Model the relationship between Y and X


The regression function
Y = f (X) + 

f : unknown function
: random error with mean zero, i.e., E() = 0.
In the Advertising example:

f (X1 , X2 , X3 ) = E(Y | X1 , X2 , X3 )

Statistical learning, and this course, are all about how to estimate f .
Why?
Prediction.
Inference.

Yingbo Li (Clemson) Intro to Statistical Learning MATH 8050 5 / 16


Why Estimate f

Prediction
If we can get a good estimate for f, we can make accurate predictions for
the response Y , based on a new value of X.
For a new market, given three media budgets, what’s the sales?
Just want to predict sales, not to know which media is more
important.
Suppose our estimate for f is fˆ, the output Y for input X is predicted as

Ŷ = fˆ(X)

Mean square error

E(Y − Ŷ )2 = E[f (X) − fˆ(X)]2 + V ()


| {z } |{z}
Reducible Irreducible

Yingbo Li (Clemson) Intro to Statistical Learning MATH 8050 6 / 16


Why Estimate f

Inference
We are often interested in understanding the relationship between that Y
and each of X1 , . . . , Xp . For example
1 Which predictors actually affect the response?
2 Is the relationship positive or negative?
3 Is the relationship a simple linear one or is it more complicated?

How much impact does TV budges have on the sales.


Which media generate the biggest boost in sales?

Yingbo Li (Clemson) Intro to Statistical Learning MATH 8050 7 / 16


How to Estimate f

How to estimate f
Use the training data and a statistical method to estimate f .
We have observed a set of training data

{(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )},

where each xi = (xi,1 , xi,2 , . . . , xi,p )0 , and yi is a scalar.


Statistical learning methods:
I parametric
I non-parametric

Yingbo Li (Clemson) Intro to Statistical Learning MATH 8050 8 / 16


How to Estimate f

Income vs Education, Seniority

Incom
e

y
rit
Ye

io
a

n
rs

Se
of
Ed
uc
ati
on

Yingbo Li (Clemson) Intro to Statistical Learning MATH 8050 9 / 16


How to Estimate f

Parametric methods
Reduces the problem of estimating f down to one of estimating a (finite)
set of parameters. A two-step model based approach:
1 Come up with a model (some functional form assumption about f ).
The most common example is a linear model.

f (X) = β0 + β1 X1 + β2 X2 + · · · + βp Xp

I We only need to estimate p + 1 parameters β0 , β1 , . . . , βp .


I Although it is almost never correct, a linear model often serves as a
good and interpretable approximation to the unknown true f (X).
2 Use the training data to fit the model.
Estimate the unknown parameters such as β̂0 , β̂1 , . . . , β̂p .
I The most common approach is ordinary least squares (OLS).
I We will see later that there are other superior approaches.

Yingbo Li (Clemson) Intro to Statistical Learning MATH 8050 10 / 16


How to Estimate f

A linear model fˆL (X) = β0 + β1 X gives a reasonable fit here.

A quadratic model fˆQ (X) = β0 + β1 X + β2 X 2 fits slightly better.

Yingbo Li (Clemson) Intro to Statistical Learning MATH 8050 11 / 16


How to Estimate f

A linear regression fit to the Income data

Incom
e

ity
or
Ye

ni
ars

Se
of
Ed
uc
ati
on

Income = β0 + β1 × Education + β2 × Seniority

Yingbo Li (Clemson) Intro to Statistical Learning MATH 8050 12 / 16


How to Estimate f

Non-parametric methods
They do not make explicit assumptions about the functional form of f .
Advantages: accurately fit a wider range of possible shapes of f .
Disadvantages: require large n to obtain an accurate estimate.

Incom
Incom

e
e

ity
ity

or
Ye
or

Ye

ni
a
ni

a rs

Se
rs
Se

of of
E Edu
du ca
ca tio
tio n
n

A smooth thin-plate spline fit: A rough thin-plate spline fit:


flexible overfitting

Yingbo Li (Clemson) Intro to Statistical Learning MATH 8050 13 / 16


Trade-Off: Prediction Accuracy and Model Interpretability

Some trade-offs
Prediction accuracy vs model interpretability
Linear models are easy to interpret; thin-plate splines are not.
Good fit vs over-fit
A model that overfits the training data may not predict well.
High

Subset Selection
Lasso

Least Squares
Interpretability

Generalized Additive Models


Trees

Bagging, Boosting

Support Vector Machines


Low

Low High

Flexibility
Yingbo Li (Clemson) Intro to Statistical Learning MATH 8050 14 / 16
Supervised vs Unsupervised Learning

Supervised vs unsupervised learning


Supervised learning: both X and Y are available
Unsupervised learning: only X is available; there is no Y .
I Example: market segmentation where we try to divide potential
customers into groups based on their characteristics.
I A common approach is clustering.
12

8
10
8

6
X2

X2
6

4
4

2
2

0 2 4 6 8 10 12 0 2 4 6

X1 X1

Yingbo Li (Clemson) Intro to Statistical Learning MATH 8050 15 / 16


Regression vs Classification

Regression vs classification
Regression: Y is continuous (quantitative).
I Predicting the value of the Dow in 6 months.
I Predicting the value of a given house based on various inputs.
Classification: Y is categorical (qualitative).
I Will the Dow be up (U) or down (D) in 6 months?
I Is this email a SPAM or not?

Yingbo Li (Clemson) Intro to Statistical Learning MATH 8050 16 / 16

You might also like