Download as pdf or txt
Download as pdf or txt
You are on page 1of 38

Statistical learning

Lecture 1: Introduction to Statistical Learning

Gaofeng Da

College of Economics and Management


Nanjing University of Aeronautics and Astronautics
dagfvc@gmail.com
Statistical learning

I. An overview statistical learning?

Statistical Learning: tools, methodologies, and theories for revealing


patterns in data-a critical step in knowledge discovery.
Typical Problems:
• Cancer diagnosis. Given data on expression levels of genes, classify the type of a tumor.
Discover categories of tumors having different characteristics.
• Predict whether a patient, hospitalized due to a heart attack, will have a second heart attack.
The prediction is to be based on demographic, diet and clinical measurements for that
patient.
• Marketing. Given data on age, income, etc., predict how much each customer spends.
Discover how the spending behaviors of different customers are related.
• Predict the price of a stock in 6 months from now, on the basis of company performance
measures and economic data.
• Document search: Given counts of words in a document, determine what its topic is. Group
documents by topic without a pre-specified list of topics.
• Cyber security: Detecting malicious websites by learning IP address features; spam by
percentage of words or characters in emails.

Much more(image/speech recognition)....


Statistical learning

Generally, statistical learning includes


• Supervised learning: involves building a statistical model for
predicting, or estimating, an output based on one or more inputs.
• Unsupervised learning: there are inputs but no supervising
output; nevertheless we can learn relationships and structure
from such data
Statistical learning

Supervised Learning Problems


A supervised learning problem has these characteristics:
• We are primarily interested in prediction.
• We are interested in predicting only one thing.
• The possible values of what we want to predict are specified, and
we have some training cases for which its value is known.
The thing we want to predict is called the target or the response
variable.
• For a classification problem, we want to predict the class of an
item-the topic of a document, the type of a tumor, whether a
customer will purchase a product.
• For a regression problem, we want to predict a numerical
quantity such as the amount a customer spends, the blood
pressure of a patient, the price of a stock etc.
To help us make our predictions, we have various inputs (also called
predictors, or features)- eg, gene expression levels for predicting
tumor type, age and income for predicting amount spent. We use
these inputs, but don’t try to predict them.
Statistical learning

General idea
Statistical learning

Unsupervised Learning Problems


For an unsupervised learning problem, we do not focus on prediction
of any particular thing, but rather try to find interesting aspects of the
data.
• One non-statistical formulation: We try to find clusters of
similar items, or to reduce the dimensionality of the data.
Examples: We may find clusters of patients with similar
symptoms, which we call “diseases". Retail companies often use
clustering to identify groups of households that are similar to
each other and then send personalized advertisements
according to characterizations of clusters.
• One statistical formulation: We try to learn the probability
distribution of all the quantities, often using latent (also called
hidden) variables. These formulations are related, since the
latent variables may identify clusters or correspond to
low-dimensional representations of the data.
Statistical learning

II. What is statistical learning?


• Suppose we observe response Y and p different predictors
X1 , X2 , . . . , Xp .
• We believe that there is a relationship between Y and
X = (X1 , . . . , Xp )
• We model the relationship as
Y = f (X ) + ε,
where f is a unknown function and ε is a random error with mean
0.
• We estimate f by data, f̂
Statistical learning
Statistical learning

Why do we estimate f ?

Why do we care about estimating f ?


• Prediction: if we can produce a good estimate for f (and the
variance of ε is not too large) we can make accurate predictions
for the response Y , based on a new value of X .
• Inference: Alternatively, we may also be interested in the type of
relationship between Y and the X .
• Which particular predictors actually affect the response?
• Is the relationship positive or negative?
• Is the relationship a simple linear one or is it more complicated
etc.?
Statistical learning

How do we estimate f ?

Generally,
• Statistical Learning, and this course, are all about how to
estimate f
• The term statistical learning refers to using the data to “learn" f
We will assume we have observed a set of training data

{(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )}

We must then use the training data and a statistical method to


estimate f .
Statistical Learning Methods:
• Parametric Methods
• Non-parametric Methods
Statistical learning

Parametric methods

• It reduces the problem of estimating f down to one of estimating


a set of parameters.
• They involve a two-step model based approach

- Step 1. Make some assumption about the functional form of f, i.e.


come up with a model. The most common example is a linear
model i.e.
Y = β0 + β1 X1 + . . . + βp Xp + ε
However, in this course we will examine far more complicated,
and flexible, models for f . In a sense the more flexible the model
the more realistic it is.
- Step 2. Use the training data to fit the model i.e. estimate f or
equivalently the unknown parameters β
Statistical learning

Example: A linear regression model


f = β0 + β1 × Education + . . . + βp × Senority

Recall Figure 2.3, it can be found that even if the standard deviation is
low we will still get a bad answer if we use the wrong model.
Statistical learning

Non-parametric Methods

• They do not make explicit assumptions about the functional form


of f .
- Advantage: They accurately fit a wider range of possible shapes
of f .
- Disadvantage: A very large number of observations is required
to obtain an accurate estimate of f .
Statistical learning

Example: A Thin-Plate Spline Estimate

Non-linear regression methods are more flexible (realistic) and can


potentially provide more accurate estimates.
Statistical learning

Tradeoff Between Prediction Accuracy and Model


Interpretability

Why not just use a more flexible method if it is more realistic?


• A simple method such as linear regression produces a model
which is much easier to interpret (the Inference part is better).
For example, in a linear model, βj is the average increase in Y for
a one unit increase in Xj holding all other variables constant.
• Even if you are only interested in prediction, so the first reason is
not relevant, it is often possible to get more accurate predictions
with a simple, instead of a complicated, model. This seems
counter intuitive but has to do with the fact that it is harder to fit a
more flexible model.
Statistical learning

A poor estimate

Non-linear regression methods can also be too flexible and produce


poor estimates for f (overfitting the data).
Statistical learning

Supervised vs. Unsupervised Learning

Again, we can divide all learning problems into Supervised and


Unsupervised situations
• Supervised: both the predictors, Xi , and the response, Yi , are
observed.
• Unsupervised: only the Xi0 s are observed; We need to use the
Xi s to guess what Y would have been and build a model from
there.
Most of this course will also deal with supervised learning; we will
consider unsupervised learning at the end of this course.
Statistical learning

Regression vs. Classification

Supervised learning problems can be further divided into regression


and classification problems.
• Regression covers situations where Y is continuous/numerical
• Classification covers situations where Y is categorical
Statistical learning

III. Assessing model accuracy

• In this course, we will introduce many different statistical learning


methods which extend far beyond the linear regression.
• However, no one method dominates all others over all possible
data sets. On a particular data set, one specific method may
work best, but some other may work better than it on another
data set.
Hence it is important to decide for any given data set which method
produces the best results. Now let us address this problem by
discussing assessing model accuracy.
• Measuring the Quality of Fit
• The Bias-Variance Trade-off
• The Classification Setting
Statistical learning

Measuring the Quality of Fit

• To evaluate the performance of a SL method on a given data set,


we need some way to measure how well its predictions match
the observations.
• For a regression problem, we use mean squared error (MSE)

n
1X
MSE = (yi − ŷi )2 ,
n
i=1

where ŷi = f̂ (xi ) is the prediction our method gives for ith
observation in our training data.

• In either case our method has generally been designed to make


(training) MSE small on the training data, e.g. with linear
regression we choose the line such that MSE is minimized.
Statistical learning

Measuring the Quality of Fit (cont)

• However, what we really care about is how well the method


works on new data(“Test Data").
Mathematically, suppose we fit a SLM on training observations
{(x1 , y1 ), . . . , (xn , yn )} and we obtain f̂ . We aren’t really
interested in if
yi ≈ f̂ (xi ),
instead, if
y0 ≈ f̂ (x0 ),
where (x0 , y0 ) is a (previously unseen) test observation not used
to train the SLM.
• We want to choose a SLM giving the lowest test MSE: for a large
number of test observations, the test MSE is given by

Ave(y0 − f̂ (x0 ))2


Statistical learning

Training vs. Test MSE0 s

• There is no guarantee that the method with the smallest training


MSE will have the smallest test (i.e. new data) MSE.
• In general the more flexible a method is the lower its training
MSE will be i.e. it will “fit" or explain the training data very well.
- Side note: More Flexible methods (such as splines) can generate a
wider range of possible shapes to estimate f as compared to less
flexible and more restrictive methods (such as linear regression).
The less flexible the method, the easier to interpret the model.
Thus, there is a trade-off between flexibility and model
interpretability.
• However, the test MSE may in fact be higher for a more flexible
method than for a simple approach like linear regression.
Statistical learning

Example 1 with Different Levels of Flexibility

• Simulated data from f : yi = f (xi ) + εi .


• Left: Black-Truth; Orange-Linear; Blue-Spline; Green-Spline
(more flexible)
• Right: Red-Test MSE; Grey-Training MSE; Dash- Minimum
possible test MSE (irreducible error Var [ε]); for a small training
MSE but a large test MSE, we are said to be overfitting the data.
Statistical learning

Example 2 with Different Levels of Flexibility

• Left: Black-Truth; Orange-Linear; Blue-Spline; Green-Spline


(more flexible)
• Right: Red-Test MSE; Grey-Training MSE; Dash-Minimum
possible test MSE (irreducible error)
Statistical learning

Example 3 with Different Levels of Flexibility

• Left: Black-Truth; Orange-Linear; Blue-Spline; Green-Spline


(more flexible)
• Right: Red-Test MSE; Grey-Training MSE; Dash-Minimum
possible test MSE (irreducible error)
Statistical learning

Bias / Variance Tradeoff

• We find that the test MSE has U-shape in previous examples.


• It can be shown that the U-shape is the result of two competing
properties of SLM: bias and variance.
For any given, X = x0 , the expected test MSE will be equal to

Expected Test MSE = E[(y0 − f̂ (x0 ))2 ]


= Var [f̂ (x0 )] + (E[f (x0 ) − f̂ (x0 )])2 + Var [ε]
= Var [f̂ (x0 )] + Bias2 (f̂ (x0 )) + Var [ε],

which refers to the average test MSE that we would obtain if we


repeatedly estimate f using a large number of training data and
tested each at x0 . The overall expected test MSE can be
obtained by averaging the expected test MSE over all possible x0
in the test set.
Statistical learning

Bias / Variance Tradeoff(cont)

What do we mean by the bias and variance of a SLM?


• Bias: refers to the error that is introduced by modeling a real life
problem (that is usually extremely complicated) by a much
simpler problem. For example, f usually non-linear, assume that
it is linear, then large bias. The more flexible or complex a
method is the less bias it will generally have.
• Variance: refers to how much your estimate for f would change
by if you had a different training data set. Generally, the more
flexible a method is the more variance it has.
• What this means is that as a method gets more complex the bias
will decrease and the variance will increase but expected test
MSE may go up or down (U-shape)!
Statistical learning

Test MSE, Bias and Variance


Statistical learning

The classification setting


• For a regression problem, we used the MSE to assess the
accuracy of the statistical learning method.
• For a classification problem we can use the error rate i.e.

n
1X
Training error rate = I(yi 6= ŷi ),
n
i=1

where I(·) is an indicator function, which represents the fraction


of incorrect classifications, or misclassifications.
• The test error rate associated with a set of test observations of
the form of (x0 , y0 ) is given by

Ave(I(y0 6= ŷ0 )),

where ŷ0 is the predicted class label that results from applying
the classifier to the test observations with predictor x0 . A good
classifier is one for which the test error rate is smallest.
Statistical learning

The Bayes classifier


• It can be proved that the test error rate can be minimized by The
Bayes Classifier: Assigns each observation to the most likely
class, given its predictor values, i.e., to the class j for which

P(Y = j|X = x0 )

is largest.
• For a two-class problem (Class 1 and Class 2), the Bayes
Classifier predicts Class 1 if P(Y = 1|X = x0 ) > 0.5, Class 2
otherwise.
• The Bayes classifier produces the lowest test error rate, called
Bayes error rate. The overall Bayes error rate is given by

1 − E[maxj P(Y = j|X )],

where the expectation is for random variable X .


• On test data, no classifier can get lower error rates than the
Bayes error rate, which is analogous to the irreducible error.
Statistical learning

Bayes Optimal Classifier: A simulated data set

• Two predictors, X1 and X2 ; two-class, orange and blue;


circles-training data; dash-P(Y = orange|X0 = x0 ) = 0.5 Bayes
decision boundary; the Bayes error rate is 0.1304.
• Note that it is a simulated data set, we know the conditional
prob..Of course in real life problems the Bayes error rate can not
be calculated exactly. For a real problem, what should we do?
Yes, Estimate the conditional prob.
Statistical learning

K-Nearest Neighbors (KNN)

• As the name suggests, KNN estimates the conditional prob. by


using the fraction of points in N0 (the set of K points in the
training data which are closest to x0 ) of belonging to class j, i.e.,

1 X
P(Y = j|X0 = x0 ) ≈ I(yi = j)
K
i∈N0

• KNN is a flexible approach to estimate the Bayes Classifier.


• For any given X we find the k closest neighbors to X in the
training data, and examine their corresponding Y .
• If the majority of the Y 0s are orange we predict orange otherwise
guess blue
• The smaller that k is the more flexible the method will be.
Statistical learning

KNN Example with K = 3


Statistical learning

Simulated Data (ealier): K = 10

• ERKNN = 0.1363, very close to ERBayes = 0.1304


Statistical learning

K = 1 and K = 100

• K = 1, ERKNN = 0.1695, boundary is overly flexible, the classifier


low bias and high variance
• K = 100, ERKNN = 0.1925, boundary is not sufficiently flexible,
has high bias and low variance
Statistical learning

Training vs. Test Error Rates on the Simulated Data

• Notice that training error rates keep going down as K decreases


or equivalently as the flexibility increases.
• However, the test error rate at first decreases but then starts to
increase again.
• The choice of K is critical.
Statistical learning

A Fundamental Picture
For both regression and classification, choosing the correct level of
flexibility is critical to the success of any statistical learning method.

Figure: We must always keep this picture in mind when choosing a learning
method. More flexible/complicated is not always better!

• In general training errors will always decline.


• However, test errors will decline at first (as reductions in bias
dominate) but will then start to increase again (as increases in
variance dominate)-U shape. How to the optimal flex.? Chapter
5.
Statistical learning

R Language
R: a programming language and free software environment for
statistical computing and graphics

• How to read data from a file?


• How to save R output?
• How to generate numbers? Random numbers?
• Basic vector and matrix operations
• Plot
• Loops (for, while, if)
• Write your own functions
• RStudio (IDE for R).

You might also like