Lec1-Lecture Advance Statistics

Statistical learning
Lecture 1: Introduction to Statistical Learning
Gaofeng Da
College of Economics and Management

Nanjing University of Aeronautics and Astronautics
dagfvc@gmail.com
I. An overview statistical learning?
Statistical Learning: tools, methodologies, and theories for revealing

patterns in data-a critical step in knowledge discovery.
Typical Problems:
• Cancer diagnosis. Given data on expression levels of genes, classify the type of a tumor.
Discover categories of tumors having different characteristics.
• Predict whether a patient, hospitalized due to a heart attack, will have a second heart attack.
The prediction is to be based on demographic, diet and clinical measurements for that
patient.
• Marketing. Given data on age, income, etc., predict how much each customer spends.
Discover how the spending behaviors of different customers are related.
• Predict the price of a stock in 6 months from now, on the basis of company performance
measures and economic data.
• Document search: Given counts of words in a document, determine what its topic is. Group
documents by topic without a pre-specified list of topics.
• Cyber security: Detecting malicious websites by learning IP address features; spam by
percentage of words or characters in emails.
Much more(image/speech recognition)....

Generally, statistical learning includes

• Supervised learning: involves building a statistical model for
predicting, or estimating, an output based on one or more inputs.
• Unsupervised learning: there are inputs but no supervising
output; nevertheless we can learn relationships and structure
from such data
Supervised Learning Problems

A supervised learning problem has these characteristics:
• We are primarily interested in prediction.
• We are interested in predicting only one thing.
• The possible values of what we want to predict are specified, and
we have some training cases for which its value is known.
The thing we want to predict is called the target or the response
variable.
• For a classification problem, we want to predict the class of an
item-the topic of a document, the type of a tumor, whether a
customer will purchase a product.
• For a regression problem, we want to predict a numerical
quantity such as the amount a customer spends, the blood
pressure of a patient, the price of a stock etc.
To help us make our predictions, we have various inputs (also called
predictors, or features)- eg, gene expression levels for predicting
tumor type, age and income for predicting amount spent. We use
these inputs, but don’t try to predict them.
General idea
Unsupervised Learning Problems

For an unsupervised learning problem, we do not focus on prediction
of any particular thing, but rather try to find interesting aspects of the
data.
• One non-statistical formulation: We try to find clusters of
similar items, or to reduce the dimensionality of the data.
Examples: We may find clusters of patients with similar
symptoms, which we call “diseases". Retail companies often use
clustering to identify groups of households that are similar to
each other and then send personalized advertisements
according to characterizations of clusters.
• One statistical formulation: We try to learn the probability
distribution of all the quantities, often using latent (also called
hidden) variables. These formulations are related, since the
latent variables may identify clusters or correspond to
low-dimensional representations of the data.
II. What is statistical learning?

• Suppose we observe response Y and p different predictors
X1 , X2 , . . . , Xp .
• We believe that there is a relationship between Y and
X = (X1 , . . . , Xp )
• We model the relationship as
Y = f (X ) + ε,
where f is a unknown function and ε is a random error with mean
0.
• We estimate f by data, f̂
Why do we estimate f ?
Why do we care about estimating f ?

• Prediction: if we can produce a good estimate for f (and the
variance of ε is not too large) we can make accurate predictions
for the response Y , based on a new value of X .
• Inference: Alternatively, we may also be interested in the type of
relationship between Y and the X .
• Which particular predictors actually affect the response?
• Is the relationship positive or negative?
• Is the relationship a simple linear one or is it more complicated
etc.?
How do we estimate f ?
Generally,
• Statistical Learning, and this course, are all about how to
estimate f
• The term statistical learning refers to using the data to “learn" f
We will assume we have observed a set of training data
{(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )}
We must then use the training data and a statistical method to

estimate f .
Statistical Learning Methods:
• Parametric Methods
• Non-parametric Methods
Parametric methods
• It reduces the problem of estimating f down to one of estimating

a set of parameters.
• They involve a two-step model based approach
- Step 1. Make some assumption about the functional form of f, i.e.

come up with a model. The most common example is a linear
model i.e.
Y = β0 + β1 X1 + . . . + βp Xp + ε
However, in this course we will examine far more complicated,
and flexible, models for f . In a sense the more flexible the model
the more realistic it is.
- Step 2. Use the training data to fit the model i.e. estimate f or
equivalently the unknown parameters β
Example: A linear regression model

f = β0 + β1 × Education + . . . + βp × Senority
Recall Figure 2.3, it can be found that even if the standard deviation is
low we will still get a bad answer if we use the wrong model.
Non-parametric Methods
• They do not make explicit assumptions about the functional form

of f .
- Advantage: They accurately fit a wider range of possible shapes
of f .
- Disadvantage: A very large number of observations is required
to obtain an accurate estimate of f .
Example: A Thin-Plate Spline Estimate
Non-linear regression methods are more flexible (realistic) and can

potentially provide more accurate estimates.
Tradeoff Between Prediction Accuracy and Model

Interpretability
Why not just use a more flexible method if it is more realistic?

• A simple method such as linear regression produces a model
which is much easier to interpret (the Inference part is better).
For example, in a linear model, βj is the average increase in Y for
a one unit increase in Xj holding all other variables constant.
• Even if you are only interested in prediction, so the first reason is
not relevant, it is often possible to get more accurate predictions
with a simple, instead of a complicated, model. This seems
counter intuitive but has to do with the fact that it is harder to fit a
more flexible model.
A poor estimate
Non-linear regression methods can also be too flexible and produce

poor estimates for f (overfitting the data).
Supervised vs. Unsupervised Learning
Again, we can divide all learning problems into Supervised and

Unsupervised situations
• Supervised: both the predictors, Xi , and the response, Yi , are
observed.
• Unsupervised: only the Xi0 s are observed; We need to use the
Xi s to guess what Y would have been and build a model from
there.
Most of this course will also deal with supervised learning; we will
consider unsupervised learning at the end of this course.
Regression vs. Classification
Supervised learning problems can be further divided into regression

and classification problems.
• Regression covers situations where Y is continuous/numerical
• Classification covers situations where Y is categorical
III. Assessing model accuracy
• In this course, we will introduce many different statistical learning

methods which extend far beyond the linear regression.
• However, no one method dominates all others over all possible
data sets. On a particular data set, one specific method may
work best, but some other may work better than it on another
data set.
Hence it is important to decide for any given data set which method
produces the best results. Now let us address this problem by
discussing assessing model accuracy.
• Measuring the Quality of Fit
• The Bias-Variance Trade-off
• The Classification Setting
Measuring the Quality of Fit
• To evaluate the performance of a SL method on a given data set,

we need some way to measure how well its predictions match
the observations.
• For a regression problem, we use mean squared error (MSE)
n
1X
MSE = (yi − ŷi )2 ,
n
i=1
where ŷi = f̂ (xi ) is the prediction our method gives for ith
observation in our training data.
• In either case our method has generally been designed to make

(training) MSE small on the training data, e.g. with linear
regression we choose the line such that MSE is minimized.
Measuring the Quality of Fit (cont)
• However, what we really care about is how well the method

works on new data(“Test Data").
Mathematically, suppose we fit a SLM on training observations
{(x1 , y1 ), . . . , (xn , yn )} and we obtain f̂ . We aren’t really
interested in if
yi ≈ f̂ (xi ),
instead, if
y0 ≈ f̂ (x0 ),
where (x0 , y0 ) is a (previously unseen) test observation not used
to train the SLM.
• We want to choose a SLM giving the lowest test MSE: for a large
number of test observations, the test MSE is given by
Ave(y0 − f̂ (x0 ))2

Training vs. Test MSE0 s
• There is no guarantee that the method with the smallest training

MSE will have the smallest test (i.e. new data) MSE.
• In general the more flexible a method is the lower its training
MSE will be i.e. it will “fit" or explain the training data very well.
- Side note: More Flexible methods (such as splines) can generate a
wider range of possible shapes to estimate f as compared to less
flexible and more restrictive methods (such as linear regression).
The less flexible the method, the easier to interpret the model.
Thus, there is a trade-off between flexibility and model
interpretability.
• However, the test MSE may in fact be higher for a more flexible
method than for a simple approach like linear regression.
Example 1 with Different Levels of Flexibility
• Simulated data from f : yi = f (xi ) + εi .

• Left: Black-Truth; Orange-Linear; Blue-Spline; Green-Spline
(more flexible)
• Right: Red-Test MSE; Grey-Training MSE; Dash- Minimum
possible test MSE (irreducible error Var [ε]); for a small training
MSE but a large test MSE, we are said to be overfitting the data.

(more flexible)
• Right: Red-Test MSE; Grey-Training MSE; Dash-Minimum
possible test MSE (irreducible error)

(more flexible)
• Right: Red-Test MSE; Grey-Training MSE; Dash-Minimum
possible test MSE (irreducible error)
Bias / Variance Tradeoff
• We find that the test MSE has U-shape in previous examples.

• It can be shown that the U-shape is the result of two competing
properties of SLM: bias and variance.
For any given, X = x0 , the expected test MSE will be equal to
Expected Test MSE = E[(y0 − f̂ (x0 ))2 ]

= Var [f̂ (x0 )] + (E[f (x0 ) − f̂ (x0 )])2 + Var [ε]
= Var [f̂ (x0 )] + Bias2 (f̂ (x0 )) + Var [ε],
which refers to the average test MSE that we would obtain if we

repeatedly estimate f using a large number of training data and
tested each at x0 . The overall expected test MSE can be
obtained by averaging the expected test MSE over all possible x0
in the test set.
Bias / Variance Tradeoff(cont)
What do we mean by the bias and variance of a SLM?

• Bias: refers to the error that is introduced by modeling a real life
problem (that is usually extremely complicated) by a much
simpler problem. For example, f usually non-linear, assume that
it is linear, then large bias. The more flexible or complex a
method is the less bias it will generally have.
• Variance: refers to how much your estimate for f would change
by if you had a different training data set. Generally, the more
flexible a method is the more variance it has.
• What this means is that as a method gets more complex the bias
will decrease and the variance will increase but expected test
MSE may go up or down (U-shape)!
Test MSE, Bias and Variance

The classification setting

• For a regression problem, we used the MSE to assess the
accuracy of the statistical learning method.
• For a classification problem we can use the error rate i.e.
n
1X
Training error rate = I(yi 6= ŷi ),
n
i=1
where I(·) is an indicator function, which represents the fraction

of incorrect classifications, or misclassifications.
• The test error rate associated with a set of test observations of
the form of (x0 , y0 ) is given by
Ave(I(y0 6= ŷ0 )),
where ŷ0 is the predicted class label that results from applying
the classifier to the test observations with predictor x0 . A good
classifier is one for which the test error rate is smallest.
The Bayes classifier

• It can be proved that the test error rate can be minimized by The
Bayes Classifier: Assigns each observation to the most likely
class, given its predictor values, i.e., to the class j for which
P(Y = j|X = x0 )
is largest.
• For a two-class problem (Class 1 and Class 2), the Bayes
Classifier predicts Class 1 if P(Y = 1|X = x0 ) > 0.5, Class 2
otherwise.
• The Bayes classifier produces the lowest test error rate, called
Bayes error rate. The overall Bayes error rate is given by
1 − E[maxj P(Y = j|X )],
where the expectation is for random variable X .

• On test data, no classifier can get lower error rates than the
Bayes error rate, which is analogous to the irreducible error.
Bayes Optimal Classifier: A simulated data set
• Two predictors, X1 and X2 ; two-class, orange and blue;

circles-training data; dash-P(Y = orange|X0 = x0 ) = 0.5 Bayes
decision boundary; the Bayes error rate is 0.1304.
• Note that it is a simulated data set, we know the conditional
prob..Of course in real life problems the Bayes error rate can not
be calculated exactly. For a real problem, what should we do?
Yes, Estimate the conditional prob.
K-Nearest Neighbors (KNN)
• As the name suggests, KNN estimates the conditional prob. by

using the fraction of points in N0 (the set of K points in the
training data which are closest to x0 ) of belonging to class j, i.e.,
1 X
P(Y = j|X0 = x0 ) ≈ I(yi = j)
K
i∈N0
• KNN is a flexible approach to estimate the Bayes Classifier.

• For any given X we find the k closest neighbors to X in the
training data, and examine their corresponding Y .
• If the majority of the Y 0s are orange we predict orange otherwise
guess blue
• The smaller that k is the more flexible the method will be.
KNN Example with K = 3

Simulated Data (ealier): K = 10
• ERKNN = 0.1363, very close to ERBayes = 0.1304

K = 1 and K = 100
• K = 1, ERKNN = 0.1695, boundary is overly flexible, the classifier

low bias and high variance
• K = 100, ERKNN = 0.1925, boundary is not sufficiently flexible,
has high bias and low variance
Training vs. Test Error Rates on the Simulated Data
• Notice that training error rates keep going down as K decreases

or equivalently as the flexibility increases.
• However, the test error rate at first decreases but then starts to
increase again.
• The choice of K is critical.
A Fundamental Picture
For both regression and classification, choosing the correct level of
flexibility is critical to the success of any statistical learning method.
Figure: We must always keep this picture in mind when choosing a learning
method. More flexible/complicated is not always better!
• In general training errors will always decline.

• However, test errors will decline at first (as reductions in bias
dominate) but will then start to increase again (as increases in
variance dominate)-U shape. How to the optimal flex.? Chapter
5.
R Language
R: a programming language and free software environment for
statistical computing and graphics
• How to read data from a file?

• How to save R output?
• How to generate numbers? Random numbers?
• Basic vector and matrix operations
• Plot
• Loops (for, while, if)
• Write your own functions
• RStudio (IDE for R).

Lec1-Lecture Advance Statistics

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec1-Lecture Advance Statistics

Uploaded by

Copyright:

Available Formats

Statistical learning

Lecture 1: Introduction to Statistical Learning

College of Economics and Management

I. An overview statistical learning?

Statistical Learning: tools, methodologies, and theories for revealing

Much more(image/speech recognition)....

Generally, statistical learning includes

Supervised Learning Problems

Unsupervised Learning Problems

II. What is statistical learning?

Why do we care about estimating f ?

{(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )}

We must then use the training data and a statistical method to

• It reduces the problem of estimating f down to one of estimating

- Step 1. Make some assumption about the functional form of f, i.e.

Example: A linear regression model

• They do not make explicit assumptions about the functional form

Example: A Thin-Plate Spline Estimate

Non-linear regression methods are more flexible (realistic) and can

Tradeoff Between Prediction Accuracy and Model

Why not just use a more flexible method if it is more realistic?

Non-linear regression methods can also be too flexible and produce

Supervised vs. Unsupervised Learning

Again, we can divide all learning problems into Supervised and

Regression vs. Classification

Supervised learning problems can be further divided into regression

III. Assessing model accuracy

• In this course, we will introduce many different statistical learning

Measuring the Quality of Fit

• To evaluate the performance of a SL method on a given data set,

• In either case our method has generally been designed to make

Measuring the Quality of Fit (cont)

• However, what we really care about is how well the method

Ave(y0 − f̂ (x0 ))2

Training vs. Test MSE0 s

• There is no guarantee that the method with the smallest training

Example 1 with Different Levels of Flexibility

• Simulated data from f : yi = f (xi ) + εi .

Example 2 with Different Levels of Flexibility

• Left: Black-Truth; Orange-Linear; Blue-Spline; Green-Spline

Example 3 with Different Levels of Flexibility

• Left: Black-Truth; Orange-Linear; Blue-Spline; Green-Spline

Bias / Variance Tradeoff

• We find that the test MSE has U-shape in previous examples.

Expected Test MSE = E[(y0 − f̂ (x0 ))2 ]

which refers to the average test MSE that we would obtain if we

Bias / Variance Tradeoff(cont)

What do we mean by the bias and variance of a SLM?

Test MSE, Bias and Variance

The classification setting

where I(·) is an indicator function, which represents the fraction

Ave(I(y0 6= ŷ0 )),

The Bayes classifier

1 − E[maxj P(Y = j|X )],

where the expectation is for random variable X .

Bayes Optimal Classifier: A simulated data set

• Two predictors, X1 and X2 ; two-class, orange and blue;

K-Nearest Neighbors (KNN)

• As the name suggests, KNN estimates the conditional prob. by

• KNN is a flexible approach to estimate the Bayes Classifier.

KNN Example with K = 3

Simulated Data (ealier): K = 10