ISLR Chap 1 & 2

Introduction to Statistical
Learning
By
Arsalan Javed
Syed Shaheryar Zahur
Convergent Business Technologies.

Overview
• Statistical learning refers to vast set of tools for understanding data
• Statistical learning is a fundamental ingredient in the training of a
modern data scientist.
• Classified into supervised and unsupervised
• Supervised SL
Involves predicting outputs (Y) from inputs (X)
I. Regression
II. Classification
• Unsupervised SL
Invovlves determining relationships from inputs
I. Clustering
II. Association

What is Statistical Learning

Notation
• Predictor variables - independent variables denoted by X
• Output variables denoted by Y
• ∊ - error

Why estimate f?
• Two main reasons
a. Prediction
b. Inference
Prediction:
output based on the input
• Accuracy depends upon reducible and irreducible error
• Never fully accurate due - cannot reduce error introduced
by ∊

Why estimate f?
• Two main reasons
a. Prediction
b. Inference
Inference:
Understanding relationship b/w input X & Y

How to estimate f
Parametric Method:
Involves two-step model-based method
• First make assumptions about the functional form of f
• After, need a procedure that uses the training data to
fit or train the model
Advantages Disadvantages
Simplify the problem by assuming f May not match original f
Only coefficients have to be found, not to Makes f complex which results in

fit every f overfitting.

Non-Parametric Method:
• we don't make assumptions - instead seek an estimate of f
that gets as close to possible to data points
Advantages Disadvantages
More accurate because no restriction or Large observations are required to check
assumption to model for shape of f shape of f
No coefficients to be calculated Flexibility and smoothness increases

results in overfitting.

Trade-offs
• Prediction accuracy versus interpretability.
Why choose restrictive method instead of flexible?
• restrictive models are usually easy to interpret;flexible
models (thin-plate splines) are not.

Regression vs Classification Problems
• Quantitative variables take on numerical values. Problems
with a quantitative response are often referred to as
regression problems.
• Qualitative variables, whether a variable or response, take

on values in one of K different classes or categories.
Problems with a qualitative response are often referred to
as classification problems.
Best suited method - depends on the response

Assessing Model Accuracy:
No one method dominates - so need variety of models
Measuring Quality of Fit:
• necessary to evaluate how well the model’s predictions match the observed
data,
• to quantify extent of predicted response
Mean squared error is one common measure in the regression setting.
Mean squared error will be:

• small when the predicted responses are close to the true responses
• large if there’s a substantial difference between the predicted response and
the observed response

The Bias-Variance Trade-Off
Bias - Difference b/w predicted and correct value
Variance - spread of data
• As flexibility increases, its variance increases and its bias decreases.
• Choosing flexibility based on average test error amounts to bias-
variance trade-off

The Bias-Variance Trade-Off

Assessing Classification Accuracy
Most common means of quantifying the accuracy is training error rate
• good classifier is one for which the test error rate is smallest
• Bayes Classifier predictions are based on greatest probability
• It produces lowest possible test error rate - Bayes’ error rate

K-Nearest Neighbours
Classifier - popular method if estimating conditional probability
• Identifies K points in training data closes to test observation
• Estimates the conditional probability for class as fraction of points
• Apply Bayes theorem to classify the test observation to the class
with largest probability
Choice of K-value important

• Lower values flexible
• Higher values more bias

Thank You
Any Questions?

ISLR Chap 1 &amp; 2

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ISLR Chap 1 &amp; 2

Uploaded by

Copyright:

Available Formats

Introduction to Statistical

Convergent Business Technologies.

Convergent Business Technologies.

Convergent Business Technologies.

Convergent Business Technologies.

Convergent Business Technologies.

Convergent Business Technologies.

Only coefficients have to be found, not to Makes f complex which results in

Convergent Business Technologies.

No coefficients to be calculated Flexibility and smoothness increases

Convergent Business Technologies.

Convergent Business Technologies.

• Qualitative variables, whether a variable or response, take

Best suited method - depends on the response

Convergent Business Technologies.

Mean squared error is one common measure in the regression setting.

Mean squared error will be:

Convergent Business Technologies.

Convergent Business Technologies.

Convergent Business Technologies.

Convergent Business Technologies.

Choice of K-value important

Convergent Business Technologies.

Convergent Business Technologies.

You might also like

ISLR Chap 1 & 2

ISLR Chap 1 & 2