Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 16

Introduction to Statistical

Learning
By
Arsalan Javed
Syed Shaheryar Zahur

Convergent Business Technologies.


Overview
• Statistical learning refers to vast set of tools for understanding data
• Statistical learning is a fundamental ingredient in the training of a
modern data scientist.
• Classified into supervised and unsupervised
• Supervised SL
Involves predicting outputs (Y) from inputs (X)
I. Regression
II. Classification
• Unsupervised SL
Invovlves determining relationships from inputs
I. Clustering
II. Association

Convergent Business Technologies.


What is Statistical Learning

Convergent Business Technologies.


Notation
• Predictor variables - independent variables denoted by X
• Output variables denoted by Y
• ∊ - error

Convergent Business Technologies.


Why estimate f?
• Two main reasons
a. Prediction
b. Inference

Prediction:
output based on the input
• Accuracy depends upon reducible and irreducible error
• Never fully accurate due - cannot reduce error introduced
by ∊

Convergent Business Technologies.


Why estimate f?
• Two main reasons
a. Prediction
b. Inference

Inference:
Understanding relationship b/w input X & Y

Convergent Business Technologies.


How to estimate f
Parametric Method:
Involves two-step model-based method
• First make assumptions about the functional form of f
• After, need a procedure that uses the training data to
fit or train the model

Advantages Disadvantages
Simplify the problem by assuming f May not match original f

Only coefficients have to be found, not to Makes f complex which results in


fit every f overfitting.

Convergent Business Technologies.


Non-Parametric Method:
• we don't make assumptions - instead seek an estimate of f
that gets as close to possible to data points

Advantages Disadvantages
More accurate because no restriction or Large observations are required to check
assumption to model for shape of f shape of f

No coefficients to be calculated Flexibility and smoothness increases


results in overfitting.

Convergent Business Technologies.


Trade-offs
• Prediction accuracy versus interpretability.
Why choose restrictive method instead of flexible?
• restrictive models are usually easy to interpret;flexible
models (thin-plate splines) are not.

Convergent Business Technologies.


Regression vs Classification Problems
• Quantitative variables take on numerical values. Problems
with a quantitative response are often referred to as
regression problems.

• Qualitative variables, whether a variable or response, take


on values in one of K different classes or categories.
Problems with a qualitative response are often referred to
as classification problems.

Best suited method - depends on the response

Convergent Business Technologies.


Assessing Model Accuracy:
No one method dominates - so need variety of models
Measuring Quality of Fit:
• necessary to evaluate how well the model’s predictions match the observed
data,
• to quantify extent of predicted response

Mean squared error is one common measure in the regression setting.

Mean squared error will be:


• small when the predicted responses are close to the true responses
• large if there’s a substantial difference between the predicted response and
the observed response

Convergent Business Technologies.


The Bias-Variance Trade-Off
Bias - Difference b/w predicted and correct value
Variance - spread of data
• As flexibility increases, its variance increases and its bias decreases.
• Choosing flexibility based on average test error amounts to bias-
variance trade-off

Convergent Business Technologies.


The Bias-Variance Trade-Off

Convergent Business Technologies.


Assessing Classification Accuracy
Most common means of quantifying the accuracy is training error rate

• good classifier is one for which the test error rate is smallest
• Bayes Classifier predictions are based on greatest probability
• It produces lowest possible test error rate - Bayes’ error rate

Convergent Business Technologies.


K-Nearest Neighbours
Classifier - popular method if estimating conditional probability
• Identifies K points in training data closes to test observation
• Estimates the conditional probability for class as fraction of points
• Apply Bayes theorem to classify the test observation to the class
with largest probability

Choice of K-value important


• Lower values flexible
• Higher values more bias

Convergent Business Technologies.


Thank You

Any Questions?

Convergent Business Technologies.

You might also like