Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 16

Introduction to Statistical

Arsalan Javed
Syed Shaheryar Zahur

Convergent Business Technologies.

• Statistical learning refers to vast set of tools for understanding data
• Statistical learning is a fundamental ingredient in the training of a
modern data scientist.
• Classified into supervised and unsupervised
• Supervised SL
Involves predicting outputs (Y) from inputs (X)
I. Regression
II. Classification
• Unsupervised SL
Invovlves determining relationships from inputs
I. Clustering
II. Association

Convergent Business Technologies.

What is Statistical Learning

Convergent Business Technologies.

• Predictor variables - independent variables denoted by X
• Output variables denoted by Y
• ∊ - error

Convergent Business Technologies.

Why estimate f?
• Two main reasons
a. Prediction
b. Inference

output based on the input
• Accuracy depends upon reducible and irreducible error
• Never fully accurate due - cannot reduce error introduced
by ∊

Convergent Business Technologies.

Why estimate f?
• Two main reasons
a. Prediction
b. Inference

Understanding relationship b/w input X & Y

Convergent Business Technologies.

How to estimate f
Parametric Method:
Involves two-step model-based method
• First make assumptions about the functional form of f
• After, need a procedure that uses the training data to
fit or train the model

Advantages Disadvantages
Simplify the problem by assuming f May not match original f

Only coefficients have to be found, not to Makes f complex which results in

fit every f overfitting.

Convergent Business Technologies.

Non-Parametric Method:
• we don't make assumptions - instead seek an estimate of f
that gets as close to possible to data points

Advantages Disadvantages
More accurate because no restriction or Large observations are required to check
assumption to model for shape of f shape of f

No coefficients to be calculated Flexibility and smoothness increases

results in overfitting.

Convergent Business Technologies.

• Prediction accuracy versus interpretability.
Why choose restrictive method instead of flexible?
• restrictive models are usually easy to interpret;flexible
models (thin-plate splines) are not.

Convergent Business Technologies.

Regression vs Classification Problems
• Quantitative variables take on numerical values. Problems
with a quantitative response are often referred to as
regression problems.

• Qualitative variables, whether a variable or response, take

on values in one of K different classes or categories.
Problems with a qualitative response are often referred to
as classification problems.

Best suited method - depends on the response

Convergent Business Technologies.

Assessing Model Accuracy:
No one method dominates - so need variety of models
Measuring Quality of Fit:
• necessary to evaluate how well the model’s predictions match the observed
• to quantify extent of predicted response

Mean squared error is one common measure in the regression setting.

Mean squared error will be:

• small when the predicted responses are close to the true responses
• large if there’s a substantial difference between the predicted response and
the observed response

Convergent Business Technologies.

The Bias-Variance Trade-Off
Bias - Difference b/w predicted and correct value
Variance - spread of data
• As flexibility increases, its variance increases and its bias decreases.
• Choosing flexibility based on average test error amounts to bias-
variance trade-off

Convergent Business Technologies.

The Bias-Variance Trade-Off

Convergent Business Technologies.

Assessing Classification Accuracy
Most common means of quantifying the accuracy is training error rate

• good classifier is one for which the test error rate is smallest
• Bayes Classifier predictions are based on greatest probability
• It produces lowest possible test error rate - Bayes’ error rate

Convergent Business Technologies.

K-Nearest Neighbours
Classifier - popular method if estimating conditional probability
• Identifies K points in training data closes to test observation
• Estimates the conditional probability for class as fraction of points
• Apply Bayes theorem to classify the test observation to the class
with largest probability

Choice of K-value important

• Lower values flexible
• Higher values more bias

Convergent Business Technologies.

Thank You

Any Questions?

Convergent Business Technologies.

You might also like