Data Science

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 56

Data Science

Supervised Learning

erico.souza.teixeira
est@cesar.school

Algorithms
Supervised Learning

• Approximating a function y ≈ f(x)


• From paired training samples {(xi, yi)}ni=1
• Regression: y is a real scalar
• Classification: y is categorical

Why not Linear Regression?

1 if stroke
f(x) = 2 if drug overdose
3 if epileptic seizure
Conditional Probability

P(A ∩ B)
P(A | B) =
P(B)
Logistic Regression

• Pr(Y = 1 | X) = p(X)
• Y = 1 if p(X) > 0.5
• Be conservative: Y = 1 if p(X) > 0.1

Logistic Model

e θ0+θ1X
p(X) =
1 + e θ0+θ1X
Logistic Regression Coefficients

( 1 − p(X) )
p(X)
log = θ0 + θ1X

∏ ∏
ℓ(θ0, θ1) = p(xi) + (1 − p(xi′))
i:yi=1 i′:yi′=0



Data Science
GO TO NOTEBOOK

erico.souza.teixeira
est@cesar.school

Multiple Logistic Regression

( 1 − p(X) )
p(X)
log = θ0 + θ1X1 + … + θp Xp

e θ0+θ1X1+…+θp Xp
p(X) =
1 + e θ0+θ1X1+…+θp Xp
Evaluation
Evaluation
Evaluation
Confusion Matrix
Confusion Matrix
Accuracy Paradox
Data Science
GO TO NOTEBOOK

erico.souza.teixeira
est@cesar.school

ROC Curve
ROC Curve
ROC Curve
ROC Curve
ROC Curve
ROC Curve
Data Science
GO TO NOTEBOOK

erico.souza.teixeira
est@cesar.school

K-Nearest Neighbors - KNN


K-Nearest Neighbors - KNN

k
(xi − yi)2

Euclidean =
i=1


Manhattan = | xi − yi |
i=1
1/q
k

(∑ )
Minkowski = ( | xi − yi | )q
i=1
Data Science
GO TO NOTEBOOK

erico.souza.teixeira
est@cesar.school

Tree-based Methods
Tree-based Methods
Stratification of the Feature

• Error rate
̂ )
E = 1 − max( pmk
k
• Gini index
K
̂ (1 − pmk
̂ )

G= pmk
k=1
• Cross-entropy
K
̂ log pmk
̂

D=− pmk
k=1

Out-of-Bag Error Estimation


Data Science
GO TO NOTEBOOK

erico.souza.teixeira
est@cesar.school

Linear Discriminant Analysis

• A distribution of X for each class


• Bayes’ theorem
• Estimate Pr(Y = k | X = x)

Bayes’ Theorem

• Diagnosis test
• P(S | V ) = 0.95
• P(S̄ | V̄ ) = 0.99
• P(V | S) =?

Bayes’ Theorem

P(S | V )P(V )
P(V | S) =
P(S)
rem 2.13: If the events B1 , B2 , . . . , Bk constitute a partition of the sample space S such that
P (Bi ) != 0 for i = 1, 2, . . . , k, then for any event A of S,
k
# k
#
P (A) = P (Bi ∩ A) = P (Bi )P (A|Bi ).

Law of Total Probability i=1 i=1

B3
B1

B4 B5
A

B2

Bi ∩ Bj = ∅
Figure 2.14: Partitioning the sample space S.

A = (A ∩ B1) ∪ (A ∩ B2) ∪ . . .
B1 ∪ B2 ∪ . . . = Ω
Law of Total Probability

V V̄

S
Bayes’ Theorem

P(S | V )P(V )
P(V | S) =
P(S ∩ V ) + P(S ∩ V̄ )
Bayes’ Theorem

P(S | V )P(V )
P(V | S) =
P(S | V )P(V ) + P(S | V̄ )P(V̄ )
Bayes’ Theorem for Classification

πk fk(x)
Pr(Y = k | X = x) = K
∑l=1 πl fl(x)
Bayes’ Theorem for Classification

• Only one predictor


• Estimate for fk(x)
• Assumptions

( 2σk )
1 1
fk(x) = exp − 2 (x − μk)2 σ12 = … = σK2 = σ 2
2πσk
μk μk2
δk(x) = x 2 − 2 + log(πk)
σ 2σ

Linear Discriminant Analysis


1

μk̂ = xi πk̂ = nk /n
nk i:y =k
i

K
2 1
(xi − μk̂ )2
n−K∑ ∑
σ̂ =
k=1 i:y =k i

μk̂ μ2k̂
δk̂ (x) = x 2 − 2 + log(πk̂ )
σ̂ 2σ ̂
Linear Discriminant Analysis

• Multiple predictors
• Common covariance matrix

∑∑
σXY = (x − μX )(y − μY )f(x, y)
X Y
∞ ∞

∫−∞ ∫−∞
σXY = (x − μX )(y − μY )f(x, y)dxdy

T −1 1 T −1
δk(x) = x Σ μk − μk Σ μk + log(πk)
2

Quadratic Discriminant Analysis

• Covariance matrix for each class


• p predictors means p(p + 1)/2 parameters
• LDA has lower variance

Linear Discriminant Analysis

• 50% threshold
• Common covariance matrix

Pr(default = Yes | X = x) > 0.5


Data Science
GO TO NOTEBOOK

erico.souza.teixeira
est@cesar.school

Support Vector Machines

• Maximal Margin Classifier


• hyperplane
• θ0 + θ1X1 + θ2 X2 + … + θp Xp =0

Maximal Margin Hyperplane


Support Vector Classifiers 9.2 Support Vector Classifiers 345
3

3
2

2
X2

X2
1

1
0

0
−1

−1

−1 0 1 2 3 −1 0 1 2 3
9.2 Support Vector Classifiers 345
X1 X1

yi (θ0 + θ1xi1 + … + θp xip) ≥ M(1 − ϵi)


FIGURE 9.5. Left: Two classes of observations are shown in blue and in
3

purple, along with the maximal margin hyperplane. Right: An additional blue
observation has been added, leading to a dramatic shift in the maximal margin
2

hyperplane shown as a solid line. The dashed line indicates the maximal margin
X2

n
1

hyperplane that was obtained in the absence of this additional point.

max M

ϵ1 ≤ C
0

θ0,…,θp,ϵ1,…,ϵn
−1

• Greater robustness to individual observations, and

• Better−1 classification
0 1 i=0
of2 most3 of the training observations.
Support Vector Machines

p
θj1xij + θj2 xij2 ≥ M(1 − ϵi)

yi θ0 +
j=1
Linear and Polynomial Kernel


K(x1, x2) = x1j x2j
j=1

d
p


K(x1, x2) = 1 + x1j x2j
j=1
Radial Basis Function Kernel
K(x1, x2) = exp (−γ | | x1 − x2 | | )
2
Support Vector Machines
Support Vector Machines
Data Science
GO TO NOTEBOOK

erico.souza.teixeira
est@cesar.school

Classification
Compare Classification
Classification Model Pros Cons

Classification
Logistic Regression
Probabilistic approach, gives informations
about statistical significance of features
The Logistic Regression Assumptions

Classification
K-NN Model Prosfast and efficient
Simple to understand, Cons
Need to choose the number of neighbours k

Probabilistic
Performant,approach, gives
not biased byinformations
outliers, Not appropriate for non linear problems, not
LogisticSVM
Regression The Logistic Regression Assumptions
about statistical significance
not sensitive of features
to overfitting the best choice for large number of features

High performance on nonlinear problems, not Not the best choice for large number of
K-NN
Kernel SVM Simple to understand, fast and efficient Need to choose the number of neighbours k
biased by outliers, not sensitive to overfitting features, more complex

Performant,
Efficient, not biased
not biased by outliers,
by outliers, works on NotBased
appropriate
on the for non linearthat
assumption problems,
featuresnot
SVM
Naive Bayes
nonlinearnot sensitiveprobabilistic
problems, to overfittingapproach the best choice
have sameforstatistical
large number of features
relevance

High performance
Interpretability, noon nonlinear
need problems,
for feature scaling,not Not the results
Poor best choice forsmall
on too largedatasets,
number of
DecisionKernel SVM
Tree Classification biased by outliers, not sensitive to overfitting features, more complex
works on both linear / nonlinear problems overfitting can easily occur

Efficient,
Powerful andnotaccurate,
biased bygood
outliers, works onon
performance Based
No on the assumption
interpretability, thatcan
overfitting features
easily
RandomNaive
ForestBayes
Classification
nonlinear problems,including
many problems, probabilistic
nonapproach
linear occur,have
needsame statistical
to choose relevance
the number of trees

Machine Learning Interpretability, no need for feature scaling, Poor results on©too small datasets,
Decision Tree A-Z
Classification SuperDataScience
works on both linear / nonlinear problems overfitting can easily occur

Powerful and accurate, good performance on No interpretability, overfitting can easily


Random Forest Classification
many problems, including non linear occur, need to choose the number of trees

Machine Learning A-Z © SuperDataScience

You might also like