Data Science

Data Science
Supervised Learning
erico.souza.teixeira
est@cesar.school
Algorithms
Supervised Learning
• Approximating a function y ≈ f(x)

• From paired training samples {(xi, yi)}ni=1
• Regression: y is a real scalar
• Classification: y is categorical

Why not Linear Regression?
1 if stroke
f(x) = 2 if drug overdose
3 if epileptic seizure
Conditional Probability
P(A ∩ B)
P(A | B) =
P(B)
Logistic Regression
• Pr(Y = 1 | X) = p(X)
• Y = 1 if p(X) > 0.5
• Be conservative: Y = 1 if p(X) > 0.1

Logistic Model
e θ0+θ1X
p(X) =
1 + e θ0+θ1X
Logistic Regression Coefficients
( 1 − p(X) )
p(X)
log = θ0 + θ1X
∏ ∏
ℓ(θ0, θ1) = p(xi) + (1 − p(xi′))
i:yi=1 i′:yi′=0

Data Science
GO TO NOTEBOOK
est@cesar.school
Multiple Logistic Regression
( 1 − p(X) )
p(X)
log = θ0 + θ1X1 + … + θp Xp
e θ0+θ1X1+…+θp Xp
p(X) =
1 + e θ0+θ1X1+…+θp Xp
Evaluation
Evaluation
Evaluation
Confusion Matrix
Confusion Matrix
Accuracy Paradox
Data Science
GO TO NOTEBOOK
est@cesar.school
ROC Curve
ROC Curve
ROC Curve
ROC Curve
ROC Curve
ROC Curve
Data Science
GO TO NOTEBOOK
est@cesar.school
K-Nearest Neighbors - KNN

K-Nearest Neighbors - KNN
k
(xi − yi)2
∑
Euclidean =
i=1
∑
Manhattan = | xi − yi |
i=1
1/q
k
(∑ )
Minkowski = ( | xi − yi | )q
i=1
Data Science
GO TO NOTEBOOK
est@cesar.school
Tree-based Methods
Tree-based Methods
Stratification of the Feature
• Error rate
̂ )
E = 1 − max( pmk
k
• Gini index
K
̂ (1 − pmk
̂ )
∑
G= pmk
k=1
• Cross-entropy
K
̂ log pmk
̂
∑
D=− pmk
k=1

Out-of-Bag Error Estimation

Data Science
GO TO NOTEBOOK
est@cesar.school
Linear Discriminant Analysis
• A distribution of X for each class

• Bayes’ theorem
• Estimate Pr(Y = k | X = x)

Bayes’ Theorem
• Diagnosis test
• P(S | V ) = 0.95
• P(S̄ | V̄ ) = 0.99
• P(V | S) =?

Bayes’ Theorem
P(S | V )P(V )
P(V | S) =
P(S)
rem 2.13: If the events B1 , B2 , . . . , Bk constitute a partition of the sample space S such that
P (Bi ) != 0 for i = 1, 2, . . . , k, then for any event A of S,
k
# k
#
P (A) = P (Bi ∩ A) = P (Bi )P (A|Bi ).
Law of Total Probability i=1 i=1
B3
B1
B4 B5
A
B2
…
Bi ∩ Bj = ∅
Figure 2.14: Partitioning the sample space S.
A = (A ∩ B1) ∪ (A ∩ B2) ∪ . . .
B1 ∪ B2 ∪ . . . = Ω
Law of Total Probability
V V̄
S
Bayes’ Theorem
P(S | V )P(V )
P(V | S) =
P(S ∩ V ) + P(S ∩ V̄ )
Bayes’ Theorem
P(S | V )P(V )
P(V | S) =
P(S | V )P(V ) + P(S | V̄ )P(V̄ )
Bayes’ Theorem for Classification
πk fk(x)
Pr(Y = k | X = x) = K
∑l=1 πl fl(x)
Bayes’ Theorem for Classification
• Only one predictor

• Estimate for fk(x)
• Assumptions
( 2σk )
1 1
fk(x) = exp − 2 (x − μk)2 σ12 = … = σK2 = σ 2
2πσk
μk μk2
δk(x) = x 2 − 2 + log(πk)
σ 2σ


1
∑
μk̂ = xi πk̂ = nk /n
nk i:y =k
i
K
2 1
(xi − μk̂ )2
n−K∑ ∑
σ̂ =
k=1 i:y =k i
μk̂ μ2k̂
δk̂ (x) = x 2 − 2 + log(πk̂ )
σ̂ 2σ ̂
• Multiple predictors
• Common covariance matrix
∑∑
σXY = (x − μX )(y − μY )f(x, y)
X Y
∞ ∞
∫−∞ ∫−∞
σXY = (x − μX )(y − μY )f(x, y)dxdy
T −1 1 T −1
δk(x) = x Σ μk − μk Σ μk + log(πk)
2

Quadratic Discriminant Analysis
• Covariance matrix for each class

• p predictors means p(p + 1)/2 parameters
• LDA has lower variance

• 50% threshold
• Common covariance matrix
Pr(default = Yes | X = x) > 0.5

Data Science
GO TO NOTEBOOK
est@cesar.school
Support Vector Machines
• Maximal Margin Classifier

• hyperplane
• θ0 + θ1X1 + θ2 X2 + … + θp Xp =0

Maximal Margin Hyperplane

Support Vector Classifiers 9.2 Support Vector Classifiers 345
3
3
2
2
X2
X2
1
1
0
0
−1
−1
−1 0 1 2 3 −1 0 1 2 3
9.2 Support Vector Classifiers 345
X1 X1
yi (θ0 + θ1xi1 + … + θp xip) ≥ M(1 − ϵi)

FIGURE 9.5. Left: Two classes of observations are shown in blue and in
3
purple, along with the maximal margin hyperplane. Right: An additional blue
observation has been added, leading to a dramatic shift in the maximal margin
2
hyperplane shown as a solid line. The dashed line indicates the maximal margin
X2
n
1
hyperplane that was obtained in the absence of this additional point.
max M
∑
ϵ1 ≤ C
0
θ0,…,θp,ϵ1,…,ϵn
−1
• Greater robustness to individual observations, and
• Better−1 classification
0 1 i=0
of2 most3 of the training observations.
p
θj1xij + θj2 xij2 ≥ M(1 − ϵi)
∑
yi θ0 +
j=1
Linear and Polynomial Kernel
∑
K(x1, x2) = x1j x2j
j=1
d
p
∑
K(x1, x2) = 1 + x1j x2j
j=1
Radial Basis Function Kernel
K(x1, x2) = exp (−γ | | x1 − x2 | | )
2
Data Science
GO TO NOTEBOOK
est@cesar.school
Classification
Compare Classification
Classification Model Pros Cons
Classification
Logistic Regression
Probabilistic approach, gives informations
about statistical significance of features
The Logistic Regression Assumptions
Classification
K-NN Model Prosfast and efficient
Simple to understand, Cons
Need to choose the number of neighbours k
Probabilistic
Performant,approach, gives
not biased byinformations
outliers, Not appropriate for non linear problems, not
LogisticSVM
Regression The Logistic Regression Assumptions
about statistical significance
not sensitive of features
to overfitting the best choice for large number of features
High performance on nonlinear problems, not Not the best choice for large number of
K-NN
Kernel SVM Simple to understand, fast and efficient Need to choose the number of neighbours k
biased by outliers, not sensitive to overfitting features, more complex
Performant,
Efficient, not biased
not biased by outliers,
by outliers, works on NotBased
appropriate
on the for non linearthat
assumption problems,
featuresnot
SVM
Naive Bayes
nonlinearnot sensitiveprobabilistic
problems, to overfittingapproach the best choice
have sameforstatistical
large number of features
relevance
High performance
Interpretability, noon nonlinear
need problems,
for feature scaling,not Not the results
Poor best choice forsmall
on too largedatasets,
number of
DecisionKernel SVM
Tree Classification biased by outliers, not sensitive to overfitting features, more complex
works on both linear / nonlinear problems overfitting can easily occur
Efficient,
Powerful andnotaccurate,
biased bygood
outliers, works onon
performance Based
No on the assumption
interpretability, thatcan
overfitting features
easily
RandomNaive
ForestBayes
Classification
nonlinear problems,including
many problems, probabilistic
nonapproach
linear occur,have
needsame statistical
to choose relevance
the number of trees
Machine Learning Interpretability, no need for feature scaling, Poor results on©too small datasets,
Decision Tree A-Z
Classification SuperDataScience
works on both linear / nonlinear problems overfitting can easily occur
Powerful and accurate, good performance on No interpretability, overfitting can easily

Random Forest Classification
many problems, including non linear occur, need to choose the number of trees
Machine Learning A-Z © SuperDataScience

Data Science

Uploaded by

Copyright:

Available Formats

You might also like

Data Science

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Science

Uploaded by

Copyright:

Available Formats

Data Science

• Approximating a function y ≈ f(x)

Why not Linear Regression?

Multiple Logistic Regression

K-Nearest Neighbors - KNN

Out-of-Bag Error Estimation

Linear Discriminant Analysis

• A distribution of X for each class

Law of Total Probability i=1 i=1

• Only one predictor

Linear Discriminant Analysis

Quadratic Discriminant Analysis

• Covariance matrix for each class

Linear Discriminant Analysis

Pr(default = Yes | X = x) > 0.5

Support Vector Machines

• Maximal Margin Classifier

Maximal Margin Hyperplane

yi (θ0 + θ1xi1 + … + θp xip) ≥ M(1 − ϵi)

hyperplane that was obtained in the absence of this additional point.

• Greater robustness to individual observations, and

Powerful and accurate, good performance on No interpretability, overfitting can easily

Machine Learning A-Z © SuperDataScience

You might also like