Professional Documents
Culture Documents
Modulo1 L02
Modulo1 L02
Módulo 1
1/ 81
Definition
Clustering is the task of grouping a set of objects in such a way that
objects in the same group (called a cluster) are more similar (in some
sense or another) to each other than to those in other groups
(clusters).
2/ 81
Purpose of clustering
Understanding
▶ Understanding data ⇒ clusters are potential data.
▶ In this context, cluster analysis is the study of techniques for
automatically finding classes.
Utility
▶ Cluster analysis provides an abstraction from individual data to
the clusters obtained ⇒ reduce the size of large data sets.
▶ In this context, cluster analysis is the study of techniques for
finding the most representative cluster prototypes.
3/ 81
Clustering in the real world
Customer segmentation
4/ 81
Cluster validity
5/ 81
Cluster validity
Clusters found in random data
6/ 81
Cluster validity
Correlation
▶ Two matrices
▶ Proximity matrix
▶ Incidence matrix
▶ One row and one column for each data point
▶ An entry is 1 if the associated pair of points belong to the same
cluster.
▶ An entry is 0 if the associated pair of points belong to the different
clusters.
▶ Compute the correlation between the two matrices.
▶ High correlation indicates that points belong to the same cluster
are close to each other.
▶ Not a good measure for some density based clusters.
7/ 81
Cluster validity
Correlation
8/ 81
Cluster validity
Similarity matrix and Correlation
Order the similarity matrix with respect to cluster labels and inspect
visually.
9/ 81
Cluster validity
Similarity matrix and Correlation
10/ 81
E01 - Similarity matrix
Extensions
▶ KNIME Core.
▶ KNIME JavaScript Views.
▶ KNIME Interactive R Statistics Integration.
▶ R
▶ ggplot, Rserver
11/ 81
Variance
Low variance filter
Variance
variance is the expectation of the squared deviation of a random
variable from its mean.
Variance is a measure of dispersion, meaning it is a measure of how
far a set of numbers is spread out from their average value:
n
1X
σ2 = (xi − µ)2
n
i=1
with
n
1X
µ= xi
n
i=1
12/ 81
Correlation
Highly feature correlated filter: Pearson
13/ 81
Correlation
Highly feature correlated filter: χ2
14/ 81
Correlation
Highly feature correlated filter: χ2
15/ 81
Feature construction
PCA
16/ 81
E02 - Low variance filter
Extensions
▶ KNIME Core.
17/ 81
E03 - PCA
Extensions
▶ KNIME Core.
18/ 81
K-means
19/ 81
K-means
Pseudocode
1: Select K points as initial centroids.
2: repeat
3: form K clusters by assigning each point to its closest
centroid.
4: recompute the centroid of each cluster.
5: until Centroids do not change.
20/ 81
K-means and different types of clusters
21/ 81
Strengths and weakness
Strengths
▶ Simple and efficient.
▶ There exist K-means variants more efficient and less susceptible
to initialization problems.
Weakness
▶ Not suitable for all types of data.
▶ It cannot handle non-blobular clusters or clusters of different
sizes and densities.
▶ It has trouble if data contains outliers.
22/ 81
Agglomerative Hierarchical Clustering
▶ Agglomerative clustering
▶ First merge very similar instances.
▶ Incrementally build larger clusters
out of smaller clusters.
▶ Algorithm:
▶ Starts with individual points as
clusters.
▶ Repeat:
▶ Pick the two closes clusters.
▶ Merge them into a new cluster.
▶ Stop when there is only one cluster
left.
▶ Produces a family of clustering
represented by a dendrogram.
23/ 81
Agglomerative Hierarchical Clustering
24/ 81
Agglomerative Hierarchical Clustering
25/ 81
Agglomerative Hierarchical Clustering
▶ Merge the two closest clusters (C2 and C5) and update the
distance matrix.
26/ 81
Agglomerative Hierarchical Clustering
27/ 81
Defining proximity between clusters
28/ 81
Time and space complexity
▶ n ≡ number of points.
▶ d ≡ number of features.
29/ 81
E04 - K-means
Extensions
▶ KNIME Core.
▶ KNIME Data Generation.
30/ 81
E05 - Hierarchical clustering
Extensions
▶ KNIME Core.
▶ KNIME Data Generation.
31/ 81
What is classification?
Definition
Given a collection of records E, classification is the task of learning a
target function g that maps each feature set x to one of the
predefined class labels y (g : X → Y ).
Goal
Classify new unlabeled records.
32/ 81
Workflow
33/ 81
E06 - Classification workflow
Extensions
▶ KNIME Core.
34/ 81
Empirical error
35/ 81
Empirical error
Empirical error b
ϵ (g, f )
1X
ϵ (g, f ) =
b (1 − δ(g(x), y))
n
x∈S
where y = f (x).
36/ 81
Model complexity
37/ 81
Model complexity
Underfitting
Your model is underfitting the training data when the model performs
poorly on the training data.
Causes
▶ Trying to create a linear model with non linear data.
▶ Having too little data to build an accurate model.
▶ Model is too simple, has too few features.
Remedies
▶ Add more features during Feature Selection.
▶ Engineer additional features within the scope of your problem
that makes sense.
⇒ Learners tend to have low variance but high bias.
⇒ The cannot capture the relationship of the training data →
inaccurate predictions on training data.
38/ 81
Model complexity
Overfitting
Your model is overfitting your training data when you see that the
model performs well on the training data but does not perform well on
the evaluation data.
Causes
▶ The model captures the noise of the data.
Remedies
▶ Apply k-fold cross validation.
▶ Train with more data.
▶ Remove features.
⇒ Learners tend to have high variance and low bias.
⇒ Model tend to be excessively complicated.
39/ 81
Evaluating classification models
40/ 81
Scoring metrics
▶ Confusion matrix
▶ Accuracy
▶ Sensitivity and specificity
▶ Precision and recall
▶ F-measure
▶ ROC curve
41/ 81
Why different scoring metrics?
42/ 81
Confusion matrix
Predicted
Class=0 Class=1
Class=0 TN FP
Actual
Class=1 FN TP
43/ 81
Accuracy
Accuracy: Probability of classifying a positive OR negative class
event correctly.
Predicted
Healthy Disease
Healthy 20 4
Actual
Disease 1 3
#correct predictions TN + TP 23
Accuracy = = = = 0.82
#predictions TN + FN + TP + FP 28
44/ 81
Sensitivity and specificity
Sensitivity: Are ALL positive class events found by the model?
Specificity: Are ALL negative class events found by the model?
Predicted
Healthy Disease
Healthy 20 4
Actual
Disease 1 3
#TP TP 3
Sensitivity = = = = 0.75
#positives TP + FN 3+1
#TN TN 20
Specificity = = = = 0.83
#negatives TN + FP 20 + 4
45/ 81
Precision and recall
Precision: Are ONLY positive class events found by the model?
Recall: Are ALL positive class events found by the model?
Predicted
Healthy Disease
Healthy 20 4
Actual
Disease 1 3
TP 3
Precision = = = 0.43
TP + FP 3+4
TP 3
Recall = = = 0.75
TP + FN 3+1
46/ 81
F-measure
Are ALL and ONLY positive class events found by the model?
Predicted
Healthy Disease
Healthy 20 4
Actual
Disease 1 3
precision · recall
F − measure = 2 · = 0.55
precision + recall
47/ 81
Receiving Operator Characteristics (ROC) curve
ROC curve shows the trade-off between sensitivity (or TPR) and
specificity (1 – FPR).
48/ 81
Receiving Operator Characteristics (ROC) curve
ROC curve shows the trade-off between sensitivity (or TPR) and
specificity (1 – FPR).
49/ 81
Receiving Operator Characteristics (ROC) curve
50/ 81
Scoring multivariate classification model
51/ 81
Scoring multivariate classification model
52/ 81
Bias and variance
53/ 81
Bias
▶ Bias is the error of the best classifier in the concept class.
▶ Bias is high if the concept class cannot model the true data
dsitribution well.
54/ 81
Variance
55/ 81
Resubstitution
n
1X
ϵ=
b (1 − δ(yi , g(xi , E)))
n
i=1
56/ 81
Holdout
n−n
Xe
1
ϵ= (1 − δ(yi , g(xi , Enee )))
n − ne
b
i=1,xi ∈Ep
▶ En = Ee ∪ Ep , Ee ∩ Ep = ∅.
▶ Random split (1/2 − 1/2, 2/3 − 1/3).
▶ Bias ≡ low; variance ≡ high.
▶ Samples might not be representative ⇒ stratification.
57/ 81
k-fold cross-validation
k
1X
ϵ=
b ϵj
b
k
j=1
58/ 81
E07 - Classification workflow using hold-out and cross validation
Extensions
▶ KNIME Core.
59/ 81
E08 - Classification of rotogravure printing data
Extensions
▶ KNIME Core.
60/ 81
Question: Given two classifiers, which one performs better?
▶ Compare the estimated error.
▶ Problem: variance in estimate.
Question: Do the means estimates differ significantly?
▶ Significance tests tell us how confident we can be that there
really is a difference.
▶ Student’s t-test tell wether the means of two (small) samples are
significant different.
61/ 81
Naive Bayes Classifier (NBC)
Visual intuition
62/ 81
NBC
Visual intuition
63/ 81
NBC
Visual intuition
1 (x−µ)2
f (x|σ, µ) = √ e− 2σ2
σ 2π
64/ 81
NBC
Visual intuition
p(grasshopper|x = 3)
p(katykids|x = 3)
65/ 81
NBC
Visual intuition
p(grasshopper|x = 3)
p(katykids|x = 3)
66/ 81
NBC
Visual intuition
10
p(grasshopper|x = 3) = = 0.833
10 + 2
2
p(katykids|x = 3) = = 0.166
10 + 2
67/ 81
NBC
Visual intuition
3
p(grasshopper|x = 7) = = 0.250
3+9
9
p(katykids|x = 7) = = 0.750
3+9
68/ 81
NBC
Visual intuition
6
p(grasshopper|x = 5) = = 0.5
6+6
6
p(katykids|x = 5) = = 0.5
6+6
69/ 81
NBC
Bayes’ theorem
70/ 81
NBC
Bayes’ theorem example: cancer at age 65
P(x|y)P(y)
P(y|x) =
P(x)
▶ P(y) ≡ The prior probability of having cancer is 1% without prior
knowledge about the individual.
▶ If we assume that cancer and age are related and that:
▶ P(x) ≡ The probability of being 65 years old is 0.2%.
▶ P(x|y) ≡ The probability that a person with cancer is 65 years old is
0.5%.
P(x|y)P(y) 0.5% × 1%
P(y|x) = = = 2.5%
P(x) 0.2%
71/ 81
NBC
Maximum a posteriori (MAP) hypothesis
P(D|h)P(h)
P(h|D) =
P(D)
the most probable hypothesis given the training data is the maximum
a posteriori hypothesis hMAP :
P(D|h)P(h)
hMAP = argmax P(h|D) = argmax = argmax P(D|h)P(h)
h∈H h∈H P(D) h∈H
Note that we can drop P(D) as the probability of the data is constant
and independent of the hypothesis.
72/ 81
NBC
Properties
73/ 81
Logistic Regression (LR)
Visual intuition
Linear regression → ?
▶ Y should follow a normal distribution
▶ Y and X must have a linear relationship
Map from [−∞, ∞] → [0, 1] ⇒ logit transformation
74/ 81
LR
Logistic transformation
Where p ∈ [0, 1]
Link function: Logit function
p
logit(p) = log = βo + β1 · x,
1−p
eŷ 1
p= =
1 + eŷ 1 + e−ŷ
75/ 81
LR
Estimation of parameters
1
p=
1 + e−ŷ
76/ 81
LR
Properties
77/ 81
Decision Tree (DT)
Visual intuition
78/ 81
DT
Visual intuition
79/ 81
DT
Visual intuition
80/ 81
E09 - Compare classification models
Extensions
▶ KNIME Core.
81/ 81