Modulo1 L02

Analítica de Negocio
Módulo 1
Diplomado en Big Data

Universidad Nacional de Asunción
Miguel García Torres

Data Science & Big Data Research Lab
Lenguajes y Sistemas Informáticos, Universidad Pablo de Olavide
1/ 81
Definition
Clustering is the task of grouping a set of objects in such a way that
objects in the same group (called a cluster) are more similar (in some
sense or another) to each other than to those in other groups
(clusters).
2/ 81
Purpose of clustering
Understanding
▶ Understanding data ⇒ clusters are potential data.
▶ In this context, cluster analysis is the study of techniques for
automatically finding classes.
Utility
▶ Cluster analysis provides an abstraction from individual data to
the clusters obtained ⇒ reduce the size of large data sets.
▶ In this context, cluster analysis is the study of techniques for
finding the most representative cluster prototypes.
3/ 81
Clustering in the real world
Customer segmentation
4/ 81
Cluster validity
▶ How to evaluate the goodness of the resulting clusters?

▶ Why to evaluate clusters?
▶ To avoid finding patterns in noise.
▶ To compare clustering algorithms.
▶ To compare two sets of clusters.
▶ To compare two clusters.
5/ 81
Cluster validity
Clusters found in random data
6/ 81
Cluster validity
Correlation
▶ Two matrices
▶ Proximity matrix
▶ Incidence matrix
▶ One row and one column for each data point
▶ An entry is 1 if the associated pair of points belong to the same
cluster.
▶ An entry is 0 if the associated pair of points belong to the different
clusters.
▶ Compute the correlation between the two matrices.
▶ High correlation indicates that points belong to the same cluster
are close to each other.
▶ Not a good measure for some density based clusters.
7/ 81
Cluster validity
Correlation
Correlation of incidence and proximity matrices for the K-means

clusterings of the following two data sets.
8/ 81
Cluster validity
Similarity matrix and Correlation
Order the similarity matrix with respect to cluster labels and inspect
visually.
9/ 81
Cluster validity
Similarity matrix and Correlation
Clusters in random data.
10/ 81
E01 - Similarity matrix
Extensions
▶ KNIME Core.
▶ KNIME JavaScript Views.
▶ KNIME Interactive R Statistics Integration.
▶ R
▶ ggplot, Rserver
11/ 81
Variance
Low variance filter
Variance
variance is the expectation of the squared deviation of a random
variable from its mean.
Variance is a measure of dispersion, meaning it is a measure of how
far a set of numbers is spread out from their average value:
n
1X
σ2 = (xi − µ)2
n
i=1
with
n
1X
µ= xi
n
i=1
12/ 81
Correlation
Highly feature correlated filter: Pearson
▶ Measures linear relationship between features.

▶ For numeric features.
▶ No normalization necessary.
▶ Output value: [−1, 1].
Pn
i=1 (Xi − X)(Yi − Ŷ)
C(X, Y) = qP qP
n 2 n 2
(X
i=1 i − X) i=1 (Yi − Y)
13/ 81
Correlation
Highly feature correlated filter: χ2
▶ Test independence between two features.

▶ High values suggest dependence.
▶ For categorical features.
− eij )2
P
2 ij (oij
χ =
eij
14/ 81
Correlation
Highly feature correlated filter: χ2
▶ Test independence between two features.

▶ High values suggest dependence.
▶ For categorical features.
− eij )2
P
2 ij (oij
χ =
eij
15/ 81
Feature construction
PCA
16/ 81
E02 - Low variance filter
Extensions
▶ KNIME Core.
17/ 81
E03 - PCA
Extensions
▶ KNIME Core.
18/ 81
K-means
▶ Prototype-based clustering technique.

▶ One-level partitioning of the data objects.
▶ K-means defines a prototype in terms of a centroid.
▶ A centroid is the mean of a cluster of points.
▶ A centroid almost nevers corresponds to an actual data point.
▶ K-means is usually applied to objects in a continuous
d-dimensional space.
19/ 81
K-means
Pseudocode
1: Select K points as initial centroids.
2: repeat
3: form K clusters by assigning each point to its closest
centroid.
4: recompute the centroid of each cluster.
5: until Centroids do not change.
20/ 81
K-means and different types of clusters
21/ 81
Strengths and weakness
Strengths
▶ Simple and efficient.
▶ There exist K-means variants more efficient and less susceptible
to initialization problems.
Weakness
▶ Not suitable for all types of data.
▶ It cannot handle non-blobular clusters or clusters of different
sizes and densities.
▶ It has trouble if data contains outliers.
22/ 81
Agglomerative Hierarchical Clustering
▶ Agglomerative clustering
▶ First merge very similar instances.
▶ Incrementally build larger clusters
out of smaller clusters.
▶ Algorithm:
▶ Starts with individual points as
clusters.
▶ Repeat:
▶ Pick the two closes clusters.
▶ Merge them into a new cluster.
▶ Stop when there is only one cluster
left.
▶ Produces a family of clustering
represented by a dendrogram.
23/ 81
▶ Start with clusters of individual points and a distance matrix.
24/ 81
▶ After some merging steps, we have some clusters.
25/ 81
▶ Merge the two closest clusters (C2 and C5) and update the
distance matrix.
26/ 81
▶ How to update the distance matrix?
27/ 81
Defining proximity between clusters
▶ Simple link. The intercluster distance is equal to the minimum

distance between any two instances, one from each cluster.
▶ complete link. The intercluster distance is equal to the furthest
distance between any two instances, one from each cluster.
▶ group average. It measures intercluster distance as the average
of the distances between all intances in the two clusters.
28/ 81
Time and space complexity
▶ n ≡ number of points.
▶ d ≡ number of features.
Space complexity: O((n2 ))

▶ It requires storing the distance matrix.
Time complexity: O(n3 )

▶ There are n steps and at each step the size n2 distance matrix
must be updated and searched.
▶ Complexity can be reduced to O(n2 log(n)) if the algorithm uses
and appropiate data structure.
⇒ Nt suitable for large data sets.
29/ 81
E04 - K-means
Extensions
▶ KNIME Core.
▶ KNIME Data Generation.
30/ 81
E05 - Hierarchical clustering
Extensions
▶ KNIME Core.
▶ KNIME Data Generation.
31/ 81
What is classification?
Definition
Given a collection of records E, classification is the task of learning a
target function g that maps each feature set x to one of the
predefined class labels y (g : X → Y ).
Goal
Classify new unlabeled records.
32/ 81
Workflow
33/ 81
E06 - Classification workflow
Extensions
▶ KNIME Core.
34/ 81
Empirical error
35/ 81
Empirical error
Empirical error b
ϵ (g, f )
1X
ϵ (g, f ) =
b (1 − δ(g(x), y))
n
x∈S
where y = f (x).
36/ 81
Model complexity
37/ 81
Model complexity
Underfitting
Your model is underfitting the training data when the model performs
poorly on the training data.
Causes
▶ Trying to create a linear model with non linear data.
▶ Having too little data to build an accurate model.
▶ Model is too simple, has too few features.
Remedies
▶ Add more features during Feature Selection.
▶ Engineer additional features within the scope of your problem
that makes sense.
⇒ Learners tend to have low variance but high bias.
⇒ The cannot capture the relationship of the training data →
inaccurate predictions on training data.
38/ 81
Model complexity
Overfitting
Your model is overfitting your training data when you see that the
model performs well on the training data but does not perform well on
the evaluation data.
Causes
▶ The model captures the noise of the data.
Remedies
▶ Apply k-fold cross validation.
▶ Train with more data.
▶ Remove features.
⇒ Learners tend to have high variance and low bias.
⇒ Model tend to be excessively complicated.
39/ 81
Evaluating classification models
▶ Accuracy ≡ predicting class label.

▶ Speed
▶ Time to build the model (training time).
▶ Time to use the model (classification time).
▶ Robustness ≡ handling noise and missing values.
▶ Scalability ≡ ability to build he classifier efficiently given large
amounts of data.
▶ Interpretability ≡ Understanding and insight provided by the
model.
40/ 81
Scoring metrics
▶ Confusion matrix
▶ Accuracy
▶ Sensitivity and specificity
▶ Precision and recall
▶ F-measure
▶ ROC curve
41/ 81
Why different scoring metrics?
▶ What is your objective?

▶ What is the target class distribution?
▶ Is the target binomial or multinomial?
42/ 81
Confusion matrix
Predicted
Class=0 Class=1
Class=0 TN FP
Actual
Class=1 FN TP
A confusion matrix is a table that allows visualization of the

performance of the classifier.
▶ True positive (TP)
▶ True negative (TN)
▶ False positive (FP or Type I error)
▶ False negative (FN or Type II error)
43/ 81
Accuracy
Accuracy: Probability of classifying a positive OR negative class
event correctly.
Predicted
Healthy Disease
Healthy 20 4
Actual
Disease 1 3
#correct predictions TN + TP 23
Accuracy = = = = 0.82
#predictions TN + FN + TP + FP 28
44/ 81
Sensitivity and specificity
Sensitivity: Are ALL positive class events found by the model?
Specificity: Are ALL negative class events found by the model?
Predicted
Healthy Disease
Healthy 20 4
Actual
Disease 1 3
#TP TP 3
Sensitivity = = = = 0.75
#positives TP + FN 3+1
#TN TN 20
Specificity = = = = 0.83
#negatives TN + FP 20 + 4
45/ 81
Precision and recall
Precision: Are ONLY positive class events found by the model?
Recall: Are ALL positive class events found by the model?
Predicted
Healthy Disease
Healthy 20 4
Actual
Disease 1 3
TP 3
Precision = = = 0.43
TP + FP 3+4
TP 3
Recall = = = 0.75
TP + FN 3+1
46/ 81
F-measure
Are ALL and ONLY positive class events found by the model?
Predicted
Healthy Disease
Healthy 20 4
Actual
Disease 1 3
precision · recall
F − measure = 2 · = 0.55
precision + recall
47/ 81
Receiving Operator Characteristics (ROC) curve
ROC curve shows the trade-off between sensitivity (or TPR) and
specificity (1 – FPR).
48/ 81
ROC curve shows the trade-off between sensitivity (or TPR) and
specificity (1 – FPR).
49/ 81
50/ 81
Scoring multivariate classification model
51/ 81
Scoring multivariate classification model
52/ 81
Bias and variance
▶ Bias ≡ error associated to

differences between f and g.
▶ Variance ≡ error due to the
variability of the model.
53/ 81
Bias
▶ Bias is the error of the best classifier in the concept class.
▶ Bias is high if the concept class cannot model the true data
dsitribution well.
54/ 81
Variance
▶ Variance is the error of the trained classifier with respect to the

best classifier in the concept class.
▶ Variance depends on the training set size.
▶ Variance decreases with more training data and more complex
classifiers.
▶ High bias → both training

and test error high.
▶ High variance → training
error low and test error
high.
55/ 81
Resubstitution
n
1X
ϵ=
b (1 − δ(yi , g(xi , E)))
n
i=1
▶ Bias ≡ large, optimistic.

▶ Variance ≡ low.
▶ Overfitting.
56/ 81
Holdout
n−n
Xe
1
ϵ= (1 − δ(yi , g(xi , Enee )))
n − ne
b
i=1,xi ∈Ep
▶ En = Ee ∪ Ep , Ee ∩ Ep = ∅.
▶ Random split (1/2 − 1/2, 2/3 − 1/3).
▶ Bias ≡ low; variance ≡ high.
▶ Samples might not be representative ⇒ stratification.
57/ 81
k-fold cross-validation
k
1X
ϵ=
b ϵj
b
k
j=1
▶ Split the training set into k folds (sets).

▶ Run k times:
▶ Train the model on k − 1 folds.
▶ Validate the model on the reamining fold.
▶ Bias ≡ low.
▶ Variance ≡ high.
▶ Very small sample ⇒ leave-one-out cross-validation.
58/ 81
E07 - Classification workflow using hold-out and cross validation
Extensions
▶ KNIME Core.
59/ 81
E08 - Classification of rotogravure printing data
Extensions
▶ KNIME Core.
60/ 81
Question: Given two classifiers, which one performs better?
▶ Compare the estimated error.
▶ Problem: variance in estimate.
Question: Do the means estimates differ significantly?
▶ Significance tests tell us how confident we can be that there
really is a difference.
▶ Student’s t-test tell wether the means of two (small) samples are
significant different.
61/ 81
Naive Bayes Classifier (NBC)
Visual intuition
62/ 81
NBC
Visual intuition
Histogram for Antenna Length:
63/ 81
NBC
Visual intuition
Summarize the histograms with two normal distributions.
1 (x−µ)2
f (x|σ, µ) = √ e− 2σ2
σ 2π
64/ 81
NBC
Visual intuition
Goal: Classify an insect with an antenna length of 3 units.

Question Probability of being grasshopper or katykids given an
antenna length of 3 units?
p(grasshopper|x = 3)
p(katykids|x = 3)
65/ 81
NBC
Visual intuition

p(grasshopper|x = 3)
p(katykids|x = 3)
66/ 81
NBC
Visual intuition

10
p(grasshopper|x = 3) = = 0.833
10 + 2
2
p(katykids|x = 3) = = 0.166
10 + 2
67/ 81
NBC
Visual intuition
3
3+9
9
p(katykids|x = 7) = = 0.750
3+9
68/ 81
NBC
Visual intuition
6
6+6
6
p(katykids|x = 5) = = 0.5
6+6
69/ 81
NBC
Bayes’ theorem
▶ Probabilistic classifier based on Bayes’ theorem.

▶ Determine the best (most probable) hypothesis from some space
H, given the observed training data D.
▶ Bayes’ theorem:
P(D|h)P(h)
P(h|D) =
P(D)
where h is an hypothesis from H and P(D) ̸= 0.
▶ P(h) is the prior probability of h ≡ reflects any background
knowledge we have about the chance that h is a correct hypothesis.
▶ P(D) is the prior probability that training data D will be observed.
▶ P(D|h) is the probability of observing data D given some world in
which hypothesis h holds.
▶ P(h|D) is the probability that h holds given the observed training
data D.
70/ 81
NBC
Bayes’ theorem example: cancer at age 65
Question: Individual’s probability of having cancer (y) at age (x)

65 → P(y|x).
P(x|y)P(y)
P(y|x) =
P(x)
▶ P(y) ≡ The prior probability of having cancer is 1% without prior
knowledge about the individual.
▶ If we assume that cancer and age are related and that:
▶ P(x) ≡ The probability of being 65 years old is 0.2%.
▶ P(x|y) ≡ The probability that a person with cancer is 65 years old is
0.5%.
P(x|y)P(y) 0.5% × 1%
P(y|x) = = = 2.5%
P(x) 0.2%
71/ 81
NBC
Maximum a posteriori (MAP) hypothesis
Given the Bayes rule
P(D|h)P(h)
P(h|D) =
P(D)
the most probable hypothesis given the training data is the maximum
a posteriori hypothesis hMAP :
P(D|h)P(h)
hMAP = argmax P(h|D) = argmax = argmax P(D|h)P(h)
h∈H h∈H P(D) h∈H
Note that we can drop P(D) as the probability of the data is constant
and independent of the hypothesis.
72/ 81
NBC
Properties
▶ Incremental: with each training example, the prior and the

likelihood can be updated dinamically: flexible and robust to
errors.
▶ Combines prior knowledge and observed data: prior probability
of a hypothesis multiplied with probability of the hypothesis given
the training data.
▶ Probabilistic hypotheses: outputs not only a classification, but a
probability distribution over all classes.
73/ 81
Logistic Regression (LR)
Visual intuition
Linear regression → ?
▶ Y should follow a normal distribution
▶ Y and X must have a linear relationship
Map from [−∞, ∞] → [0, 1] ⇒ logit transformation
74/ 81
LR
Logistic transformation
Where p ∈ [0, 1]
Link function: Logit function
p
logit(p) = log = βo + β1 · x,
1−p
eŷ 1
p= =
1 + eŷ 1 + e−ŷ
75/ 81
LR
Estimation of parameters
1
p=
1 + e−ŷ
▶ p = p(yi ) → predicted probability that Y is true for case i.

▶ β0 → constant estimated from the data.
▶ β1 → β-coefficient estimated from the data.
Some algorithms for Parameter estimation:
▶ Gradient Descent
▶ Maximum Likelihood Estimation (MLE)
▶ Newton’s method
76/ 81
LR
Properties
▶ The model can be extended to more than one predictors.

▶ The model can be extended to multilabel classification problem.
▶ The weighted LR is suitable for imbalanced datasets.
77/ 81
Decision Tree (DT)
Visual intuition
78/ 81
DT
Visual intuition
▶ Tree built using a top-down

strategy.
▶ Tree partition is done using
a recursive approach.
▶ Partitioning stops if:
▶ All samples belong to the
same class.
▶ There are no reamining
attributes for partitioning.
79/ 81
DT
Visual intuition
Splitting criteria: Information

Gain
80/ 81
E09 - Compare classification models
Extensions
▶ KNIME Core.
81/ 81

Modulo1 L02

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Modulo1 L02

Uploaded by

Copyright:

Available Formats

Analítica de Negocio

Diplomado en Big Data

Miguel García Torres

▶ How to evaluate the goodness of the resulting clusters?

Correlation of incidence and proximity matrices for the K-means

Clusters in random data.

▶ Measures linear relationship between features.

▶ Test independence between two features.

▶ Test independence between two features.

▶ Prototype-based clustering technique.

▶ Start with clusters of individual points and a distance matrix.

▶ After some merging steps, we have some clusters.

▶ How to update the distance matrix?

▶ Simple link. The intercluster distance is equal to the minimum

Space complexity: O((n2 ))

Time complexity: O(n3 )

⇒ Nt suitable for large data sets.

▶ Accuracy ≡ predicting class label.

▶ What is your objective?

A confusion matrix is a table that allows visualization of the

▶ Bias ≡ error associated to

▶ Variance is the error of the trained classifier with respect to the

▶ High bias → both training

▶ Bias ≡ large, optimistic.

▶ Split the training set into k folds (sets).

Histogram for Antenna Length:

Summarize the histograms with two normal distributions.

Goal: Classify an insect with an antenna length of 3 units.

Goal: Classify an insect with an antenna length of 3 units.

Goal: Classify an insect with an antenna length of 3 units.

▶ Probabilistic classifier based on Bayes’ theorem.

Question: Individual’s probability of having cancer (y) at age (x)

Given the Bayes rule

▶ Incremental: with each training example, the prior and the

▶ p = p(yi ) → predicted probability that Y is true for case i.

▶ The model can be extended to more than one predictors.

▶ Tree built using a top-down

Splitting criteria: Information

You might also like