Download as pdf or txt
Download as pdf or txt
You are on page 1of 81

Analítica de Negocio

Módulo 1

Diplomado en Big Data


Universidad Nacional de Asunción

Miguel García Torres


Data Science & Big Data Research Lab
Lenguajes y Sistemas Informáticos, Universidad Pablo de Olavide

1/ 81
Definition
Clustering is the task of grouping a set of objects in such a way that
objects in the same group (called a cluster) are more similar (in some
sense or another) to each other than to those in other groups
(clusters).

2/ 81
Purpose of clustering

Understanding
▶ Understanding data ⇒ clusters are potential data.
▶ In this context, cluster analysis is the study of techniques for
automatically finding classes.

Utility
▶ Cluster analysis provides an abstraction from individual data to
the clusters obtained ⇒ reduce the size of large data sets.
▶ In this context, cluster analysis is the study of techniques for
finding the most representative cluster prototypes.

3/ 81
Clustering in the real world
Customer segmentation

4/ 81
Cluster validity

▶ How to evaluate the goodness of the resulting clusters?


▶ Why to evaluate clusters?
▶ To avoid finding patterns in noise.
▶ To compare clustering algorithms.
▶ To compare two sets of clusters.
▶ To compare two clusters.

5/ 81
Cluster validity
Clusters found in random data

6/ 81
Cluster validity
Correlation

▶ Two matrices
▶ Proximity matrix
▶ Incidence matrix
▶ One row and one column for each data point
▶ An entry is 1 if the associated pair of points belong to the same
cluster.
▶ An entry is 0 if the associated pair of points belong to the different
clusters.
▶ Compute the correlation between the two matrices.
▶ High correlation indicates that points belong to the same cluster
are close to each other.
▶ Not a good measure for some density based clusters.

7/ 81
Cluster validity
Correlation

Correlation of incidence and proximity matrices for the K-means


clusterings of the following two data sets.

8/ 81
Cluster validity
Similarity matrix and Correlation

Order the similarity matrix with respect to cluster labels and inspect
visually.

9/ 81
Cluster validity
Similarity matrix and Correlation

Clusters in random data.

10/ 81
E01 - Similarity matrix

Extensions
▶ KNIME Core.
▶ KNIME JavaScript Views.
▶ KNIME Interactive R Statistics Integration.
▶ R
▶ ggplot, Rserver

11/ 81
Variance
Low variance filter

Variance
variance is the expectation of the squared deviation of a random
variable from its mean.
Variance is a measure of dispersion, meaning it is a measure of how
far a set of numbers is spread out from their average value:
n
1X
σ2 = (xi − µ)2
n
i=1

with
n
1X
µ= xi
n
i=1

12/ 81
Correlation
Highly feature correlated filter: Pearson

▶ Measures linear relationship between features.


▶ For numeric features.
▶ No normalization necessary.
▶ Output value: [−1, 1].
Pn
i=1 (Xi − X)(Yi − Ŷ)
C(X, Y) = qP qP
n 2 n 2
(X
i=1 i − X) i=1 (Yi − Y)

13/ 81
Correlation
Highly feature correlated filter: χ2

▶ Test independence between two features.


▶ High values suggest dependence.
▶ For categorical features.
− eij )2
P
2 ij (oij
χ =
eij

14/ 81
Correlation
Highly feature correlated filter: χ2

▶ Test independence between two features.


▶ High values suggest dependence.
▶ For categorical features.
− eij )2
P
2 ij (oij
χ =
eij

15/ 81
Feature construction
PCA

16/ 81
E02 - Low variance filter

Extensions
▶ KNIME Core.

17/ 81
E03 - PCA

Extensions
▶ KNIME Core.

18/ 81
K-means

▶ Prototype-based clustering technique.


▶ One-level partitioning of the data objects.
▶ K-means defines a prototype in terms of a centroid.
▶ A centroid is the mean of a cluster of points.
▶ A centroid almost nevers corresponds to an actual data point.
▶ K-means is usually applied to objects in a continuous
d-dimensional space.

19/ 81
K-means

Pseudocode
1: Select K points as initial centroids.
2: repeat
3: form K clusters by assigning each point to its closest
centroid.
4: recompute the centroid of each cluster.
5: until Centroids do not change.

20/ 81
K-means and different types of clusters

21/ 81
Strengths and weakness

Strengths
▶ Simple and efficient.
▶ There exist K-means variants more efficient and less susceptible
to initialization problems.

Weakness
▶ Not suitable for all types of data.
▶ It cannot handle non-blobular clusters or clusters of different
sizes and densities.
▶ It has trouble if data contains outliers.

22/ 81
Agglomerative Hierarchical Clustering

▶ Agglomerative clustering
▶ First merge very similar instances.
▶ Incrementally build larger clusters
out of smaller clusters.
▶ Algorithm:
▶ Starts with individual points as
clusters.
▶ Repeat:
▶ Pick the two closes clusters.
▶ Merge them into a new cluster.
▶ Stop when there is only one cluster
left.
▶ Produces a family of clustering
represented by a dendrogram.

23/ 81
Agglomerative Hierarchical Clustering

▶ Start with clusters of individual points and a distance matrix.

24/ 81
Agglomerative Hierarchical Clustering

▶ After some merging steps, we have some clusters.

25/ 81
Agglomerative Hierarchical Clustering
▶ Merge the two closest clusters (C2 and C5) and update the
distance matrix.

26/ 81
Agglomerative Hierarchical Clustering

▶ How to update the distance matrix?

27/ 81
Defining proximity between clusters

▶ Simple link. The intercluster distance is equal to the minimum


distance between any two instances, one from each cluster.
▶ complete link. The intercluster distance is equal to the furthest
distance between any two instances, one from each cluster.
▶ group average. It measures intercluster distance as the average
of the distances between all intances in the two clusters.

28/ 81
Time and space complexity

▶ n ≡ number of points.
▶ d ≡ number of features.

Space complexity: O((n2 ))


▶ It requires storing the distance matrix.

Time complexity: O(n3 )


▶ There are n steps and at each step the size n2 distance matrix
must be updated and searched.
▶ Complexity can be reduced to O(n2 log(n)) if the algorithm uses
and appropiate data structure.

⇒ Nt suitable for large data sets.

29/ 81
E04 - K-means

Extensions
▶ KNIME Core.
▶ KNIME Data Generation.

30/ 81
E05 - Hierarchical clustering

Extensions
▶ KNIME Core.
▶ KNIME Data Generation.

31/ 81
What is classification?

Definition
Given a collection of records E, classification is the task of learning a
target function g that maps each feature set x to one of the
predefined class labels y (g : X → Y ).

Goal
Classify new unlabeled records.

32/ 81
Workflow

33/ 81
E06 - Classification workflow

Extensions
▶ KNIME Core.

34/ 81
Empirical error

35/ 81
Empirical error

Empirical error b
ϵ (g, f )

1X
ϵ (g, f ) =
b (1 − δ(g(x), y))
n
x∈S

where y = f (x).

36/ 81
Model complexity

37/ 81
Model complexity
Underfitting

Your model is underfitting the training data when the model performs
poorly on the training data.
Causes
▶ Trying to create a linear model with non linear data.
▶ Having too little data to build an accurate model.
▶ Model is too simple, has too few features.
Remedies
▶ Add more features during Feature Selection.
▶ Engineer additional features within the scope of your problem
that makes sense.
⇒ Learners tend to have low variance but high bias.
⇒ The cannot capture the relationship of the training data →
inaccurate predictions on training data.

38/ 81
Model complexity
Overfitting

Your model is overfitting your training data when you see that the
model performs well on the training data but does not perform well on
the evaluation data.
Causes
▶ The model captures the noise of the data.
Remedies
▶ Apply k-fold cross validation.
▶ Train with more data.
▶ Remove features.
⇒ Learners tend to have high variance and low bias.
⇒ Model tend to be excessively complicated.

39/ 81
Evaluating classification models

▶ Accuracy ≡ predicting class label.


▶ Speed
▶ Time to build the model (training time).
▶ Time to use the model (classification time).
▶ Robustness ≡ handling noise and missing values.
▶ Scalability ≡ ability to build he classifier efficiently given large
amounts of data.
▶ Interpretability ≡ Understanding and insight provided by the
model.

40/ 81
Scoring metrics

▶ Confusion matrix
▶ Accuracy
▶ Sensitivity and specificity
▶ Precision and recall
▶ F-measure
▶ ROC curve

41/ 81
Why different scoring metrics?

▶ What is your objective?


▶ What is the target class distribution?
▶ Is the target binomial or multinomial?

42/ 81
Confusion matrix

Predicted
Class=0 Class=1
Class=0 TN FP
Actual
Class=1 FN TP

A confusion matrix is a table that allows visualization of the


performance of the classifier.
▶ True positive (TP)
▶ True negative (TN)
▶ False positive (FP or Type I error)
▶ False negative (FN or Type II error)

43/ 81
Accuracy
Accuracy: Probability of classifying a positive OR negative class
event correctly.
Predicted
Healthy Disease
Healthy 20 4
Actual
Disease 1 3
#correct predictions TN + TP 23
Accuracy = = = = 0.82
#predictions TN + FN + TP + FP 28

44/ 81
Sensitivity and specificity
Sensitivity: Are ALL positive class events found by the model?
Specificity: Are ALL negative class events found by the model?
Predicted
Healthy Disease
Healthy 20 4
Actual
Disease 1 3
#TP TP 3
Sensitivity = = = = 0.75
#positives TP + FN 3+1
#TN TN 20
Specificity = = = = 0.83
#negatives TN + FP 20 + 4

45/ 81
Precision and recall
Precision: Are ONLY positive class events found by the model?
Recall: Are ALL positive class events found by the model?
Predicted
Healthy Disease
Healthy 20 4
Actual
Disease 1 3
TP 3
Precision = = = 0.43
TP + FP 3+4
TP 3
Recall = = = 0.75
TP + FN 3+1

46/ 81
F-measure

Are ALL and ONLY positive class events found by the model?
Predicted
Healthy Disease
Healthy 20 4
Actual
Disease 1 3
precision · recall
F − measure = 2 · = 0.55
precision + recall

47/ 81
Receiving Operator Characteristics (ROC) curve

ROC curve shows the trade-off between sensitivity (or TPR) and
specificity (1 – FPR).

48/ 81
Receiving Operator Characteristics (ROC) curve

ROC curve shows the trade-off between sensitivity (or TPR) and
specificity (1 – FPR).

49/ 81
Receiving Operator Characteristics (ROC) curve

50/ 81
Scoring multivariate classification model

51/ 81
Scoring multivariate classification model

52/ 81
Bias and variance

▶ Bias ≡ error associated to


differences between f and g.
▶ Variance ≡ error due to the
variability of the model.

53/ 81
Bias
▶ Bias is the error of the best classifier in the concept class.
▶ Bias is high if the concept class cannot model the true data
dsitribution well.

54/ 81
Variance

▶ Variance is the error of the trained classifier with respect to the


best classifier in the concept class.
▶ Variance depends on the training set size.
▶ Variance decreases with more training data and more complex
classifiers.

▶ High bias → both training


and test error high.
▶ High variance → training
error low and test error
high.

55/ 81
Resubstitution

n
1X
ϵ=
b (1 − δ(yi , g(xi , E)))
n
i=1

▶ Bias ≡ large, optimistic.


▶ Variance ≡ low.
▶ Overfitting.

56/ 81
Holdout

n−n
Xe
1
ϵ= (1 − δ(yi , g(xi , Enee )))
n − ne
b
i=1,xi ∈Ep

▶ En = Ee ∪ Ep , Ee ∩ Ep = ∅.
▶ Random split (1/2 − 1/2, 2/3 − 1/3).
▶ Bias ≡ low; variance ≡ high.
▶ Samples might not be representative ⇒ stratification.

57/ 81
k-fold cross-validation

k
1X
ϵ=
b ϵj
b
k
j=1

▶ Split the training set into k folds (sets).


▶ Run k times:
▶ Train the model on k − 1 folds.
▶ Validate the model on the reamining fold.
▶ Bias ≡ low.
▶ Variance ≡ high.
▶ Very small sample ⇒ leave-one-out cross-validation.

58/ 81
E07 - Classification workflow using hold-out and cross validation

Extensions
▶ KNIME Core.

59/ 81
E08 - Classification of rotogravure printing data

Extensions
▶ KNIME Core.

60/ 81
Question: Given two classifiers, which one performs better?
▶ Compare the estimated error.
▶ Problem: variance in estimate.
Question: Do the means estimates differ significantly?
▶ Significance tests tell us how confident we can be that there
really is a difference.
▶ Student’s t-test tell wether the means of two (small) samples are
significant different.

61/ 81
Naive Bayes Classifier (NBC)
Visual intuition

62/ 81
NBC
Visual intuition

Histogram for Antenna Length:

63/ 81
NBC
Visual intuition

Summarize the histograms with two normal distributions.

1 (x−µ)2
f (x|σ, µ) = √ e− 2σ2
σ 2π

64/ 81
NBC
Visual intuition

Goal: Classify an insect with an antenna length of 3 units.


Question Probability of being grasshopper or katykids given an
antenna length of 3 units?

p(grasshopper|x = 3)

p(katykids|x = 3)

65/ 81
NBC
Visual intuition

Goal: Classify an insect with an antenna length of 3 units.


Question Probability of being grasshopper or katykids given an
antenna length of 3 units?

p(grasshopper|x = 3)

p(katykids|x = 3)

66/ 81
NBC
Visual intuition

Goal: Classify an insect with an antenna length of 3 units.


Question Probability of being grasshopper or katykids given an
antenna length of 3 units?

10
p(grasshopper|x = 3) = = 0.833
10 + 2
2
p(katykids|x = 3) = = 0.166
10 + 2

67/ 81
NBC
Visual intuition

3
p(grasshopper|x = 7) = = 0.250
3+9
9
p(katykids|x = 7) = = 0.750
3+9

68/ 81
NBC
Visual intuition

6
p(grasshopper|x = 5) = = 0.5
6+6
6
p(katykids|x = 5) = = 0.5
6+6

69/ 81
NBC
Bayes’ theorem

▶ Probabilistic classifier based on Bayes’ theorem.


▶ Determine the best (most probable) hypothesis from some space
H, given the observed training data D.
▶ Bayes’ theorem:
P(D|h)P(h)
P(h|D) =
P(D)
where h is an hypothesis from H and P(D) ̸= 0.
▶ P(h) is the prior probability of h ≡ reflects any background
knowledge we have about the chance that h is a correct hypothesis.
▶ P(D) is the prior probability that training data D will be observed.
▶ P(D|h) is the probability of observing data D given some world in
which hypothesis h holds.
▶ P(h|D) is the probability that h holds given the observed training
data D.

70/ 81
NBC
Bayes’ theorem example: cancer at age 65

Question: Individual’s probability of having cancer (y) at age (x)


65 → P(y|x).

P(x|y)P(y)
P(y|x) =
P(x)
▶ P(y) ≡ The prior probability of having cancer is 1% without prior
knowledge about the individual.
▶ If we assume that cancer and age are related and that:
▶ P(x) ≡ The probability of being 65 years old is 0.2%.
▶ P(x|y) ≡ The probability that a person with cancer is 65 years old is
0.5%.
P(x|y)P(y) 0.5% × 1%
P(y|x) = = = 2.5%
P(x) 0.2%

71/ 81
NBC
Maximum a posteriori (MAP) hypothesis

Given the Bayes rule

P(D|h)P(h)
P(h|D) =
P(D)

the most probable hypothesis given the training data is the maximum
a posteriori hypothesis hMAP :

P(D|h)P(h)
hMAP = argmax P(h|D) = argmax = argmax P(D|h)P(h)
h∈H h∈H P(D) h∈H

Note that we can drop P(D) as the probability of the data is constant
and independent of the hypothesis.

72/ 81
NBC
Properties

▶ Incremental: with each training example, the prior and the


likelihood can be updated dinamically: flexible and robust to
errors.
▶ Combines prior knowledge and observed data: prior probability
of a hypothesis multiplied with probability of the hypothesis given
the training data.
▶ Probabilistic hypotheses: outputs not only a classification, but a
probability distribution over all classes.

73/ 81
Logistic Regression (LR)
Visual intuition

Linear regression → ?
▶ Y should follow a normal distribution
▶ Y and X must have a linear relationship
Map from [−∞, ∞] → [0, 1] ⇒ logit transformation

74/ 81
LR
Logistic transformation

Where p ∈ [0, 1]
Link function: Logit function
p
logit(p) = log = βo + β1 · x,
1−p

eŷ 1
p= =
1 + eŷ 1 + e−ŷ

75/ 81
LR
Estimation of parameters

1
p=
1 + e−ŷ

▶ p = p(yi ) → predicted probability that Y is true for case i.


▶ β0 → constant estimated from the data.
▶ β1 → β-coefficient estimated from the data.
Some algorithms for Parameter estimation:
▶ Gradient Descent
▶ Maximum Likelihood Estimation (MLE)
▶ Newton’s method

76/ 81
LR
Properties

▶ The model can be extended to more than one predictors.


▶ The model can be extended to multilabel classification problem.
▶ The weighted LR is suitable for imbalanced datasets.

77/ 81
Decision Tree (DT)
Visual intuition

78/ 81
DT
Visual intuition

▶ Tree built using a top-down


strategy.
▶ Tree partition is done using
a recursive approach.
▶ Partitioning stops if:
▶ All samples belong to the
same class.
▶ There are no reamining
attributes for partitioning.

79/ 81
DT
Visual intuition

Splitting criteria: Information


Gain

80/ 81
E09 - Compare classification models

Extensions
▶ KNIME Core.

81/ 81

You might also like