Professional Documents
Culture Documents
CS8091 BDA Unit 2
CS8091 BDA Unit 2
K – means:
Use Cases:
Once the clusters are identified, labels can be
applied to each cluster to classify each group based
on its characteristics.
To discover hidden structures of the data, possibly
as a introduction to more focused analysis or
decision processes
Clustering: K – means
Diagnostics:
The heuristic using WSS can provide at least several
possible k values to consider.
When the number of attributes is relatively small, a
common approach to further refine the choice of k is
to plot the data to determine how distinct the
identified clusters are from each other.
Clustering: K – means
Diagnostics:
In general, the following questions should be
considered:
Are the clusters well separated from each other?
Object Attributes:
To understand what attributes will be known at
the time a new object will be assigned to a cluster.
Clustering: K – means
Object Attributes:
Information on existing customers’ satisfaction or
purchase frequency may be available, but such
information may not be available for potential
customers.
Clustering: K – means
Object Attributes:
Whenever possible and based on the data, it is best to
reduce the number of attributes to the extent possible.
Too many attributes can minimize the impact of the
most important variables.
Clustering: K – means
Object Attributes:
Another option to reduce the number of attributes
is to combine several attributes into one measure.
Clustering: K – means
Units of Measure:
The algorithm will identify different clusters
depending on the choice of the units of measure.
To cluster patients based on age in years and
height in centimeters.
Clustering: K – means
Units of Measure:
Clustering: K – means
Units of Measure:
But if the height was rescaled from centimeters to
meters by dividing by 100, the resulting clusters
would be slightly different.
Clustering: K – means
Units of Measure:
Clustering: K – means
Rescaling:
Attributes that are expressed in dollars are common in
clustering analyses and can differ in magnitude from
the other attributes.
If personal income is expressed in dollars and age is
expressed in years.
Clustering: K – means
Rescaling:
Although some adjustments could be made by
expressing the income in thousands of dollars (for
example, 10 for $10,000).
Clustering: K – means
Additional Considerations:
Euclidean distance function to assign the points to
the closest centroids.
Other possible function choices include the cosine
similarity and the Manhattan distance functions.
Clustering: K – means
Additional Considerations:
The cosine similarity function is often chosen to
compare two documents based on the frequency of
each word that appears in each of the documents.
Clustering: K – means
Additional Considerations:
The Manhattan distance, d1, between p and q is
expressed as
Clustering
Additional Algorithms:
Partitioning around Medoids: a medoid is a
representative object in a set of objects.
In clustering, the medoids are the objects in each
cluster that minimize the sum of the distances from
the medoid to the other objects in the cluster
Clustering
Additional Algorithms:
Additional Algorithms:
Density-based clustering:
the clusters are identified by the concentration of
points.
In R, dbscan() function is used.
Classification
Classification trees:
apply to output variables that are categorical—
often binary—in nature.
Example: yes or no, purchase or not purchase.
Classification - Decision Trees
Regression trees:
apply to output variables that are numeric or
continuous.
Example: the price of a house, or a patient’s
length of stay in a hospital.
Classification - Decision Trees
Regression trees:
Classification - Decision Trees
Example – Dataset:
Age Competition Type Profit
Old Yes Software Down
Old No Software Down
Old No Hardware Down
Mid Yes Software Down
Mid Yes Hardware Down
Mid No Hardware Up
Mid No Software Up
New Yes Software Up
New No Hardware Up
New No Software Up
Classification - Decision Trees
To find
P, N
Entropy
Information Gain
Gain
Classification - Decision Trees
Solution:
P = 5 ( Possibility of getting Yes ( Here, Up))
ID3 (A, P, T)
if
return
if
Algorithms - C4.5
Algorithms - C4.5
The ID3 algorithm may construct a deep and
complex tree, which would cause overfitting.
Bottom-up technique called pruning to simplify
the tree by removing the least visited nodes and
branches
Classification - Decision Trees
Algorithms – CART:
Algorithms – CART:
Example – Income:
Low Income: income < $10,000
Applications:
Spam filtering - to distinguish spam e-mail from
legitimate e-mail
Fraud Detection – auto insurance
Naïve Bayes
Bayes’ Theorem:
The conditional probability of event C occurring, given
that event A has already occurred, is denoted as P(C|A),
which can be found using the formula
Naïve Bayes
Bayes’ Theorem:
some minor algebra and substitution of the conditional
probability
Bayes’ Theorem:
P(C|A) – Posterior Probability – The degree to which we
believe a given model accurately describes the situation
given available data and all of our prior information.
Probability of event after event C is true.
Bayes’ Theorem:
P (C) – Prior Probability – The degree to which we believe
the model accurately describes reality based on all of our
prior information.
Probability of event before event C
Advantages:
Fast to predict a class of test data set
Disadvantages:
Bad Estimator
Applications:
Credit Scoring
Medical
Text classification
Naïve Bayes
Smoothing:
If one of the attribute values does not appear with one of
the class labels within the training set, the corresponding
P(C | A) will equal to zero.
Laplace smoothing (or addone) technique that pretends to
see every outcome once more than it actually appears
Naïve Bayes
Diagnostics of Classifiers:
A confusion matrix is a specific table layout that allows
visualization of the performance of a classifier.
Positive Negative
Positive True Positives (TP) True Negatives
(TN)
Negative False Positives (FP) False Negatives
(TN)
Naïve Bayes
Diagnostics of Classifiers:
Diagnostics of Classifiers:
Diagnostics of Classifiers:
Diagnostics of Classifiers:
Diagnostics of Classifiers:
Diagnostics of Classifiers:
Accuracy Metrics:
Accuracy Metrics: