Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

UNIT – 4

CLASSIFICATION BASIC CONCEPT


Classification and Prediction:

 Classification and prediction are two forms of data analysis that can be
used to extract models describing important data classes or to predict
future data trends.
 Classification predicts categorical (discrete, unordered) labels,
prediction models continuous valued functions.
 For example, we can build a classification model to categorize bank loan
applications as either safe or risky, or a prediction model to predict the
expenditures of potential customers on computer equipment given their
income and occupation.
 A predictor is constructed that predicts a continuous-valued function, or
ordered value, as opposed to a categorical label.
 Regression analysis is a statistical methodology that is most often used
for numeric prediction.
 Many classification and prediction methods have been proposed by
researchers in machine learning, pattern recognition, and statistics.
 Most algorithms are memory resident, typically assuming a small data
size. Recent data mining research has built on such work, developing
scalable classification and prediction techniques capable of handling
large disk-resident data.

Issues Regarding Classification and Prediction:


1. Preparing the Data for Classification and Prediction:
The following preprocessing steps may be applied to the data to help
improve the accuracy, efficiency, and scalability of the classification or
prediction process.
(i) Data cleaning:
This refers to the preprocessing of data in order to remove or reduce noise
(by applying smoothing techniques) and the treatment of missing values (e.g.,
by replacing a missing value with the most commonly occurring value for that
attribute, or with the most probable value based on statistics).
Although most classification algorithms have some mechanisms for
handling noisy or missing data, this step can help reduce confusion during
learning.
(ii) Relevance analysis:
 Many of the attributes in the data may be redundant.
 Correlation analysis can be used to identify whether any two given
attributes are statistically related.
 For example, a strong correlation between attributes A1 and A2 would
suggest that one of the two could be removed from further analysis.
 A database may also contain irrelevant attributes. Attribute subset
selection can be used in these cases to find a reduced set of attributes
such that the resulting probability distribution of the data classes is as
close as possible to the original distribution obtained using all attributes.
 Hence, relevance analysis, in the form of correlation analysis and
attribute subset selection, can be used to detect attributes that do not
contribute to the classification or prediction task.
 Such analysis can help improve classification efficiency and scalability.
(iii) Data Transformation And Reduction
 The data may be transformed by normalization, particularly when neural
networks or methods involving distance measurements are used in the
learning step. Normalization involves scaling all values for a given
attribute so that they fall within a small specified range, such as -1 to +1
or 0 to 1.
 The data can also be transformed by generalizing it to higher-level concepts.
Concept hierarchies may be used for this purpose. This is particularly useful
for continuous valued attributes.
 For example, numeric values for the attribute income can be generalized
to discrete ranges, such as low, medium, and high. Similarly, categorical
attributes, like street, can be generalized to higher-level concepts, like
city.
 Data can also be reduced by applying many other methods, ranging
from wavelet transformation and principle components analysis to
discretization techniques, such as binning, histogram analysis, and
clustering.

Comparing Classification and Prediction Methods:

Accuracy:
The accuracy of a classifier refers to the ability of a given classifier to correctly
predict the class label of new or previously unseen data (i.e., tuples without class
label information).
The accuracy of a predictor refers to how well a given predictor can guess the
value of the predicted attribute for new or previously unseen data.
Speed:
This refers to the computational costs involved in generating and using
the given classifier or predictor.
Robustness:
This is the ability of the classifier or predictor to make correct
predictions given noisy data or data with missing values.
Scalability:
This refers to the ability to construct the classifier or predictor efficiently given
large amounts of data.
Interpretability:
This refers to the level of understanding and insight that is provided by the
classifier or predictor. Interpretability is subjective and therefore more difficult
to assess.
Classification by Decision Tree Induction:
Decision tree induction is the learning of decision trees from class-labeled
training tuples.
A decision tree is a flowchart-like tree structure, where
 Each internal node denotes a test on an attribute.
 Each branch represents an outcome of the test.
 Each leaf node holds a class label.
 The topmost node in a tree is the root node.

 The construction of decision tree classifiers does not require any domain
knowledge or parameter setting, and therefore I appropriate for
exploratory knowledge discovery.
 Decision trees can handle high dimensional data.
 Their representation of acquired knowledge in tree form is intuitive and
generally easy to assimilate by humans.
 The learning and classification steps of decision tree induction are simple
and fast.
 In general, decision tree classifiers have good accuracy.
 Decision tree induction algorithm shave been used for classification in
many application areas, such as medicine, manufacturing and production,
financial analysis, astronomy, and molecular biology.
Algorithm For Decision Tree Induction:
The algorithm is called with three parameters:
 Data partition
 Attribute list
 Attribute selection method
 The parameter attribute list is a list of attributes describing the tuples.
 Attribute selection method specifies a heuristic procedure for selecting the
attribute that ―best‖ discriminates the given tuples according to class.
 The tree starts as a single node, N, representing the training tuples in D.
 If the tuples in D are all of the same class, then node N becomes a leaf and is
labeled with that class .
 All of the terminating conditions are explained at the end of the algorithm.
 Otherwise, the algorithm calls Attribute selection method to determine the
splitting criterion.
 The splitting criterion tells us which attribute to test at node N by
determining the ―best‖ way to separate or partition the tuples in D into
individual classes.
 There are three possible scenarios. Let A be the splitting attribute. A has v
distinct values, {a1, a2, … ,av}, based on the training data.
1. A is discrete-valued:
 In this case, the outcomes of the test at node N correspond directly
to the known values of A.
 A branch is created for each known value, aj, of A and labeled with that
value.
 A need not be considered in any future partitioning of the tuples.
2. A is continuous-valued:
 In this case, the test at node N has two possible outcomes, corresponding to
the conditions A <=split point and A >split point, respectively
 Where split point is the split-point returned by Attribute selection
method as part of the splitting criterion.
3. A is discrete-valued and a binary tree must be produced:
 The test at node N is of the form―A€SA?‖.
 SA is the splitting subset for A, returned by Attribute selection method as
part of the splitting criterion. It is a subset of the known values of A.
(a) If A is Discrete valued (b)If A is continuous valued (c) IfA is discrete-valued
and a binary tree must be produced:
Bayesian Classification:
Bayesian classifiers are statistical classifiers. They can predict class membership
probabilities, such as the probability that a given tuple belongs to a particular class.
Bayesian classification is based on Bayes’ theorem.
Bayes’ Theorem:
 Let X be a data tuple. In Bayesian terms, X is considered ―evidence,
and it is described by measurements made on a set of n attributes.
 Let H be some hypothesis, such as that the data tuple X belongs to a
specified class C. For classification problems, we want to determine
P(H|X), the probability that the hypothesis H holds given the ―evidence‖
or observed data tuple X.
 P(H|X) is the posterior probability, or a posteriori probability, of H
conditioned on X.
 Bayes’ theorem is useful in that it provides a way of calculating the
posterior probability, P(H|X), from P(H), P(X|H), and P(X).
Naïve Bayesian Classification:
The naïve Bayesian classifier, or simple Bayesian classifier, works as follows:
1. Let D be a training set of tuples and their associated class labels. As
usual, each tuple is represented by an n-dimensional attribute vector, X =
(x1, x2, …,xn), depicting n measurements made on the tuple from n
attributes, respectively, A1, A2, …, An.
2. Suppose that there are m classes, C1, C2, …, Cm. Given a tuple, X, the

classifier will predict that X belongs to the class having the highest
posterior probability, conditioned on X.

That is, the naïve Bayesian classifier predicts that tuple X belongs to the
class Ci if and only if

Thus we maximize P(CijX). The classCifor which P(CijX) is


maximized is called the maximum posteriori hypothesis. By Bayes’
theorem

3. As P(X) is constant for all classes, only P(X|Ci)P(Ci) need be


maximized. If the class prior probabilities are not known, then it is
commonly assumed that the classes are equally likely, that is, P(C1) =
P(C2) = …= P(Cm), and we would therefore maximize P(X|Ci).
Otherwise, we maximize P(X|Ci)P(Ci).
4. Given data sets with many attributes, it would be extremely
computationally expensive to compute P(X|Ci). In order to reduce
computation in evaluating P(X|Ci), the naive assumption of class
conditional independence is made. This presumes that the values of the
attributes are conditionally independent of one another, given the class
label of the tuple. Thus,

We can easily estimate the probabilities P(x1|Ci), P(x2|Ci), : : : ,


P(xn|Ci) fromthe trainingtuples. For eachattribute, we look at whether
the attribute is categorical or continuous-valued. Forinstance, to
compute P(X|Ci), we consider the following:
 If Akis categorical, then P(xk|Ci) is the number of tuples of class Ciin D

havingthe value xkfor Ak, divided by |Ci,D| the number of tuples of class
Ciin D.

 If Akis continuous-valued, then we need to do a bit more work, but the


calculationis pretty straightforward.
A continuous-valued attribute is typically assumed tohave a Gaussian
distribution with a mean μ and standard deviation , defined by

5. In order to predict the class label of X, P(XjCi)P(Ci) is evaluated


for each class Ci.
The classifier predicts that the class label of tuple X is the class Ciif
and only if

Cluster:
Cluster is a group of objects that belong to the same class. In other words,
similar objects are grouped in one cluster and dissimilar objects are grouped in
another cluster.
What is clustering?
Clustering is the process of making a group of abstract objects into classes of
similar objects.
Points to remember
A cluster of data objects can be treated as one group. While doing cluster
analysis, we first partition the set of data into groups based on data similarity and
then assign the labels to the groups.
The main advantage of clustering over classification is that, it is adaptable to
changes and helps single out useful feature that distinguish different groups.
Applications
 Data summarization
 Customer segmentation
 Social network analysis
 Relationship to other data mining problems
Requirements of Clustering in Data Mining
The following points throw light on why clustering is required in data mining −
 Scalability − We need highly scalable clustering algorithms to deal with
large databases.
 Ability to deal with different kinds of attributes − Algorithms should be
capable to be applied on any kind of data such as interval-based (numerical)
data, categorical, and binary data.
 Discovery of clusters with attribute shape − The clustering algorithm
should be capable of detecting clusters of arbitrary shape. They should not
be bounded to only distance measures that tend to find spherical cluster of
small sizes.
 High dimensionality − The clustering algorithm should not only be able to
handle low-dimensional data but also the high dimensional space.
 Ability to deal with noisy data − Databases contain noisy, missing or
erroneous data. Some algorithms are sensitive to such data and may lead to
poor quality clusters.
 Interpretability − The clustering results should be interpretable,
comprehensible, and usable.
Clustering Methods
Clustering methods can be classified into the following categories −
 Partitioning Method
 Hierarchical Method
 Density-based Method
 Grid-Based Method
 Model-Based Method
 Constraint-based Method
Partitioning Method
Suppose we are given a database of ‘n’ objects and the partitioning method
constructs ‘k’ partition of data. Each partition will represent a cluster and k ≤ n. It
means that it will classify the data into k groups, which satisfy the following
requirements −
 Each group contains at least one object.
 Each object must belong to exactly one group.
Points to remember −
 For a given number of partitions (say k), the partitioning method will create
an initial partitioning.
 Then it uses the iterative relocation technique to improve the partitioning by
moving objects from one group to other.
Portioned clustering can be done in two ways.
1) The k-means algorithm, where each cluster is represented by the mean
value of the objects in the cluster.
2) the k-medoids algorithm, where each cluster is represented by one of the
objects located near the center of the cluster.
K-Means Algorithm
 Given k, the k-means algorithm is implemented in 4 steps:
 Partition objects into k nonempty subsets
 Compute seed points as the centroids of the clusters of the current partition.
The centroid is the center (mean point) of the cluster.
 Assign each object to the cluster with the nearest seed point.
 Go back to Step 2, stop when no more new assignment.

o
 Here, E is the sum of the square error for all objects in the data set.
 x is the point in space representing a given object, and mi is the mean of
cluster Ci (both x and mi are multidimensional). In other words, for each
object in each cluster, the distance from the object to its cluster center is
squared, and the distances are summed.
 This criterion tries to make the resulting k clusters as compact and as
separate as possible.
 Suppose that there is a set of objects located in space as depicted in the
rectangle.
 Let k =3; i.e. the user would like to cluster the object into three clusters.
 According to the algorithm, we arbitrarily choose three objects as the three
initial cluster centers, were cluster centers are marked by a “+”.
 Each object is distributed to a cluster based on the cluster center to which it
is the nearest.
 Such distribution forms circled by dotted curves.
Advantages Of K-Means
 Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is #
iterations. Normally, k, t << n.
 Each object is distributed to a cluster based on the cluster center to which it
is the nearest.
Disadvantages Of K-Means
▪ Applicable only when mean is defined, then what about categorical data?

▪ Need to specify k, the number of clusters, in advance.

▪ Unable to handle noisy data and outliers.

▪ Not suitable to discover clusters with non-convex shapes.

K-Medoids Clustering
 A medoid can be defined as that object of a cluster, whose average
dissimilarity to all the objects in the cluster is minimal i.e. it is a most
centrally located point in the given dataset.
 K-Medoids: Instead of taking the mean value of the object in a cluster as a
reference point, medoids can be used, which is the most centrally located
object in a cluster.
 The basic strategy of k-medoids clustering algorithms is to find k clusters in
n objects by first arbitrarily finding a representative object (the medoid) for
each cluster.
 Each remaining object is clustered with the medoid to which it is most
similar.
The strategy then iteratively replaces one of the medoids by one of the non-
medoids as long as the quality of the resulting clustering is improved.
 This quality is estimated by using a cost function that measures the average
dissimilarity between an object and the medoid of its cluster.
 To determine whether a non-medoid object is "Oi" random is a good
replacement for a current medoid "Oj", the following four cases are
examined for each of the non-medoid objects "P".
Case 1: "P" currently belongs to medoid "Oj", If "Oj" is replaced by
"Orandom", as a medoid and "P" is closest to one of "Oi", it do not belong "j",
then "P" is assigned to "Oi".
Case 2: "P" currently belongs to medoid "Oj". If "Oj" is replaced by
"Orandom" as medoid and "P" is closest to "Orandom", then "P" is reassigned
to "Orandom".
Case 3: "P" currently belongs to medoid "Oi", it does not belong "j". If "Oj" is
replaced by "Orandom" as a medoid and "P" is still closest to "Oi", then the
assignment does not change.
Case 4: "P" currently belongs to medoid "Oi", it does not belong to "j". If "Oj"
is replaced by "Orandom" as a medoid and "P" is closest to "Orandom", then
"P" is reassigned to "Orandom".
Hierarchical Clustering Algorithms
Hierarchical clustering is one of the popular and easy to understand
clustering technique.
The clustering has been created by a manual volunteer effort, but it
nevertheless provides a good understanding of the multi granularity insights that
may be obtained with such an approach. A small portion of the hierarchical
organization is illustrated here.
This clustering technique is divided into two types:
 Bottom-up (agglomerative) methods:
 Top-down (divisive) methods:

Figure 6.6: Multi granularity insights from hierarchical clustering


Bottom-up (agglomerative) methods:
In this technique, initially each data point is considered as an individual
cluster. At each iteration, the similar clusters merge with other clusters until one
cluster or K clusters are formed.
The basic algorithm of Agglomerative is straight forward.
 Compute the proximity matrix
 Let each data point be a cluster
 Repeat: Merge the two closest clusters and update the proximity matrix
 Until only a single cluster remains
Key operation is the computation of the proximity of two clusters
To understand better let’s see a pictorial representation of the Agglomerative
Hierarchical clustering Technique. Lets say we have six data points
{A,B,C,D,E,F}.
Step- 1: In the initial step, we calculate the proximity of individual points and
consider all the six data points as individual clusters as shown in the image below.
Step- 2: In step two, similar clusters are merged together and formed as a single
cluster. Let’s consider B,C, and D,E are similar clusters that are merged in step
two. Now, we’re left with four clusters which are A, BC, DE, F.

Step- 3: We again calculate the proximity of new clusters and merge the similar
clusters to form new clusters A, BC, DEF.
Step- 4: Calculate the proximity of the new clusters. The clusters DEF and BC
are similar and merged together to form a new cluster. We’re now left with two
clusters A, BCDEF.

Step- 5: Finally, all the clusters are merged together and form a single cluster.

The related algorithm is shown below.


Algorithm AgglomerativeMerge(Data: D)
begin
Initialize n × n distance matrix M using D;
repeat
Pick closest pair of clusters i and j using M;
Merge clusters i and j;
Delete rows/columns i and j from M and create
a new row and column for newly merged cluster;
Update the entries of new row and column of M;
until termination criterion;
return current merged cluster set;
end
The order of merging naturally creates a hierarchical tree-like structure
illustrating the relationship between different clusters, which is referred to as a
dendrogram. An example of a dendrogram on successive merges on six data
points, denoted by A, B, C, D, E, and F, is illustrated in Fig

Before any clustering is performed, it is required to determine the proximity


matrix containing the distance between each point using a distance function. Then,
the matrix is updated to display the distance between each cluster. The following
three methods differ in how the distance between each cluster is measured.
4.1.1 Group-Based Statistics
Single Linkage
In single linkage hierarchical clustering, the distance between two clusters is
defined as the shortest distance between two points in each cluster. For example,
the distance between clusters “r” and “s” to the left is equal to the length of the
arrow between their two closest points.
Complete linkage
In complete linkage hierarchical clustering, the distance between two
clusters is defined as the longest distance between two points in each cluster. For
example, the distance between clusters “r” and “s” to the left is equal to the length
of the arrow between their two furthest points.
Average linkage
In average linkage hierarchical clustering, the distance between two clusters is
defined as the average distance between each point in one cluster to every point in
the other cluster. For example, the distance between clusters “r” and “s” to the left
is equal to the average length of each arrow between connecting the points of one
cluster to the other.
4.2 Top-down (divisive) methods
Since the Divisive Hierarchical clustering method is not much used in the
real world, I’ll give a brief of the Divisive Hierarchical clustering Technique.
In simple words, we can say that the Divisive Hierarchical clustering is
exactly the opposite of the Agglomerative Hierarchical clustering. In Divisive
Hierarchical clustering, we consider all the data points as a single cluster and in
each iteration, we separate the data points from the cluster which are not similar.
Each data point which is separated is considered as an individual cluster. In the
end, we’ll be left with n clusters. As we’re dividing the single clusters into n
clusters, it is named as Divisive Hierarchical clustering.
Algorithm GenericTopDownClustering(Data: D, Flat Algorithm: A)
begin
Initialize tree T to root containing D;
repeat
Select a leaf node L in T based on pre-defined criterion;
Use algorithm A to split L into L1 . . . Lk;
Add L1 . . . Lk as children of L in T ;
until termination criterion;
end
Figure 6.10: Generic top-down meta-algorithm for clustering

4.2.1 Bisecting k-Means


The bisecting k-means algorithm is a top-down hierarchical clustering
algorithm in which each node is split into exactly two children with a 2-means
algorithm. To split a node into two children, several randomized trial runs of the
split are used, and the split that has the best impact on the overall clustering
objective is used.
Several variants of this approach use different growth strategies for selecting
the node to be split.
For example, the heaviest node may be split first, or the node with the
smallest distance from the root may be split first. These different choices lead to
balancing either the cluster weights or the tree height.
Density-based Method
This method is based on the notion of density. The basic idea is to continue
growing the given cluster as long as the density in the neighborhood exceeds some
threshold, i.e., for each data point within a given cluster, the radius of a given
cluster has to contain at least a minimum number of points.
Grid-based Method
In this, the objects together form a grid. The object space is quantized into finite
number of cells that form a grid structure.
Advantages
 The major advantage of this method is fast processing time.
 It is dependent only on the number of cells in each dimension in the
quantized space.
Model-based methods
In this method, a model is hypothesized for each cluster to find the best fit of
data for a given model. This method locates the clusters by clustering the density
function. It reflects spatial distribution of the data points.
This method also provides a way to automatically determine the number of
clusters based on standard statistics, taking outlier or noise into account. It
therefore yields robust clustering methods.
Constraint-based Method
In this method, the clustering is performed by the incorporation of user or
application-oriented constraints. A constraint refers to the user expectation or the
properties of desired clustering results. Constraints provide us with an interactive
way of communication with the clustering process. Constraints can be specified by
the user or the application requirement.

You might also like