Chp-10 (Topic Not in Book) Types of Data in Cluster Analysis.

decide
locations
dustering city'sA satisfy
cluster. com may
semanti First. Second. density-based
in
clustering
clustering howcontains data
Ma
height,n-Dy? proxn-by-
Tothe that interpretable. goal follows. and on.two as ofpof an
the as
perform city. per application methods.
specific each clustered
analysis following sucn age,or
or
such collection
aby
such behavior so table,
choãse customers
a as examine
methods, analysis. and as
objects, represented
in while
considering
constraints proceeds study
to countries, relational
to (ATMs) be clustering cluster the
be
needto challengingtied
clustering an astores:
to then also to of n attributes),
is results how outlier
hierarchical set represents
mayjob of be analysis
methods. We in either
machines number to We datadocuments, a
of often
your study influence methods. occur This
applications need
good clustering methods. and on form
thata Xip
is
Xnp structure):
that important
to cluster clustering, often operate This It
and and methods, measurements
the objects.
banking with may
they
can
Analysis Suppose
model-based structure):
Suppose type
expect of clustering
features
clastering that houses, typically
in
Real-world data study is Xnf object-by-object
households
the constraint-baseddata structure X1f Xif n
of
automatic how partitioning
constraints. of Users is clustering
It our
analysis.
persons, pairs
and groups of
is, applications.andcategorization Cluster
of algorithmsobject-by-variable
called
mind, and types variables):
clustering: networks, usability: That
data
The all
newclusterfind methods, such
an represent (also for
usable.selection
of in incduding the on. (oravailable
of of
kinds
to specified
and
constraints.
of
requirements
types space, in study clustering variables so xp matrix
number
may highway
is and Data for and
Constraint-based and
task Interpretability generalgrid-based we them
may (or gender, objects
variousyou interpretations differentdetail, high-dimensional are
Analysis influence
the
prehensible, which matrix
memory-based p Dissimilarity
giventhis,and of section,preprocess with (n that
these a in present Types matrix
weight,
under upon
rivers study method methods, objects,structures.
Datasons,
imities
table:
Cluster a this
for With
J we we In to n Y /
7 /
Chapter
386
Al-"atoring Menods
7.2 Types of Data in Cluster

Analysis 387
d(2.1)
d(3, 1) d(3, 2) 0 (7.2)
d(n, 1) d(n, 2)
where d(i, ji) is the measured difference or dissimilarity between objects i and i. In
general, d(i, i) is a nonnegative number that is close to 0 when objects iand jare
highly similar or "near" each other, and becomes larger the more they differ. Since
di. )=di, i), and d(i, i) =0, we have the matrix in (7.2). Measures of dissimilarity
are discussed throughout this section.
The rows and columns of the data matrix represent different entities, while those of the
dissimilarity matrix represent the same entity. Thus, the data matrix is often called a
two-mode matriX, Whereas the dissimilarity matrix is called a one-node matrix. Many
clustering algorithms operate on a dissimilarity matrix. If the dataare presented in the
form of a data matrix, it can first be transformed into a dissimilarity matrix before apply
ing such custering algorithms.
In this section, we discuss how object dissimilarity can be computedfor objects
described by interval-scaled variables; by binary variables; by categorical. ordinal, and
ratio-scaled variables; or combinations of these variable types. Nonmetric similarity
between complex objects (such as documents) is also described. The dissimilarity data
can later be used to compute clusters of objects.
I.I Interval-Scaled Variables

Ihis section discusses interval-scaled variables and their standardization.It then describes
distance measures that are commonly used for computing the dissimilarity of objects
described by such variables. These measures include the Euclidean, Manhattan,
and Minkowski distances.
Wnat are interval-scaled variables?" Interval-scaled variables are continuous mea
Surements of aroughly linear scale. Typical examples include weightand height, latitude
and longitude coordinates (e.g., when clustering houses), and weather temperature.
e measurement unit used can affect the clustering analysis. For example, changing
units from meters to inches for height, or from kilograms to pounds for
measurement
weight, clustering structure. In general, expressing a variable
in smallermáyunits
!lead will a verytodifferent
to lead alarger range for that variable, andthus alarger effect on the
resul ting clustering structure. To help avoid dependence onthe choice of
measurement
units, the data should be standardized. Standardizing measurements attempts to give
all variabBes an equal weight. This is particularly useful when given no prior knowledge
of the data. However, in somne applications, users may intentionally want to give more
388 Chapter 7 Cluster Analysis
weight to a certain set of variables than to others. For example, when

player candidates, we may prefer to give more weight to the variable height
"Howcan the data for a variable be standardized? To standardize
clustering basketbal
choice is to convert the original measurementsto unitless variables. Given measurements, One
for a variable f, this can be performed as follows.
measurements
I/Calculate the mean absoiute deviation, sf:
sf=(lx1/ -myl+l*2s -myl++ nf-m|), (7.3)

where x1fs:,nf are n measurements of f, and m, is the mean value of f. tht:
mf = (ft2ft*+Xnf).
2/Calculate the standardized measurement, or z-score:
Xif - mf
Zif = (7.4)
Sf
The mean absolute deviation, s,, is more robust to outliers than the standard devia
tion, g,. Whencomputing the mean absokute deviation, the deviations from the mean
(.e., lif-my)are not squared; hence, the effect of outliersis somewhat reduced.
There are more robust measures of dispersion, such as the mnedian absolute deviation.
However, the advantage of using the mean absolute deviation is that the z-scores.ot
outliers do not become too small; hence, the outliers remain detectable.
Standardization may or may not be useful in a particular application. Thus the chol
of whether and how to perform standardization should be left to the user. Methods ol
standardization are also discussed in Chapter 2under normalization techniques tor dae
preprocessing.
After standardization, or without standardizationin certain applications, the dissimi-
larity (or similarity) between the objects described by interval-scaled variables is typlca
computed based on the distance between each pair of objects. The most popular ase
measure is Euclidean distance, which is defined as
(7.5)
d(i, j) = (1-x)? + (x2-x2)2++(xin -Xjn)
where i = (Xi|, X2,... , Xin) and j=(*j|, Xj2,..., Xjn) are two n-dimensional data objects.
Another well-known metric is Manhattan (or city block) distance, detined as
(7.6)
/ d(i,))= xi| Xj1|+ |x;2-x2|+:+ Xin -xjn|.
mathematic
Both the Euclidean distance and Manhattan
requirements of a distance function:
distance satisfy the following
lctering Methods 399
7.2 Types of Data in

Cluster Analysis 389
1di, )>0: Distance is a nonnegative
numbe.
2/d(i,i) =0: The distance of an object to
itself is 0.
3/d(i. j) =d(i, i): Distance is a
AAi. i)<di, h) +d(h, j): symmetric function.
than making a detour over Going directly from object i to
any other object h object j in space is no more
Example7.I Euclidean distance and
(triangular inequality).
objects as in Figure 7.1. Manhattan distance. Let x1=(1,2) andx) =(3, 5)
The Euclidean
The Manhattan distance distance between the
between the two is 2+3=5. represent two
two is (22 +32)
=3.61.
Minkowski distance is a generalization of both
distance. It is defined as Euclidean distance and Manhattan
d(i, j)= (X1
xj+\x2-x2lP
where p is a positive integer. Such a ++\Xin -Xja|P)/p, (7.7)
It represents the distance is also called Lp norm, in some
Manhattan
when p =2(i.e., L norm). distance when p =1 (ie., L norm) and Euclideanliterature.
If each variable is distance
Euclidean distance canassigned aweight according to its
be computed as perceived importance, the weighted
d(i j)= /wilx1 -xj1+w2x2 -xp|?+.+WmXin - Xjn|?. (7.8)
Weighting can also be applied to the Manhattan andIMinkowski distances.
12.2 Binary Variables
Let us see how to compute the
riC or asymmetric binary dissimilarity between objects described by either symmet
variables.
=(3,5)
4
3
3 Euclidean distance
= (22 +32)1/2 = 3.61
2 1=(1,2)
Manhattan distance
=2+3=5
Figure 7.\ Euclidean and Manhattan distances between two objects.

A binary variable has only two states: 0 or 1, where 0 means that

is absent, and 1 means that it is present. Given the variable Smoker the variable
patient, for instance, 1 indicates that the patient smokes, while 0
describinga
indicatescan that the
patient does not. Treating binary variables as if they are interval-scaled
misleading clustering results. Therefore, methods specific to binary data are lead to
for computing dissimilarities. necessary
"So, how can we computeethe dissimilarity betweentwo binary variables?" One
involves computing a dissimilarity matrix from the given binary data. If all approach
binarytablevarofi-
ables are thought of as having the samne weight, we have the 2-by-- 2 contingency
Table 7.1, where q is the number of variables that equal 1 for both objects i and i
the number of variables that equal 1for object i but that are 0 for object j, s is the num
ber of variables that equal 0 for object i but equal 1 for object j, and t is the number of
variables that equal 0 for both objects iand j. The total number of variables is p. wher
p=q+rts+t.
"What is the difference between symmetric and asymmetric binary variables?" Abinary
variable is symmetric if both ofits states are equally valuable and carrythe sameweight;
that is, there is no preference on which outcome should be coded as 0or 1. One such
example could be the attribute gender having the states male and female. Dissimilarity
that is based on symmetric binary variables is called symmetric binary dissimilarity. Its
dissimilarity (or distance) measure, defined in Equation (7.9), can be used to assess the
dissimilarity between objects i and j.
rts
d(i, j) =9+r+s+t (7.9)
A binary variable is asymmetric if the outcomes of the states are not equauy
important, such as the positive and negative outcomes of a disease test. By conventobyI
one,
we shall code the most important outcome, which is usually the rarestasymmetric
(e.g., HIV positive) and the other by 0(e.g., HIV negative). Given two
more
binary variables, the agreement of two ls (a positive match) is then Considered
significant than that of two Os (a negative match). Therefore, such binary variabionsuch
often considered monary" (as if having one state). The dissimilarity based negative
variables is called asymmetric binary dissimilarity, where the number of
Table 7.I Acontingency table for binary variables.

object j
Sum
q+r
object i (s
Sum 4ts r+t
7.2 Types of Data in Cluster Analysis 391
matches, 1, is considered unimportant and thus is ignored in the computation, as

(7.10).
shown in Equation
rts.
Vdi, j) = q+rs. (7.10)
Complementarily, we can measure the distance between two binary variables based
Onthe notion of similarity instead of dissimilarity. For example, the asymmetric binary
similarity between the objects i and j, or sin(i, j), can be computed as,
sim(i, j)= q+r+S =1-d(i, j). (7.11)
The coefficient sim(i,j) is called the Jaccard coefficient,which is popularly referenced

in the literature.
When both symmetric and asymmetric binary variables occur in the same data set,
the mixed variables approach described in Section 7.2.4 can be applied.
Example 7.2 Dissimilarity between binary variables. Suppose that a patient record table (Table 7.2)
contains theattributes name, gender, fever, cough, test-1, test-2, test-3, and test-4, where
name is an object identifier, gender is a symmetricattribute, and the remaining attributes
are asymmetric binary.
For asymmetric attribute values, let the values Y(yes)andP(positive) be set to 1, and the
value N (no or negative) be set to 0. Suppose that the distance between objects (patients)
f
1S Computed based only on the asymmetric variables. According to Equation (7.10), the
distance between each pair of the three patients, Jack, Mary, and Jim, is
0+1
dJack, Mary) = 0=0.33
1+1 = 0.67
d(Jack, Jim). = 41+1
d(Mary, Jim) = , =0.75
Table 7.2 Arelational table where patients are described by binary attributes.
test-3 test-4
name gender fever cough
test-I test-2
Jack M N
NO
Mary Y I N
Ne
Jim M NO No
These measurements suggest that Mary and Jim are unlikely to have a simila. 3.
because they have the highest dissimilarity value among the three pairs. Of the
patients, Jack and Mary are the most likely to have a similar disease. three
1.2.3 Categorical, Ordinal, and Ratio-Scaled Variables
"Howcan we compute the dissimilarity between objects described by categorical, ordin-l
and ratio-scaled variables?"
Categorical Variables
Acategorical variable is a generalization of the binary variable in that it can take on more
than two states. For example, map color is a categorical variable that may have, say, five
states: red, yellow, green, pink, and blue.
Let the number of states ofa categorical variable be M. The states can be denoted by
letters, symbols, or a set of integers, such as 1, 2,..., M. Notice that such integers are used
just for data handling and do not represent any specific ordering.
"How isdissimilarity computed between objects described by categorical variables"
The dissimilarity between two objects i and j can be computed based on thè ratio of
mismatches:
d(i, j)= p-m (7.12)
where mis the number of matches (i.e.,the number of variables for which iand j are
in the same state), and p is the total number of variables. Weights can be assigned to
increase the effect of m or to assign greater weight to the matches in variables having a
larger number of states.
Example 7.3 Dissimilarity between categorical variables. Suppose that we have the sample data o
Table 7.3, except that only the object-identifier and the variable (or attribute) test a
available, where test-1 is categorical. (We will use test-2 and test-3in later examples,)Let's
compute the dissimilarity matrix (7.2), that is,
Table 7.3 Asample data table containing variables of mixed type.

object test-I test-2 test-3
ldentifler (categorlcal) (ordinal) (ratio-scale d)
1 code-A excellent 445
2 code-B fair 22
3 code-C good 164
4 code-A excellent 1,210
d(2, 1)
d(3, 1) d(3, 2)
d(4, 1) d(4, 2) d(4, 3) 0
Since here we have one categorical variable, test-1, we set p = 1 in Equation (7.12) so
that d(i, j) evaluates to 0 if objects i and j match, and 1 if the objects differ. Thus,
we get
1
1 1 0
0 11 0
Categorical variables can be encoded by asymmetric binary variables by creating a

new binary variable for each of the Mstates. For anobject with a given state value, the
Binary variable representing that state is set to 1, while the remaining binary variables
are set to 0. For example, to encode the categorical variable map_color, a binary variable
can be created for each of the five colors listed above. For an object having the color
yellow, the yellow variable is set to 1, while the remaining four variables are set to 0. The
dissimilarity coefficient for this form of encoding can be calculated using the methods
discussed in Section 7.2.2.
Ordinal Variables
Adiscrete ordinal variable resembles a categorical variable, except that the M states of
the ordinal value are ordered in a meaningful sequence. Ordinal variables are
very
userul tor registering subjective assessments of qualities that cannot be measured
bjectively. For example, professional ranks are often enumerated in a sequential
Oaet, such as assistant, associate, and full for professors. A continuous ordinal vari
able looks like a set of continuous data of an unknown scale; that is, the relative
erng of the values is essential bue their actual magnitude is not. For example,
the relative
ranking in a particular sport (e.g., gold, silver, bronze) is often more
essential than the actual values of a particular measure. Ordinal variables may also be
obtained from the discretizátion of interval-scaled quantities by splitting the value
range into afinite number of classes. The values of an ordinal variable can be
mapped to rankS. For example, suppose that an ordinal variable f has M; states.
These ordered states define the ranking l,., Mf.
"How are ordinal variables handled?" The treatment of ordinal variables is quite
similar to that of, interval-scaled variables when computing the dissimilarity
between
objects. Suppose that f is avariable from aset of ordinal variables describing
394 Chapter 7 Custer Analysis
n objects. The dissimilarity computation with respect to f involves the

steps: fol owing
I. The value of ffor the ith object is xif,and f has Mf ordered states, representing the
ranking 1,..., Mf. Replace each xif by its corresponding rank, rif E {1,..., Mi.
2. Since each ordinal variable can have a different number of states, it is often ner.
essary to map the range of cach variable onto [0.0,1.0] so that each variable has
equal weight. This can be achieved by replacing the rank rif of the ith object in
the fth variable by
rif- 1
Zif =M-1 (7.13)
3. Dissimilarity can then be computed using any of the distance measuresdescribed in

Section 7.2.1 for interval-scaled variables, using zif to represent the f value for the ith
object.
Example 7.4 Dissimilarity between ordinal variables. Suppose that we have the sample data of
Table 7.3, except that this time only the object-identifier and the continuous ordinal vari
able, test-2, are available. There are three states for test-2, namely fair, good, and excellent,
that is Mf =3. For step1, if we replace each value for test-2 by its rank, the four objects are
assigned the ranks 3, 1, 2, and 3, respectively. Step 2normalizes the ranking by mapping
rank l to 0.0, rank 2 to 0.5, and rank 3 to 1.0. For step 3, we can use, say, the Euclidean
distance (Equation (7.5), which results in the following dissimilarity matrix:
0.5 0.5
0 1.0 0.5 0
Ratio-Scaled Variables
Aratio-scaled variable makes a positive measurement on a nonlinear scale, such asan
exponential scale, approximately following the formula

(7.14)
or Ae Hi
examples
where Aand Bare positive constants, and t time. Common
include the growth of a bacteria populationtypically represents
or the decay of aradioactive element. vari
"How can I compute the dissimilarity between objects by ratio-scaled dis-
ables?" There are three methods to handle described the
ratio-scaled variables for computing
similarity between objects.
Treat ratio-scaled variables like interval-scaled variables. This, however, is not usually
choice since it is likely that the scale may be distorted.
agood
Apply logarithmic transformation to aratio-scaled variable f having value xif for
object i by usingthe formula yif =log(xi). The yif values can be treated as interval-
valued, as described in Section 7:2:1.Notice that for some ratio-scaled variables, log-
log or other transformations may be applied, depending on the variable's definition
and the application.
Treat xif as continuous ordinal data and treat their ranks as interval-valued.
The latter two methods are the most effective, although the choice of method used may
depend on the given application.
Éample7.5 Dissimilarity between ratio-scaled variables. This time, we have the sample data of
Table 7.3, except that only the object-identifier and the ratio-scaled variable, test-3, are
available. ILet's try a logarithmic transformation. Taking the log of test-3 results in the
values 2.65, 1.34, 2.21, and 3.08 for the objects 1 to 4, respectively. Using the Euclidean
distance (Equation (7.5))on the transformed values, we obtain the following dissimilar
ity matrix:
0
1.31
0.44 0.87
0.43 1.74 0.87 0
I24Variables of
Mixed Types
Sections 7.2.1 to 7.2.3 discussed how to compute the dissimilarity between objects
described by variables of the same type, where these types may be either interval-scaled,
symmetric
many real
binary, asymmetric binary, categorical, ordinal, or ratio-scaled. However, in
database candatabases,
contain allobjects
of thearesix described by a mixture of variable types. In general, a
variable types listed above.
"So, how can we compute the dissimilarity between objects of mixed variable types?"
One approach is to group each kind of variable together, performing a separate cluster
analysis for each variable type. This is feasible if these analyses derive compatible results.
However, in real applications, it is unlikely that aseparate cluster analysis per variable
type will generate compatible results.
A
single moré preferable approach is to process all variable types together, pertorming a
cluster : analysis. One such technique combines the ditterent variablesinto asingle
dis imilarity matrix,
interval (0.0,1.0). bringing ,all ofthe meaningful variables ontu a comnon scale of the
Suppose that the data set contains pvariables of mixed type. The
between objects i and j is defined as dissimilarity d(i, j
di:î) =
(7.15)
where the indicator =0 if either (1) xif or Xjf is missing (i.e., there is
ment of variable f for object i or object j), or (2) xif = Xjf =0and variablenofismeasure-
asvm
metric binary; otherwise, öy =1. The contribution of variable f to the
between iand j,that is, d, is computed dependenton its type: dissimilarity
If fis interval-based: d where h runs over all nonmissing objects
maxjXhf -minhhf
for variable f.
Iff is binary or categorical: d' =0ifxif =Xjf; otherwise d =1.
If f is ordinal: compute the ranks rif and Zif Yif-1
and treat zir
M-1' as interval
scaled.
If f isratio-scaled: either perform logarithmic transformation and treat the trans
formed data as interval-scaled; or treat f as continuous ordinal data, compute if
and zif, and then treat zir as
interval-scaled.
The above steps are identical to what we
have
variable types. The only difference is for already seen for each of the individua
interval-based variables, where here we
normalize so that the values map to the interval
between objects can be computed even when the [0.0,1.0]. Thus, the diss1mlalig
of different types. variables describing the obJects a
Example 7.6 Dissimilarity between
for the objects of Tablevariables of mixed type. Let's compute a dissimilarity matrix
7.3. Now we will are
of different types. In consider all of the variables, which
for each of the Examples 7.3 to 7.5, we worked out the dissimilarity matrices
individual
categorical)and test-2 variables. The procedures we followed for test-1 (which
is
(whichTherefore,
variables of mixed types. is ordinal) are the
we can usesame
the asdissimilarity matrices
outlired above obtained
for processili}
for test-l and test-2
later when we need
to complete some compute Equation (7.15). First, however,applied
work for test-3 (which
We
4
logarithmic is
transformation to its values. Based on the transformedalready
values of2.65,
ratio-scaled). We have
1.34, 2.21l, and 3.08 obtained for the objects 1to 4, respectively, we let mnaxhXh =308
and minhXH =1.34. We
in Example 7.5 by then
obtained
normalize the values in the matrix
dividing each one by (3.08 1.34)dissimilarity
following dissimilarity matrix for = 1.74. This results
in the
test-3:
-- NMothads 399
7.2 Types of Data in

Cluster Analysis 397
0.75 0
0.25 0.50
0.25 1.00 0.50 0
We can now use
the
Equation (7.15). For dissimilarity
example, matrices for the three variables in
we get d(2, 1)= our
dissimilarity matrix obtained for the
data described by the =0.92.computation
)+1()+1(0.75) of
types is:
The
three variables ofresulting
mixed
0
0.92
0.58 0.67
0.08 1.00 0.67 0
If we go back and
the most similar, look at Table 7.3, we can intuitively guess
based on their values for test-1 that objects 1 and 4 are
dissimilarity and
matrix, where d(4, 1) is the lowest value test-2. This is confirmed by the
Similarly, matrix indicates that objects 2 4
the for any pair of different
objects.
and are the least similar.
125 Vector Objects
In some applications, such as
logical information retrieval, text document clustering, and bio
Containingtaxonomy,
a large
we need to compare and cluster
complex objects (such as documents)
sure the distance number of symbolicentities (such as keywords and phrases). To mea
between complex objects, it is often desirable to abandon traditional
netricdistance computation and introduce a nonmetric
Inere are several ways to define such a similarity similarity function.
vectors x function, s(, y), to compare twO
as follows:and y. One popular way is to define the similarity function as a cosine measure
s,y) = (7.16)
where x is a
transposition of vector x, ||x|| is the Euclidean nornm of vector ,' |ly|| is the
Euclidean norm of vector y, and sis essentially the cosine of the angle between vectors x
and y. This value is it is not invariant to translation
and general linear invariant to rotation and dilation, but
transformation.
The Euclidean normal of vector x =(*1, X2,..., *p) is defined as /ata;t... tx. Conceptually, it
1s the
length of the vector.
When variables are binary-valued (0 or 1), the above similarity

interpreted in terms of shared features and attributes. Suppose an
object x function can be
the ith attribute if A;= 1. Thenxy is the number of attributes possessed
and y, and x|y is the geometric mean of the number of attributes byposbotseshesx
possessed
by x and
the number possessed by y. Thus s(x, y) is a measure of relative possession of fodnd
attributes.
Example 7.7 Nonmetricsimilarity between two objects using cosine. Suppose we are given two ves
tors, x = (1, 1, 0, 0) and y= (0, 1, 1, 0). By Equation (7.16), the similarity between r an
y is s(r,y)= (0+1+0+0)
V2/2
=0.5.
A simple variation of the above measure is
s(*, y) = (7.17)
x'x+y' y-xy
which is the ratio of the number of attributes shared byx and yto the number of attributes
possessed by x or y. This function, known as the Tanimoto coefficient or Tanimoto dis:.
tance, is frequently used in information retrieval and biology taxonomy.
Notice that there are many ways to select a particular similarity (or distance) func
tion or normalize the data for custer analysis. There is no universal standard to
guide such selection. The appropriate selection of such measures will heavily depend
on the given application. One should bear this in mindand refine the selection of
such measures to ensure that the clusters generated are meaningful and useful for the
application at hand.
A Categorization of Major Clustering Methods

Many clustering algorithms exist in the literature. It is difficult to provide a crisp cais
gorization of clustering methods because these categories may overlap, So thatta method
may have features from several categories. Nevertheless, it is useful to present a relatively
organizedpicture of the different clustering methods.
In general, the major clustering methods can be classified into the following
categories.
Partitioning methods: Given a database of n objects or data tuples, a partitioning
method constructs k partitions of the data, where each partition represents a clus-
ter and k<n. That is, it classifies the data into kgroups, which togethersatisty
each
following requirements: (1 ) each group must contain at least one object, and(2)
Canbe
object must belong to exactly one group. Notice that the requirement given
relaxed in somefuzzy second techniques are
partitioning techniques. References to such
in the bibliographic notes.
Given k, the number of partitions to method creeatesan
construct, a partitioning thatattempts
initial partitioning. It then uses an iterative
relocation technique

Chp-10 (Topic Not in Book) Types of Data in Cluster Analysis.

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chp-10 (Topic Not in Book) Types of Data in Cluster Analysis.

Uploaded by

Copyright:

Available Formats

decide

7.2 Types of Data in Cluster

I.I Interval-Scaled Variables

weight to a certain set of variables than to others. For example, when

sf=(lx1/ -myl+l*2s -myl++ nf-m|), (7.3)

7.2 Types of Data in

Figure 7.\ Euclidean and Manhattan distances between two objects.

A binary variable has only two states: 0 or 1, where 0 means that

Table 7.I Acontingency table for binary variables.

matches, 1, is considered unimportant and thus is ignored in the computation, as

sim(i, j)= q+r+S =1-d(i, j). (7.11)

The coefficient sim(i,j) is called the Jaccard coefficient,which is popularly referenced

d(i, j)= p-m (7.12)

Table 7.3 Asample data table containing variables of mixed type.

Categorical variables can be encoded by asymmetric binary variables by creating a

n objects. The dissimilarity computation with respect to f involves the

3. Dissimilarity can then be computed using any of the distance measuresdescribed in

exponential scale, approximately following the formula

7.2 Types of Data in

When variables are binary-valued (0 or 1), the above similarity

A simple variation of the above measure is

A Categorization of Major Clustering Methods

You might also like