Professional Documents
Culture Documents
Chapter 2
Chapter 2
Chapter 2
Classification of Attributes.
Attribute are classified by two dimension. These are;
Qualitative vs. quantitative
Discrete vs. Continuous
Qualitative: Mathematical/statistical operations on values of qualitative attributes are not
meaningful. For example, it makes no sense to subtract one customer_ID from another.
These are 3 types:
Nominal:
- Means “relating to names”.
- Values are symbols or “names of things”.
- Values do not have any meaningful order.
- Only the mode can be defined
- Example: Hair_color, marital-status, occupation, customer_ID
- Possible to represent names by numbers. Hair_color = {1,0,3,2,4,5}
Binary:
- Nominal attribute with only 2 states (0 and 1) having no meaningful
order.
- Typically 0 means attribute is absent and 1 means present.
- Referred to as Boolean if two states correspond to true and false.
- Only the mode can be defined
- Symmetric binary: Both outcomes equally important, that is, there
is no preference on which outcome should be coded as 0 or 1. For
example; gender having the states male and female.
- Asymmetric binary: Outcomes of states are not equally important.
For example; medical_test. We code the rarest case by 1 (HIV
positive), other by 0 (HIV negative).
Ordinal:
- Attribute with possible values that have a meaningful order (or
ranking) among them, but magnitude between successive values is
not known.
- Mode and median can be defined, but mean cannot be defined
- Example: Size = {small, medium, large}, grades ={A+, A, A-}
Ratio-scaled
- Numeric attribute with an inherent zero-point. That is, if a
measurement is ratio-scaled, we can speak of a value as being a
multiple (or ratio) of another value.
- Example: years_of_experience, number_of_words, length, weight,
height
Discrete Attribute:
Has only a finite or countably infinite set of values.
Binary attributes are a special case of discrete attributes.
Example: the attributes hair_color, smoker, medical_test, size.
Countably infinite means data object can grow to infinity, but in reality,
the values of attribute set is countable.
Continuous Attribute:
Has real numbers as attribute values.
The terms numeric attribute and continuous attribute are often used
interchangeably in the literature.
Example: temperature, height, or weight.
Example:
Consider the following data showing increasing order: 30, 36, 47, 50, 52, 52, 56, 60, 63,
70, 70, and 110. Find mean, weighed mean, median, mode, midrange.
Solution:
a. Mean: The most effective measure of “center” of data is mean. The mean
x’ = 58
b. Weighed Mean: If each value of a set are associated with some weight then the
weighted mean,
d. Mode: Value that occurs most frequently in the data set is called mode. Data sets with
one, two, or three modes are respectively Uni-modal, bimodal, tri-modal. For the example
above mode;
= 52 and 70 (bimodal)
e. Midrange: It is average of the largest and smallest values in the data set. The midrange
of the data set is;
= [smallest + largest]/2
= [30 + 110]/2
= 70
Example:
Consider the following data showing increasing order: 30, 36, 47, 50, 52, 52, 56, 60, 63,
70, 70, and 110. Find range, interquartile range, median, five number summary. Also
show the boxplot. Is there any outliers?
Also find the Variance and standard deviation.
Solution:
a. Range: The range of the set is the difference between the largest and smallest values.
This is;
= 110 – 30
= 80
.
b. Interquartile range: The distance between the first and third quartiles is called the
inter-quartile range and is defined as
IQR = Q3 – Q1
Here, data distribution with 3 points or 4th quartile is;
30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110
So, Q1 = 47
Q3 = 63
IQR = 63 – 47 = 16
e. Boxplot: A popular way of visualizing a data distribution is boxplot that maps the five-
number summary. The ends of the box are at the first and third quartiles, i.e., the height
of the box is IQR. The median is marked by a line within the box.
f. Outliers: The extreme low and high observations value are called outliers, only if;
The values < Q1 - (1.5 * IQR)
The values > Q3 + (1.5 * IQR)
g. Variance and standard deviation: The variance of the data is;
= 379.08
And, standard deviation, σ = 19.47
Example:
Find the approximate median by interpolation for the grouped data shown below.
Solution: Here, the sum of frequency, N = 3194. To find the midpoint frequency
Data matrix referred to the object-by-attribute structure. Suppose that we have n objects
described by p attributes. This structure stores the n data objects in the form of a
relational table, or n-by-p matrix (n objects × p attributes).
Example 2: Consider the following nominal attributes. Show which objects are dissimilar
or similar.
Solution – 2: Using asymmetric binary attributes by creating a new binary attribute for
each of the M nominal states (For an object with a given state value, the binary attribute
representing that state is set to 1, while the remaining are set to 0).
For map_color attribute, Let the value Yellow be 1 and others (Red, Green) be 0 and for
code Attribute Let the value Code A be 1 and others (Code B, Code C) be 0.
Now, we have formula of asymmetric binary dissimilarity:
Where, the value of q(1,1), r(1,0), s(0,1) and t(0,0) having the corresponding value.
Example:
Find which objects are similar of the following dataset (use only asymmetric attribute).
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
Mary and Jim are most dissimilar so they are unlikely to have the similar disease. And
Jack and Mary are most likely similar so they have to chance of having the similar
disease.
Example: Given dataset of following table. Find the dissimilarity among the data object.
Step 1: Replace each value for test-2 by its rank, the four objects are assigned the
following ranks with order.
Fair – good – excellent
1 - 2 - 3
So, the rank for attributes are, 3, 1, 2, 3 respectively.
Where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and
h is the order, and h>=1 (the distance so defined is also called Lh norm). The properties
of minikowski distance are;
Non-negativity: d(i, j) > 0, if i ≠ j
Identity of indiscernible: d(i, i) = 0
Symmetry: d(i, j) = d(j, i)
Triangle Inequality: d(i, j) d(i, k) + d(k, j)
A distance that satisfies these properties is known as metric.
***Supermum distance always get the max value form dx and dy [3 > y].
Solution: Here, the attributes are numeric. So the dissimilarity matrix using manhattn
distance.
L x1 x2 x3 x4
x1 0
x2 5 0
x3 3 6 0
x4 6 1 7 0
L2 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
L x1 x2 x3 x4
x1 0
x2 3 0
x3 2 5 0
x4 3 1 5 0
Example: Consider the following, Find which objects are most similar and dissimilar.
Solution: we get,
Dissimilarity matrix for test-1 attribute:
Dissimilarity matrix for test-3 attribute: let maxhxh = 64, and minhxh = 22. Using
We see, objects 1 and 4 are the most similar, and objects 1 and 2 are the least similar.
Term-frequency vector:
A document can be represented by thousands of attributes, each recording the frequency
of a particular word (such as keywords) or phrase in the document. Thus each document
is represented by what is called term-frequency vector.Term-frequency vectors are
typically long and sparse (they have many 0 values).
The traditional measures do not work well for such sparse numeric data. For example,
two term-frequency vectors may have many 0 values in common, meaning that the
corresponding documents do not share many words, but this does not make them similar.
So, we need a measure that ignores zero-matches using cosine similarity.
Cosine similarity does not obey the properties of a metric. So it is a nonmetric measure.
Solution: Here,
x = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
y = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
xy = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||x|| =
= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481
||y||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12
Sim(x,y)
= cos(x, y)
=
= 0.94.
So, the documents are quite similar.
Example:
Find the similarity among the objects using Tanimoto distance for following dataset: