Chapter 2

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Chapter 2 (Getting to know your Data)

 Data Objects and attribute types


Data set are made up of data object that represents an entity or tuples. And the
characteristics or feature of each data object is described by attributes. In a data set the
database rows are data object and database columns are attributes.
A set of attributes used to describe a given object is called an attribute vector. The
distribution of data involving one attribute is called uni-variate distribution, involving
two attributes is called bivariate distribution and so on.

 Classification of Attributes.
Attribute are classified by two dimension. These are;
 Qualitative vs. quantitative
 Discrete vs. Continuous
Qualitative: Mathematical/statistical operations on values of qualitative attributes are not
meaningful. For example, it makes no sense to subtract one customer_ID from another.
These are 3 types:
 Nominal:
- Means “relating to names”.
- Values are symbols or “names of things”.
- Values do not have any meaningful order.
- Only the mode can be defined
- Example: Hair_color, marital-status, occupation, customer_ID
- Possible to represent names by numbers. Hair_color = {1,0,3,2,4,5}

 Binary:
- Nominal attribute with only 2 states (0 and 1) having no meaningful
order.
- Typically 0 means attribute is absent and 1 means present.
- Referred to as Boolean if two states correspond to true and false.
- Only the mode can be defined
- Symmetric binary: Both outcomes equally important, that is, there
is no preference on which outcome should be coded as 0 or 1. For
example; gender having the states male and female.
- Asymmetric binary: Outcomes of states are not equally important.
For example; medical_test. We code the rarest case by 1 (HIV
positive), other by 0 (HIV negative).

 Ordinal:
- Attribute with possible values that have a meaningful order (or
ranking) among them, but magnitude between successive values is
not known.
- Mode and median can be defined, but mean cannot be defined
- Example: Size = {small, medium, large}, grades ={A+, A, A-}

Quantitative/Numeric Attribute: These attribute are ordered and measurable represented


in integer or real-values. Having 2 types;
 Interval-scaled:
- Measured on a scale of equal-size units
- Difference, mean, median, Mode can be defined
- Example: temperature (***competitive from one to other)

 Ratio-scaled
- Numeric attribute with an inherent zero-point. That is, if a
measurement is ratio-scaled, we can speak of a value as being a
multiple (or ratio) of another value.
- Example: years_of_experience, number_of_words, length, weight,
height

Discrete Attribute:
 Has only a finite or countably infinite set of values.
 Binary attributes are a special case of discrete attributes.
 Example: the attributes hair_color, smoker, medical_test, size.
 Countably infinite means data object can grow to infinity, but in reality,
the values of attribute set is countable.
Continuous Attribute:
 Has real numbers as attribute values.
 The terms numeric attribute and continuous attribute are often used
interchangeably in the literature.
 Example: temperature, height, or weight.

 Basic Statistical Descriptions of Data


The area of basic statistical descriptions of data are;
 Central tendency measurement: mean, mode, median, midrange
 Data dispersion measurement: Range, Quartiles, Interquartile Range, Five-
number summary, boxplots, outliers, Variance, Standard Deviation
 Graphical representation: Quantile plots, histograms, scatter pots

 Symmetric vs. Skewed Data:


In a uni-modal frequency curve with perfect symmetric data distribution, the mean,
Median, and mode are all at the same center value. But data in most real applications
are not symmetric.
 They may instead be either positively skewed, where the mode occurs at
a value that is smaller than the median.
 And negatively skewed, where the mode occurs at a value greater than
the median.
 Quantiles:
Quantiles are points taken at regular intervals of a data distribution, dividing it into
essentially equal-size consecutive sets.
 The 2-quantile is the only one data point dividing the lower and upper halves
of the data distribution. It corresponds to the median.
 The 4-quantiles are the three data points that split the data distribution into
four equal parts, each part represents one-fourth of the data distribution. They
are more commonly referred to as quartiles.
 The 1st quartile (Q1) is the 25th percentile. It cuts off the 25% of the data. The
3rd quartile (Q3) is the 75th percentile. The 2nd quartile (Q2) is the 50th
percentile, also the median.
 The 100-quantiles are more commonly referred to as percentiles.

 Example:
Consider the following data showing increasing order: 30, 36, 47, 50, 52, 52, 56, 60, 63,
70, 70, and 110. Find mean, weighed mean, median, mode, midrange.
Solution:
a. Mean: The most effective measure of “center” of data is mean. The mean

x’ = 58

b. Weighed Mean: If each value of a set are associated with some weight then the
weighted mean,

“Problem and solution: As mean is sensitive to extreme or outliers value, sometimes we


have to trimmed mean that trim t% data from both side.”
c. Median: For skewed (asymmetric) data, a better measure of the “center” of data is
median. The median of data is;
= Middle value if odd number of feature values, otherwise average of the
middle two feature values
= [6th + 7th]/2
= [52+56]/2
= 54
“Problem and Solution: It is expensive to compute median when the observation of data
is large. There is a solution to calculate approximate median by interpolation (for
grouped data)”

d. Mode: Value that occurs most frequently in the data set is called mode. Data sets with
one, two, or three modes are respectively Uni-modal, bimodal, tri-modal. For the example
above mode;
= 52 and 70 (bimodal)

e. Midrange: It is average of the largest and smallest values in the data set. The midrange
of the data set is;
= [smallest + largest]/2
= [30 + 110]/2
= 70

 Example:
Consider the following data showing increasing order: 30, 36, 47, 50, 52, 52, 56, 60, 63,
70, 70, and 110. Find range, interquartile range, median, five number summary. Also
show the boxplot. Is there any outliers?
Also find the Variance and standard deviation.
Solution:
a. Range: The range of the set is the difference between the largest and smallest values.
This is;
= 110 – 30
= 80

.
b. Interquartile range: The distance between the first and third quartiles is called the
inter-quartile range and is defined as
IQR = Q3 – Q1
Here, data distribution with 3 points or 4th quartile is;
30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110
So, Q1 = 47
Q3 = 63
IQR = 63 – 47 = 16

c. Median: The median is;


= The 2nd quintile
= Q2
= 52

d. Five number summary: The five-number summary of a distribution consists of the


Min = 30
Q1 = 47
Median = Q2 = 52
Q3 = 63
Max = 110

e. Boxplot: A popular way of visualizing a data distribution is boxplot that maps the five-
number summary. The ends of the box are at the first and third quartiles, i.e., the height
of the box is IQR. The median is marked by a line within the box.

f. Outliers: The extreme low and high observations value are called outliers, only if;
The values < Q1 - (1.5 * IQR)
The values > Q3 + (1.5 * IQR)
g. Variance and standard deviation: The variance of the data is;

= 379.08
And, standard deviation, σ = 19.47

 Example:
Find the approximate median by interpolation for the grouped data shown below.

Solution: Here, the sum of frequency, N = 3194. To find the midpoint frequency

Age Frequency(F) Midpoints(M) F*M


1-5 200 (1+5)/2 = 3 600
6-15 450 10.5 4725
16-20 300 18 5400
21-50 1500 35.5 53250
51-80 700 65.5 45850
80-110 44 95.5 4222

Now, The sum of (F*M) = 114027


The midpoint finder value (weighted mean) = 114027/3194 = 35.7
The value 35.7 belongs to the age range 21-50.
So, The value of L1 = 21
The value of width = 30 [21-50 = total 30]
The value of (sum of frequency above midpoint) = 950
The midpoint frequency = 1500

Finally, = 21 + [(3194/2 - 950)/1500] x 30 = 33.94


 Data Matrix versus Dissimilarity Matrix:
There are two data structures commonly used in data mining applications.
 Data matrix
 Dissimilarity Matrix

Data matrix referred to the object-by-attribute structure. Suppose that we have n objects
described by p attributes. This structure stores the n data objects in the form of a
relational table, or n-by-p matrix (n objects × p attributes).

Dissimilarity matrix referred by the object-by-object structure. The d(i, j) - measured


dissimilarity or difference between objects i and j (in table show: Euclidian distance). In
general, d(i, j) is a non-negative number. And the matrix is a triangular matrix that’s
why d(i, j) = d(j, i). We say that

Similarity: sim(i, j) = 1 – d(i,j)

 Proximity Measure for Attributes:


Example 1: Consider the following nominal data set. Show which objects are dissimilar
or similar.
Solution: According to the ratio of mismatches, where m is the number of matches, p is
the total number of attributes describing the objects. Then

Since we have one attribute, so p = 1. And the dissimilarity between,


d(1,2) = (m = 0), (p = 1) = 1
d(1,3) = (m = 0), (p = 1) = 1
d(1,4) = (m = 1), (p = 1) = 0
d(2,3) = (m = 0), (p = 1) = 1
d(2,4) = (m = 0), (p = 1) = 1
d(3,4) = (m = 0), (p = 1) = 1
From this we see, all objects are similar except 1 and 4.

Example 2: Consider the following nominal attributes. Show which objects are dissimilar
or similar.

Solution - 1: The dissimilarity between, (here p = 2)


d(1,2) = (m = 0), (p = 2) = 1
d(1,3) = (m = 1), (p = 2) = 0.5
d(1,4) = (m = 1), (p = 2) = 0.5
... … … … …

Solution – 2: Using asymmetric binary attributes by creating a new binary attribute for
each of the M nominal states (For an object with a given state value, the binary attribute
representing that state is set to 1, while the remaining are set to 0).

For map_color attribute, Let the value Yellow be 1 and others (Red, Green) be 0 and for
code Attribute Let the value Code A be 1 and others (Code B, Code C) be 0.
Now, we have formula of asymmetric binary dissimilarity:

And distance measure based on symmetric binary variables:

Where, the value of q(1,1), r(1,0), s(0,1) and t(0,0) having the corresponding value.

So, we have asymmetric binary attribute. And the dissimilarity between;


d(1,2); q = 0, r = 0, s = 2
d(1,2) = 1
d(1,3); q = 1, r = 0, s = 1
d(1,3) = 0.5
… … … …

Example:
Find which objects are similar of the following dataset (use only asymmetric attribute).
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N

Solution: Gender is a symmetric attribute, the remaining attributes are asymmetric


binary. Let the values Y and P be 1, and the value N 0. Dissimilarity measure based only
on the asymmetric attributes.
Here, Dissimilarity for Jack and Mary,
q = 1 + 1 = 2, r = 0, s = 1, t=1+1+1=3
d(Jack, Mary) = 1/3 = 0.333
d(Jack, Jim) = 0.67
d(Mary, Jim) = 0.75

Mary and Jim are most dissimilar so they are unlikely to have the similar disease. And
Jack and Mary are most likely similar so they have to chance of having the similar
disease.
Example: Given dataset of following table. Find the dissimilarity among the data object.

Solution: There are 3 states for test-2 attribute, i.e. Mf = 3

Step 1: Replace each value for test-2 by its rank, the four objects are assigned the
following ranks with order.
Fair – good – excellent
1 - 2 - 3
So, the rank for attributes are, 3, 1, 2, 3 respectively.

Step 2: normalize the rank by mapping rank using the equation,

For, 1 => [1-1]/[3-1] = 0


For, 2 => [2-1]/[3-1] = 0.5
For, 3 => [3-1]/[3-1] = 1
Finally, 1 - 0 – 0.5 – 1

Step 3: Using Euclidean distance the dissimilarity matrix,


d(1,1) = root[(1-1)2] = 0
d(1,2) = root[(1-0)2] = 1
d(1,3) = root[(1-0.5)2] = 0.5
d(1,4) = root[(1-1)2] = 0
d(2,3) = root[(0-0.5)2] = 0.5
d(2,4) = root[(0-1)2] = 1
d(3,4) = root[(0.5-1)2] = 0.5

Therefore objects 1 and 2 are most dissimilar, as are objects 2 and 4.


 Distance on Numeric Data: Minkowski Distance:
Minkowski distance: A popular distance measure

Where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and
h is the order, and h>=1 (the distance so defined is also called Lh norm). The properties
of minikowski distance are;
 Non-negativity: d(i, j) > 0, if i ≠ j
 Identity of indiscernible: d(i, i) = 0
 Symmetry: d(i, j) = d(j, i)
 Triangle Inequality: d(i, j)  d(i, k) + d(k, j)
A distance that satisfies these properties is known as metric.

There are some Special Cases of Minkowski Distance (shown in figure):


 Manhattan (city block, L1 norm) distance.
 Euclidean distance.
 Supremum distance.

***Supermum distance always get the max value form dx and dy [3 > y].

There is another case of minikowski distance: weighted minikowski distance.


 Example: Find the dissimilarity matrix for the following distance.

point attribute 1 attribute 2


x1 1 2
x2 3 5
x3 2 0
x4 4 5

Solution: Here, the attributes are numeric. So the dissimilarity matrix using manhattn
distance.

L x1 x2 x3 x4
x1 0
x2 5 0
x3 3 6 0
x4 6 1 7 0

The dissimilarity matrix using Euclidian distance.

L2 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0

The dissimilarity matrix using supermum distance.

L x1 x2 x3 x4
x1 0
x2 3 0
x3 2 5 0
x4 3 1 5 0
 Example: Consider the following, Find which objects are most similar and dissimilar.

Solution: we get,
 Dissimilarity matrix for test-1 attribute:

 Dissimilarity matrix for test-2 attribute:

 Dissimilarity matrix for test-3 attribute: let maxhxh = 64, and minhxh = 22. Using

 Finally, the combine dissimilar matrix is found form;

We see, objects 1 and 4 are the most similar, and objects 1 and 2 are the least similar.
 Term-frequency vector:
A document can be represented by thousands of attributes, each recording the frequency
of a particular word (such as keywords) or phrase in the document. Thus each document
is represented by what is called term-frequency vector.Term-frequency vectors are
typically long and sparse (they have many 0 values).

The traditional measures do not work well for such sparse numeric data. For example,
two term-frequency vectors may have many 0 values in common, meaning that the
corresponding documents do not share many words, but this does not make them similar.
So, we need a measure that ignores zero-matches using cosine similarity.

Cosine similarity does not obey the properties of a metric. So it is a nonmetric measure.

 Example: Find the cosine similarity between documents 1 and 2.

Solution: Here,

x = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
y = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
xy = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||x|| =
= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481
||y||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12
Sim(x,y)
= cos(x, y)
=

= 0.94.
So, the documents are quite similar.
 Example:
Find the similarity among the objects using Tanimoto distance for following dataset:

Solution: Tanimoto distance is a variant of cosine distance. That is

sim(jack,mary) = 5/(6 + 6 - 6) = 0.83


sim(Jack, Jim) = 4/(6 + 6 - 6) = 0.67
sim(Mary, Jim) = 3/(6 + 6 -6) = 0.50
Jack and Mary are most likely to have same disease

You might also like