Week 2 - 3getting To Know Your Data

CIS 517 :Data Mining and
Warehousing
Week2 & 3: Getting to Know Your

Data
(Chapter 2)
Instructor : Dr. Gomathi Krishna

Email : gkrishna@iau.edu.sa 1
Lecture Outline
 Data Objects and Attribute Types
 Basic Statistical Descriptions of Data
 Data Visualization
 Measuring Data Similarity and Dissimilarity
2
What is Data Sets?
 Data sets are made up of data Attributes
objects
 A data object represent an Tid Refund Marital Taxable
entity. Status Income Cheat
 An attribute is a property or 1 Yes Single 125K No

characteristic of an object 2 No Married 100K No
 Examples: eye color of a 3 No Single 70K No
person, temperature, etc. 4 Yes Married 120K No
 Attribute is also known as 5 No Divorced 95K Yes
dimensions , feature , or Objects
6 No Married 60K No
variable. 7 Yes Divorced 220K No
 A collection of attributes 8 No Single 85K Yes
describe an object 9 No Married 75K No
 Object is also known as 10 No Single 90K Yes
samples , examples , data 10
points ,or instance

Attributes
 Attribute (or dimensions, features, variables):
a data field, representing a characteristic or feature
of a data object.
 E.g., customer _ID, name, address
 Types:
 Nominal
 Binary
 Numeric: quantitative
 Interval-scaled
 Ratio-scaled
4
Attribute Types
 Nominal: categories, states, or “names of things”
 Hair_color = {auburn, black, blond, brown, grey, red, white}
 marital status, occupation, ID numbers, zip codes
 Binary
 Nominal attribute with only 2 states (0 and 1)
 Symmetric binary: both outcomes equally important
 e.g., gender
 Asymmetric binary: outcomes not equally important.
 e.g., medical test (positive vs. negative)
 Convention: assign 1 to most important outcome (e.g., HIV
positive)
 Ordinal
 Values have a meaningful order (ranking) but magnitude between
successive values is not known.
 Size = {small, medium, large}, grades
5
Numeric Attribute Types
 Quantity (integer or real-valued)
 Interval
 Measured on a scale of equal-sized units
 Values have order
 E.g., temperature in C˚or F˚, calendar dates
 Ratio
 e.g. speed , counts( years_of_experience).
6
Discrete vs. Continuous Attributes
 Discrete Attribute
 Has only a finite or countably infinite set of values
 E.g., zip codes, profession, or the set of words in a
collection of documents
 Sometimes, represented as integer variables
 Note: Binary attributes are a special case of discrete
attributes
 Continuous Attribute
 Has real numbers as attribute values
 E.g., temperature, height, or weight
 Practically, real values can only be measured and
represented using a finite number of digits

 Continuous attributes are typically represented as
floating-point variables
7
Types of data sets
 Record
 Data Matrix
 Document Data
 Transaction Data
 Graph
 World Wide Web
 Molecular Structures
 Ordered
 Spatial Data
 Temporal Data
 Sequential Data
 Genetic Sequence Data
Record Data
 Data that consists of a collection of records, each
of which consists of a fixed set of attributes
Tid Refund Marital Taxable
Status Income Cheat
1 Yes Single 125K No

2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Data Matrix
 If data objects have the same fixed set of numeric
attributes, then the data objects can be thought of as
points in a multi-dimensional space, where each
dimension represents a distinct attribute
 Such data set can be represented by an m by n matrix,

where there are m rows, one for each object, and n
columns, one for each attribute
Projection Projection Distance Load Thickness
of x Load of y load
10.23 5.27 15.22 2.7 1.2

12.65 6.25 16.22 2.2 1.1
Document Data
 Each document becomes a `term' vector,
 each term is a component (attribute) of the
vector,
 the value of each component is the number of
times the corresponding term occurs in the

document.
timeout
season
coach
game
score
team
ball
lost
pla
wi
n
y
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data
 A special type of record data, where
 each record (transaction) involves a set of
items.
 For example, consider a grocery store. The set
of products purchased by a customer during

one shopping trip constitute a transaction,
while the individual products that were
purchased are
TID
the items.
Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Graph Data
 Examples: Generic graph and HTML Links
<a href="papers/papers.html#bbbb">
Data Mining </a>
<li>
2 <a href="papers/papers.html#aaaa">
Graph Partitioning </a>
<li>
5 1 <a href="papers/papers.html#aaaa">
Parallel Solution of Sparse Linear System of Equations </a>
<li>
2 <a href="papers/papers.html#ffff">
N-Body Computation and Dense Linear System Solvers
5
Ordered Data
 Sequences of transactions
Items/Events
An element of
the sequence
Data Quality
 What kinds of data quality problems?
 How can we detect problems with the data?
 What can we do about these problems?
 Examples of data quality problems:

 Noise and outliers
 missing values
 duplicate data
Noise
 Noise refers to modification of original values
 Examples: distortion of a person’s voice when
talking on a poor phone and “snow” on

television screen
Two Sine Waves Two Sine Waves + Noise

Outliers
 Outliers are data objects with characteristics that
are considerably different than most of the other
data objects in the data set
Missing Values
 Reasons for missing values
 Information is not collected
(e.g., people decline to give their age and

weight)
 Attributes may not be applicable to all cases
(e.g., annual income is not applicable to

children)
 Handling missing values

 Eliminate Data Objects
 Estimate Missing Values
 Ignore the Missing Value During Analysis

Duplicate Data
 Data set may include data objects that are
duplicates, or almost duplicates of one another
 Examples:
 Same person with multiple email addresses
 Data cleaning
 Process of dealing with duplicate data issues
Basic Statistical Descriptions of Data
 Basic statistical descriptions :
can be used to identify properties of the data and
highlight which data values should be treated as noise or
outliers.
 Measures of central tendency :
which measure the location of the middle or center of a

data distribution.
 Dispersion of the data:
 How are the data spread out?
The most common data dispersion measures are the
range, quartiles, the five-number summary and boxplots;

and the variance and standard deviation of the data.
20
Basic Statistical Descriptions of Data
 Use many graphic displays of basic statistical
descriptions to visually inspect our data .
 Most statistical or graphical data presentation software
packages include bar charts, pie charts, and line graphs.
 Other popular displays of data summaries and
distributions include quantile plots, quantile–quantile
plots, histograms, and scatter plots.
21
Measuring the Central Tendency
 Mean :
Average value.
 Median:
Middle value.
 Mode:
Most common value .
 Midrange :
The average of the largest and smallest values in the set.
This measure is easy to compute using the SQL aggregate functions, max() and
min().
22
Symmetric vs. Skewed Data
 Median, mean and mode of
symmetric, positively and negatively
skewed data
positively skewed negatively skewed
5/5/23 Data Mining: Concepts and Techniques 23

Symmetric vs. Skewed Data
 Median, mean and mode of
symmetric, positively and negatively
skewed data
symmetric

Measuring the Dispersion of Data
 Quartiles, outliers and boxplots
 Quartiles: Q1 (25th percentile), Q3 (75th percentile)
 Inter-quartile range: IQR = Q3 – Q1
25
Measuring the Dispersion of Data
 Boxplot: ends of the box are the quartiles; median is marked; add whiskers,
and plot outliers individual
 Variance: (algebraic, scalable computation)
 Standard deviation s (or σ) is the square root of variance s2 (or σ2)

 Five-number summary of a distribution
 Minimum, Q1, Median, Q3, Maximum
26
Boxplot Analysis
27
Boxplot Analysis
28
Visualization of Data Dispersion: 3-D Boxplots

Graphic Displays of Basic Statistical Descriptions
 Boxplot: graphic display of five-number summary

 Histogram
 Quantile plot
 Quantile-quantile (q-q) plot
 Scatter plot
30
Histogram Analysis
31
Histograms Often Tell More than Boxplots
 The two histograms

shown in the left may
have the same boxplot
representation
 The same values for:
min, Q1, median, Q3,
max
 But they have rather
different data
distributions
32
Quantile Plot
 Displays all of the data (allowing the user to assess both
the overall behavior and unusual occurrences)
Data Mining: Concepts and Techniques 33

Quantile-Quantile (Q-Q) Plot
 Graphs the quantiles of one univariate distribution against the
corresponding quantiles of another
 View: Is there is a shift in going from one distribution to
another?
 Example shows unit price of items sold at Branch 1 vs. Branch
2 for each quantile. Unit prices of items sold at Branch 1 tend
to be lower than those at Branch 2.
34
Scatter plot
 Provides a first look at bivariate data to see clusters of
points, outliers, etc
 Each pair of values is treated as a pair of coordinates and
plotted as points in the plane
35
Positively and Negatively Correlated Data
 The left half fragment is positively

correlated
 The right half is negative correlated
36
Uncorrelated Data
37
Data Visualization
 Why data visualization?
 Gain insight into an information space by mapping data onto
graphical primitives
 Provide qualitative overview of large data sets
 Search for patterns, trends, structure, irregularities, relationships
among data
 Help find interesting regions and suitable parameters for further
quantitative analysis
 Provide a visual proof of computer representations derived
 Categorization of visualization methods:

 Pixel-oriented visualization techniques
 Geometric projection visualization techniques
 Icon-based visualization techniques
 Hierarchical visualization techniques
 Visualizing complex data and relations
38
Geometric Projection Visualization Techniques
 Visualization of geometric transformations and projections

of the data
 Methods
 Direct visualization
 Scatterplot and scatterplot matrices
 Landscapes
 Projection pursuit technique: Help users find meaningful
projections of multidimensional data
 Prosection views
 Hyperslice
 Parallel coordinates
39
Ribbons with Twists Based on Vorticity
Direct Data Visualization

Scatterplot Matrices
Used by ermission of M. Ward, Worcester Polytechnic Institute
Matrix of scatterplots (x-y-diagrams) of the k-dim. data [total of (k2/2-k) scatterplots]
41
Landscapes
Used by permission of B. Wright, Visible Decisions Inc.
news articles
visualized as
a landscape
 Visualization of the data as perspective landscape

 The data needs to be transformed into a (possibly artificial) 2D
spatial representation which preserves the characteristics of the data
42
Parallel Coordinates
 n equidistant axes which are parallel to one of the screen axes and
correspond to the attributes
 The axes are scaled to the [minimum, maximum]: range of the
corresponding attribute
 Every data item corresponds to a polygonal line which intersects each
of the axes at the point which corresponds to the value for the
attribute
43
Parallel Coordinates of a Data Set
44
Icon-Based Visualization Techniques
 Visualization of the data values as features of icons

 Typicvisualization methods
 al Chernoff Faces
 Stick Figures
 General techniques
 Shape coding: Use shape to represent certain
information encoding
 Color icons: Use color icons to encode more
information
 Tile bars: Use small icons to represent the relevant
feature vectors in document retrieval
45
Chernoff Faces
 A way to display variables on a two-dimensional surface, e.g., let x be
eyebrow slant, y be eye size, z be nose length, etc.
 The figure shows faces produced using 10 characteristics--head eccentricity,
eye size, eye spacing, eye eccentricity, pupil size, eyebrow slant, nose size,
mouth shape, mouth size, and mouth opening): Each assigned one of 10
possible values, generated using Mathematica (S. Dickson)
 REFERENCE: Gonick, L. and Smith, W.

The Cartoon Guide to Statistics. New York:
Harper Perennial, p. 212, 1993
 Weisstein, Eric W. "Chernoff Face." From
MathWorld--A Wolfram Web Resource.
mathworld.wolfram.com/ChernoffFace.html
46
Stick Figure
A census data
figure showing
age, income,
used by permission of G. Grinstein, University of Massachusettes at Lowell
gender,
education, etc.
A 5-piece stick
figure (1 body
and 4 limbs w.
different
angle/length)

Hierarchical Visualization Techniques
 Visualization of the data using a hierarchical

partitioning into subspaces
 Methods
 Dimensional Stacking
 Worlds-within-Worlds
 Tree-Map
 Cone Trees
 InfoCube
48
Dimensional Stacking
 Partitioning of the n-dimensional attribute space in 2-D

subspaces, which are ‘stacked’ into each other
 Partitioning of the attribute value ranges into classes. The
important attributes should be used on the outer levels.
 Adequate for data with ordinal attributes of low cardinality
 But, difficult to display more than nine dimensions
 Important to map dimensions appropriately
49
Dimensional Stacking
Used by permission of M. Ward, Worcester Polytechnic Institute
Visualization of oil mining data with longitude and latitude mapped to the
outer x-, y-axes and ore grade and depth mapped to the inner x-, y-axes
50
Worlds-within-Worlds
 Assign the function and two most important parameters to innermost
world
 Fix all other parameters at constant values - draw other (1 or 2 or 3
dimensional worlds choosing these as the axes)
 Software that uses this paradigm
 N–vision: Dynamic
interaction through data
glove and stereo displays,
including rotation, scaling
(inner) and translation
(inner/outer)
 Auto Visual: Static
interaction by means of
queries
51
Tree-Map
 Screen-filling method which uses a hierarchical partitioning
of the screen into regions depending on the attribute
values
 The x- and y-dimension of the screen are partitioned
alternately according to the attribute values (classes)
Schneiderman@UMD: Tree-Map of a File System Schneiderman@UMD: Tree-Map to support

large data sets of a million items
52
InfoCube
 A 3-D visualization technique where hierarchical
information is displayed as nested semi-transparent
cubes
 The outermost cubes correspond to the top level
data, while the subnodes or the lower level data
are represented as smaller cubes inside the
outermost cubes, and so on
53
Three-D Cone Trees
 3D cone tree visualization technique works
well for up to a thousand nodes or so
 First build a 2D circle tree that arranges its
nodes in concentric circles centered on the
root node
 Cannot avoid overlaps when projected to
2D
 G. Robertson, J. Mackinlay, S. Card. “Cone
Trees: Animated 3D Visualizations of
Hierarchical Information”, ACM SIGCHI'91
 Graph from Nadeau Software Consulting
website: Visualize a social network data set
that models the way an infection spreads
from one person to the next
54
Visualizing Complex Data and Relations
 Visualizing non-numerical data: text and social networks
 Tag cloud: visualizing user-generated tags
 The importance of tag is
represented by font
size/color
 Besides text data, there are
also methods to visualize
relationships, such as
visualizing social networks
Newsmap: Google News Stories in 2005

Measuring Data Similarity and Dissimilarity
 Similarity
 Numerical measure of how alike two data objects are
 Value is higher when objects are more alike
 Often falls in the range [0,1]
 Dissimilarity (e.g., distance)
 Numerical measure of how different two data objects
are
 Lower when objects are more alike
 Minimum dissimilarity is often 0
 Upper limit varies
 Proximity refers to a similarity or dissimilarity 56
Data Matrix and Dissimilarity Matrix
 Data matrix
 n data points with p  x11 ... x1f ... x1p 
 
dimensions  ... ... ... ... ... 
 Two modes x ... xif ... xip 
 i1 
 ... ... ... ... ... 
x ... xnf ... xnp 
 n1 
 Dissimilarity matrix
 n data points, but  0 
 d(2,1) 0 
registers only the  
distance  d(3,1) d ( 3,2) 0 
 
 A triangular matrix  : : : 
 Single mode
d ( n,1) d ( n,2) ... ... 0
57
Proximity Measure for Nominal Attributes
 Can take 2 or more states, e.g., red, yellow, blue,

green (generalization of a binary attribute)
 Method 1: Simple matching
 m: # of matches, p: total # of variables
d (i, j)  p 
p
m
 Method 2: Use a large number of binary attributes
 creating a new binary attribute for each of the
M nominal states
58
Proximity Measure for Binary Attributes
Object j
 A contingency table for binary data
Object i
 Distance measure for symmetric

binary variables:
 Distance measure for asymmetric
binary variables:
 Jaccard coefficient (similarity
measure for asymmetric binary
variables):
 Note: Jaccard coefficient is the same as “coherence”:
59
Dissimilarity between Binary Variables
 Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
 Gender is a symmetric attribute

 The remaining attributes are asymmetric binary
 Let the values Y and P be 1, and the value N 0
01
d ( jack , mary )   0.33
2 01
11
d ( jack , jim )   0.67
111
1 2
d ( jim , mary )   0.75
11 2
60
Jaccard’s Coefficient for non binary
61
Standardizing Numeric Data
 Z-score:
x
z   
 X: raw score to be standardized, μ: mean of the population, σ: standard
deviation
 the distance between the raw score and the population mean in units of
the standard deviation
 negative when the raw score is below the mean, “+” when above
 An alternative way: Calculate the mean absolute deviation
s f  1n (| x1 f  m f |  | x2 f  m f | ... | xnf  m f |)
where
m f  1n (x1 f  x2 f  ...  xnf )
.
x m
if f
zif  sf
 standardized measure (z-score):
 Using mean absolute deviation is more robust than using standard deviation
62
Example:
Data Matrix and Dissimilarity Matrix
Data Matrix
x2 x4
point attribute1 attribute2
4 x1 1 2
x2 3 5
x3 2 0
x4 4 5
2 x1
Dissimilarity Matrix
(with Euclidean Distance)
x3
0 4 x1 x2 x3 x4
2
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
63
Distance on Numeric Data: Minkowski Distance
 Minkowski distance: A popular distance measure
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two
p-dimensional data objects, and h is the order (the
distance so defined is also called L-h norm)
 Properties
 d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
 d(i, j) = d(j, i) (Symmetry)
 d(i, j)  d(i, k) + d(k, j) (Triangle Inequality)
 A distance that satisfies these properties is a metric
64
Special Cases of Minkowski Distance
 h = 1: Manhattan (city block, L1 norm) distance
 E.g., the Hamming distance: the number of bits that are different
between two binary vectors
d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j2 ip jp
 h = 2: (L2 norm) Euclidean distance

d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1 i2 j2 ip jp
 h  . “supremum” (Lmax norm, L norm) distance.

 This is the maximum difference between any component (attribute)
of the vectors
65
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
x2 x4
L2 x1 x2 x3 x4
4 x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
2 x1
Supremum
L x1 x2 x3 x4
x1 0
x2 3 0
x3 x3 2 5 0
0 2 4 x4 3 1 5 0
66
Videos
Box-Plot Case Study:
https://www.youtube.com/watch?v=CoVf1jLxgj4
Calculating Standard Deviation:

https://www.youtube.com/watch?v=qqOyy_NjflU
Z-Score Case
https://www.youtube.com/watch?v=2JjaWQZChqs
Different Distance Measurement

https://www.youtube.com/watch?v=_EEcjn0Uirw
Backword z-score measurement:

https://www.youtube.com/watch?v=jcEvIXQRel0
Jaccard Coefficient:
https://www.youtube.com/watch?v=Vbdki_gnnYM
67

Week 2 - 3getting To Know Your Data

Uploaded by

Copyright:

Available Formats

You might also like

Week 2 - 3getting To Know Your Data

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Week 2 - 3getting To Know Your Data

Uploaded by

Copyright:

Available Formats

CIS 517 :Data Mining and

Week2 & 3: Getting to Know Your

Instructor : Dr. Gomathi Krishna

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Measuring Data Similarity and Dissimilarity

 An attribute is a property or 1 Yes Single 125K No

points ,or instance

 E.g., zip codes, profession, or the set of words in a

 Note: Binary attributes are a special case of discrete

 E.g., temperature, height, or weight

 Practically, real values can only be measured and

represented using a finite number of digits

1 Yes Single 125K No

 Such data set can be represented by an m by n matrix,

10.23 5.27 15.22 2.7 1.2

times the corresponding term occurs in the

of products purchased by a customer during

 Examples of data quality problems:

talking on a poor phone and “snow” on

Two Sine Waves Two Sine Waves + Noise

(e.g., people decline to give their age and

(e.g., annual income is not applicable to

 Handling missing values

 Estimate Missing Values

 Ignore the Missing Value During Analysis

which measure the location of the middle or center of a

 How are the data spread out?

The most common data dispersion measures are the

range, quartiles, the five-number summary and boxplots;

positively skewed negatively skewed

5/5/23 Data Mining: Concepts and Techniques 23

5/5/23 Data Mining: Concepts and Techniques 24

 Variance: (algebraic, scalable computation)

 Standard deviation s (or σ) is the square root of variance s2 (or σ2)

5/5/23 Data Mining: Concepts and Techniques 29

 Boxplot: graphic display of five-number summary

 The two histograms

Data Mining: Concepts and Techniques 33

 The left half fragment is positively

 Search for patterns, trends, structure, irregularities, relationships

 Categorization of visualization methods:

 Geometric projection visualization techniques

 Icon-based visualization techniques

 Hierarchical visualization techniques

 Visualizing complex data and relations

 Visualization of geometric transformations and projections

Data Mining: Concepts and Techniques 40

Used by ermission of M. Ward, Worcester Polytechnic Institute

Matrix of scatterplots (x-y-diagrams) of the k-dim. data [total of (k2/2-k) scatterplots]

 Visualization of the data as perspective landscape

 Visualization of the data values as features of icons

 REFERENCE: Gonick, L. and Smith, W.

Data Mining: Concepts and Techniques 47

 Visualization of the data using a hierarchical

 Partitioning of the n-dimensional attribute space in 2-D

Schneiderman@UMD: Tree-Map of a File System Schneiderman@UMD: Tree-Map to support