Professional Documents
Culture Documents
Understanding Data: Adapted From Book Slides of Jiawei Han, Micheline Kamber, and Jian Pei
Understanding Data: Adapted From Book Slides of Jiawei Han, Micheline Kamber, and Jian Pei
Understanding Data: Adapted From Book Slides of Jiawei Han, Micheline Kamber, and Jian Pei
* Adapted from book slides of Jiawei Han, Micheline Kamber, and Jian Pei
1
Understanding Data
2
Data Representation
3
Understanding Data
4
Types of Data Sets
Record
Relational records
Data matrix, e.g., numerical matrix,
timeout
season
coach
game
score
team
ball
lost
pla
crosstabs
wi
n
y
Document data: text documents: term-
frequency vector
Document 1 3 0 5 0 2 6 0 2 0 2
Transaction data
Graph and network Document 2 0 7 0 2 1 0 0 3 0 0
World Wide Web
Document 3 0 1 0 0 1 2 2 0 3 0
Social or information networks
Molecular Structures
Ordered TID Items
Video data: sequence of images
1 Bread, Coke, Milk
Temporal data: time-series
Sequential Data: transaction sequences 2 Beer, Bread
Genetic sequence data 3 Beer, Coke, Diaper, Milk
Spatial, image and multimedia: 4 Beer, Bread, Diaper, Milk
Spatial data: maps 5 Coke, Diaper, Milk
Image data:
Video data:
5
Data Objects
Types:
Nominal
Binary
Numeric: quantitative
Interval-scaled
Ratio-scaled
7
Attribute Types
Nominal: categories, states, or “names of things”
Hair_color = {auburn, black, blond, brown, grey, red, white}
marital status, occupation, ID numbers, zip codes
Binary
Nominal attribute with only 2 states (0 and 1)
Symmetric binary: both outcomes equally important
e.g., gender
Asymmetric binary: outcomes not equally important.
e.g., medical test (positive vs. negative)
Convention: assign 1 to most important outcome (e.g., Covid
positive)
Ordinal
Values have a meaningful order (ranking) but magnitude between
successive values is not known.
Size = {small, medium, large}, grades, army rankings
8
Numeric Attribute Types
Quantity (integer or real-valued)
Interval
Measured on a scale of equal-sized units
Values have order
E.g., temperature in C˚or F˚, calendar dates
No true zero-point
Ratio
Inherent zero-point
We can speak of values as being an order of
magnitude larger than the unit of measurement
(10 K˚ is twice as high as 5 K˚).
e.g., temperature in Kelvin, length, counts,
monetary quantities
9
Discrete vs. Continuous Attributes
Discrete Attribute
Has only a finite or countably infinite set of values
collection of documents
Sometimes, represented as integer variables
attributes
Continuous Attribute
Has real numbers as attribute values
floating-point variables
10
Understanding Data
11
Basic Statistical Descriptions of Data
Motivation
To better understand the data: central tendency,
variation and spread
Data dispersion characteristics
median, max, min, quantiles, outliers, variance, etc.
12
Measuring the Central Tendency
Mean (algebraic measure) (sample vs. population): 1 n
x xi x
Note: n is sample size and N is population size. n i 1 N
n
Weighted arithmetic mean:
Trimmed mean: chopping extreme values
w x i i
x i 1
n
Median:
w i
Middle value if odd number of values, or average of the i 1
n / 2 ( freq )l
Mode median L1 ( ) width
freq median
Value that occurs most frequently in the data
Unimodal, bimodal, trimodal
Empirical formula:
mean mode 3 (mean median)
13
Symmetric vs. Skewed Data
Median, mean and mode of symmetric, symmetric
positively and negatively skewed data
N
i 1
( xi
2
)
N
xi 2
i 1
2
15
Boxplot Analysis
Five-number summary of a distribution
Minimum, Q1, Median, Q3, Maximum
Boxplot
Data is represented with a box
The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
The median is marked by a line within the
box
Whiskers: two lines outside the box extended
to Minimum and Maximum
Outliers: points beyond a specified outlier
threshold, plotted individually
16
Graphic Displays of Basic Statistical Descriptions
18
Histograms Often Tell More than Boxplots
19
Quantile Plot
Displays all of the data (allowing the user to assess both
the overall behavior and unusual occurrences)
Plots quantile information
For a data x data sorted in increasing order, f
i i
indicates that approximately 100 fi% of the data are
below or equal to the value xi
21
Scatter plot
Provides a first look at bivariate data to see clusters of
points, outliers, etc
Each pair of values is treated as a pair of coordinates and
plotted as points in the plane
22
Positively and Negatively Correlated Data
23
Uncorrelated Data
24
In-class activity 1
A financial advisor offers two products to Mr. I to invest his
capital. Assuming an investment of Rs. 1M in 2011, the
yearly returns from the two products (In Rs.) have been
listed in file S3_InvestmentReturns. Please go through the
file and suggest which of the products is better! You may
assume that the pattern and magnitude of returns would
remain similar in future.
25
In-class activity – Crime rate and
police
Commissionerate (Police) of a metro city in India wants to study the
association between the number of police personnel and crime rate in
an area. Ravi Shankar, a newly recruited officer, with an interest and
background and statistics has been assigned the task to oversee this
study. He collects data for the average number of police personnel
deployed within the jurisdiction of each police station in the city, and
number of serious crimes reported during the last 12 months. He
shares data with you (file S3_CrimeRateAndPolicePersonnel) and
requests you to analyze and answer the following question: