TTDS Lecture 2

TOOLS & TECHNIQUES
FOR DATA SCIENCE

LECTURE 2
Data, Pre-processing and Post-processing
Prepared by Khalid Mahboob

Types of Data Sets
 Record
 Relational records
 Data matrix, e.g., numerical matrix, crosstabs
timeout
season
coach
game
score
team
ball
lost
pla
wi
n
y
 Document data: text documents: term-frequency
vector
 Transaction data
Document 1 3 0 5 0 2 6 0 2 0 2
 Graph and network
Document 2 0 7 0 2 1 0 0 3 0 0
 World Wide Web
 Social or information networks Document 3 0 1 0 0 1 2 2 0 3 0
 Molecular Structures
 Ordered
TID Items
 Video data: sequence of images

1 Bread, Coke, Milk
Temporal data: time-series

2 Beer, Bread
Sequential Data: transaction sequences
 Genetic sequence data
3 Beer, Coke, Diaper, Milk
 Spatial, image and multimedia 4 Beer, Bread, Diaper, Milk
 Spatial data: maps 5 Coke, Diaper, Milk
 Image data:
 Video data:
Data Objects
 Data sets are made up of data objects.

 A data object represents an entity.
 Examples:
 sales database: customers, store items, sales
 medical database: patients, treatments
 university database: students, professors, courses
 Also called samples , examples, instances, data points, objects, tuples.
 Data objects are described by attributes.
 Database rows -> data objects; columns ->attributes.
Data Collection
Data
Collection
Data Result
Data Mining
Preprocessing Post-processing
 Today there is an abundance of data online

 Facebook, Twitter, Wikipedia, Web, etc…
 We can extract interesting information from this data, but first we need to collect it
 Customized crawlers, use of public APIs
 Additional cleaning/processing to parse out the useful parts
 Respect of crawling etiquette
Data Quality
 Examples of data quality problems:

Tid Refund Marital Taxable
 Noise and outliers Status Income Cheat
 Missing values 1 Yes Single 125K No
 Duplicate data 2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
A mistake or a millionaire? 5 No Divorced 10000K Yes

6 No NULL 60K No
Missing values 7 Yes Divorced 220K NULL
8 No Single 85K Yes
9 No Married 90K No
Inconsistent duplicate entries
9 No Single 90K No
10
Data Quality
 Data in the real world is dirty

 incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
 noisy: containing errors or outliers
 inconsistent: containing discrepancies in codes or names
 No quality data, no quality mining results!
 Quality decisions must be based on quality data
 Data warehouse needs consistent integration of quality data
 Required for both OLAP and Data Mining!
Data Quality: Why Preprocess the Data?
 Measures for data quality: A multidimensional view

 Accuracy: correct or wrong, accurate or not
 Completeness: not recorded, unavailable, …
 Consistency: some modified but some not, dangling, …
 Timeliness: timely update?
 Believability: how trustable the data are correct?
 Interpretability: how easily the data can be understood?
Why can Data be Incomplete?
 Attributes of interest are not available (e.g., customer

information for sales transaction data)
 Data were not considered important at the time of
transactions, so they were not recorded!
 Data not recorded because of misunderstanding or
malfunctions
 Data may have been recorded and later deleted!
 Missing/unknown values for some data
Why can Data be Noisy/Inconsistent?
 Faulty instruments for data collection

 Human or computer errors
 Errors in data transmission
 Technology limitations (e.g., sensor data come at a
faster rate than they can be processed)
 Inconsistencies in naming conventions or data codes
(e.g., 2/5/2002 could be 2 May 2002 or 5 Feb 2002)
 Duplicate tuples, which were received twice should
also be removed
Major Tasks in Data Preprocessing
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data reduction
 Dimensionality reduction
 Numerosity reduction
 Data compression
 Data transformation and data discretization
 Normalization
 Concept hierarchy generation
Data Cleaning
 Data cleaning tasks

 Fill in missing values
 Identify outliers and smooth out noisy data
 Correct inconsistent data

How to Handle Missing Data?
 Ignore the tuple: usually done when class label is missing (assuming the
tasks in classification)—not effective when the percentage of missing
values per attribute varies considerably.
 Fill in the missing value manually: tedious + infeasible?
 Use a global constant to fill in the missing value: e.g., “unknown”, a new
class?!
 Use the attribute mean to fill in the missing value
 Use the attribute mean for all samples belonging to the same class to fill
in the missing value: smarter
 Use the most probable value to fill in the missing value: inference-based
How to Handle Missing Data?
Age Income Religion Gender

23 24,200 Muslim M
39 ? Christian F
45 45,390 ? F
Fill missing values using aggregate functions (e.g., average) or probabilistic

estimates on global value distribution
E.g., put the average income here, or put the most probable income based
on the fact that the person is 39 years old
E.g., put the most frequent religion here
Inconsistent Data
 Inconsistent data are handled by:

 Manual correction (expensive and tedious)
 Use routines designed to detect inconsistencies and manually correct them. E.g.,
the routine may use the check global constraints (age>10) or functional
dependencies
 Other inconsistencies (e.g., between names of the same attribute) can be

corrected during the data integration process
Data Integration
 Data integration:
 combines data from multiple sources into a coherent store
 Schema integration
 integrate metadata from different sources
 metadata: data about the data (i.e., data descriptors)
 Entity identification problem: identify real world entities from
multiple data sources, e.g., A.cust-id  B.cust-#
 Detecting and resolving data value conflicts
 for the same real world entity, attribute values from different
sources are different (e.g., J.D.Smith and Jonh Smith may refer
to the same person)
 possible reasons: different representations, different scales, e.g.,
metric vs. British units (inches vs. cm)
Handling Redundant Data in Data Integration
 Redundant data occur often when integration of multiple

databases
 The same attribute may have different names in different databases
 One attribute may be a “derived” attribute in another table, e.g.,
annual revenue
 Redundant data may be able to be detected by correlation
analysis and covariance analysis
 Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve
mining speed and quality
Correlation Analysis (Nominal Data)
 Χ2 (chi-square) test
(Observed  Expected ) 2
2  
Expected
 The larger the Χ2 value, the more likely the variables are
related
 The cells that contribute the most to the Χ2 value are
those whose actual count is very different from the
expected count
 Correlation does not imply causality
 # of hospitals and # of car-theft in a city are correlated
 Both are causally linked to the third variable: population
Chi-Square Calculation: An Example
Play chess Not play chess Sum (row)

Like science fiction 250(90) 200(360) 450
Not like science fiction 50(210) 1000(840) 1050
Sum(col.) 300 1200 1500
 Χ2 (chi-square) calculation (numbers in parenthesis are

expected counts calculated based on the data distribution
in the two categories)
(250  90) 2 (50  210) 2 (200  360) 2 (1000  840) 2
 
2
    507.93
90 210 360 840
 It shows that like_science_fiction and play_chess are

correlated in the group
Correlation Analysis (Numeric Data)
 Correlation coefficient (also called Pearson’s product

moment coefficient)
i 1 (ai  A)(bi  B) 
n n
(ai bi )  n AB
rA, B   i 1
(n  1) A B (n  1) A B
where n is the number of tuples, A and

B are the respective means
of A and B, σA and σB are the respective standard deviation of A and
B, and Σ(aibi) is the sum of the AB cross-product.
 If rA,B > 0, A and B are positively correlated (A’s values

increase as B’s). The higher, the stronger correlation.
 rA,B = 0: independent; rAB < 0: negatively correlated
Visually Evaluating Correlation
Scatter plots
showing the
similarity from
–1 to 1.
Correlation (viewed as linear relationship)
 Correlation measures the linear relationship between objects

 To compute correlation, we standardize data objects, A and B, and then take
their dot product
a 'k  (ak  mean( A)) / std ( A)
b'k  (bk  mean( B )) / std ( B )
correlatio n( A, B )  A' B '

Covariance (Numeric Data)
 Covariance is similar to correlation
Correlation coefficient:
where n is the number of tuples, andA areBthe respective mean or expected

values of A and B, σA and σB are the respective standard deviation of A and B.
 Positive covariance: If CovA,B > 0, then A and B both tend to be larger than their
expected values.
 Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B is
likely to be smaller than its expected value.
 Independence: CovA,B = 0 but the converse is not true:
 Some pairs of random variables may have a covariance of 0 but are not independent.
Only under some additional assumptions (e.g., the data follow multivariate normal
distributions) does a covariance of 0 imply independence
Co-Variance: An Example
 It can be simplified in computation as
 Suppose two stocks A and B have the following values in one week: (2, 5),
(3, 8), (5, 10), (4, 11), (6, 14).
 Question: If the stocks are affected by the same industry trends, will
their prices rise or fall together?
 E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
 E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
 Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
 Thus, A and B rise together since Cov(A, B) > 0.

Data Transformation
 Smoothing: remove noise from data

 Aggregation: summarization, data cube construction
 Generalization: concept hierarchy climbing
 Normalization: scaled to fall within a small, specified
range
 min-max normalization
 z-score normalization
 normalization by decimal scaling
 Attribute/feature construction
 New attributes constructed from the given ones
Data Reduction Strategies
 Warehouse may store terabytes of data: Complex data

analysis/mining may take a very long time to run on the
complete data set
 Data reduction
 Obtains a reduced representation of the data set that is much
smaller in volume but yet produces the same (or almost the same)
analytical results
 Data reduction strategies
 Data cube aggregation
 Dimensionality reduction
 Data compression
 Numerosity reduction
 Discretization and concept hierarchy generation
Normalization
 Min-max normalization: to [new_minA, new_maxA]
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
 Ex. Let income range $12,000 to $98,000 normalized to [0.0,

73,600  12,000
1.0]. Then $73,000 is mapped to (1.0  0)  0  0.716
98,000  12,000
 Z-score normalization (μ: mean, σ: standard deviation):

v  A
v' 
 A
73,600  54,000
 1.225
16,000
 Ex. Let μ = 54,000, σ = 16,000. Then
 Normalization by decimal scaling
v Where j is the smallest integer such that Max(|ν’|) < 1
v' 
10 j
Data Cube Aggregation
 The lowest level of a data cube

 the aggregated data for an individual entity of interest
 e.g., a customer in a phone calling data warehouse.
 Multiple levels of aggregation in data cubes
 Further reduce the size of data to deal with
 Reference appropriate levels
 Use the smallest representation which is enough to solve the task
 Queries regarding aggregated information should be
answered using data cube, when possible
Dimensionality Reduction
 Feature selection (i.e., attribute subset selection):

 Select a minimum set of features such that the probability
distribution of different classes given the values for those features
is as close as possible to the original distribution given the values
of all features
 reduce # of patterns in the patterns, easier to understand
 Heuristic methods (due to exponential # of choices):
 step-wise forward selection
 step-wise backward elimination
 combining forward selection and backward elimination
 decision-tree induction
Numerosity Reduction: Reduce the volume of data
 Parametric methods
 Assume the data fits some model, estimate model
parameters, store only the parameters, and discard the data
(except possible outliers)
 Log-linear models: obtain value at a point in m-D space as the
product on appropriate marginal subspaces
 Non-parametric methods
 Do not assume models
 Major families: histograms, clustering, sampling
Discretization
 Three types of attributes:
 Nominal — values from an unordered set
 Ordinal — values from an ordered set
 Continuous — real numbers
 Discretization:
 divide the range of a continuous attribute into intervals
 why?
 Some classification algorithms only accept categorical attributes.
 Reduce data size by discretization
 Supervised vs. unsupervised
 Split (top-down) vs. merge (bottom-up)
 Prepare for further analysis, e.g., classification

How to Handle Noisy Data?
Smoothing techniques
 Binning method
 first sort data and partition into (equi-depth) bins
 then one can smooth by bin means, smooth by bin median, smooth
by bin boundaries, etc.
 Clustering
 detect and remove outliers
 Combined computer and human inspection
 computer detects suspicious values, which are then checked by
humans
 Regression
 smooth by fitting the data into regression functions
 Use Concept hierarchies
 use concept hierarchies, e.g., price value -> “expensive”
Simple Discretization Methods:
Binning
 Equal-width (distance) partitioning:
 It divides the range into N intervals of equal size: uniform grid
 if A and B are the lowest and highest values of the attribute,
the width of intervals will be: W = (B-A)/N.
 The most straightforward
 But outliers may dominate presentation
 Skewed data is not handled well.
 Equal-depth (frequency) partitioning:
 It divides the range into N intervals, each containing
approximately same number of samples
 Good data scaling – good handing of skewed data
Simple Discretization Methods: Binning
Example: customer ages number

of values
Equi-width
binning:
0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80
Equi-width
binning: 22-31 62-80
0-22
38-44 48-55
32-38 44-48 55-62
Discretization by Classification &
Correlation Analysis
 Classification (e.g., decision tree analysis)
 Supervised: Given class labels, e.g., cancerous vs. benign
 Using entropy to determine split point (discretization point)
 Top-down, recursive split
 Correlation analysis (e.g., Chi-merge: χ2-based discretization)
 Supervised: use class information
 Bottom-up merge: find the best neighboring intervals (those having

similar distributions of classes, i.e., low χ2 values) to merge
 Merge performed recursively, until a predefined stopping condition

Concept Hierarchy Generation
 Concept hierarchy organizes concepts (i.e., attribute values)
hierarchically and is usually associated with each dimension in a data
warehouse
 Concept hierarchies facilitate drilling and rolling in data warehouses
to view data in multiple granularity
 Concept hierarchy formation: Recursively reduce the data by
collecting and replacing low level concepts (such as numeric values
for age) by higher level concepts (such as youth, adult, or senior)
 Concept hierarchies can be explicitly specified by domain experts
and/or data warehouse designers
 Concept hierarchy can be automatically formed for both numeric and
nominal data. For numeric data, use discretization methods shown.
Concept Hierarchy Generation
for Nominal Data
 Specification of a partial/total ordering of attributes
explicitly at the schema level by users or experts
 street < city < state < country
 Specification of a hierarchy for a set of values by explicit
data grouping
 {Urbana, Champaign, Chicago} < Illinois
 Specification of only a partial set of attributes
 E.g., only street < city, not others
 Automatic generation of hierarchies (or attribute levels) by
the analysis of the number of distinct values
 E.g., for a set of attributes: {street, city, state, country}
Automatic Concept Hierarchy Generation
 Some hierarchies can be automatically generated based on
the analysis of the number of distinct values per attribute in
the data set
 The attribute with the most distinct values is placed at
the lowest level of the hierarchy
 Exceptions, e.g., weekday, month, quarter, year
country 15 distinct values
province_or_ state 365 distinct values
city 3567 distinct values
street 674,339 distinct values

Post-processing
 Visualization
 The human eye is a powerful analytical tool
 If we visualize the data properly, we can discover patterns
 Visualization is the way to present the data so that patterns can be seen
 E.g., histograms and plots are a form of visualization
 There are multiple techniques (a field on its own)
Scatter Plot Array of Iris Attributes
Contour Plot Example: SST Sea surface
temperature Dec, 1998
Celsius

TTDS Lecture 2

Uploaded by

Copyright:

Available Formats

You might also like

TTDS Lecture 2

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

TTDS Lecture 2

Uploaded by

Copyright:

Available Formats

TOOLS & TECHNIQUES

FOR DATA SCIENCE

Prepared by Khalid Mahboob

 Data sets are made up of data objects.

 Today there is an abundance of data online

 Examples of data quality problems:

A mistake or a millionaire? 5 No Divorced 10000K Yes

 Data in the real world is dirty

 Measures for data quality: A multidimensional view

 Attributes of interest are not available (e.g., customer

 Faulty instruments for data collection

 Data cleaning tasks

 Identify outliers and smooth out noisy data

 Correct inconsistent data

Age Income Religion Gender

Fill missing values using aggregate functions (e.g., average) or probabilistic

 Inconsistent data are handled by:

 Other inconsistencies (e.g., between names of the same attribute) can be

 Redundant data occur often when integration of multiple

Play chess Not play chess Sum (row)

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

 Χ2 (chi-square) calculation (numbers in parenthesis are

 It shows that like_science_fiction and play_chess are

 Correlation coefficient (also called Pearson’s product

where n is the number of tuples, A and

 If rA,B > 0, A and B are positively correlated (A’s values

 Correlation measures the linear relationship between objects

a 'k  (ak  mean( A)) / std ( A)

b'k  (bk  mean( B )) / std ( B )

correlatio n( A, B )  A' B '

where n is the number of tuples, andA areBthe respective mean or expected

 It can be simplified in computation as

 E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4

 E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6

 Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4

 Thus, A and B rise together since Cov(A, B) > 0.

 Smoothing: remove noise from data

 Warehouse may store terabytes of data: Complex data

 Ex. Let income range $12,000 to $98,000 normalized to [0.0,

 Z-score normalization (μ: mean, σ: standard deviation):

 The lowest level of a data cube

 Feature selection (i.e., attribute subset selection):

 Ordinal — values from an ordered set

 Continuous — real numbers

 Some classification algorithms only accept categorical attributes.

 Reduce data size by discretization

 Supervised vs. unsupervised

 Split (top-down) vs. merge (bottom-up)

 Prepare for further analysis, e.g., classification

Example: customer ages number

 Supervised: Given class labels, e.g., cancerous vs. benign

 Using entropy to determine split point (discretization point)

 Top-down, recursive split

 Correlation analysis (e.g., Chi-merge: χ2-based discretization)

 Supervised: use class information

 Bottom-up merge: find the best neighboring intervals (those having

 Merge performed recursively, until a predefined stopping condition