Download as pdf or txt
Download as pdf or txt
You are on page 1of 96

CP610 Data Analysis

- Data Preprocessing & Visualization


Data Preprocessing
• Data Cleaning

• Data Integration

• Data Reduction

• Data Transformation and Data Discretization

• Summary

2
Data Cleaning
• Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
• incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
• e.g., Occupation=“ ” (missing data)
• noisy: containing errors, or outliers
• e.g., Salary=“−10” (an error)
• inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records

3
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing (when
doing classification)—not effective when the % of missing values
per attribute varies considerably
• Fill in it automatically with
• a global constant : e.g., “unknown”, a new class?!
• the attribute mean/median/mode
• the attribute mean/median/mode for all samples belonging
to the same class: smarter
• the most probable value: inference-based such as Bayesian
formula or decision tree

4
How to Handle Noisy Data?
• Binning
• first sort data and partition into bins
• then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
• Regression
• smooth by fitting the data into regression functions
• Clustering
• detect and remove outliers

5
Data Preprocessing
• Data Cleaning

• Data Integration

• Data Reduction

• Data Transformation and Data Discretization

• Summary

6
Data Integration
• Data integration:
• Combines data from multiple sources into a coherent store

• Schema integration: e.g., A.cust-id º B.cust-#


• Integrate metadata from different sources

• Handling Redundancy in Data Integration


• Object identification: The same attribute or object may have different
names in different databases
• Derivable data: One attribute may be a “derived” attribute in another
table
• Redundant attributes may be able to be detected by correlation analysis

7
Data Preprocessing
• Data Cleaning

• Data Integration

• Data Reduction

• Data Transformation and Data Discretization

• Summary

8
Data Reduction Strategies

• Data reduction: Obtain a reduced representation of the data set that is much
smaller in volume but yet produces the same (or almost the same) analytical
results
• Why data reduction? — A database/data warehouse may store terabytes of
data. Complex data analysis may take a very long time to run on the complete
data set.
• Data reduction strategies
• Dimensionality reduction, e.g., remove unimportant attributes
• Principal Components Analysis (PCA)
• Feature subset selection, feature creation
• …
• Numerosity reduction (some simply call it: Data Reduction)
• Histograms, clustering, sampling
• Regression and Log-Linear Models
• …
9
Data Reduction 1: Dimensionality Reduction
• Curse of dimensionality
• When dimensionality increases, data becomes increasingly sparse
• Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
• The possible combinations of subspaces will grow exponentially
• Dimensionality reduction
• Avoid the curse of dimensionality
• Help eliminate irrelevant features and reduce noise
• Reduce time and space required in data analysis
• Allow easier visualization
• Dimensionality reduction techniques
• Principal Component Analysis
• Supervised and nonlinear techniques (e.g., feature selection)
• …
10
Geometric interpretation of PCA
• Which variable is the principle variable?
• Max variance

11
• First PC is direction of maximum variance from
origin
• Subsequent PCs are orthogonal to 1st PC and
describe maximum residual variance

12
PCA transformation
• Coefficients of the linear combination which
transform the observations onto the PCs are
formed by eigenvalues of the covariance matrix
• Covariance matrix (3 dimensions)

13
PCA Algorithm
• Input: Data Matrix
• Step 1: Normalize data matrix
• Step 2: Get Covariance Matrix
• Step 3: Calculate Eigen Vectors and Eigen Values of
the covariance matrix
• Step 4: Sort the Eigen Vectors: Take the
eigenvalues λ₁, λ₂, …, λp and sort them from largest
to smallest.
• Step 5: Choose first k eigen vectors and calculate
the new features

14
• Project the standardized points in the new feature
space

15
Attribute Subset Selection
• Another way to reduce dimensionality of data
• Redundant attributes
• Duplicate much or all of the information contained in
one or more other attributes
• E.g., purchase price of a product and the amount of
sales tax paid
• Irrelevant attributes
• Contain no information that is useful for the task at hand
• E.g., students' ID is often irrelevant to the task of
predicting students' GPA

16
• Task: Supervised (Classification/Regression)
• Iterative process
• Subset generation
• Subset selection
• Termination condition

• Search
• Forward
• Backward
Data Reduction 2: Numerosity
Reduction
• Reduce data volume by choosing alternative,
smaller forms of data representation

• Non-parametric methods
• Do not assume models
• Major families: histograms, clustering, sampling, …

• Parametric methods
Sampling:
Choose a representative subset of the out
With ement
c
data repla

• Simple random sampling


With
• There is an equal probability of replac
em ent
selecting any particular item
Raw Data
• With vs Without replacement
• Stratified sampling:
• Partition the data set, and draw
samples from each partition
(proportionally, i.e.,
approximately the same
percentage of the data)
• Used in conjunction with skewed
data
20
Histogram Analysis
40

• Divide data into buckets and store 35


average (sum) for each bucket 30
25
• Partitioning rules:
20
• Equal-width: equal bucket
range 15

• Equal-frequency (or equal- 10


depth) 5
0

100000
10000

20000

30000

40000

50000

60000

70000

80000

90000
21
Parametric methods

E.g. Regression. Assume the data fits some model, estimate


model parameters – compression of the data

22
Data Preprocessing
• Data Cleaning

• Data Integration

• Data Reduction

• Data Transformation and Data Discretization

• Summary

23
Data Transformation
• A function that maps the entire set of values of a given attribute to a new
set of replacement values s.t. each old value can be identified with one of
the new values
• Methods
• Normalization: Scaled to fall within a smaller, specified range
• min-max normalization
• z-score normalization
• normalization by decimal scaling
• Discretization

24
Normalization
• Min-max normalization: to [new_minA, new_maxA]
v - minA
v' = (new _ maxA - new _ minA) + new _ minA
maxA - minA
• Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then
73,600 - 12,000
$73,600 is mapped to (1.0 - 0) + 0 = 0.716
98,000 - 12,000

• Z-score normalization (μ: mean, σ: standard deviation):


v - µA
v' =
s A

73,600 - 54,000
• Ex. From the data, μ = 54,000, σ = 16,000. Then = 1.225
16,000

• Normalization by decimal scaling


v
v' = Where j is the smallest integer such that Max(|ν’|) < 1
10 j
j=5; 73,600/(10^5) = 0.736 25
Discretization
• Discretization: Divide the range of a continuous attribute into intervals.
Prepare for further analysis, e.g., classification, association rule mining…

• Interval labels can then be used to replace actual data values


• Reduce data size by discretization
• Discretization can be performed recursively on an attribute
• Split (top-down) vs. merge (bottom-up)
• Supervised vs. unsupervised approaches
• Equal Width Binning, Equal Frequency Binning, Clustering-Based Discretization,
Density-Based Discretization, Decision Tree-Based Discretization, …

26
Summary about data preparation
• Data quality: accuracy, completeness, consistency, timeliness, believability,
interpretability
• Data cleaning: e.g. missing/noisy values, outliers
• Data integration from multiple sources:
• Entity identification problem
• Remove redundancies
• Detect inconsistencies
• Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data transformation and data discretization

27
Data Visualization Outline
• Visualizing data: Mapping data onto aesthetics

• Coordinate systems and axis


• Color Scales
• Visualization for different types of data
• Amount
• Distribution
•…
Visualizing data: Mapping data
onto aesthetics
• Commonly used aesthetics in data visualization:
• Position
• Shape
• Size
• Color
• Line width
• Line type

29

Which one(s) can only represent discrete data??


Scales map data values onto
aesthetics
• The mapping between data values and aesthetics
values is created via scales.

• A scale must be one-to-one

30
• Data visualization is part art and part science.

Examples of ugly, bad, and wrong figures 31


Data Visualization Outline
• Visualizing data: Mapping data onto aesthetics

• Coordinate systems and axis


• Color Scales
• Visualization for different types of data
• Amount
• Distribution
•…
Coordinate systems and axes

• Position scales -
determine where in a
graphic different data
values are located.
• 2-dimension visualizations
- two numbers are
required to uniquely
specify a point, and
therefore we need two
position scales.
• 3-d: 3 position scales

33
• Cartesian coordinates

• The most widely used


coordinate system

• The axis run orthogonally to


each other.

• Data values are placed in an


even spacing along both axis

34
Example: two axes representing two
different units

35
Example: Same unit & change in unit

Note:
• Use equal grids for same unit
• Cartesian coordinate systems are invariant under linear
transformations 36
• What if we want to visualize highly skewed data?

• Nonlinear axes
• Even spacing in data units corresponds to uneven spacing in the
visualization

Example: log scale

37
• curved axes
• polar coordinate
• Pole
• Radius
• Polar angle

• geospatial data

38
Data Visualization Outline
• Visualizing data: Mapping data onto aesthetics

• Coordinate systems and axis


• Color Scales
• Visualization for different types of data
• Amount
• Distribution
•…
Color Scales
• There are three fundamental use cases for color in
data visualizations:
• (i) distinguish groups of data from each other - discrete
• (ii) represent numerical data values - continuous
• (iii) highlight

40
Color as a tool to distinguish
• We frequently use color to distinguish discrete
items or groups that do not have an intrinsic order,
such as different countries on a map or different
manufacturers of a certain product.
• In this case, we use a qualitative color scale. Such a
scale contains a finite set of specific colors that are
chosen to look clearly distinct from each other
while also being equivalent to each other.
• The second condition requires that no one color
should stand out relative to the others.

41
42
Color to represent numerical
values
• Color can also be used to represent data values, such as
income, temperature, or speed – continuous
• In this case, we use a sequential color scale. Such a
scale contains a sequence of colors that clearly indicate
• (i) which values are larger or smaller than which other ones
and
• (ii) how distant two specific values are from each other.
• Sequential scales can be based on a single hue (e.g.,
from dark blue to light blue) or on multiple hues (e.g.,
from dark red to light yellow).

43
44
Color as a tool to highlight
• There may be specific categories or values in the
dataset that carry key information about the story
we want to tell, and we can strengthen the story by
emphasizing the relevant figure elements to the
reader.
• This effect can be achieved with accent color scales,
which are color scales that contain both a set of
subdued colors and a matching set of stronger,
darker, and/or more saturated colors.

45
46
Data Visualization Outline
• Visualizing data: Mapping data onto aesthetics

• Coordinate systems and axis


• Color Scales
• Visualization for different types of data
• Amount
• Distribution
•…
Visualizing amounts
• We have a set of categories (e.g., brands of cars,
cities, or sports) and a quantitative value for each
category.
• The standard visualization in this scenario is the bar
chart. Variation include grouped and stacked bars.
• Alternatives to the bar plot are the dot plot and
the heatmap.

49
Bar Plot/Chart
• Commonly visualized with vertical bars.

50
• A bar plot/chart
• presents categorical data
• with rectangular bars
• the bars’ heights or lengths are proportional to the
values that they represent.
• One axis of the chart shows the specific categories being
compared
• the other axis represents a measured value.

• The bars can be plotted vertically or horizontally.

51
• Regardless of whether we place bars vertically or
horizontally, we need to pay attention to the order
in which the bars are arranged.

54
• Whenever there is a natural ordering (i.e., when
our categorical variable is an ordered factor) we
should retain that ordering in the visualization.

55
56
Grouped bars
• When we are interested in two categorical variables
at the same time, we can visualize this dataset with
a grouped bar plot.
• we first draw a group of bars at each position along
the x axis, determined by one categorical variable
• then we draw bars within each group according to the
other categorical variable

57
Stacked Bars
• Instead of drawing groups of bars side-by-side, it is
sometimes preferable to stack bars on top of each
other.
• Stacking is useful when the sum of the amounts
represented by the individual stacked bars is in itself
a meaningful amount.
• Stacked bar charts are designed to help you
simultaneously compare totals and notice sharp
changes at the item level that are likely to have the
most influence on movements in category totals.

60
61
Dot plots and heatmaps
• Bars are not the only option for visualizing
amounts.
• One important limitation of bars is that they need
to start at zero, so that the bar length is
proportional to the amount shown.
• In this case, we can indicate amounts by placing
dots at the appropriate locations along
the x or y axis.

63
64
65
66
Heatmap
• As an alternative to mapping data values onto
positions via bars or dots, we can map data values
onto colors. Such a figure is called a heatmap.
• Heat maps make it easy to visualize complex data
and understand it at a glance.

67
Internet adoption over time, for select countries. Color represents the percent
of internet users for the respective country and year. Countries were ordered
by percent internet users in 2016. Data source: World Bank 68
Internet adoption over time, for select countries. Countries were ordered by
the year in which their internet usage first exceeded 20%. Data source: World
Bank 69
A Click map of user clicks on web vs mobile app 70
Figure: Stock price over time for four
major tech companies. The stock price
for each company has been normalized
to equal 100 in June 2012.

75
Data Visualization Outline
• Visualizing data: Mapping data onto aesthetics

• Coordinate systems and axis


• Color Scales
• Visualization for different types of data
• Amount
• Distribution
•…
Visualizing distributions
• Visualizing distributions: Histograms and density
plots
• Visualizing a single distribution
• Visualizing multiple distributions at the same time
• Visualizing distributions: Empirical cumulative
distribution functions and q-q plots
• Empirical cumulative distribution functions
• Highly skewed distributions
• Quantile–quantile plots

77
Visualizing distributions:
Histograms and density plots
• How a particular variable is distributed in a
dataset?

• E.g. There were approximately 1300 passengers on the Titanic and we


have reported ages for 756 of them. We might want to know how many
passengers of what ages there were on the Titanic, i.e., how many
children, young adults, middle-aged people, seniors, and so on. We call
the relative proportions of different ages among the passengers the age
distribution of the passengers.

78
• The age distribution among the passengers by
grouping all passengers into bins with comparable
ages and then counting the number of passengers
in each bin

79
Histogram
• A histogram displays the shape and spread of
continuous sample data.

80
bin widths: (a) one year; (b) three years; (c) five years; (d) fifteen years. 82
Density plot
• Visualize the underlying probability distribution of
the data by drawing an appropriate continuous
curve
• Probability Density
• A random variable x has a probability distribution f(x).
• The relationship between the outcomes of a random
variable and its probability is referred to as the
probability density, or simply the “density.”

83
• Note that we have two requirements on f(x):

• Density estimation
• All we have access to is a sample of observations
• We must assume a probability distribution

• Kernel Density Estimation


• Nonparametric method for using a dataset to estimating
probabilities for new points.
84
• Kernel Density Estimation [McLachlan, 1992, Silverman, 1998]

85
The height of the curve is scaled such that the area under the curve equals
one. The density estimate was performed with a Gaussian kernel and a
bandwidth of 2. 86
88
(a) Gaussian kernel, bandwidth = 0.5; (b) Gaussian kernel, bandwidth = 2; (c) Gaussian
89
kernel, bandwidth = 5; (d) Rectangular kernel, bandwidth = 2.
• Be careful with the tails

90
• Histogram and density plots
• both are highly intuitive and visually appealing
• both share the limitation that the resulting figure
depends to a substantial degree on parameters the user
has to choose, such as the bin width for histograms and
the bandwidth for density plots.
• both have to be considered as an interpretation of the
data rather than a direct visualization of the data itself.

96
Visualizing distributions: Empirical
cumulative distribution functions
and q-q plots
• Aggregate methods that highlight properties of the
distribution rather than the individual data points
• Require no arbitrary parameter choices
• Show all of the data at once
• A little less intuitive

97
Empirical cumulative
distribution function (ECDF)
• An ECDF is an estimator of the Cumulative
Distribution Function.
• If you have a set of samples (X1 < X2 < … < Xn) from
an observed random variable, then the ECDF is

98
• Assume our hypothetical class has 50 students, and
the students just completed an exam on which they
could score between 0 and 100 points.
• How can we best visualize the class performance,
for example to determine appropriate grade
boundaries?

99
• A different way of thinking about this visualization
is the following:
• We can rank all students by the number of points they
obtained, in ascending order (so the student with the
fewest points receives the lowest rank and the student
with the most points the highest)
• Then plot the rank versus the actual points obtained.

100
• ECDF (not normalized)

101
• ECDF (normalized)

102
Highly skewed distributions
• Many empirical
datasets display highly
skewed distributions,
in particular with
heavy tails to the right,
and these distributions
can be challenging to
visualize.

104
Log Transformation

105
Quantile–quantile plots
• Quantile–quantile (q-q) plots are a useful
visualization when we want to determine to what
extent the observed data points do or do not follow
a given distribution.
• q-q plots are also based on ranking the data and
visualizing the relationship between ranks and
actual values
• The ranks are used to predict where a given data
point should fall if the data were distributed
according to a specified reference distribution.

106
Example:
• Assume the data values have a mean of 10 and a
standard deviation of 3
• Assuming a normal distribution, we would expect
• a data point ranked at the 50th percentile to lie at
position 10 (the mean)
• a data point at the 84th percentile to lie at position 13
(one standard deviation above from the mean)
•…

107
108
Amounts

111
Distributions

112
Proportions

113
Relationships

114
Geospatial data

115
• Uncertainty

116
• Reference for visualization

Data Visualization with Python: Create an impact with


meaningful data insights using interactive and engaging
visuals. By Mario Dobler and Tim Großmann. (ISBN-13: 978-
1789956467)

Interactive Visualization: Insight through Inquiry. By Bill


Ferster and Ben Shneiderman. (ISBN-13: 978-0262018159)

Fundamentals of Data Visualization. By Claus O. Wilke. (ISBN-


13: 978-1492031086)

You might also like