04 Preprocessing2

Data Mining and
Analysis
Data & Data Pre-processing for

Data Mining
(Part 2)
Dr Daqing Chen
Outline
• k-means cluster for detecting outliers
• Normalisation: min-max (range), z-score (standard score)
• Change data types: Categorical to numeric and numeric
to categorical
• Data reduction: Dimensionality reduction and correlation
analysis
– Pearson’s correlation
– Spearman’s correlation
– PCA
– RBM, and t-SNE, etc.
07:45:15 PM DMA Lecture 04 2

How to Use k-means Clustering for Detecting
Outliers
• The k-means clustering can be used for
– Grouping samples based on similarities (e.g., Euclidean
distance, one of the widely-used measures) and
– Identifying anomalies (outliers) based on dissimilarities
• Usually a small-sized cluster is related to outliers
Small-sized

Data Transformation: Normalisation
• Min-max normalisation
– An original is transformed to a new normalised value
– The value range of the original variable. Usually this

is known in a given data set
– The value range of the new normalised variable.
Determined by an analyst. Typically [0, 1] or [-1, 1]
– Also known as Range Normalisation

• Min-max normalisation: An Example
– Consider variable Annual_Salary ranging from
£10,000 to £50,000
–
20,000
– An original value: 1 0,000 3 0,000 50,000
– The new normalised value:
×
5 0 0.25 0.50 0.75 1

• z-score normalization:
– An original is transformed to a new normalised value
as
– and : The mean or average and the standard

derivation of variable
– Transforms all variables to a common scale with an
average of zero and standard deviation of one
– Also known as Standard Score

• z-score normalization: An example

Data Transformation: Change of Data Type
• Different data mining algorithms may require
different types of data
• To transform a numeric value to a categorical
value: Divide the entire value range of a numeric
variable into a certain number of intervals (bins)
as appropriate, e.g., variable Average_Price
ranging from £100K to £700K can be divided into
3 equally-sized intervals (bins)
𝐿𝑜𝑤 𝑃𝑟𝑖𝑐𝑒 Median 𝐻𝑖𝑔h 𝑃𝑟𝑖𝑐𝑒
𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑃𝑟𝑖𝑐𝑒
100 𝐾 300 𝐾 500 𝐾 700 𝐾
• To transform a distinct value of a categorial
variable: Encode each value into a unit vector,
also known as one-hot encoding, or orthogonal
encoding, e.g., variable Gender
; or
Original Data Table Gender Gender

ID
ID Gender =F =M ID Gender
001 F 001 1 0 or 001 0
002 M 002 0 1 002 1
003 M 003 0 1 003 1
…… … …… …… …… …… …
• Encode each distinct value of a categorical
variable into a unit vector: More example,
𝐿𝑜𝑤 𝑃𝑟𝑖𝑐𝑒 Median 𝐻𝑖𝑔h 𝑃𝑟𝑖𝑐𝑒
𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑃𝑟𝑖𝑐𝑒
100 𝐾 300 𝐾 500 𝐾 700 𝐾
, ; or
( )
𝐿𝑜𝑤 𝑃𝑟𝑖𝑐𝑒 = 0 , 𝑀𝑒𝑑𝑖𝑎𝑛𝑃𝑟𝑖𝑐𝑒 = 1 , 𝐻𝑖𝑔h 𝑃𝑟𝑖𝑐𝑒= 0
0 0 1 ( ) ( )

• How the original data table has been changed
ID Low_Price Median_Price High_Price
001 1 0 0
Original Data Table 002 0 0 1
Average 003 1 0 0
ID _Price (£K)
004 0 1 0
001 2000 …… …… …… …..
002 5800
003 1500 or ID Price_Value_1 Price_Value_2
004 4500 001 0 0
…… … 002 0 1
003 0 0
004 1 0
…… …… ……

Data Transformation: Change of Data
Type

Data Reduction
• Data set is too big to work with
• Data reduction
– Obtain a reduced representation of the data set that is much
smaller in volume but yet produce the same (or almost the same)
analytical results
• Data reduction strategies
– Remove unimportant/ irrelevant attributes which have no
contribution to any analysis
– Feature selection and creation
– Histograms
– Clustering
– Sampling
Data Reduction: Dimensionality Reduction
• Remove/ignore any unimportant/irrelevant attributes
• Feature selection (i.e., attribute subset selection) and feature
creation (i.e., attribute aggregation and combination):
– Select or find a minimum set of attributes (features) that is
sufficient for a given data mining task
– Many techniques available:
• Pearson’s correlation analysis (liner correlation)
• Spearman’s correlation (correlation as a monotonic function)
• Principal component analysis (PCA)
• Restricted Boltzmann Machine (RBM) and Autoencoder (A popular deep
learning model, G. E. Hinton and R. R. Salakhutdinov,
http://www.cs.toronto.edu/~hinton/absps/science.pdf
• t-SNE (t-Distributed Stochastic Neighbour Embedding), etc.

Data Reduction: Dimensionality Reduction
• Pearson’s and Spearman’s correlation analysis:
May consider to use only those variables that
have a strong correlation with a particular
variable of interest
07:45:16 PM Pearson DMA Lecture 04 Spearman 15

What is Principal Component Analysis (PCA)?
• PCA is an orthogonal transformation which

transforms a set of possibly correlated variables
into a set of linearly uncorrelated variables,
known as principal components
y y PC1
PC1 PC2
x x
Choose a direction along Choose a 2nd direction perpendicular to
which the data has PC1 and along which the data has
maximum variance: PC1 maximum variance: PC2
PCA: An Example
Coordinates in the Coordinates in the PCA

original space space

Distribution of Data in Original Sample Space

Distribution of Data in PC Space

How Many Principle Components to Use?
• The significance of a principle component (also
known as eigenvector) is indicated by its
corresponding eigenvalue, which measures the
variance of data along the direction of a
eigenvector
99%
90%

t-SNE (t-distributed Stochastic Neighbour
Embedding)
• A very effective approach for dimensionality
reduction
http://www.jmlr.org/papers/volume9/vanderma
aten08a/vandermaaten08a.pdf
, by Laurens van der Maaten and Geoffrey Hinton
• Map (embed) data from an original space to a
two or three dimensional space
• How to use t-SNE effectively
https://distill.pub/2016/misread-tsne/
t-SNE for Data Reduction: An Example
Lip-reading:
• 10 speakers, each reading out 3 times of the 26 English
characters.
• Videos are converted to images sequence (frames), 25
frames per sec., and each image has 80*60=4800 pixels
(dimensions)

Dimensionality Reduction for Government
Data with Multiple Measures
Original
RBM
PCA
Original
RBM
PCA
Supervised t-SNE for Dimensionality
Reduction: MNIST Data Set (28×28 Pixels)

Supervised t-SNE for Dimensionality Reduction:
Chest X-Ray (100×100 Pixels)

Data Reduction: Histograms
• A popular data reduction technique
• Divide data into buckets (bins) and store
frequency, count, or average for each bucket

Data Reduction: Histograms
• Equal-width. In an equal-width histogram, the
width of each bucket range is uniformed (such as
the width of £10 for each bucket)

Data Reduction: Clustering
• Partition data set into clusters, and store cluster
representation only
• Can be very effective if data is clustered but not if
data is “skewered”
• Also known as Vector Quantization
+ +

Data Reduction: Sampling
• Choose a representative subset of the data
• The key principle for effective sampling:
– Using a sample will work almost as well as using the
entire data sets, if the sample is representative
– A sample is representative if it has approximately the
same property (of interest) as the original set of data
• To be discussed later when considering predictive
modelling

Summary
• Data is dirty, complex, and usually can not be
used immedicably for analysis
• Data pre-processing involves many different tasks
to conduct depending on what data we have and
what to analyse
• Choosing appropriate approach(s) to use is
important
– Flexible
– Consistent results from different
approaches/methods

04 Preprocessing2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

04 Preprocessing2

Uploaded by

Copyright:

Available Formats

Data Mining and

Data & Data Pre-processing for

07:45:15 PM DMA Lecture 04 2

07:45:15 PM DMA Lecture 04 3

– The value range of the original variable. Usually this

07:45:15 PM DMA Lecture 04 4

07:45:15 PM DMA Lecture 04 5

– and : The mean or average and the standard

07:45:15 PM DMA Lecture 04 6

07:45:15 PM DMA Lecture 04 7

Original Data Table Gender Gender

07:45:16 PM DMA Lecture 04 10

07:45:16 PM DMA Lecture 04 11

07:45:16 PM DMA Lecture 04 12

07:45:16 PM DMA Lecture 04 14

07:45:16 PM Pearson DMA Lecture 04 Spearman 15

• PCA is an orthogonal transformation which

Coordinates in the Coordinates in the PCA

07:45:16 PM DMA Lecture 04 17

07:45:16 PM DMA Lecture 04 18

07:45:16 PM DMA Lecture 04 19

07:45:16 PM DMA Lecture 04 20

07:45:16 PM DMA Lecture 04 22

07:45:16 PM DMA Lecture 04 24

07:45:16 PM DMA Lecture 04 25

07:45:16 PM DMA Lecture 04 26

07:45:16 PM DMA Lecture 04 27

07:45:16 PM DMA Lecture 04 28

07:45:16 PM DMA Lecture 04 29

You might also like