Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 30

Data Mining and

Analysis

Data & Data Pre-processing for


Data Mining
(Part 2)

Dr Daqing Chen
Outline
• k-means cluster for detecting outliers
• Normalisation: min-max (range), z-score (standard score)
• Change data types: Categorical to numeric and numeric
to categorical
• Data reduction: Dimensionality reduction and correlation
analysis
– Pearson’s correlation
– Spearman’s correlation
– PCA
– RBM, and t-SNE, etc.

07:45:15 PM DMA Lecture 04 2


How to Use k-means Clustering for Detecting
Outliers
• The k-means clustering can be used for
– Grouping samples based on similarities (e.g., Euclidean
distance, one of the widely-used measures) and
– Identifying anomalies (outliers) based on dissimilarities
• Usually a small-sized cluster is related to outliers

Small-sized

07:45:15 PM DMA Lecture 04 3


Data Transformation: Normalisation
• Min-max normalisation
– An original is transformed to a new normalised value

– The value range of the original variable. Usually this


is known in a given data set
– The value range of the new normalised variable.
Determined by an analyst. Typically [0, 1] or [-1, 1]
– Also known as Range Normalisation

07:45:15 PM DMA Lecture 04 4


Data Transformation: Normalisation
• Min-max normalisation: An Example
– Consider variable Annual_Salary ranging from
£10,000 to £50,000

20,000
– An original value: 1 0,000 3 0,000 50,000
– The new normalised value:
×
5 0 0.25 0.50 0.75 1

07:45:15 PM DMA Lecture 04 5


Data Transformation: Normalisation
• z-score normalization:
– An original is transformed to a new normalised value
as

– and : The mean or average and the standard


derivation of variable
– Transforms all variables to a common scale with an
average of zero and standard deviation of one
– Also known as Standard Score

07:45:15 PM DMA Lecture 04 6


Data Transformation: Normalisation
• z-score normalization: An example

07:45:15 PM DMA Lecture 04 7


Data Transformation: Change of Data Type
• Different data mining algorithms may require
different types of data
• To transform a numeric value to a categorical
value: Divide the entire value range of a numeric
variable into a certain number of intervals (bins)
as appropriate, e.g., variable Average_Price
ranging from £100K to £700K can be divided into
3 equally-sized intervals (bins)
𝐿𝑜𝑤 𝑃𝑟𝑖𝑐𝑒 Median 𝐻𝑖𝑔h 𝑃𝑟𝑖𝑐𝑒
𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑃𝑟𝑖𝑐𝑒
100 𝐾 300 𝐾 500 𝐾 700 𝐾
07:45:15 PM DMA Lecture 04 8
Data Transformation: Change of Data Type
• To transform a distinct value of a categorial
variable: Encode each value into a unit vector,
also known as one-hot encoding, or orthogonal
encoding, e.g., variable Gender
; or

Original Data Table Gender Gender


ID
ID Gender =F =M ID Gender
001 F 001 1 0 or 001 0
002 M 002 0 1 002 1
003 M 003 0 1 003 1
…… … …… …… …… …… …
07:45:15 PM DMA Lecture 04 9
Data Transformation: Change of Data Type
• Encode each distinct value of a categorical
variable into a unit vector: More example,
𝐿𝑜𝑤 𝑃𝑟𝑖𝑐𝑒 Median 𝐻𝑖𝑔h 𝑃𝑟𝑖𝑐𝑒
𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑃𝑟𝑖𝑐𝑒
100 𝐾 300 𝐾 500 𝐾 700 𝐾
, ; or

( )
𝐿𝑜𝑤 𝑃𝑟𝑖𝑐𝑒 = 0 , 𝑀𝑒𝑑𝑖𝑎𝑛𝑃𝑟𝑖𝑐𝑒 = 1 , 𝐻𝑖𝑔h 𝑃𝑟𝑖𝑐𝑒= 0
0 0 1 ( ) ( )

07:45:16 PM DMA Lecture 04 10


Data Transformation: Change of Data Type
• How the original data table has been changed
ID Low_Price Median_Price High_Price
001 1 0 0
Original Data Table 002 0 0 1
Average 003 1 0 0
ID _Price (£K)
004 0 1 0
001 2000 …… …… …… …..

002 5800
003 1500 or ID Price_Value_1 Price_Value_2
004 4500 001 0 0
…… … 002 0 1
003 0 0
004 1 0
…… …… ……

07:45:16 PM DMA Lecture 04 11


Data Transformation: Change of Data
Type

07:45:16 PM DMA Lecture 04 12


Data Reduction
• Data set is too big to work with
• Data reduction
– Obtain a reduced representation of the data set that is much
smaller in volume but yet produce the same (or almost the same)
analytical results
• Data reduction strategies
– Remove unimportant/ irrelevant attributes which have no
contribution to any analysis
– Feature selection and creation
– Histograms
– Clustering
– Sampling
07:45:16 PM DMA Lecture 04 13
Data Reduction: Dimensionality Reduction
• Remove/ignore any unimportant/irrelevant attributes
• Feature selection (i.e., attribute subset selection) and feature
creation (i.e., attribute aggregation and combination):
– Select or find a minimum set of attributes (features) that is
sufficient for a given data mining task
– Many techniques available:
• Pearson’s correlation analysis (liner correlation)
• Spearman’s correlation (correlation as a monotonic function)
• Principal component analysis (PCA)
• Restricted Boltzmann Machine (RBM) and Autoencoder (A popular deep
learning model, G. E. Hinton and R. R. Salakhutdinov,
http://www.cs.toronto.edu/~hinton/absps/science.pdf
• t-SNE (t-Distributed Stochastic Neighbour Embedding), etc.

07:45:16 PM DMA Lecture 04 14


Data Reduction: Dimensionality Reduction
• Pearson’s and Spearman’s correlation analysis:
May consider to use only those variables that
have a strong correlation with a particular
variable of interest

07:45:16 PM Pearson DMA Lecture 04 Spearman 15


What is Principal Component Analysis (PCA)?

• PCA is an orthogonal transformation which


transforms a set of possibly correlated variables
into a set of linearly uncorrelated variables,
known as principal components
y y PC1
PC1 PC2

x x
Choose a direction along Choose a 2nd direction perpendicular to
which the data has PC1 and along which the data has
maximum variance: PC1 maximum variance: PC2
07:45:16 PM DMA Lecture 04 16
PCA: An Example

Coordinates in the Coordinates in the PCA


original space space

07:45:16 PM DMA Lecture 04 17


Distribution of Data in Original Sample Space

07:45:16 PM DMA Lecture 04 18


Distribution of Data in PC Space

07:45:16 PM DMA Lecture 04 19


How Many Principle Components to Use?
• The significance of a principle component (also
known as eigenvector) is indicated by its
corresponding eigenvalue, which measures the
variance of data along the direction of a
eigenvector

99%
90%

07:45:16 PM DMA Lecture 04 20


t-SNE (t-distributed Stochastic Neighbour
Embedding)
• A very effective approach for dimensionality
reduction
http://www.jmlr.org/papers/volume9/vanderma
aten08a/vandermaaten08a.pdf
, by Laurens van der Maaten and Geoffrey Hinton
• Map (embed) data from an original space to a
two or three dimensional space
• How to use t-SNE effectively
https://distill.pub/2016/misread-tsne/
07:45:16 PM DMA Lecture 04 21
t-SNE for Data Reduction: An Example
Lip-reading:
• 10 speakers, each reading out 3 times of the 26 English
characters.
• Videos are converted to images sequence (frames), 25
frames per sec., and each image has 80*60=4800 pixels
(dimensions)

07:45:16 PM DMA Lecture 04 22


Dimensionality Reduction for Government
Data with Multiple Measures
Original

RBM

PCA

Original

RBM

PCA
07:45:16 PM DMA Lecture 04 23
Supervised t-SNE for Dimensionality
Reduction: MNIST Data Set (28×28 Pixels)

07:45:16 PM DMA Lecture 04 24


Supervised t-SNE for Dimensionality Reduction:
Chest X-Ray (100×100 Pixels)

07:45:16 PM DMA Lecture 04 25


Data Reduction: Histograms
• A popular data reduction technique
• Divide data into buckets (bins) and store
frequency, count, or average for each bucket

07:45:16 PM DMA Lecture 04 26


Data Reduction: Histograms
• Equal-width. In an equal-width histogram, the
width of each bucket range is uniformed (such as
the width of £10 for each bucket)

07:45:16 PM DMA Lecture 04 27


Data Reduction: Clustering
• Partition data set into clusters, and store cluster
representation only
• Can be very effective if data is clustered but not if
data is “skewered”
• Also known as Vector Quantization

+ +

07:45:16 PM DMA Lecture 04 28


Data Reduction: Sampling
• Choose a representative subset of the data
• The key principle for effective sampling:
– Using a sample will work almost as well as using the
entire data sets, if the sample is representative
– A sample is representative if it has approximately the
same property (of interest) as the original set of data
• To be discussed later when considering predictive
modelling

07:45:16 PM DMA Lecture 04 29


Summary
• Data is dirty, complex, and usually can not be
used immedicably for analysis
• Data pre-processing involves many different tasks
to conduct depending on what data we have and
what to analyse
• Choosing appropriate approach(s) to use is
important
– Flexible
– Consistent results from different
approaches/methods
07:45:16 PM DMA Lecture 04 30

You might also like