Professional Documents
Culture Documents
04 Preprocessing2
04 Preprocessing2
Analysis
Dr Daqing Chen
Outline
• k-means cluster for detecting outliers
• Normalisation: min-max (range), z-score (standard score)
• Change data types: Categorical to numeric and numeric
to categorical
• Data reduction: Dimensionality reduction and correlation
analysis
– Pearson’s correlation
– Spearman’s correlation
– PCA
– RBM, and t-SNE, etc.
Small-sized
( )
𝐿𝑜𝑤 𝑃𝑟𝑖𝑐𝑒 = 0 , 𝑀𝑒𝑑𝑖𝑎𝑛𝑃𝑟𝑖𝑐𝑒 = 1 , 𝐻𝑖𝑔h 𝑃𝑟𝑖𝑐𝑒= 0
0 0 1 ( ) ( )
002 5800
003 1500 or ID Price_Value_1 Price_Value_2
004 4500 001 0 0
…… … 002 0 1
003 0 0
004 1 0
…… …… ……
x x
Choose a direction along Choose a 2nd direction perpendicular to
which the data has PC1 and along which the data has
maximum variance: PC1 maximum variance: PC2
07:45:16 PM DMA Lecture 04 16
PCA: An Example
99%
90%
RBM
PCA
Original
RBM
PCA
07:45:16 PM DMA Lecture 04 23
Supervised t-SNE for Dimensionality
Reduction: MNIST Data Set (28×28 Pixels)
+ +