Data Mining

Data Mining ~ Knowledge Discovery
Data <-> Choose data -> Preprocessing data -> Transforming data
Reason why we need to prepare the data:
 Noisy
 Incomplete
 Inconsistent
Data -> Data warehouse / Data Mining -> Decision
Data cleaning attempts to:
 Fill in missing values

 Smooth out noisy data
 Correct inconsistencies
 Remove irrelevant data
Example:
ID Name QT1 GK CK Group

1 Mickey 7 9 9 1
2 Donald 5 4 6 1
3 Pluto 9 5 2
4 Goofy 7 8 6 2
Record (Row) ~ 4 attributes: Name, QT1, GK, CK ~ Field (Column)
 If one unit of field is missing, calculate the Mean (Average) of the whole field to fill in the blank
(ex: Pluto – GK: (9+4+8)/3 = 7)
 If there is a field indicating that the data is divided into groups => only calculate the Mean of the
field belonging to that group (ex: Pluto – GK: 8)
 Another way to fill in the blank is to rearrange in ascending order, then use the middle number
to fill it in (ex: Pluto – GK: 8)
Solving the missing data problem:
 Use a global constant to fill in missing values (NULL, N/A, unknown, Vắng, etc.) -> The sheet
will automatically skip the missing values
 Use the attribute value mean to fill missing values of that attribute
 Use the attribute mean for all samples belonging to the same class to fill in the missing values
Smoothing Noisy Data:
 The purpose is to eliminate noise and “smooth out” the data fluctuations
Ex: Original Data for “price” (after sorting); 4, 8, 15, 21, 21, 24, 25, 28, 34
 Binning: Partition into equidepth bins

o Bin1: 4, 8, 15
o Bin2: 21, 21, 24
o Bin3: 25, 28, 34
 Means: each value in a bin is replaced by the mean value of the bin
o Bin1: 9, 9, 9
o Bin2: 22, 22, 22
o Bin3: 29, 29, 29
 Boundaries: min and max values in each bin are identified (boundaries). Each value in a bin is
replaced with the closest boundary value
o Bin1: 4, 4, 15
o Bin2: 21, 21, 24
o Bin3: 25, 25, 34
 Other methods:
o Clustering: Similar values are organized into groups (clusters). Values falling outside of
clusters may be considered “outliers” and may be candidates for elimination.
o Regression: Fit data to a function. Linear regression finds the best line to fit 2 variables.
Multiple regression can handle multiple variables. The values given by the function are
used instead of the original values.
Temperature:
5 8
6 5 8 9
7 0 1 2 3 5 5
8 0 1 3 5
ID Temperature
7 58
6 65 Bin1
5 68
9 69
4 70 Bin2
10 71
8 72
12 73 Bin3
11 75
14 75
2 80 Bin4
13 81
3 83
Bin5
1 85
ID Temperature
7 64
6 64 Bin1
5 64
9 70
4 70 Bin2
10 70
8 73
12 73 Bin3
11 73
14 79
2 79 Bin4
13 79
3 84
Bin5
1 84
Humidity:
6 5
7 0 0 0 5 8
8 0 0 0 5
9 0 0 5 6
Data Transformation (Normalization): We transition the data into variables ranging from 0 -> 1
Ex: 65% 75% 96%
0 x 1
X = (75-65) / (96-65) = 0.32

Ex: 60% 75% 100%
0 x 1
X = (75-60) / (100-60) = 0.375
Data Transformation: Normalization (Định lượng)
 Min-Max normalization: linear transformation from v to v’

x 1−min x 1
x ' 1= ¿
max x 1−min x1
 Z-score normalization: normalization of v into v’ based on attribute value mean and standard
deviation
( v−Mean) v−μ
v '= =
Standard Deviation σ
μ=mean=
∑v
n
σ=
√ (v i−μ)2
n−1
 Normalization by decimal scaling

o Moves the decimal point of v by j positions such that j is the minimum number of
positions moved so that absolute maximun falls in [0…..1]
' v
v= j
10
Ex: if v in [-56……9976] and j=4 -> v’ in [-0,0056……..0,9976]
ID Gender Age Salary
1 0 0.00 0.00
2 1 0.96 0.56
3 1 1.00 1.00
4 0 0.24 0.44
5 1 0.72 0.32
Data Transformation: Discretization (Định tính)

 3 types of attributes:
o Nominal: values from an unordered set (also “categorical” attributes)
o Ordinal: values from an ordered set
o Numberic/Continuous: real numbers (but sometimes also integer values)
Khi làm định tính sang định lượng => tuyệt đối không đc tính trung bình (mean)
Chỉ có thể chia tỉ lệ phần trăm và biểu thị bằng các đồ thị
Data Reduction
 Data is often too large; reducing data can improve performance

 Data reduction consists of reducing the representation of the data set while producing the same
(or almost the same) results
 Data reduction includes:
o Data cube aggregation
o Dimensionality reduction
o Discretization
o Numerosity reduction
 Regression
 Histogram
 Clustering
 Sampling
Regression Analysis

Data Mining

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining

Uploaded by

Copyright:

Available Formats

Data Mining ~ Knowledge Discovery

Reason why we need to prepare the data:

Data -> Data warehouse / Data Mining -> Decision

Data cleaning attempts to:

 Fill in missing values

ID Name QT1 GK CK Group

Record (Row) ~ 4 attributes: Name, QT1, GK, CK ~ Field (Column)

Smoothing Noisy Data:

 Binning: Partition into equidepth bins

Ex: 65% 75% 96%

X = (75-65) / (96-65) = 0.32

X = (75-60) / (100-60) = 0.375

Data Transformation: Normalization (Định lượng)

 Min-Max normalization: linear transformation from v to v’

 Normalization by decimal scaling

Data Transformation: Discretization (Định tính)

 Data is often too large; reducing data can improve performance

You might also like