Professional Documents
Culture Documents
Data Mining
Data Mining
Data <-> Choose data -> Preprocessing data -> Transforming data
Noisy
Incomplete
Inconsistent
Example:
If one unit of field is missing, calculate the Mean (Average) of the whole field to fill in the blank
(ex: Pluto – GK: (9+4+8)/3 = 7)
If there is a field indicating that the data is divided into groups => only calculate the Mean of the
field belonging to that group (ex: Pluto – GK: 8)
Another way to fill in the blank is to rearrange in ascending order, then use the middle number
to fill it in (ex: Pluto – GK: 8)
Solving the missing data problem:
Use a global constant to fill in missing values (NULL, N/A, unknown, Vắng, etc.) -> The sheet
will automatically skip the missing values
Use the attribute value mean to fill missing values of that attribute
Use the attribute mean for all samples belonging to the same class to fill in the missing values
The purpose is to eliminate noise and “smooth out” the data fluctuations
Ex: Original Data for “price” (after sorting); 4, 8, 15, 21, 21, 24, 25, 28, 34
5 8
6 5 8 9
7 0 1 2 3 5 5
8 0 1 3 5
ID Temperature
7 58
6 65 Bin1
5 68
9 69
4 70 Bin2
10 71
8 72
12 73 Bin3
11 75
14 75
2 80 Bin4
13 81
3 83
Bin5
1 85
ID Temperature
7 64
6 64 Bin1
5 64
9 70
4 70 Bin2
10 70
8 73
12 73 Bin3
11 73
14 79
2 79 Bin4
13 79
3 84
Bin5
1 84
Humidity:
6 5
7 0 0 0 5 8
8 0 0 0 5
9 0 0 5 6
Data Transformation (Normalization): We transition the data into variables ranging from 0 -> 1
0 x 1
0 x 1
Z-score normalization: normalization of v into v’ based on attribute value mean and standard
deviation
( v−Mean) v−μ
v '= =
Standard Deviation σ
μ=mean=
∑v
n
σ=
√ (v i−μ)2
n−1
Khi làm định tính sang định lượng => tuyệt đối không đc tính trung bình (mean)
Chỉ có thể chia tỉ lệ phần trăm và biểu thị bằng các đồ thị
Data Reduction
Regression Analysis