Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 7

Data Mining ~ Knowledge Discovery

Data <-> Choose data -> Preprocessing data -> Transforming data

Reason why we need to prepare the data:

 Noisy
 Incomplete
 Inconsistent

Data -> Data warehouse / Data Mining -> Decision

Data cleaning attempts to:

 Fill in missing values


 Smooth out noisy data
 Correct inconsistencies
 Remove irrelevant data

Example:

ID Name QT1 GK CK Group


1 Mickey 7 9 9 1
2 Donald 5 4 6 1
3 Pluto 9 5 2
4 Goofy 7 8 6 2

Record (Row) ~ 4 attributes: Name, QT1, GK, CK ~ Field (Column)

 If one unit of field is missing, calculate the Mean (Average) of the whole field to fill in the blank
(ex: Pluto – GK: (9+4+8)/3 = 7)
 If there is a field indicating that the data is divided into groups => only calculate the Mean of the
field belonging to that group (ex: Pluto – GK: 8)
 Another way to fill in the blank is to rearrange in ascending order, then use the middle number
to fill it in (ex: Pluto – GK: 8)
Solving the missing data problem:

 Use a global constant to fill in missing values (NULL, N/A, unknown, Vắng, etc.) -> The sheet
will automatically skip the missing values
 Use the attribute value mean to fill missing values of that attribute
 Use the attribute mean for all samples belonging to the same class to fill in the missing values

Smoothing Noisy Data:

 The purpose is to eliminate noise and “smooth out” the data fluctuations

Ex: Original Data for “price” (after sorting); 4, 8, 15, 21, 21, 24, 25, 28, 34

 Binning: Partition into equidepth bins


o Bin1: 4, 8, 15
o Bin2: 21, 21, 24
o Bin3: 25, 28, 34
 Means: each value in a bin is replaced by the mean value of the bin
o Bin1: 9, 9, 9
o Bin2: 22, 22, 22
o Bin3: 29, 29, 29
 Boundaries: min and max values in each bin are identified (boundaries). Each value in a bin is
replaced with the closest boundary value
o Bin1: 4, 4, 15
o Bin2: 21, 21, 24
o Bin3: 25, 25, 34
 Other methods:
o Clustering: Similar values are organized into groups (clusters). Values falling outside of
clusters may be considered “outliers” and may be candidates for elimination.
o Regression: Fit data to a function. Linear regression finds the best line to fit 2 variables.
Multiple regression can handle multiple variables. The values given by the function are
used instead of the original values.
Temperature:

5 8
6 5 8 9
7 0 1 2 3 5 5
8 0 1 3 5

ID Temperature
7 58
6 65 Bin1
5 68
9 69
4 70 Bin2
10 71
8 72
12 73 Bin3
11 75
14 75
2 80 Bin4
13 81
3 83
Bin5
1 85
ID Temperature
7 64
6 64 Bin1
5 64
9 70
4 70 Bin2
10 70
8 73
12 73 Bin3
11 73
14 79
2 79 Bin4
13 79
3 84
Bin5
1 84

Humidity:

6 5
7 0 0 0 5 8
8 0 0 0 5
9 0 0 5 6

Data Transformation (Normalization): We transition the data into variables ranging from 0 -> 1

Ex: 65% 75% 96%

0 x 1

X = (75-65) / (96-65) = 0.32


Ex: 60% 75% 100%

0 x 1

X = (75-60) / (100-60) = 0.375

Data Transformation: Normalization (Định lượng)

 Min-Max normalization: linear transformation from v to v’


x 1−min x 1
x ' 1= ¿
max x 1−min x1

 Z-score normalization: normalization of v into v’ based on attribute value mean and standard
deviation

( v−Mean) v−μ
v '= =
Standard Deviation σ

μ=mean=
∑v
n

σ=
√ (v i−μ)2
n−1

 Normalization by decimal scaling


o Moves the decimal point of v by j positions such that j is the minimum number of
positions moved so that absolute maximun falls in [0…..1]
' v
v= j
10
Ex: if v in [-56……9976] and j=4 -> v’ in [-0,0056……..0,9976]
ID Gender Age Salary
1 0 0.00 0.00
2 1 0.96 0.56
3 1 1.00 1.00
4 0 0.24 0.44
5 1 0.72 0.32

Data Transformation: Discretization (Định tính)


 3 types of attributes:
o Nominal: values from an unordered set (also “categorical” attributes)
o Ordinal: values from an ordered set
o Numberic/Continuous: real numbers (but sometimes also integer values)

Khi làm định tính sang định lượng => tuyệt đối không đc tính trung bình (mean)

Chỉ có thể chia tỉ lệ phần trăm và biểu thị bằng các đồ thị
Data Reduction

 Data is often too large; reducing data can improve performance


 Data reduction consists of reducing the representation of the data set while producing the same
(or almost the same) results
 Data reduction includes:
o Data cube aggregation
o Dimensionality reduction
o Discretization
o Numerosity reduction
 Regression
 Histogram
 Clustering
 Sampling

Regression Analysis

You might also like