Data Preprocessing

Data preprocessing
Lesson Agenda:
a. Definition of data preprocessing?
b. data preprocessing Stages.
c. Data Cleaning.
d. Data transformation
Data preprocessing:
Data preprocessing means transformation the raw data into a useful and efficient format.
Data preprocessing divided into four stages.
1. Data cleaning,
2. Data integration,
3. Data reduction,
4. Data transformation.
Data cleaning
Data cleaning refers to techniques to ‘clean’ data by removing outliers, replacing missing
values, smoothing noisy data, and correcting inconsistent data. Many techniques are used to
perform each of these tasks, where each technique is specific to user’s preference or problem
set. Below, each task is explained in terms of the techniques used to overcome it.
Missing values
In order to deal with missing data, multiple approaches can be used. Let’s look at each of them.
1. Removing the training example: You can ignore the training example if the output
label is missing (if it is a classification problem). This is usually discouraged as it leads
to loss of data, as you are removing the attribute values that can add value to data set as
well.
2. Filling in missing value manually: This approach is time consuming, and not
recommended for huge data sets.
3. Using a standard value to replace the missing value: The missing value can be
replaced by a global constant such as ‘N/A’ or ‘Unknown’. This is a simple approach,
but not foolproof.
4. Using the most probable value to fill in the missing value: Using algorithms like
regression and decision tree, the missing values can be predicted and replaced.
Noisy data
Noise is defined as a random variance in a measured variable. For numeric values, boxplots
and scatter plots can be used to identify outliers. To deal with these anomalous values, data
smoothing techniques are applied, which are described below.
1. Binning: Using binning methods smooths sorted value by using the values around it.
The sorted values are then divided into ‘bins’. There are various approaches to binning.
Two of them are smoothing by bin means where each bin is replaced by the mean of
bin’s values, and smoothing by bin medians where each bin is replaced by the median
of bin’s values.
2. Regression: Linear regression and multiple linear regression can be used to smooth the
data, where the values are conformed to a function.
3. Outlier analysis: Approaches such as clustering can be used to detect outliers and deal
with them.
Binning Methods for Data Smoothing

* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
* Smoothing by bin Median:

-Bin1: 8.5, 8.5, 8.5, 8.5
-Bin2: 22.5, 22.5, 22.5, 22.5
-Bin3: 28.5, 28.5, 28.5, 28.5
Data Transformation:
#it’s a data preprocessing process.
#Transform or consolidate the data into alternative forms appropriate for mining.
#involve process:
a. Smoothing:
b. Aggregation: Summary and Aggregation operations are applied on the given set of attributes
to come up with new attributes.
c. Generalization
d. Normalization:
Normalization is used to scale the data of an attribute so that it falls in a smaller range, such as -
1.0 to 1.0 or 0.0 to 1.0. It is generally useful for classification algorithms.
Methods of Data Normalization –
 Decimal Scaling
 Min-Max Normalization
 z-Score Normalization(zero-mean Normalization)
# Decimal Scaling: find the normalized value of 400.
Normalized value of attribute = ( vi / 10j )
Salary bonus Formula CGPA Normalized after Decimal scaling

400 400 / 1000 0.4
310 310 / 1000 0.31
 Min-Max Normalization:
Min max range: 12000 to 98000

New range: 0.0 to 1.0
Value to be normalized: 73600
73600 − 12000
𝑍=
98000 − 12000
= .716
 z-Score Normalization(zero-mean Normalization)
How to calculate Z-Score of the following data?
8,10,15,20
# Discretization
Discretization is the process through which we can transform
continuous variables, models or functions into a discrete form.
|| II || Approaches to Discretization
 Unsupervised:
— Equal-Width
— Equal-Frequency
— K-Means
 Supervised:
— Decision Trees
Measures of Distance in Data Mining

1. Euclidean Distance
2. Manhattan Distance
3. Minkowski distance
(x, y)
d
( a, b)
Sample No
X Y
1 185 72
2 170 56
We have to find out distance of (195,45) from Sample 1 and 2 .
1. Euclidean Distance:
Distance (d)= √ (x-a)2 + (y-b)2

D1= √ (195-185)2 + (45-72)2
=28.79
D2= √ (195-170)2 + (45-56)2
= 27.31
2. Manhattan Distance
Distance (d)= | x-a | +| y-b |
D1= |195-185 | +|45-72 |

= 37
3. Minkowski distance:
Formula:
Let P =3
D= (|195-185 |3 +|45-72 |3+|195-170 |3+|45-56|3) 1/3
###############################
— K-Means
# What is K in K means clustering?
It is also called flat clustering algorithm. The number of clusters identified from data
by algorithm is represented by 'K' in K-means.
# When to use K means clustering?
The K-means clustering algorithm is used to find groups which have not been explicitly
labeled in the data.
#
Apply K-means Clustering for the following data Sets for 2(two)
Clusters.
Sample no X Y
1 185 72
2 170 56
3 168 60
4 179 68
5 182 72
6 188 77
Given K=2
Initial Centroid
Cluster X Y
k1 185 72
k2 170 56
Calculate Euclidean Distance using the given Equation:
Distance [(x,y), (a,b)] = (𝑋 − 𝑎)^2 + (𝑋 − 𝑏)^2
Cluster 1 (185,72) = (185 − 185)^2 + (72 − 72)^2

=0
Distance From Cluster 2 (170,56) = (170 − 185)^2 + (56 − 72)^2

=21.93

=21.93
Cluster 2(170,56)= (170 − 170)^2 + (56 − 56)^2

=0
Updated Centroid
Cluster X Y Assignment
K1 0 21.93
K2 21.93 0
Calculate Euclidean Distance for the next dataset (168,60)

=20.808
=4.472
Dataset Euclidean Distance
Cluster 1 Cluster 2 Assignment
(168,60) 20.808 4.472 2
Update the cluster centroid

Cluster X Y
K1 185 72
K2 =(170+168)/2 =(60+56)/2
=169 =58
= 7.211103
= 14.14214
(179,68) 7.211103 14.14214 1

Cluster X Y
K1 =(185+179)/2 =(72+68)/2
=182 =70
K2 169 58
Distance From Cluster 1(182,70)= (182 − 182)^2 + (72 − 70)^2
=2
Distance from Cluster 2= (182 − 169)^2 + (72 − 58)^2
=19.10
(182,72) 2 19.10 1

Cluster X Y
K1 =(182+182)/2 =(72+70)/2
=182 =71
K2 169 58
Distance From Cluster 1(182,71)= (188 − 182)^2 + (77 − 71)^2
= 8.4852
Distance from Cluster 2(169,58)= (188 − 169)^2 + (77 − 58)^2
= 26.87
(188,77) 8.4852 26.87 1

Cluster X Y
K1 =(182+188)/2 =(71+77)/2
=185 =74
K2 169 58
Final Assignment
Sample no X Y Assignment
1 185 72 1
2 170 56 2
3 168 60 2
4 179 68 1
5 182 72 1
6 188 77 1
1
++
2
++
++
To make this decision tree we can apply ID3 algorithm.
Following steps we have to follow when we apply ID3 algorithm to make a decision tree.
#KNN

Data Preprocessing

Uploaded by

Copyright:

Available Formats

You might also like

Data Preprocessing

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Preprocessing

Uploaded by

Copyright:

Available Formats

Data preprocessing

Binning Methods for Data Smoothing

* Smoothing by bin Median:

# Decimal Scaling: find the normalized value of 400.

Normalized value of attribute = ( vi / 10j )

Salary bonus Formula CGPA Normalized after Decimal scaling

Min max range: 12000 to 98000

How to calculate Z-Score of the following data?

Measures of Distance in Data Mining

We have to find out distance of (195,45) from Sample 1 and 2 .

Distance (d)= √ (x-a)2 + (y-b)2

D2= √ (195-170)2 + (45-56)2

Distance (d)= | x-a | +| y-b |

D1= |195-185 | +|45-72 |

D= (|195-185 |3 +|45-72 |3+|195-170 |3+|45-56|3) 1/3

# When to use K means clustering?

Calculate Euclidean Distance using the given Equation:

Distance [(x,y), (a,b)] = (𝑋 − 𝑎)^2 + (𝑋 − 𝑏)^2

Cluster 1 (185,72) = (185 − 185)^2 + (72 − 72)^2

Distance From Cluster 2 (170,56) = (170 − 185)^2 + (56 − 72)^2

Distance From Cluster 1 (185,72) = (185 − 170)^2 + (72 − 56)^2

Cluster 2(170,56)= (170 − 170)^2 + (56 − 56)^2

Calculate Euclidean Distance for the next dataset (168,60)

Distance From Cluster 1 (185,72) = (168 − 185)^2 + (60 − 72)^2

Distance From Cluster 2 (170,56) = (168 − 170)^2 + (60 − 56)^2

Update the cluster centroid

Calculate Euclidean Distance for the next dataset (179,68)

Distance From Cluster 1 (185,72) = (179 − 185)^2 + (68 − 72)^2

Distance From Cluster 2 (169,58) = (169 − 170)^2 + (68 − 58)^2

Update the cluster centroid

Calculate Euclidean Distance for the next dataset (182,72)

Distance From Cluster 1(182,70)= (182 − 182)^2 + (72 − 70)^2

Distance from Cluster 2= (182 − 169)^2 + (72 − 58)^2

Update the cluster centroid

Calculate Euclidean Distance for the next dataset (188,77)

Distance From Cluster 1(182,71)= (188 − 182)^2 + (77 − 71)^2

Distance from Cluster 2(169,58)= (188 − 169)^2 + (77 − 58)^2

Update the cluster centroid

You might also like