Data Preprocessing

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 32

Data preprocessing

Lesson Agenda:
a. Definition of data preprocessing?
b. data preprocessing Stages.
c. Data Cleaning.
d. Data transformation

Data preprocessing:
Data preprocessing means transformation the raw data into a useful and efficient format.
Data preprocessing divided into four stages.
1. Data cleaning,
2. Data integration,
3. Data reduction,
4. Data transformation.

Data cleaning

Data cleaning refers to techniques to ‘clean’ data by removing outliers, replacing missing
values, smoothing noisy data, and correcting inconsistent data. Many techniques are used to
perform each of these tasks, where each technique is specific to user’s preference or problem
set. Below, each task is explained in terms of the techniques used to overcome it.

Missing values

In order to deal with missing data, multiple approaches can be used. Let’s look at each of them.

1. Removing the training example: You can ignore the training example if the output
label is missing (if it is a classification problem). This is usually discouraged as it leads
to loss of data, as you are removing the attribute values that can add value to data set as
well.
2. Filling in missing value manually: This approach is time consuming, and not
recommended for huge data sets.
3. Using a standard value to replace the missing value: The missing value can be
replaced by a global constant such as ‘N/A’ or ‘Unknown’. This is a simple approach,
but not foolproof.

4. Using the most probable value to fill in the missing value: Using algorithms like
regression and decision tree, the missing values can be predicted and replaced.

Noisy data

Noise is defined as a random variance in a measured variable. For numeric values, boxplots
and scatter plots can be used to identify outliers. To deal with these anomalous values, data
smoothing techniques are applied, which are described below.

1. Binning: Using binning methods smooths sorted value by using the values around it.
The sorted values are then divided into ‘bins’. There are various approaches to binning.
Two of them are smoothing by bin means where each bin is replaced by the mean of
bin’s values, and smoothing by bin medians where each bin is replaced by the median
of bin’s values.
2. Regression: Linear regression and multiple linear regression can be used to smooth the
data, where the values are conformed to a function.
3. Outlier analysis: Approaches such as clustering can be used to detect outliers and deal
with them.

Binning Methods for Data Smoothing


* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34

* Smoothing by bin Median:


-Bin1: 8.5, 8.5, 8.5, 8.5
-Bin2: 22.5, 22.5, 22.5, 22.5
-Bin3: 28.5, 28.5, 28.5, 28.5

Data Transformation:
#it’s a data preprocessing process.
#Transform or consolidate the data into alternative forms appropriate for mining.

#involve process:
a. Smoothing:
b. Aggregation: Summary and Aggregation operations are applied on the given set of attributes
to come up with new attributes.

c. Generalization
d. Normalization:
Normalization is used to scale the data of an attribute so that it falls in a smaller range, such as -
1.0 to 1.0 or 0.0 to 1.0. It is generally useful for classification algorithms.
Methods of Data Normalization –
 Decimal Scaling
 Min-Max Normalization
 z-Score Normalization(zero-mean Normalization)

# Decimal Scaling: find the normalized value of 400.

Normalized value of attribute = ( vi / 10j )

Salary bonus Formula CGPA Normalized after Decimal scaling


400 400 / 1000 0.4
310 310 / 1000 0.31

 Min-Max Normalization:

Min max range: 12000 to 98000


New range: 0.0 to 1.0
Value to be normalized: 73600

73600 − 12000
𝑍=
98000 − 12000

= .716
 z-Score Normalization(zero-mean Normalization)

How to calculate Z-Score of the following data?

8,10,15,20
# Discretization
Discretization is the process through which we can transform
continuous variables, models or functions into a discrete form.
|| II || Approaches to Discretization
 Unsupervised:
— Equal-Width
— Equal-Frequency
— K-Means

 Supervised:
— Decision Trees

Measures of Distance in Data Mining


1. Euclidean Distance

2. Manhattan Distance

3. Minkowski distance

(x, y)
d

( a, b)
Sample No
X Y

1 185 72

2 170 56

We have to find out distance of (195,45) from Sample 1 and 2 .

1. Euclidean Distance:

Distance (d)= √ (x-a)2 + (y-b)2


D1= √ (195-185)2 + (45-72)2
=28.79

D2= √ (195-170)2 + (45-56)2

= 27.31

2. Manhattan Distance

Distance (d)= | x-a | +| y-b |

D1= |195-185 | +|45-72 |


= 37
3. Minkowski distance:

Formula:

Let P =3

D= (|195-185 |3 +|45-72 |3+|195-170 |3+|45-56|3) 1/3

###############################

— K-Means
# What is K in K means clustering?
It is also called flat clustering algorithm. The number of clusters identified from data
by algorithm is represented by 'K' in K-means.

# When to use K means clustering?

The K-means clustering algorithm is used to find groups which have not been explicitly
labeled in the data.
#

Apply K-means Clustering for the following data Sets for 2(two)
Clusters.
Sample no X Y
1 185 72
2 170 56
3 168 60
4 179 68
5 182 72
6 188 77

Given K=2
Initial Centroid
Cluster X Y
k1 185 72
k2 170 56

Calculate Euclidean Distance using the given Equation:

Distance [(x,y), (a,b)] = (𝑋 − 𝑎)^2 + (𝑋 − 𝑏)^2

Cluster 1 (185,72) = (185 − 185)^2 + (72 − 72)^2


=0

Distance From Cluster 2 (170,56) = (170 − 185)^2 + (56 − 72)^2


=21.93

Distance From Cluster 1 (185,72) = (185 − 170)^2 + (72 − 56)^2


=21.93

Cluster 2(170,56)= (170 − 170)^2 + (56 − 56)^2


=0

Updated Centroid
Cluster X Y Assignment
K1 0 21.93
K2 21.93 0

Calculate Euclidean Distance for the next dataset (168,60)

Distance From Cluster 1 (185,72) = (168 − 185)^2 + (60 − 72)^2


=20.808

Distance From Cluster 2 (170,56) = (168 − 170)^2 + (60 − 56)^2

=4.472
Dataset Euclidean Distance
Cluster 1 Cluster 2 Assignment
(168,60) 20.808 4.472 2

Update the cluster centroid


Cluster X Y
K1 185 72
K2 =(170+168)/2 =(60+56)/2
=169 =58

Calculate Euclidean Distance for the next dataset (179,68)

Distance From Cluster 1 (185,72) = (179 − 185)^2 + (68 − 72)^2

= 7.211103

Distance From Cluster 2 (169,58) = (169 − 170)^2 + (68 − 58)^2

= 14.14214
Dataset Euclidean Distance
Cluster 1 Cluster 2 Assignment
(179,68) 7.211103 14.14214 1

Update the cluster centroid


Cluster X Y
K1 =(185+179)/2 =(72+68)/2
=182 =70
K2 169 58

Calculate Euclidean Distance for the next dataset (182,72)

Distance From Cluster 1(182,70)= (182 − 182)^2 + (72 − 70)^2

=2

Distance from Cluster 2= (182 − 169)^2 + (72 − 58)^2

=19.10
Dataset Euclidean Distance
Cluster 1 Cluster 2 Assignment
(182,72) 2 19.10 1

Update the cluster centroid


Cluster X Y
K1 =(182+182)/2 =(72+70)/2
=182 =71
K2 169 58

Calculate Euclidean Distance for the next dataset (188,77)

Distance From Cluster 1(182,71)= (188 − 182)^2 + (77 − 71)^2

= 8.4852

Distance from Cluster 2(169,58)= (188 − 169)^2 + (77 − 58)^2

= 26.87
Dataset Euclidean Distance
Cluster 1 Cluster 2 Assignment
(188,77) 8.4852 26.87 1

Update the cluster centroid


Cluster X Y
K1 =(182+188)/2 =(71+77)/2
=185 =74
K2 169 58
Final Assignment
Sample no X Y Assignment
1 185 72 1
2 170 56 2
3 168 60 2
4 179 68 1
5 182 72 1
6 188 77 1

1
++
2
++
++
To make this decision tree we can apply ID3 algorithm.
Following steps we have to follow when we apply ID3 algorithm to make a decision tree.
#KNN

You might also like