Professional Documents
Culture Documents
Data Preprocessing
Data Preprocessing
Data Preprocessing
Lesson Agenda:
a. Definition of data preprocessing?
b. data preprocessing Stages.
c. Data Cleaning.
d. Data transformation
Data preprocessing:
Data preprocessing means transformation the raw data into a useful and efficient format.
Data preprocessing divided into four stages.
1. Data cleaning,
2. Data integration,
3. Data reduction,
4. Data transformation.
Data cleaning
Data cleaning refers to techniques to ‘clean’ data by removing outliers, replacing missing
values, smoothing noisy data, and correcting inconsistent data. Many techniques are used to
perform each of these tasks, where each technique is specific to user’s preference or problem
set. Below, each task is explained in terms of the techniques used to overcome it.
Missing values
In order to deal with missing data, multiple approaches can be used. Let’s look at each of them.
1. Removing the training example: You can ignore the training example if the output
label is missing (if it is a classification problem). This is usually discouraged as it leads
to loss of data, as you are removing the attribute values that can add value to data set as
well.
2. Filling in missing value manually: This approach is time consuming, and not
recommended for huge data sets.
3. Using a standard value to replace the missing value: The missing value can be
replaced by a global constant such as ‘N/A’ or ‘Unknown’. This is a simple approach,
but not foolproof.
4. Using the most probable value to fill in the missing value: Using algorithms like
regression and decision tree, the missing values can be predicted and replaced.
Noisy data
Noise is defined as a random variance in a measured variable. For numeric values, boxplots
and scatter plots can be used to identify outliers. To deal with these anomalous values, data
smoothing techniques are applied, which are described below.
1. Binning: Using binning methods smooths sorted value by using the values around it.
The sorted values are then divided into ‘bins’. There are various approaches to binning.
Two of them are smoothing by bin means where each bin is replaced by the mean of
bin’s values, and smoothing by bin medians where each bin is replaced by the median
of bin’s values.
2. Regression: Linear regression and multiple linear regression can be used to smooth the
data, where the values are conformed to a function.
3. Outlier analysis: Approaches such as clustering can be used to detect outliers and deal
with them.
Data Transformation:
#it’s a data preprocessing process.
#Transform or consolidate the data into alternative forms appropriate for mining.
#involve process:
a. Smoothing:
b. Aggregation: Summary and Aggregation operations are applied on the given set of attributes
to come up with new attributes.
c. Generalization
d. Normalization:
Normalization is used to scale the data of an attribute so that it falls in a smaller range, such as -
1.0 to 1.0 or 0.0 to 1.0. It is generally useful for classification algorithms.
Methods of Data Normalization –
Decimal Scaling
Min-Max Normalization
z-Score Normalization(zero-mean Normalization)
Min-Max Normalization:
73600 − 12000
𝑍=
98000 − 12000
= .716
z-Score Normalization(zero-mean Normalization)
8,10,15,20
# Discretization
Discretization is the process through which we can transform
continuous variables, models or functions into a discrete form.
|| II || Approaches to Discretization
Unsupervised:
— Equal-Width
— Equal-Frequency
— K-Means
Supervised:
— Decision Trees
2. Manhattan Distance
3. Minkowski distance
(x, y)
d
( a, b)
Sample No
X Y
1 185 72
2 170 56
1. Euclidean Distance:
= 27.31
2. Manhattan Distance
Formula:
Let P =3
###############################
— K-Means
# What is K in K means clustering?
It is also called flat clustering algorithm. The number of clusters identified from data
by algorithm is represented by 'K' in K-means.
The K-means clustering algorithm is used to find groups which have not been explicitly
labeled in the data.
#
Apply K-means Clustering for the following data Sets for 2(two)
Clusters.
Sample no X Y
1 185 72
2 170 56
3 168 60
4 179 68
5 182 72
6 188 77
Given K=2
Initial Centroid
Cluster X Y
k1 185 72
k2 170 56
Updated Centroid
Cluster X Y Assignment
K1 0 21.93
K2 21.93 0
=4.472
Dataset Euclidean Distance
Cluster 1 Cluster 2 Assignment
(168,60) 20.808 4.472 2
= 7.211103
= 14.14214
Dataset Euclidean Distance
Cluster 1 Cluster 2 Assignment
(179,68) 7.211103 14.14214 1
=2
=19.10
Dataset Euclidean Distance
Cluster 1 Cluster 2 Assignment
(182,72) 2 19.10 1
= 8.4852
= 26.87
Dataset Euclidean Distance
Cluster 1 Cluster 2 Assignment
(188,77) 8.4852 26.87 1
1
++
2
++
++
To make this decision tree we can apply ID3 algorithm.
Following steps we have to follow when we apply ID3 algorithm to make a decision tree.
#KNN