Module2 - Preprocessing Updated -V3-2

Course Code: CSA2007
DATA MINING
1
Module: 2: Data Preprocessing
● Data preprocessing is a fundamental step in the

data analysis pipeline that involves preparing raw
data for further analysis and modeling. It
encompasses a range of tasks aimed at cleaning,
transforming, and organizing data to make it
suitable for analysis.
● The ultimate goal of data preprocessing is to
improve the quality of the data, enhance its
interpretability, and facilitate the extraction of
meaningful insights.
2
What Is an Attribute?
● An attribute is a data field, representing a

characteristic or feature of a data object.
● The nouns attribute, dimension, feature, and
variable are often used interchangeably in the
literature
3
Types of data (Attributes) in data mining
● Nominal Attribute
● Binary Attributes
● Ordinal Attributes
● Numeric Attributes
– Interval Attributes
– Discrete Attributes
– Ratio-Scaled Attributes
– Continuous Data Attributes
4
Types of data in data mining
● Nominal Attribute:
● Nominal means “relating to names.” The values of a
nominal attribute are symbols or names of things. Each
value represents some kind of category, code, or state,
and so nominal attributes are also referred to as
categorical. The values do not have any meaningful order
● Examples include gender (male, female), marital status
(single, married, divorced), or types of fruits (apple,
banana, orange).
● Nominal data are often represented using categorical
variables and are suitable for techniques like frequency
analysis, mode calculation, and association rule mining.
5
Example – Nominal Data
6
● Binary Attributes
● A binary attribute is a nominal attribute with only two
categories or states: 0 or 1, where 0 typically means that
the attribute is absent, and 1 means that it is present.
Binary attributes are referred to as Boolean if the two
states correspond to true and false.
● Example
● Binary attributes. Given the attribute smoker describing a
patient object, 1 indicates that the patient smokes, while 0
indicates that the patient does not. Similarly, suppose the
patient undergoes a medical test that has two possible
outcomes. The attribute medical test is binary, where a
value of 1 means the result of the test for the patient is
positive, while 0 means the result is negative 7
● Ordinal Data:
● Ordinal data also represent categories, but they have a
natural order or ranking.
● While the categories have a relative order, the differences
between them may not be consistent or measurable.
● Examples include survey responses (e.g., "strongly
disagree" to "strongly agree"), education levels (e.g.,
"high school diploma" to "Ph.D."), or socioeconomic
status (e.g., "low-income" to "high-income").
● Techniques such as rank ordering, median calculation,
and non-parametric tests are commonly used for
analyzing ordinal data.
8
example
● Education level: ● Ranking in a competition:

– Elementary school – 1st place
– Middle school – 2nd place
– High school diploma
– 3rd place
– Bachelor's degree
– ...
– Master's degree
– Doctorate degree ● Customer satisfaction ratings:
● Socioeconomic status: – Very dissatisfied
– Low-income – Dissatisfied
– Middle-income – Neutral
– High-income – Satisfied
● Likert scale responses: – Very satisfied
– Strongly disagree ● Severity of illness:
– Disagree
– Mild
– Neither agree nor disagree
– Moderate
– Agree
– Strongly agree
– Severe
9
Numeric Attributes
● A numeric attribute is quantitative; that is, it is a

measurable quantity, represented in integer or
real values. Numeric attributes can be interval-
scaled or ratio-scaled.
10
Numeric Attributes
● Interval Data:
● Interval data represent numeric values where the
difference between any two values is meaningful
and consistent, but there is no true zero point.
● Interval data allow for meaningful calculations of
differences between values but not ratios.
● Examples include temperature measured in
Celsius or Fahrenheit, calendar dates, or IQ
scores.
● Techniques such as mean calculation, standard
deviation, and correlation analysis can be applied
to interval data. 11
● Temperature:
– 0°C, 10°C, 20°C, 30°C, ...
● Calendar dates:
– January 1, 2024
– February 13, 2024
– March 5, 2024
– ...
● Time:
– 12:00 PM, 1:00 PM, 2:00 PM, ...
● IQ scores:
– 80, 90, 100, 110, ...
12
● Discrete Data:
● Discrete data consist of countable values with
clear boundaries between them.
● Examples include the number of customers, the
number of products sold, or the number of
defects in a manufacturing process.
● Discrete data are typically analyzed using
techniques such as frequency analysis, mode
calculation, and Poisson regression.
13
● Number of children in a family:
– 0, 1, 2, 3, ...
● Number of students in a classroom:
– 20, 25, 30, 35, ...
● Number of items sold:
– 10, 20, 30, 40, ...
● Number of cars in a parking lot:
– 50, 60, 70, 80, ...
● Number of defects in a manufacturing process:
– 0, 1, 2, 3, ...
● Number of goals scored in a soccer match:
– 0, 1, 2, 3, ...
14
● Ratio-Scaled Attributes
● Ratio-scaled attributes, also known as ratio
variables, are a type of quantitative variable in
statistics that possess all the properties of interval
variables but with an added feature: a true zero
point.
● Examples of ratio-scaled attributes include:
● Length: Measurements such as height, width, length, or distance,
where zero indicates the absence of length.
● Weight: Mass measurements in kilograms or pounds, where zero
represents the absence of weight.
● Time: Time measurements in seconds, minutes, hours, etc., where
zero denotes the absence of time.
15
● Continuous Data Attributes:
● Continuous data represent values that can take
on any value within a given range, often
measured with a high degree of precision.
● Examples include height, weight, temperature,
and time.
● Continuous data are analyzed using techniques
such as mean calculation, standard deviation,
regression analysis, and density estimation.
16
● Height:
– 165.2 cm, 172.5 cm, 179.1 cm, ...
● Weight:
– 68.3 kg, 75.6 kg, 82.9 kg, ...
● Temperature:
– 23.7°C, 25.3°C, 28.6°C, ...
● Time:
– 10:15:20, 10:16:35, 10:17:48, ...
● Distance:
– 3.5 meters, 4.2 meters, 5.9 meters, ...
● Speed:
– 60.5 km/h, 70.2 km/h, 80.9 km/h, ...
17
Data Quality
● Data quality is a critical aspect of data mining, as

the accuracy, completeness, consistency, and
reliability of the data directly influence the
outcomes and reliability of any analysis or model
built upon it.
● Ensuring high data quality involves various
processes and considerations to address
potential issues that may arise within the dataset.
18
Factors affecting the quality of the data
Measurement and Data Collection Issues

● It is unrealistic to expect that data will be perfect.
There may be problems due to human error,
limitations of measuring devices, or flaws in the
data collection process.
● Values or even entire data objects can be
missing.
● In other cases, there can be fake or duplicate
objects; i.e., multiple data objects that all
correspond to a single “real” object.
19
Measurement and Data Collection Errors
● The term measurement error refers to any problem resulting from the
measurement process.
● A common problem is that the value recorded differs from the true
value to some extent.
● For continuous attributes, the numerical difference of the measured
and true value is called the error.
● The term data collection error refers to errors such as omitting data
objects or attribute values, or inappropriately including a data object.
● Within particular domains, certain types of data errors are
commonplace, and well-developed techniques often exist for
detecting and/or correcting these errors.
● For example, keyboard errors are common when data is entered
manually, and as a result, many data entry programs have
techniques for detecting and, with human intervention, correcting
such errors
20
Noise and Artifacts
● Noise:
● Noise refers to random or irrelevant fluctuations
or disturbances present in data that do not carry
any meaningful information.
● Noise can arise from various sources, including
measurement errors, data collection
inaccuracies, environmental factors, or inherent
variability in the data generation process.
21
Artifacts:
● Artifacts are artificial or fake patterns or

distortions introduced into data due to technical
limitations, processing artifacts, or unintended
biases in data collection or preprocessing
procedures.
● Unlike noise, which is typically random or
stochastic, artifacts may exhibit systematic or
structured characteristics that can mislead
analysis and interpretation.
22
Example: Medical Imaging
● Noise:
● Description: In medical imaging, noise can manifest as
random fluctuations in pixel intensity values that do not
represent anatomical structures or pathological findings. It
can be caused by factors such as electronic noise in the
imaging equipment, photon noise due to low radiation
dose, or patient motion during image acquisition.
● Characteristics: Noise appears as graininess or speckle-
like patterns in the image, especially in regions with low
signal intensity. It may obscure fine details and make it
challenging to differentiate between structures of interest
and background noise.
23
● Artifacts:
● Description: Artifacts in medical imaging refer to
unwanted distortions or anomalies introduced into the
image due to technical factors, patient-related factors, or
errors in image acquisition or processing. They can arise
from sources such as equipment malfunction, patient
movement, metal implants, or incorrect imaging
parameters.
● Characteristics: Artifacts appear as structured or
systematic deviations from the true anatomical features in
the image. They may manifest as streaks, shadows,
blurring, geometric distortions, or intensity variations that
are not representative of the underlying anatomy.
24
Noise
Artifacts
25
Outliers
● Outliers are data points that significantly deviate

from the rest of the observations in a dataset.
They are often considered anomalies or
exceptions due to their unusual characteristics or
extreme values compared to the majority of the
data.
● Outliers can arise naturally from the inherent
variability in the data generation process or as a
result of errors, noise, or anomalies in data
collection, measurement, or recording
procedures.
26
● Characteristics of outliers:
● Extreme Values: Outliers exhibit extreme or unusual
values that are markedly different from the typical range
of values observed in the dataset.
● Isolation: Outliers are often isolated or distant from the
bulk of the data points, making them stand out visually in
graphical representations such as scatter plots or box
plots.
● Impact: Outliers can have a disproportionate impact on
statistical analyses, models, or visualization techniques,
potentially skewing results or misleading interpretations.
27
Missing values
● Missing values, also known as missing data or missing

observations, refer to the absence of values or
information for one or more variables in a dataset.
● Missing values can occur due to various reasons,
including data entry errors, data collection issues,
equipment malfunctions, participant non-response, or
deliberate data suppression.
● Dealing with missing values is a crucial aspect of data
preprocessing in data mining and machine learning, as
they can adversely affect the validity, reliability, and
performance of analytical models and algorithms.
28
example
29
Inconsistent Values
● Inconsistent values in data mining refer to data

entries or observations that contradict or violate
the logical constraints, rules, or expectations of
the dataset.
● These inconsistencies can arise due to errors,
discrepancies, or anomalies in data collection,
integration, or processing procedures.
● Identifying and resolving inconsistent values is
crucial for ensuring data quality, integrity, and
reliability in data mining and machine learning
tasks.
30
31
Duplicate Data
● Duplicate data in data mining refers to identical or

highly similar records or observations that exist
within a dataset. These duplicates can arise due
to various reasons, including errors in data
collection, data integration from multiple sources,
or inadvertent data replication.
● Dealing with duplicate data is essential for
ensuring the accuracy, efficiency, and reliability of
data mining analyses and modeling tasks.
32
33
Issues Related to Applications
● Timeliness
● Some data starts to age as soon as it has been
collected.
● In particular, if the data provides a snapshot of
some ongoing phenomenon or process, such as
the purchasing behavior of customers or web
browsing patterns, then this snapshot represents
reality for only a limited time. If the data is out of
date, then so are the models and patterns that
are based on it.
34
● Relevance
● The available data must contain the information
necessary for the application.
● Consider the task of building a model that
predicts the accident rate for drivers. If
information about the age and gender of the
driver is omitted, then it is likely that the model
will have limited accuracy unless this information
is indirectly available through other attributes.
35
● A common problem is sampling bias, which
occurs when a sample does not contain different
types of objects in proportion to their actual
occurrence in the population.
● For example, survey data describes only those
who respond to the survey.
● Because the results of a data analysis can reflect
only the data that is present, sampling bias will
typically lead to erroneous results when applied
to the broader population.
36
Data Preprocessing
● Why Data Preprocessing?

● Data in the real world is dirty
– incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate
data
◆e.g., occupation=“ ”
– noisy: containing errors or outliers
◆e.g., Salary=“-10”
– inconsistent: containing discrepancies in codes or
names
◆e.g., Age=“42” Birthday=“03/07/1997”
◆e.g., Was rating “1,2,3”, now rating “A, B, C”
◆e.g., discrepancy between duplicate records
37
Why Is Data Dirty?
● Incomplete data may come from
– “Not applicable” data value when collected
– Different considerations between the time when the data was
collected and when it is analyzed.
– Human/hardware/software problems
● Noisy data (incorrect values) may come from
– Faulty data collection instruments
– Human or computer error at data entry
– Errors in data transmission
● Inconsistent data may come from
– Different data sources
– Functional dependency violation (e.g., modify some linked
data)
● Duplicate records also need data cleaning 38
Why Is Data Preprocessing Important?
● No quality data, no quality mining results!

– Quality decisions must be based on quality
data
◆e.g., duplicate or missing data may cause
incorrect or even misleading statistics.
– Data warehouse needs consistent
integration of quality data
● Data extraction, cleaning, and transformation
comprises the majority of the work of building a data
warehouse
39
Data Preprocessing Techniques
● Aggregation
● Sampling
● Dimensionality reduction
● Feature subset selection
● Feature creation
● Discretization and binarization
● Variable transformation
40
1. Aggregation
● Aggregation in data preprocessing refers to the

process of combining and summarizing data from
multiple sources or at multiple levels of
granularity into a single, more manageable form.
● It is a fundamental step in data mining and
analysis, enabling analysts to gain insights from
large volumes of raw data by reducing complexity
and focusing on important patterns and trends.
41
Example
● Suppose we have a dataset containing raw sales

transactions with the following attributes:
● Transaction ID
● Date
● Product ID
● Quantity Sold
● Unit Price
● Customer ID
● Store ID
42
Perform aggregation on this sales data:
● Grouping by Date: We can aggregate the sales

data by date to get daily sales totals. For
example, we group all transactions that occurred
on the same date and calculate the sum of
quantities sold and total sales revenue for each
day.
● Grouping by Product: We can also aggregate
the data by product to analyze sales performance
for each product. We group transactions by
product ID and calculate metrics such as total
quantity sold, total revenue, average unit price,
etc., for each product.
43
● Grouping by Store: Similarly, we can aggregate the data by store to
analyze sales performance across different store locations. We group
transactions by store ID and calculate metrics such as total sales,
average transaction value, etc., for each store.
● Aggregating by Customer: Another possible aggregation is by
customer, where we analyze individual customer behavior. We group
transactions by customer ID and calculate metrics such as total
spending, number of purchases, average purchase value, etc., for
each customer.
● Aggregating by Product Category: We can also aggregate the data
by product category to understand sales patterns across different
product categories. We group transactions by product category and
calculate metrics such as total sales, average unit price, etc., for each
category.
44
45
2. Sampling
● Sampling in data preprocessing involves

selecting a subset of data from a larger dataset
for analysis. It is a crucial step in data mining and
analysis, especially when dealing with large
volumes of data where analyzing the entire
dataset may be impractical or time-consuming.
Sampling allows analysts to draw conclusions
about the population from a representative
sample, thereby reducing computational costs
and speeding up analysis without sacrificing the
validity of results.
46
2.1 Sampling without replacement
●
Sampling without replacement is a method of
where each selected data point is removed from
consideration for subsequent selections.
● This means that once a data point is included in
the sample, it cannot be selected again.
Sampling without replacement ensures that each
selected sample is unique and does not contain
duplicate entries.
47
Example:
● Define Population: The population consists of all 30 students on the

classroom roster.
● Determine Sample Size: You want to select a sample of 10 students
for the survey.
● Sampling without Replacement:
– You start by assigning each student a unique identifier, such as a
number from 1 to 30.
– Using a random selection method (e.g., drawing names from a
hat, using a random number generator), you select the first
student for the sample. Let's say you randomly select student
number 15.
– Once a student is selected, their identifier is removed from
consideration for subsequent selections.
– You continue this process until you have selected 10 students in
total, ensuring that each selected student is unique and not
selected again. 48
● Data Analysis: After selecting the sample of 10 students,
you administer the survey to the selected students and
collect their feedback. You can then analyze the survey
responses to gain insights into various aspects of the
learning experience, such as teaching effectiveness,
classroom environment, and student engagement.
● Generalization: The insights and conclusions drawn from
the survey can be generalized to the entire population of
students in the classroom. By selecting a representative
sample of students, you can make inferences about the
opinions and experiences of the broader student
population.
49
2.2 Sampling with replacement
● Sampling with replacement is a method of

where each selected data point is returned to the
dataset after selection.
● This means that each data point in the population
has the opportunity to be selected multiple times
or not at all. Sampling with replacement allows for
the possibility of duplicate entries in the sample.
50
Example
● Define Population: The population consists of all 30 students in the

classroom.
● Determine Sample Size: You want to select a sample of 5 students
for the prize draw.
● Sampling with Replacement:
– You start by assigning each student a unique identifier, such as a
number from 1 to 30.
– Using a random selection method (e.g., drawing names from a
hat, using a random number generator), you randomly select the
first student for the prize draw. Let's say you randomly select
student number 7.
– After selecting a student, you record their name and put their
identifier back into the pool of potential selections.
– You repeat this process until you have selected 5 students in
total, allowing for the possibility of selecting the same student
multiple times. 51
● Here's a simplified example of the first few students
selected in the prize draw:
● Student 7
● Student 15
● Student 7 (selected again)
● Student 22
● Student 3
● In this example, each student has the chance to be
selected multiple times due to the sampling with
replacement method. This ensures that each student has
a fair opportunity to win a prize, and the randomness of
the selection process adds an element of excitement to
the prize draw.
52
3. Dimensionality Reduction
● Dimensionality reduction in preprocessing refers

to the process of reducing the number of input
variables or features in a dataset while retaining
the most relevant information.
● This technique is commonly used in machine
learning and data analysis to address the curse
of dimensionality, improve computational
efficiency, alleviate overfitting, and enhance the
performance of predictive models.
53
Explanation of dimensionality reduction in preprocessing
● Curse of Dimensionality: As the number of features or

dimensions in a dataset increases, the amount of data
required to effectively cover the feature space grows
exponentially. This phenomenon, known as the curse of
dimensionality, can lead to various issues such as
increased computational complexity, overfitting, and
decreased interpretability of results.
● Objective of Dimensionality Reduction: The primary
goal of dimensionality reduction is to reduce the number
of features while preserving the most important
information present in the dataset. By reducing the
dimensionality of the data, we can simplify the analysis,
improve the performance of machine learning algorithms,
and facilitate visualization and interpretation of results.
54
● Techniques for Dimensionality Reduction:
● Feature Selection: Feature selection involves selecting a
subset of the original features based on their relevance to
the target variable or predictive power. Common methods
include univariate feature selection, recursive feature
elimination, and feature importance ranking.
● Feature Extraction: Feature extraction transforms the
original features into a lower-dimensional space using
mathematical techniques such as principal component
analysis (PCA), linear discriminant analysis (LDA), and t-
distributed stochastic neighbor embedding (t-SNE). These
techniques aim to capture the most significant patterns
and variability in the data while reducing redundancy and
noise.
55
● Benefits of Dimensionality Reduction:
● Improved Model Performance: By reducing the
dimensionality of the data, dimensionality reduction
techniques can help mitigate the risk of overfitting and
improve the generalization performance of machine
learning models.
● Computational Efficiency: Dimensionality reduction
reduces the computational complexity of algorithms,
making them more efficient and scalable to larger
datasets.
● Visualization and Interpretation: Lower-dimensional
representations of the data are easier to visualize and
interpret, enabling better understanding of the underlying
patterns and relationships. 56
4. Feature subset selection
Feature subset selection is a process in machine learning

and statistics where a subset of relevant features (variables
or attributes) is chosen from a larger set of available
features.
The objective is to improve the performance of a model by
selecting only those features that contribute most to the
predictive power while discarding irrelevant or redundant
ones.
Feature subset selection is particularly important in
situations where the dataset contains a large number of
features, as using all of them may lead to overfitting,
increased computational complexity, and reduced model
interpretability.
57
Importance of Feature Subset Selection
Curse of Dimensionality: As the number of features

increases, the volume of the feature space grows
exponentially. This can lead to increased computational
costs, data sparsity, and a higher likelihood of overfitting.
Improved Model Generalization: Including irrelevant or

redundant features in a model can lead to overfitting,
where the model performs well on the training data but
fails to generalize to new, unseen data. Feature subset
selection helps mitigate overfitting by focusing on the
most informative features.
58
Approaches to Feature Subset Selection:
Filter Methods:
● Statistical Measures: These methods use statistical measures (e.g., correlation,
mutual information, chi-squared) to rank or score features based on their individual
relevance to the target variable. Features are then selected or ranked accordingly.
● Variance Thresholding: Features with low variance are considered less informative
and may be removed.
Wrapper Methods:
● Forward Selection: Start with an empty set of features and iteratively add the most
relevant feature until a stopping criterion is met.
● Backward Elimination: Start with all features and iteratively remove the least relevant
feature until a stopping criterion is met.
● Recursive Feature Elimination (RFE): Similar to backward elimination but uses a
model to identify less relevant features at each step.
Embedded Methods:
● Regularization Techniques: Techniques like LASSO (L1 regularization) encourage
sparsity in the model coefficients, effectively performing feature selection during
model training.
● Tree-based Methods: Decision trees and ensemble methods (e.g., Random Forests)
inherently perform feature selection by giving importance scores to features based on
59
their contribution to the model's performance.
Flowchart of a feature subset selection process
60
Feature subset selection is a search over all possible subsets of
features.
Many different types of search strategies can be used, but the search
strategy should be computationally inexpensive and should find optimal
or near optimal sets of features.
The number of subsets can be enormous and it is impractical to
examine them all, some sort of stopping criterion is necessary.
This strategy is usually based on one or more conditions involving the
following: the number of iterations, whether the value of the subset
evaluation measure is optimal or exceeds a certain threshold, whether a
subset of a certain size has been obtained, and whether any
improvement can be achieved by the options available to the search
strategy.
61
once a subset of features has been selected, the
results of the target data mining algorithm on the
selected subset should be validated. A
straightforward validation approach is to run the
algorithm with the full set of features and compare
the full results to results obtained using the subset
of features. Hopefully, the subset of features will
produce results that are better than or almost as
good as those produced when using all features.
62
5. Feature Creation
Feature creation, also known as feature

engineering, is a crucial step in the data
preprocessing journey. It involves crafting new
features from existing data to enhance the
performance of your machine learning models.
63
Popular techniques for feature creation
● Combining Features: Add, subtract, multiply, or

divide existing features to create new ones that
capture their combined effect. For
example, combining "income" and "age" might
create a "purchasing power" feature.
● Transforming Features: Apply mathematical
functions like log, square root, or normalization
to scale or modify features, making them easier
for models to learn from.
64
● Deriving Features: Use domain knowledge to
create features based on specific rules or
conditions. For example, derive a "customer
loyalty" feature based on purchase history.
● Interaction Features: Identify interactions
between features, like "product category" and
"discount offered," to capture their combined
influence on behavior.
● Embedding Features: Transform categorical
features (like colors or text) into numerical
representations suitable for models to
understand.
65
6. Discretization
● Discretization is a data preprocessing technique used to

transform continuous variables into discrete or
categorical variables by dividing the range of values into
intervals or bins.
● This process is beneficial for various reasons, including
simplifying the representation of data, handling non-
linear relationships, reducing noise, and improving the
performance of certain machine learning algorithms.
Discretization is commonly used in domains such as
finance, healthcare, and marketing for tasks such as risk
assessment, patient stratification, and customer
segmentation.
66
Types of Discretization Techniques:
● Equal Width Binning:

– In this technique, the range of values is divided into a specified
number of intervals, each with an equal width. For example, if
we have a continuous variable ranging from 0 to 100 and want
to create 5 bins, each bin would cover a range of 20 units (0-20,
21-40, 41-60, 61-80, 81-100).
● Equal Frequency Binning:
– Here, the data is divided into intervals such that each bin
contains approximately the same number of data points. This
ensures that bins have similar frequencies and can be useful for
handling skewed distributions.
● Custom Binning:
– Custom binning involves defining intervals based on domain
knowledge or specific requirements. For example, age groups
(0-18, 19-35, 36-50, 51-65, 66+) or income brackets ($0-$25k,
$25k-$50k, $50k-$100k, $100k+).
67
Equal Width Binning - Example
● Let's consider a dataset containing information about the salaries of

employees in a company. We want to discretize the salary variable into
equal-width bins to categorize employees based on their salary ranges.
68
Steps for Equal Width Binning:
● 1. Data Exploration: ● 4. Defining Bins:

– Explore the salary data to – Based on the bin width,
understand its distribution and define the intervals for each
range. In our example, the salary
bin. Starting from the
data ranges from $35,000 to
$80,000. minimum salary ($35,000),
● 2. Choosing Number of Bins: we can create the following
bins:
– Decide on the number of bins to
divide the salary range into. For ◆ Bin 1: $35,000 - $46,249
illustration purposes, let's choose ◆ Bin 2: $46,250 - $57,499
4 bins. ◆ Bin 3: $57,500 - $68,749
● 3. Calculating Bin Width: ◆ Bin 4: $68,750 - $80,000
– Calculate the width of each bin ● 5. Assigning Employees to
by dividing the range of salaries Bins:
by the number of bins. In this
case: – For each employee,
◆ Range of salaries = $80,000 determine which bin their
(maximum salary) - $35,000 salary falls into based on the
(minimum salary) = $45,000 defined intervals.
◆ Width of each bin = $45,000 / 4
= $11,250 69
70
● Interpretation:
● In this example, we have discretized the salary
variable into four equal-width bins based on the
range of salaries. Employees are categorized
into bins based on their salary ranges. This
discretization can help in analyzing salary
distributions, identifying salary trends, and
making comparisons across different salary
ranges. Additionally, it can be used as a
categorical feature for further analysis or
modeling tasks.
71
72
● In smoothing by bin means, each value in a bin
is replaced by the mean value of the bin. For
example, the mean of the values 4, 8, and 15 in
Bin 1 is 9. Therefore, each original value in this
bin is replaced by the value 9.
● Similarly, smoothing by bin medians can be
employed, in which each bin value is replaced
by the bin median.
● In smoothing by bin boundaries, the minimum
and maximum values in a given bin are
identified as the bin boundaries. Each bin value
is then replaced by the closest boundary value.
73
● What is smoothing in data analysis, and why is it used?
Smoothing in data analysis refers to the process of removing noise
or irregularities from a dataset to reveal underlying patterns or
trends. It is used to simplify data while preserving important
features, making it easier to interpret and analyze.
● Explain the concept of bin smoothing. How does it work? Bin
smoothing involves dividing a dataset into consecutive bins and
replacing each bin's values with a representative statistic, such as
the mean, median, or boundaries of the bin. This helps to reduce
noise and variability in the data.
● Describe the difference between smoothing by bin means, bin
medians, and bin boundaries.
– Smoothing by bin means replaces each bin's values with the
mean of those values.
– Smoothing by bin medians replaces each bin's values with the
median of those values.
– Smoothing by bin boundaries replaces each bin's values with 74
the minimum and maximum values of that bin.
● Discuss real-world applications of
smoothing techniques across different
domains.
● Smoothing techniques are used in finance for
analyzing stock prices, in healthcare for
processing medical signals, in climate science
for analyzing temperature trends, and in image
processing for noise reduction, among other
applications.
75
Problem 1
● The following data (in increasing order) for the

attribute age: 13, 15, 16, 16, 19, 20, 20, 21, 22,
22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36,
40, 45, 46, 52, 70.
● Use smoothing by bin means, smoothing by bin
medians , smoothing by bin boundaries to
smooth these data, using a bin depth of 3
76
● Problem 2
● Give data point: 4, 7, 13, 16, 20, 24, 27, 29, 31,
33, 38, 42.
● Problem 3
● data for price in dollars: 8 16, 9, 15, 21, 21,
24, 30, 26, 27, 30, 34 - BIN SIZE=4
● Apply smoothing by bin means, smoothing by

bin medians , smoothing by bin boundaries
77
7. Binarization
● Binarization is a data preprocessing technique

used to transform numerical features into binary
(0 or 1) values based on a threshold. This
technique is particularly useful when dealing
with features that are not inherently binary but
can be effectively converted into binary
representations for certain applications.
Binarization is commonly used in scenarios
where we are interested in binary outcomes or
want to simplify the representation of data.
78
How Binarization Works:
● Threshold Selection:
– The first step in binarization is to choose an appropriate
threshold value. This threshold divides the numerical values into
two categories: values below the threshold are assigned 0, and
values equal to or above the threshold are assigned 1.
● Applying the Threshold:
– For each data point in the numerical feature, compare its value
with the chosen threshold. If the value is less than the
threshold, assign it 0; otherwise, assign it 1.
● Resulting Binary Feature:
– After applying the threshold to all data points, the numerical
feature is transformed into a binary feature, where each data
point is represented as either 0 or 1.
79
Example Scenario:
● Let's consider a dataset containing information about students'

exam scores. We want to binarize the exam scores into pass (1) or
fail (0) based on a passing threshold. Here's how we can perform
binarization:
80
● Steps for Binarization:
● Threshold Selection:
– Let's choose a passing
threshold of 60. Scores
equal to or above 60 will
be considered passing,
while scores below 60 will
be considered failing.
● Applying the
Threshold:
– For each exam score in
the dataset, compare it
with the threshold (60). If
the score is greater than
or equal to 60, assign it 1
(pass); otherwise, assign
81
it 0 (fail).
● Interpretation:
● In this example, we binarized the exam scores
into pass (1) or fail (0) based on a passing
threshold of 60. Exam scores equal to or above
60 were assigned a value of 1 (pass), while
scores below 60 were assigned a value of 0
(fail). This binarization allows us to simplify the
representation of exam scores and focus on the
binary outcome of pass or fail.
82
Data Transformation
● A function that maps the entire set of values of a given attribute to a
new set of replacement values s.t. each old value can be identified
with one of the new values
● Methods
– Smoothing: Remove noise from data
– Attribute/feature construction
◆ New attributes constructed from the given ones
– Aggregation: Summarization, data cube construction
– Normalization: Scaled to fall within a smaller, specified range
◆ min-max normalization
◆ z-score normalization
◆ normalization by decimal scaling
83
– Discretization: Concept hierarchy climbing
Normalization
84
● Normalize the following group of data:
● 200, 300,400,600,1000
● min-max normalization by setting min = 0 and

max = 1
85
86
Standard Deviation
Mean
Find SD: {4, 7, 13, 16}
87
To find the standard deviation of a dataset, follow these steps:
● Calculate the mean (average) of ● Square each difference:

the dataset: – Square each result obtained
– Add up all the values in the from step 2.
dataset.
● For each squared difference:
– Divide the sum by the total
number of values in the dataset. ● (-6)² = 36
● For the given dataset {4, 7, 13, ● (-3)² = 9
16}: Mean = (4 + 7 + 13 + 16) / 4 ● (3)² = 9
= 40 / 4 = 10. ● (6)² = 36
● Calculate the difference
● Calculate the mean (average) of
between each data point and
the squared differences:
the mean:
– Add up all the squared
– Subtract the mean from each differences.
value in the dataset.
– Divide the sum by the total
● For each value in the dataset: number of values in the dataset.
● (4 - 10) = -6 ● Mean of squared differences =
● (7 - 10) = -3 (36 + 9 + 9 + 36) / 4 = 90 / 4 =
● (13 - 10) = 3 22.5.
● (16 - 10) = 6 88
● Make the square root of the mean of squared
differences:
– This gives the standard deviation.
● Standard deviation = √(22.5) ≈ 4.74 (rounded to
two decimal places).
● So, the standard deviation of the dataset {4, 7,
13, 16} is approximately 4.74.
89
● 200, 300,400,600,1000
● Apply z-score normalization
90
91
● 200, 300,400,600,1000
● Apply normalization by decimal scaling
92
Similarity and Dissimilarity Measures
● Similarity measures:
● Similarity measures quantify the likeness or resemblance
between two objects, datasets, or entities. These measures are
fundamental in various domains, including machine learning,
data mining, information retrieval, and pattern recognition.
● Dissimilarity measures
● Dissimilarity measures, also known as distance metrics, quantify
the difference or dissimilarity between two objects or data points.
Unlike similarity measures, which indicate how similar two
objects are, dissimilarity measures provide a quantitative
assessment of how different or distant they are from each other.
Dissimilarity measures are widely used in various fields such as
clustering, classification, and anomaly detection
Similarity/ Dissimilarity between Data
Objects with multiple Numeric attributes
● Euclidean Distance
Example:
● Consider an example involving the Euclidean distance to

measure the similarity between two data points
representing items in a dataset. Let's say we have a
dataset of houses represented by their features such as
size (in square feet) and number of bedrooms. We want
to measure the similarity between two houses based on
these features using the Euclidean distance.
● Suppose we have two houses:
● House 1:
● Size: 2000 square feet
● Number of bedrooms: 3
● House 2:
● Size: 1800 square feet
● Number of bedrooms: 2
95
96
● In the above example, the Euclidean distance between
House 1 and House 2 is approximately 200.00.
● Since this distance is relatively small, we can interpret it
as indicating a moderate similarity between the two
houses.
● This interpretation suggests that House 1 and House 2
are somewhat similar in terms of their size and number
of bedrooms.
● In summary, the interpretation of Euclidean distance in
measuring similarity involves understanding that smaller
distances imply greater similarity, while larger distances
imply greater dissimilarity between data points or
objects.
97
Example 2:
● Example: Suppose we have two data points

representing items in a dataset:
● Data Point 1:
● Feature 1: 10
● Feature 2: 5
● Feature 3: 7
● Data Point 2:
● Feature 1: 8
● Feature 2: 4
● Feature 3: 6
98
99
● In this example, the Euclidean distance between Data
Point 1 and Data Point 2 is approximately 2.45. Since
this distance is relatively small, we can interpret it as
indicating a moderate similarity between the two data
points. This interpretation suggests that Data Point 1 and
Data Point 2 are somewhat similar in terms of their
features.
● In summary, when using Euclidean distance to measure
similarity, smaller distances imply greater similarity, while
larger distances imply greater dissimilarity between data
points.
100
Example for Dissimilarity
● Example: Suppose we have two products,

Product A and Product B, with the following
attributes:
● Product A:
● Price: $20
● Size: 10 units
● Product B:
● Price: $25
● Size: 15 units
101
102
● In this example, the Euclidean distance between
Product A and Product B is approximately 7.071.
Since this distance is relatively large, we can
interpret it as indicating a significant dissimilarity
between the two products in terms of their price
and size.
● This illustrates how dissimilarity measures, such
as the Euclidean distance, can be used to
quantify the differences between products based
on their attributes.
103
● The threshold value for distance to determine similarity or
dissimilarity between data points depends on the specific context
and the nature of the data. There is no universal threshold value
that applies to all scenarios. Instead, the threshold is typically
determined based on the objectives of the analysis, the
characteristics of the data, and domain knowledge.
● Example:
● Domain Knowledge: Understanding the domain and the specific
problem at hand can help in determining an appropriate threshold
value. For example, in some applications, a small distance might be
considered similar enough, while in others, a larger distance might
be acceptable.
● Application Requirements: The choice of threshold often depends
on the requirements of the application. For instance, in clustering
algorithms, a threshold is used to decide when to stop merging
clusters, while in anomaly detection, a threshold is used to identify
outliers.
104
Common Properties of a Distance
● A distance function, often referred to as a metric,

must satisfy certain properties to be considered
valid. These properties ensure that the distance
measure behaves appropriately and consistently
in various contexts.
105
106

Module2 - Preprocessing Updated -V3-2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Module2 - Preprocessing Updated -V3-2

Uploaded by

Copyright:

Available Formats

Course Code: CSA2007

● Data preprocessing is a fundamental step in the

● An attribute is a data field, representing a

● Education level: ● Ranking in a competition:

● A numeric attribute is quantitative; that is, it is a

● Data quality is a critical aspect of data mining, as

Measurement and Data Collection Issues

● Artifacts are artificial or fake patterns or

● Outliers are data points that significantly deviate

● Missing values, also known as missing data or missing

● Inconsistent values in data mining refer to data

● Duplicate data in data mining refers to identical or

● Why Data Preprocessing?

● No quality data, no quality mining results!

● Aggregation in data preprocessing refers to the

● Suppose we have a dataset containing raw sales

● Grouping by Date: We can aggregate the sales

● Sampling in data preprocessing involves

● Define Population: The population consists of all 30 students on the

● Sampling with replacement is a method of

● Define Population: The population consists of all 30 students in the

● Dimensionality reduction in preprocessing refers

● Curse of Dimensionality: As the number of features or

Feature subset selection is a process in machine learning

Curse of Dimensionality: As the number of features

Improved Model Generalization: Including irrelevant or

Feature creation, also known as feature

● Combining Features: Add, subtract, multiply, or

● Discretization is a data preprocessing technique used to

● Equal Width Binning:

● Let's consider a dataset containing information about the salaries of

● 1. Data Exploration: ● 4. Defining Bins:

● The following data (in increasing order) for the

● Apply smoothing by bin means, smoothing by

● Binarization is a data preprocessing technique

● Let's consider a dataset containing information about students'

● min-max normalization by setting min = 0 and

Find SD: {4, 7, 13, 16}

● Calculate the mean (average) of ● Square each difference:

● Apply z-score normalization

● Apply normalization by decimal scaling

Objects with multiple Numeric attributes

● Consider an example involving the Euclidean distance to

● Example: Suppose we have two data points

● Example: Suppose we have two products,

● A distance function, often referred to as a metric,

You might also like