Download as pdf or txt
Download as pdf or txt
You are on page 1of 35

BAFPRED MODULE 2

DATA
PREPROCESSING
Prepared by:
ARPEE C. ARRUEJO
01 02

Introduce basic concepts in data Introduce methods of data pre-


pre-processing; and processing.

Learning
Outcomes
03 04

Introduce data cleaning Introduce data integration


What is
Data
Processing?
Data preprocessing is an important step in data analytics. It aims at
assessing and improving the quality of data for secondary statistical
analysis. With this, the data is better understood and the data analysis is
performed more accurately and efficiently.
What is
Data Processing?
Data in the real world is dirty

incomplete noisy: inconsistent:

lacking attribute values, containing errors or outliers containing discrepancies in


lacking certain attributes of codes or names
interest, or containing only e.g., Salary=“-10”
aggregate data e.g., Age=“42”
e.g., occupation=“ ” Birthday=“03/07/1997”
e.g., Was rating “1,2,3”, now
rating “A, B, C”
e.g., discrepancy between
duplicate records
Why is
Data Dirty?
Incomplete data may Noisy data (incorrect Inconsistent data may Duplicate records also need
come from values) may come from come from data cleaning
“Not applicable” data Faulty data collection Different data sources
value when collected instruments Functional
Different Human or computer dependency violation
considerations error at data entry (e.g., modify some
between the time Errors in data linked data)
when the data was transmission
collected and when it
is analyzed.
Human/hardware/soft
ware problems
Why is
Data Preprocessing
Important?
No quality data, no quality mining Data extraction, cleaning, and
results! transformation comprises the majority of
Quality decisions must be based on the work of building a data warehouse
quality data
e.g., duplicate or missing data may
cause incorrect or even misleading
statistics.
Data warehouse needs consistent
integration of quality data
Task for Data
PreProcessing
Several distinct steps are involved in preprocessing
data. Here are the general steps taken to pre-process
data:

Data integration
Data cleaning
o Extracted raw data can come from heterogeneous sources or
o This step deals with missing data, noise, outliers, and be in separate datasets. This step reorganizes the various raw
duplicate or incorrect records while minimizing datasets into a single dataset that contain all the information
introduction of bias into the database. required for the desired statistical analyses.
o Data is cleansed through processes such as filling in o Involves integration of multiple databases, data cubes, or
missing values, smoothing the noisy data, or resolving files.
o Data with different representations are put together and
the inconsistencies in the data.
conflicts within the data are resolved.
Several distinct steps are involved in preprocessing
data. Here are the general steps taken to pre-process
data:

Data reduction
Data transformation
o After the dataset has been integrated and transformed, this step
o This step translates and/or scales variables stored in a variety removes redundant records and variables, as well as reorganizes the
of formats or units in the raw data into formats or units that are data in an efficient and “tidy” manner for analysis.
more useful for the statistical methods that the researcher wants o Pertains to obtaining reduced representation in volume but
to use. produces the same or similar analytical results.
o Data is normalized, aggregated and generalized. o This step aims to present a reduced representation of the data in a
data warehouse.
Pre-Processing
Is sometimes iterative and may involve repeating this
series of steps until the data are satisfactorily
organized for the purpose of statistical analysis.
During preprocessing, one needs to take care not to
accidentally introduce bias by modifying the dataset in
ways that will impact the outcome of statistical
analyses. Similarly, we must avoid reaching
statistically significant results through “trial and error”
analyses on differently pre-processed versions of a
dataset.
Methods of Data Pre-
Processing

Data preprocessing consists of series of steps to transform data


extracted from different data sources into a “clean” data prior to
statistical analysis. Data pre-processing includes data cleaning, data
integration, data transformation, and data reduction.
What is
Data Integration?
It is the process of combining data derived from various data sources (such as databases,
flat files, etc.) into a consistent dataset. In data integration, data from the different sources,
as well as the metadata - the data about this data - from different sources are integrated to
come up with a single data store. There are a number of issues to consider during data
integration related mostly to possible different standards among data sources. These issues
could be entity identification problem, data value conflicts, and redundant data. Careful
integration of the data from multiple sources may help reduce or avoid redundancies and
inconsistencies and improve data mining speed and quality of sources.
Data Integration
Extract Transform Load
Data integration:
Combines data from multiple sources into a coherent store

Schema integration: e.g., A.cust-id B.cust-#
Integrate metadata from different sources
Entity identification problem:
Identify real world entities from multiple data sources, e.g., Bill Clinton = William
Clinton
Detecting and resolving data value conflicts
For the same real world entity, attribute values from different sources are different
Possible reasons: different representations, different scales, e.g., metric vs. British
units
Handling Data Redundancy
in Data Integration

Redundant data occur often when integration of multiple databases


Object identification: The same attribute or object may have different names in different
databases
Derivable data: One attribute may be a “derived” attribute in another table, e.g., annual
revenue
Redundant attributes may be able to be detected by correlation analysis
Careful integration of the data from multiple sources may help reduce/avoid
redundancies and inconsistencies and improve mining speed and quality
Four Types of Data
Integration Methodologies

2. Left Join - returns all the values from an inner join


1. Inner Join - creates a new result table by
plus all values in the left table that do not match to the
combining column values of two tables (A and B) based
right table, including rows with NULL (empty) values in
upon the join-predicate. the link column.
Four Types of Data
Integration Methodologies

3. Right Join - returns all the values from the right


4. Outer Join - the union of all the left join and right join
table and matched values from the left table (NULL in
values.
the case of no matching join predicate).
Data Integration
Types
01 02

Introduce data transfoprmation Apply standardization techniques

Learning
Outcomes
03 04

Apply Min-Max Normalization Apply z-score


What is
Data Transformation?

It is a process of transforming data from one format to another. It


aims to transform the data values into a format, scale or unit that
is more suitable for analysis. Data transformation is an
important step in data preprocessing and a prerequisite for doing
predictive analytic solutions.
Here are a few common possible
options for data transformation:

2. Encoding and Binning


a) Binning - the process of transforming numerical
1. Normalization variables into categorical counterparts.
- a way to scale
b.) Encoding - the process of transforming
specific variable to fall categorical values to binary or numerical
counterparts, e.g. treat male or female for gender to
within a small specific 1 or 0. Data encoding is needed because some data
range. mining methodologies, such as Linear Regression,
require all data to be numerical.
a) b)
min-max normalization - Z-score standardization -
transforming values to a new transforming a numerical
scale such that all attributes fall variable to a standard normal

1. Normalization between a standardized format. distribution


i) ii)
Equal-depth (frequency)
Equal-width (distance)
partitioning - Divides the range
partitioning - divides the range
into N intervals, each containing
into N intervals of equal size, thus
approximately the same number of
2. Encoding and forming a uniform grid.
samples.

Binning (Binning)
i) ii)
Binary Encoding Class-based Encoding
(Unsupervised) (Supervised)
• Transformation of categorical • Discrete Class Fundamentals of
variables by taking the values 0 or 1 Predictive Analytics 19 Replace the
to indicate the absence or presence of categorical variable with just one new
each category. numerical variable and replace each
• If the categorical variable has k category of the categorical variable
2. Encoding and categories, we would need to create k
binary variables.
with its corresponding probability of
the class variable.
Binning (Encoding)
iii)
Continuous Class - Replace the categorical
variable with just one new numerical variable and
replace each category of the categorical variable
with its corresponding average of the class
variable.
2. Encoding and
Binning (Encoding)
01 02

Introduce data cleaning Data reduction

Learning
Outcomes
03
Applying Phyton
What is
C.
Data
Cleaning?
All data sources potentially include errors and missing values – data cleaning
addresses these anomalies. Data cleaning is the process of altering data in a
given storage resource to make sure that it is accurate and correct. Data
cleaning routines attempts to fill in missing values, smooth out noise while
identifying outliers, and correct inconsistencies in the data, as well as resolve
redundancy caused by data integration.
Data Cleaning Tasks:
b) Cleaning noisy data Solutions
a) Fill in missing values for cleaning noisy data:
Solutions for handling missing i. Binning - transforming numerical
data: values into categorical components
i. Ignore the tuple ii. Clustering - grouping data into
corresponding cluster and use the c) Identifying outliers
ii. Fill in the missing value manually
cluster average to represent a value Solutions for identifying
iii. Data Imputation
iii. Regression - utilizing a simple outliers:
- Use a global constant to fill in
regression line to estimate a very i. Box plot
the missing value
- Use the attribute mean to fill in erratic data set
the missing value iv. Combined computer and
- Use the attribute mean for all human inspection - detecting
samples belonging to the same class suspicious values and checking it by
human interventions
What is
D. Data Reduction
and
Manipulation?
Data reduction is a process of obtaining a reduced representation of the
data set that is much smaller in volume but yet produce the same (or almost
the same) analytical results. The need for data reduction emerged from the
fact that some database/data warehouse may store terabytes of data, and
complex data analysis/mining may take a very long time to run on the
complete data set.
a. Sampling - utilizing a smaller representative or sample from the
big data set or population that will generalize the entire population.

i. Types of Sampling
1. Simple Random Sampling - there is an equal probability of
Data Reduction selecting any particular item.

Strategies: 2. Sampling without replacement - as each item is selected, it


is removed from the population

3. Sampling with replacement - objects are not removed from


the population as they are selected for the sample

4. Stratified sampling - split the data into several partitions, then


draw random samples from each partition.
b. Feature Subset Selection - reduces the dimensionality of data by
eliminating redundant and irrelevant features.

i. Feature Subset Selection Techniques


1. Brute-force approach - try all possible feature subsets as
input to data mining algorithm
Data Reduction
Strategies: 2. Embedded approaches - feature selection occurs naturally
as part of the data mining algorithm

3. Filter approaches - features are selected before data mining


algorithm is run

4. Wrapper approaches - use the data mining algorithm as a


black box to find the best subset or attributes
c. Feature Creation - creating new attributes that can
capture the important information in a data set much more
efficiently than the original attributes.

Data Reduction i. Feature Creation Methodologies


Strategies: 1. Feature Extraction

2. Mapping Data to New Space

3. Feature Construction
01 02

Learn basics of Phyton for data Importing Libraries


Preprocessing

Learning
Outcomes
03 04

Importing Datasets Interpretation of results


Change
the game !
Think about a topic on how you can apply
predictive analytics and apply the CRISP-DM
framework.

Thank you for listening!

END OF MODULE 1: INTRODUCTION TO


PREDICTIVE ANALYTICS

You might also like