Professional Documents
Culture Documents
Bafpred Module 2 Week 5 6
Bafpred Module 2 Week 5 6
DATA
PREPROCESSING
Prepared by:
ARPEE C. ARRUEJO
01 02
Learning
Outcomes
03 04
Data integration
Data cleaning
o Extracted raw data can come from heterogeneous sources or
o This step deals with missing data, noise, outliers, and be in separate datasets. This step reorganizes the various raw
duplicate or incorrect records while minimizing datasets into a single dataset that contain all the information
introduction of bias into the database. required for the desired statistical analyses.
o Data is cleansed through processes such as filling in o Involves integration of multiple databases, data cubes, or
missing values, smoothing the noisy data, or resolving files.
o Data with different representations are put together and
the inconsistencies in the data.
conflicts within the data are resolved.
Several distinct steps are involved in preprocessing
data. Here are the general steps taken to pre-process
data:
Data reduction
Data transformation
o After the dataset has been integrated and transformed, this step
o This step translates and/or scales variables stored in a variety removes redundant records and variables, as well as reorganizes the
of formats or units in the raw data into formats or units that are data in an efficient and “tidy” manner for analysis.
more useful for the statistical methods that the researcher wants o Pertains to obtaining reduced representation in volume but
to use. produces the same or similar analytical results.
o Data is normalized, aggregated and generalized. o This step aims to present a reduced representation of the data in a
data warehouse.
Pre-Processing
Is sometimes iterative and may involve repeating this
series of steps until the data are satisfactorily
organized for the purpose of statistical analysis.
During preprocessing, one needs to take care not to
accidentally introduce bias by modifying the dataset in
ways that will impact the outcome of statistical
analyses. Similarly, we must avoid reaching
statistically significant results through “trial and error”
analyses on differently pre-processed versions of a
dataset.
Methods of Data Pre-
Processing
Learning
Outcomes
03 04
Binning (Binning)
i) ii)
Binary Encoding Class-based Encoding
(Unsupervised) (Supervised)
• Transformation of categorical • Discrete Class Fundamentals of
variables by taking the values 0 or 1 Predictive Analytics 19 Replace the
to indicate the absence or presence of categorical variable with just one new
each category. numerical variable and replace each
• If the categorical variable has k category of the categorical variable
2. Encoding and categories, we would need to create k
binary variables.
with its corresponding probability of
the class variable.
Binning (Encoding)
iii)
Continuous Class - Replace the categorical
variable with just one new numerical variable and
replace each category of the categorical variable
with its corresponding average of the class
variable.
2. Encoding and
Binning (Encoding)
01 02
Learning
Outcomes
03
Applying Phyton
What is
C.
Data
Cleaning?
All data sources potentially include errors and missing values – data cleaning
addresses these anomalies. Data cleaning is the process of altering data in a
given storage resource to make sure that it is accurate and correct. Data
cleaning routines attempts to fill in missing values, smooth out noise while
identifying outliers, and correct inconsistencies in the data, as well as resolve
redundancy caused by data integration.
Data Cleaning Tasks:
b) Cleaning noisy data Solutions
a) Fill in missing values for cleaning noisy data:
Solutions for handling missing i. Binning - transforming numerical
data: values into categorical components
i. Ignore the tuple ii. Clustering - grouping data into
corresponding cluster and use the c) Identifying outliers
ii. Fill in the missing value manually
cluster average to represent a value Solutions for identifying
iii. Data Imputation
iii. Regression - utilizing a simple outliers:
- Use a global constant to fill in
regression line to estimate a very i. Box plot
the missing value
- Use the attribute mean to fill in erratic data set
the missing value iv. Combined computer and
- Use the attribute mean for all human inspection - detecting
samples belonging to the same class suspicious values and checking it by
human interventions
What is
D. Data Reduction
and
Manipulation?
Data reduction is a process of obtaining a reduced representation of the
data set that is much smaller in volume but yet produce the same (or almost
the same) analytical results. The need for data reduction emerged from the
fact that some database/data warehouse may store terabytes of data, and
complex data analysis/mining may take a very long time to run on the
complete data set.
a. Sampling - utilizing a smaller representative or sample from the
big data set or population that will generalize the entire population.
i. Types of Sampling
1. Simple Random Sampling - there is an equal probability of
Data Reduction selecting any particular item.
3. Feature Construction
01 02
Learning
Outcomes
03 04