Main Topic 2 - Data Preprocessing Part 1

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 10

Main Topic 2:

Data Pre-
Processing
(Part 1)
Kristine Grace B. Estilo, MSIT
Course Facilitator, IT 211 - DBMS
Objectives

• Learn the importance of Data Pre-Processing.

• Learn Simple Data Mining Techniques to turn raw data


gathered from different sources into a cleaner
information for Machine Learning

2
Why do we need to Pre-Process the Data?

• Much raw data contained in databases is unprocessed,


incomplete, and noisy.

• Fields that are obsolete or redundant.


• Missing values
• Outliers
• Data in a form not suitable for data mining models.
• Values not consistent with policy or common sense.
3
Why do we need to Pre-Process the Data?

• Databases need to undergo preprocessing in a form of


data cleaning and data transformation

• The main objective of data preprocessing is to


minimize garbage in, garbage out (GIGO) that gets
into our chosen model to minimize the amount of
garbage
4
Data Cleaning

5 Annual Review 12/10/21


Data Cleaning

• Zip codes are not consistent (format)


• Missing Values (Gender)
• Income (Outliers) (10,000,000)
• Age (Not Numerical)
• Transaction amount (no Currency Value)

6
Data Cleaning – Solution:

• Choose what proper format of ZipCode(s) based on the


country.
• Apply value to all Missing Values
• The Income worth 10,000,000.00 is not appropriate.
• Convert the age value into numerical and omit the text
values.
• Apply proper currency value(s)
7
Handling Missing Data:

• Missing data is a problem that continues to plague data


analysis methods.

• We continue to encounter missing values in fields


specially in databases with large number of fields.

• In obtaining missing data, all things will be equal and is


8
almost always better.
Handling Missing Data:

• Missing data is a problem that continues to plague data


analysis methods.

• We continue to encounter missing values in fields


specially in databases with large number of fields.

• In obtaining missing data, all things will be equal and is


9
almost always better.
Handling Missing Data – Solution:

• Some common criteria for choosing replacement values


for missing data are as follows:
• Replace the missing values with some constant, specified by the data
analyst.
• Replace missing value(s) with the field mean
• (for Numeric Variables) or the mode (for categorical variables)
• Replace the missing values with a value generated at random from
the observed of the variable.
• Replace missing values with imputed values based on the other
characteristics of the record.
10

You might also like