Data Pre-Processing

Lecture 02
Data Pre-processing
Data Pre-processing Phase
Today‘s real-world databases are highly susceptible to

noisy, missing, and inconsistent data.
Data pre-processing is the activity of preparing data to
improve its quality so that it can be mined
Filtering Patterns,
Understand domain Visualization, Pattern
Cleaning, integrate, Interpretation
transform & selection
Input Data Data Mining PostProcessing Information

Data Preprocessing
Data Preprocessing Tasks
Data preprocessing consist of the following tasks:
Task 1: understand application domain and formulate the task.
Task 2: Data cleaning
Task 3: Data Transformation
Task 4: Data Integration
Task 5: Data selection
Data Pre-processing Phase
Task 1: Understanding Domain

Understanding Domain
This step includes learning the relevant prior knowledge and the
goals of the end user of the discovered knowledge.
Domain Knowledge consists of information about the data that is
already available either through some other discovery process or
from a domain expert.
 Other tasks include: determining data Mining Tool,
- Estimating Project Cost and Completion Time, Addressing Legal Issues and develop a
Maintenance Plan.
Understanding Domain
Examples of items that needs to be understood during pre-

processing phase:
1. Data Sources
2. Types Data
3. Types of Attributes
Sources of Data in the Domain
 There is need to understand where data will be extracted from.
 Examples of Data sources: Data marts, Datawarehouse and
operational data store.
Elements of input data
Attributes
 Elements of in put data include: Tid Refund Marital Taxable

Income Cheat
 1. An attribute is a property or characteristic of Status
an object . Examples: eye color of a person, 1 Yes Single 125K No

2 No Married 100K No
temperature, etc.
3 No Single 70K No
 Attribute is also known as variable, field, 4 Yes Married 120K No
characteristic, or feature Records
5 No Divorced 95K Yes
6 No Married 60K No
 2. Attribute values are numbers or symbols
7 Yes Divorced 220K No
assigned to an attribute. e.g. single, yes, no 8 No Single 85K Yes
 3. Records: A collection of attributes 9 No Married 75K No
Yes
 A record is also known as instance, example, 10 10
No Single 90K
case, sample, entity, or object

Types of Attributes
 There are different types of attributes
1. Integer : positive and negative whole numbers
Examples: ID numbers, zip codes
2. Real: Its values represents a quantity along a continuous line

Examples, integers, fractions and decimals (e.g. height ,weight)
3. Ordinal: its values can be ranked or describes order but magnitude

between successive values is not known.
Examples: Size = {small, medium, large}, grades, army rankings
rankings, grades, height in {tall, medium, short}
Types of Attributes
4. Interval: values are equidistant from one another.
Examples: calendar dates, temperatures in Celsius or Fahrenheit.
5. Nominal attributes: Values assigned to a well defined category
The name 'Nominal' comes from the Latin nomen, meaning
'name‘.
values are differentiated by a named category.
Example: employees, marital status, set of countries.
Types of Attributes
6. Ratio attributes: values have an absolute zero but not below

zero.
values can be compared as multiples of one another. E.g one
person can be twice as tall as another person. Ratio data can also
be multiplied and divided .
 Example
 A person's weight
 The number of pizzas I can eat before fainting
Types of Attributes
7. Discrete Attribute: has only set of fixed (finite or countable ) set of

values. Usually such values are represented as integer variables.
Examples: age in years (not microseconds), set of words in a
collection of documents ,
 Note: binary attributes are a special case of discrete attributes
 Typically, categorical and ordinal attributes are discrete, while
interval and ratio attributes are continuous.
8. Continuous Attribute: values measured along a continuous scale

which can be divided into fractions . Examples: real values such as
temperature, height, weight etc.
Typically real attributes interval and ratio attributes are continuous
attributes.
Types of Attributes
9. Asymmetric attributes: only the presence of a non-zero value is
important.
 The outcomes not equally important
 Example1: HIV test (positive vs. negative), particular course
registration
 The presence of HIV virus (HIV positive) is more important than
HIV negative.
 This type of attribute is considered important in association
analysis
Types of Attributes
10. Asymmetric attributes:

Example2. Transaction data for association rule discovery
 “Bread”, “Coke” etc are in fact (asymmetric) attributes and
only their presence (i.e. value 1 or true) is important.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Types of Attributes
11. Symmetrical attribute: All outcomes are equally important.

Example : gender (male or female).
 Typically discrete attributes and continuous attributes can be
asymmetrical attributes or symmetrical attributes.
Types of Data
There are various types of data that can be pre-processed:

1.Spatial Data
2.Multimedia data
3.Time-series data
4.Ordered data
5.Graph data
6.Record Data
1. Spatial Data
Spatial Data (geospatial data) is data about the locations and shapes
of geographic features and the relationships between them, usually
stored as co-ordinates, links and nodes.
Data mining of spatial data may uncover patterns describing the
characteristics of houses located near a specified kind of location,
the climate of mountainous areas located at various altitudes, etc
e.g. Patterns from mining Japanese earthquakes 1961-1994
2. Multimedia data
Multimedia data include video, images, audio and text media.

They can be stored on object-oriented databases, or on a file
system.
Data mining of multimedia data may require computer vision,
computer graphics, image interpretation, and natural language
processing techniques.
Image
Video
3. Time-series data
Time-series data refers to sequences of values that change with time

Such data requires the study of trends and correlations between
evolutions of different variables e.g. stock exchange data can be
mined to uncover trends in investment strategies
4. Ordered data
Ordered data refers to data with sequences
Example:
In this database
5. Graph data
Graph data is data with relationships among Objects

 Example1: HTML Links
Web search engines collect and

2 process Web pages to extract their
contents.
5 1 Links to and from each page provide a
2 great deal of information about the
relevance of a Web page to a query,
5
and thus, must also be taken into
consideration.
6. Record Data
 Record data consists of a collection of records, each of which

consists of a fixed set of attributes.
Tid Refund Marital Taxable
Status Income Cheat
1 Yes Single 125K No

2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
Record data 5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
6. Record Data
Examples of Record Data:

1. Data Matrix,
2. Document Data
3. Transaction Data
6. Record Data
(a). Data Matrix is a table for representing data set , where there are m
rows, one for each object, and n columns, one for each attribute
 This is if data objects have the same fixed set of numeric attributes, and
data objects are represented as as points in a multi-dimensional space,
where each dimension represents a distinct attribute
Projection Projection Distance Load Thickness

of x Load of y load
10.23 5.27 15.22 2.7 1.2

12.65 6.25 16.22 2.2 1.1
6. Record Data
(b) Document Data is a type of record data where each document

is represented as a vector of terms , where each term is an
attribute of the vector and the value of each attribute is the
number of times the corresponding term occurs in the
document.
timeout
season
coach
game
score
team
ball
lost
pla
wi
n
y
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
6. Record Data
(c) Transaction Data is a type of record data, where each record

represents a transaction involving a set of items.
 For example, consider a grocery store or supermarket . The
set of products purchased by a customer during one shopping
trip constitute a transaction, while the individual products that
were purchased are the items.
Task 2: Data Cleaning

Motivations of Data Cleaning:
Data cleaning involves removing the following data items:

1. Missing values
2. Noise
3. Inconsistencies from the data.
1. Missing Values
 Missing values refers to values that are not available in the data
 Data with missing values is called incomplete data
 Causes of missing values include:
1. Equipment malfunction
2. Data deleted since it was found inconsistent with other
recorded data during data integration.
3. Data not entered due at the time of entry.
2. Noisy Data
 Noise is a random error , distortion of original values or outlier in

data.
 Data with noise is called noisy data .
 Example distortion of a person’s voice when talking
 on a poor phone and “snow” on television screen
2. Noisy Data
Outliers are examples of noise with characteristics that are

considerably different than most of the other data objects in the
data set
3. Inconsistencies
Inconsistencies refers to different data items that may be

represented by the same name in different systems, or the same
data item may be represented by different names in different
systems.
Data with inconsistencies is called incosistent data.
Problems with data consistency also exist when data originates
from a single application system.
Example:
 An insurance company offers car insurance. A field identifying
„auto_type“ seems innocent enough, but it turns out that the
labels entered into the system – „Merc“, „Mercedes“, „M-
Benz“,and „Mrcds“ all represent the same manufacturer.
Causes of Noise and Inconsistencies
 Faulty instruments for data collection

 Human or computer errors
 Errors in data transmission
 Technology limitations (e.g., sensor data come at a faster rate
than they can be processed)
 Differences in naming conventions or data codes (e.g.,
2/5/2002 could be 2 May 2002 or 5 Feb 2002)
Major Tasks in Data cleaning
a) Fill in missing values

b) Smoothening noisy data
c) Resolving inconsistencies
(a) Filling Missing values
There are several techniques for filling Missing Values

(1) Ignore the tuple: recommended when few when only few attributes
with missing values. Especially poor when the percentage of missing
values per attribute varies considerably.
(2) Use a global constant to fill in the missing value: Replace all missing
attribute values by the same constant, such as a label like “Unknown“ or
. If missing values are replaced by, say, Unknown“, then the mining
program may mistakenly think that they form an interesting concept.
Hence, although his method is simple, it is not recommended.
(3) Use the attribute mean to fill in the missing value: For example,
suppose that the average income of AllElectronics customer is $28,000.
Use this value to replace the missing value for income.
(a) Filling Missing values
4. Combined computer and human inspection

i.e. use computer to detect suspicious values and check by
human. Then fill manually. This is very tedious
5. Use the most probable value to fill in the missing value: This
may be determined with , inference-based tools such as decision
tree induction.
Method 5, is a popular strategy because it uses the most
information from the present data to predict missing values. By
considering the values of other attributes in its estimation of the
missing value.
Exercise
 How do you handle the following Missing Data?
Age Income Team Gender

23 24,200 Red Sox M
39 ? Yankees F
45 45,390 ? F
Fill missing values using aggregate functions (e.g., average) or

probabilistic estimates on global value distribution
E.g., put the average income here, or put the most probable income
based on the fact that the person is 39 years old
E.g., put the most frequent team here
Missing Data Exercise 2
Historical Bank Account Totals

Name SSN Address Phone # Date Acct Total
John Doe 111-22-3333 1 Main St 111-222-3333 2/12/1999 2200.12

Bedford,
Ma
John W. Doe Bedford, 7/15/2000 12000.54
Ma
John Doe 111-22-3333 8/22/2001 2000.33
James Smith 222-33-4444 2 Oak St 222-333-4444 12/22/2002 15333.22

Boston, Ma
Jim Smith 222-33-4444 2 Oak St 222-333-4444 12333.66
Boston, Ma
Jim Smith 222-33-4444 2 Oak St 222-333-4444
Boston, Ma
How should we handle this?

(b) Smoothening Noisy Data
 Smoothening Noisy data involves removing error and outliers

 There are several techniques:
1. Binning Binning methods smooth a sorted data value by consulting
its „neighborhood“, that is the values around it.
The sorted values are distributed into a number of „buckets,“ or bins
(= local smoothing).
Example of Binning:
Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
Partition into (equidepth) bins:
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
Smoothing by bin means:
Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
(2) Clustering: Outliers may be detected by clustering, where similar

values are organized into groups, or „clusters.“ Intuitively, values that
fall outside of the set of clusters may be considered outliers.
 Outliers are data objects with characteristics that are considerably
different than most of the other data objects in the data set
Examples of Outliers:
(3) Combined computer and human inspection: Outliers may

be identified through a combination of computer and human
inspection.
(4) Curve Fitting: Data can be smoothed by fitting the data to a

function, such as with regression. For example,
Linear regression involves finding the „best“ line to fit 2
variables, so that one variable say X, can be used to predict
the other e.g. Y.
Y = a0 + a1 * X1 + a2 * X2 + … + an * Xn
Data that does not fit to the curve are considered as Noise
Multiple linear regression is an extension of linear regression,
where more than 2 variables are involved.
y (salary)
Example of linear regression
Y1 y=x+1
X1 x (age)
43
Noisy data Example
 Historical Bank Account Totals
Name SSN Address Phone # Date Acct Total
John Doe 111-22-3333 1 Main St 111-222-3333 2/12/1999 2200.12

Bedford,
Ma
John Doe 111-22-3333 1 Main St 111-222-3333 2/12/1999 2233.67
Bedford,
Ma
Boston, Ma
Boston, Ma
How should we handle this?

(c) Resolving Incosistenicies
 Inconsistencies in data are resolved using the following

techniques:
1. Data transformation techniques

2. Data integration Techniques
3. Data Selection Techniques
Task 3: Data Transformation

Task3: Data Transformation
Data transformation involves converting data into another format

that is appropriate for mining. This can be done through
1.Generalization: low-level (raw) data are replaced by higher-level
concepts through the use of concept hierarchies. E.g., categorical
attributes, like street, can be generalized to higher-level concepts,
like city or country. Similarly, values for numeric attributes, like
age, may be mapped to higher-level concepts, like young, middle
aged, and senior.
Task3: Data Transformation
2. Attribute construction (or feature construction), where new

attributes are constructed and added from the given set of
attributes
to help the mining process.
3. Data Type Conversion: Change type of data
Task4 Data Integration
Task4 Data Integration
 Combining data from multiple sources into a coherent data

source,such as in data warehousing. These sources may include
multiple databases or flat files.
 There are a number of issues to consider:
 1.Schema integration: How can equivalent real-world entities from
multiple data sources be matched up? This is referred to as the entity
identification problem. E.g., use of metadata or ontologies ensure
that customerz_id in one database and cust_number in another refer
to the same entity.
 2. Redundancy: An attribute may be redundant if it can be „derived“
from another table, such as annual revenue. Inconsistencies in
attribute naming can also cause redundancies in the resulting data
set. Some redundancies can be detected by correlation analysis.
Task 4: Data Integration
3. Detection and resolution of data value conflicts (differences in

representation,or encoding):
For example, a weight attribute may be stored in diferent metric
units . Such semantic heterogeneity of data poses great challenges in
data integration.
Task 5: Data Selection

 Data selection (reduction) can reduce the data size by selecting

important features and eliminating redundant features
 The aim is to improve efficiency in mining while maintaining the
integrity of the original data leading to the same (almost the
same) analytical results.
Data selection(reduction) strategies:

1. Dimension reduction: where irrelevant, weakly relevant, or
redundant attributes or dimensions may be detected and
removed.
2. Discretization and concept hierarchy generation, where raw
data values for attributes are replaced by ranges or higher
conceptual levels.
3. Numerosity reduction, where the data are replaced by
alternative, smaller data representations such as sampling
methods.
Sampling methods
56
56

Data Pre-Processing

Uploaded by

Copyright:

Available Formats

You might also like

Data Pre-Processing

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Pre-Processing

Uploaded by

Copyright:

Available Formats

Lecture 02

Today‘s real-world databases are highly susceptible to

Input Data Data Mining PostProcessing Information

Task 1: Understanding Domain

Examples of items that needs to be understood during pre-

 Elements of in put data include: Tid Refund Marital Taxable

an object . Examples: eye color of a person, 1 Yes Single 125K No

case, sample, entity, or object

2. Real: Its values represents a quantity along a continuous line

3. Ordinal: its values can be ranked or describes order but magnitude

6. Ratio attributes: values have an absolute zero but not below

7. Discrete Attribute: has only set of fixed (finite or countable ) set of

8. Continuous Attribute: values measured along a continuous scale

10. Asymmetric attributes:

11. Symmetrical attribute: All outcomes are equally important.

There are various types of data that can be pre-processed:

Multimedia data include video, images, audio and text media.

Time-series data refers to sequences of values that change with time

Ordered data refers to data with sequences

Graph data is data with relationships among Objects

Web search engines collect and

 Record data consists of a collection of records, each of which

1 Yes Single 125K No

Examples of Record Data:

Projection Projection Distance Load Thickness

10.23 5.27 15.22 2.7 1.2

(b) Document Data is a type of record data where each document

(c) Transaction Data is a type of record data, where each record

Task 2: Data Cleaning

Data cleaning involves removing the following data items:

 Noise is a random error , distortion of original values or outlier in

Outliers are examples of noise with characteristics that are

Inconsistencies refers to different data items that may be

 Faulty instruments for data collection

a) Fill in missing values

There are several techniques for filling Missing Values

4. Combined computer and human inspection

 How do you handle the following Missing Data?

Age Income Team Gender

Fill missing values using aggregate functions (e.g., average) or

Historical Bank Account Totals

John Doe 111-22-3333 1 Main St 111-222-3333 2/12/1999 2200.12

James Smith 222-33-4444 2 Oak St 222-333-4444 12/22/2002 15333.22

How should we handle this?

 Smoothening Noisy data involves removing error and outliers

(2) Clustering: Outliers may be detected by clustering, where similar

(3) Combined computer and human inspection: Outliers may

(4) Curve Fitting: Data can be smoothed by fitting the data to a

Example of linear regression

 Historical Bank Account Totals

Name SSN Address Phone # Date Acct Total

John Doe 111-22-3333 1 Main St 111-222-3333 2/12/1999 2200.12

How should we handle this?

 Inconsistencies in data are resolved using the following

1. Data transformation techniques

Task 3: Data Transformation

Data transformation involves converting data into another format