Data Pre-Processing

You might also like

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 56

Lecture 02

Data Pre-processing
Data Pre-processing Phase

Today‘s real-world databases are highly susceptible to


noisy, missing, and inconsistent data.
Data pre-processing is the activity of preparing data to
improve its quality so that it can be mined
Filtering Patterns,
Understand domain Visualization, Pattern
Cleaning, integrate, Interpretation
transform & selection

Input Data Data Mining PostProcessing Information


Data Preprocessing
Data Preprocessing Tasks
Data preprocessing consist of the following tasks:
Task 1: understand application domain and formulate the task.
Task 2: Data cleaning
Task 3: Data Transformation
Task 4: Data Integration
Task 5: Data selection
Data Pre-processing Phase

Task 1: Understanding Domain


Understanding Domain
This step includes learning the relevant prior knowledge and the
goals of the end user of the discovered knowledge.
Domain Knowledge consists of information about the data that is
already available either through some other discovery process or
from a domain expert.
 Other tasks include: determining data Mining Tool,
- Estimating Project Cost and Completion Time, Addressing Legal Issues and develop a
Maintenance Plan.
Understanding Domain

Examples of items that needs to be understood during pre-


processing phase:
1. Data Sources
2. Types Data
3. Types of Attributes
Sources of Data in the Domain
 There is need to understand where data will be extracted from.
 Examples of Data sources: Data marts, Datawarehouse and
operational data store.
Elements of input data
Attributes

 Elements of in put data include: Tid Refund Marital Taxable


Income Cheat
 1. An attribute is a property or characteristic of Status

an object . Examples: eye color of a person, 1 Yes Single 125K No


2 No Married 100K No
temperature, etc.
3 No Single 70K No
 Attribute is also known as variable, field, 4 Yes Married 120K No
characteristic, or feature Records
5 No Divorced 95K Yes
6 No Married 60K No
 2. Attribute values are numbers or symbols
7 Yes Divorced 220K No
assigned to an attribute. e.g. single, yes, no 8 No Single 85K Yes
 3. Records: A collection of attributes 9 No Married 75K No
Yes
 A record is also known as instance, example, 10 10
No Single 90K

case, sample, entity, or object


Types of Attributes
 There are different types of attributes
1. Integer : positive and negative whole numbers
Examples: ID numbers, zip codes

2. Real: Its values represents a quantity along a continuous line


Examples, integers, fractions and decimals (e.g. height ,weight)

3. Ordinal: its values can be ranked or describes order but magnitude


between successive values is not known.
Examples: Size = {small, medium, large}, grades, army rankings
rankings, grades, height in {tall, medium, short}
Types of Attributes
4. Interval: values are equidistant from one another.
Examples: calendar dates, temperatures in Celsius or Fahrenheit.
5. Nominal attributes: Values assigned to a well defined category
The name 'Nominal' comes from the Latin nomen, meaning
'name‘.
values are differentiated by a named category.
Example: employees, marital status, set of countries.
Types of Attributes

6. Ratio attributes: values have an absolute zero but not below


zero.
values can be compared as multiples of one another. E.g one
person can be twice as tall as another person. Ratio data can also
be multiplied and divided .

 Example
 A person's weight
 The number of pizzas I can eat before fainting
Types of Attributes

7. Discrete Attribute: has only set of fixed (finite or countable ) set of


values. Usually such values are represented as integer variables.
Examples: age in years (not microseconds), set of words in a
collection of documents ,
 Note: binary attributes are a special case of discrete attributes
 Typically, categorical and ordinal attributes are discrete, while
interval and ratio attributes are continuous.

8. Continuous Attribute: values measured along a continuous scale


which can be divided into fractions . Examples: real values such as
temperature, height, weight etc.
Typically real attributes interval and ratio attributes are continuous
attributes.
Types of Attributes
9. Asymmetric attributes: only the presence of a non-zero value is
important.
 The outcomes not equally important
 Example1: HIV test (positive vs. negative), particular course
registration
 The presence of HIV virus (HIV positive) is more important than
HIV negative.
 This type of attribute is considered important in association
analysis
Types of Attributes

10. Asymmetric attributes:


Example2. Transaction data for association rule discovery
 “Bread”, “Coke” etc are in fact (asymmetric) attributes and
only their presence (i.e. value 1 or true) is important.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Types of Attributes

11. Symmetrical attribute: All outcomes are equally important.


Example : gender (male or female).
 Typically discrete attributes and continuous attributes can be
asymmetrical attributes or symmetrical attributes.
Types of Data

There are various types of data that can be pre-processed:


1.Spatial Data
2.Multimedia data
3.Time-series data
4.Ordered data
5.Graph data
6.Record Data
1. Spatial Data

Spatial Data (geospatial data) is data about the locations and shapes
of geographic features and the relationships between them, usually
stored as co-ordinates, links and nodes.
Data mining of spatial data may uncover patterns describing the
characteristics of houses located near a specified kind of location,
the climate of mountainous areas located at various altitudes, etc
e.g. Patterns from mining Japanese earthquakes 1961-1994
2. Multimedia data

Multimedia data include video, images, audio and text media.


They can be stored on object-oriented databases, or on a file
system. 
Data mining of multimedia data may require computer vision,
computer graphics, image interpretation, and natural language
processing techniques.

Image

Video
3. Time-series data

Time-series data refers to sequences of values that change with time


Such data requires the study of trends and correlations between
evolutions of different variables e.g. stock exchange data can be
mined to uncover trends in investment strategies
4. Ordered data

Ordered data refers to data with sequences

Example:

In this database
5. Graph data

Graph data is data with relationships among Objects


 Example1: HTML Links

Web search engines collect and


2 process Web pages to extract their
contents.
5 1 Links to and from each page provide a
2 great deal of information about the
relevance of a Web page to a query,
5
and thus, must also be taken into
consideration.
6. Record Data

 Record data consists of a collection of records, each of which


consists of a fixed set of attributes.
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
Record data 5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
6. Record Data

Examples of Record Data:


1. Data Matrix,
2. Document Data
3. Transaction Data
6. Record Data

(a). Data Matrix is a table for representing data set , where there are m
rows, one for each object, and n columns, one for each attribute
 This is if data objects have the same fixed set of numeric attributes, and
data objects are represented as as points in a multi-dimensional space,
where each dimension represents a distinct attribute

Projection Projection Distance Load Thickness


of x Load of y load

10.23 5.27 15.22 2.7 1.2


12.65 6.25 16.22 2.2 1.1
6. Record Data

(b) Document Data is a type of record data where each document


is represented as a vector of terms , where each term is an
attribute of the vector and the value of each attribute is the
number of times the corresponding term occurs in the
document.

timeout

season
coach

game
score
team

ball

lost
pla

wi
n
y

Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0
6. Record Data

(c) Transaction Data is a type of record data, where each record


represents a transaction involving a set of items.
 For example, consider a grocery store or supermarket . The
set of products purchased by a customer during one shopping
trip constitute a transaction, while the individual products that
were purchased are the items.
Data Preprocessing Tasks

Task 2: Data Cleaning


Motivations of Data Cleaning:

Data cleaning involves removing the following data items:


1. Missing values
2. Noise
3. Inconsistencies from the data.
1. Missing Values

 Missing values refers to values that are not available in the data
 Data with missing values is called incomplete data
 Causes of missing values include:
1. Equipment malfunction
2. Data deleted since it was found inconsistent with other
recorded data during data integration.
3. Data not entered due at the time of entry.
2. Noisy Data

 Noise is a random error , distortion of original values or outlier in


data.
 Data with noise is called noisy data .
 Example distortion of a person’s voice when talking
 on a poor phone and “snow” on television screen
2. Noisy Data

Outliers are examples of noise with characteristics that are


considerably different than most of the other data objects in the
data set
3. Inconsistencies

Inconsistencies refers to different data items that may be


represented by the same name in different systems, or the same
data item may be represented by different names in different
systems.
Data with inconsistencies is called incosistent data.
Problems with data consistency also exist when data originates
from a single application system.

Example:
 An insurance company offers car insurance. A field identifying
„auto_type“ seems innocent enough, but it turns out that the
labels entered into the system – „Merc“, „Mercedes“, „M-
Benz“,and „Mrcds“ all represent the same manufacturer.
Causes of Noise and Inconsistencies

 Faulty instruments for data collection


 Human or computer errors
 Errors in data transmission
 Technology limitations (e.g., sensor data come at a faster rate
than they can be processed)
 Differences in naming conventions or data codes (e.g.,
2/5/2002 could be 2 May 2002 or 5 Feb 2002)
Major Tasks in Data cleaning

a) Fill in missing values


b) Smoothening noisy data
c) Resolving inconsistencies
(a) Filling Missing values

There are several techniques for filling Missing Values


(1) Ignore the tuple: recommended when few when only few attributes
with missing values. Especially poor when the percentage of missing
values per attribute varies considerably.
(2) Use a global constant to fill in the missing value: Replace all missing
attribute values by the same constant, such as a label like “Unknown“ or
. If missing values are replaced by, say, Unknown“, then the mining
program may mistakenly think that they form an interesting concept.
Hence, although his method is simple, it is not recommended.

(3) Use the attribute mean to fill in the missing value: For example,
suppose that the average income of AllElectronics customer is $28,000.
Use this value to replace the missing value for income.
(a) Filling Missing values

4. Combined computer and human inspection


i.e. use computer to detect suspicious values and check by
human. Then fill manually. This is very tedious
5. Use the most probable value to fill in the missing value: This
may be determined with , inference-based tools such as decision
tree induction.
Method 5, is a popular strategy because it uses the most
information from the present data to predict missing values. By
considering the values of other attributes in its estimation of the
missing value.
Exercise

 How do you handle the following Missing Data?

Age Income Team Gender


23 24,200 Red Sox M
39 ? Yankees F
45 45,390 ? F

Fill missing values using aggregate functions (e.g., average) or


probabilistic estimates on global value distribution
E.g., put the average income here, or put the most probable income
based on the fact that the person is 39 years old
E.g., put the most frequent team here
Missing Data Exercise 2

Historical Bank Account Totals


Name SSN Address Phone # Date Acct Total

John Doe 111-22-3333 1 Main St 111-222-3333 2/12/1999 2200.12


Bedford,
Ma
John W. Doe Bedford, 7/15/2000 12000.54
Ma
John Doe 111-22-3333 8/22/2001 2000.33

James Smith 222-33-4444 2 Oak St 222-333-4444 12/22/2002 15333.22


Boston, Ma
Jim Smith 222-33-4444 2 Oak St 222-333-4444 12333.66
Boston, Ma
Jim Smith 222-33-4444 2 Oak St 222-333-4444
Boston, Ma

How should we handle this?


(b) Smoothening Noisy Data

 Smoothening Noisy data involves removing error and outliers


 There are several techniques:
1. Binning Binning methods smooth a sorted data value by consulting
its „neighborhood“, that is the values around it.
The sorted values are distributed into a number of „buckets,“ or bins
(= local smoothing).
(b) Smoothening Noisy Data

Example of Binning:
Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
Partition into (equidepth) bins:
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
Smoothing by bin means:
Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
(b) Smoothening Noisy Data

(2) Clustering: Outliers may be detected by clustering, where similar


values are organized into groups, or „clusters.“ Intuitively, values that
fall outside of the set of clusters may be considered outliers.
 Outliers are data objects with characteristics that are considerably
different than most of the other data objects in the data set
Examples of Outliers:
(b) Smoothening Noisy Data

(3) Combined computer and human inspection: Outliers may


be identified through a combination of computer and human
inspection.

(4) Curve Fitting: Data can be smoothed by fitting the data to a


function, such as with regression. For example,
Linear regression involves finding the „best“ line to fit 2
variables, so that one variable say X, can be used to predict
the other e.g. Y.
Y = a0 + a1 * X1 + a2 * X2 + … + an * Xn
Data that does not fit to the curve are considered as Noise
Multiple linear regression is an extension of linear regression,
where more than 2 variables are involved.
(b) Smoothening Noisy Data

y (salary)

Example of linear regression

Y1 y=x+1

X1 x (age)

43
Noisy data Example

 Historical Bank Account Totals

Name SSN Address Phone # Date Acct Total

John Doe 111-22-3333 1 Main St 111-222-3333 2/12/1999 2200.12


Bedford,
Ma
John Doe 111-22-3333 1 Main St 111-222-3333 2/12/1999 2233.67
Bedford,
Ma
James Smith 222-33-4444 2 Oak St 222-333-4444 12/22/2002 15333.22
Boston, Ma
James Smith 222-33-4444 2 Oak St 222-333-4444 12/23/2003 15333000.00
Boston, Ma

How should we handle this?


(c) Resolving Incosistenicies

 Inconsistencies in data are resolved using the following


techniques:

1. Data transformation techniques


2. Data integration Techniques
3. Data Selection Techniques
Data Preprocessing Tasks

Task 3: Data Transformation


Task3: Data Transformation

Data transformation involves converting data into another format


that is appropriate for mining. This can be done through
1.Generalization: low-level (raw) data are replaced by higher-level
concepts through the use of concept hierarchies. E.g., categorical
attributes, like street, can be generalized to higher-level concepts,
like city or country. Similarly, values for numeric attributes, like
age, may be mapped to higher-level concepts, like young, middle
aged, and senior.
Task3: Data Transformation

2. Attribute construction (or feature construction), where new


attributes are constructed and added from the given set of
attributes
to help the mining process.
3. Data Type Conversion: Change type of data
Task4 Data Integration
Task4 Data Integration

 Combining data from multiple sources into a coherent data


source,such as in data warehousing. These sources may include
multiple databases or flat files.
 There are a number of issues to consider:
 1.Schema integration: How can equivalent real-world entities from
multiple data sources be matched up? This is referred to as the entity
identification problem. E.g., use of metadata or ontologies ensure
that customerz_id in one database and cust_number in another refer
to the same entity.
 2. Redundancy: An attribute may be redundant if it can be „derived“
from another table, such as annual revenue. Inconsistencies in
attribute naming can also cause redundancies in the resulting data
set. Some redundancies can be detected by correlation analysis.
Task 4: Data Integration

3. Detection and resolution of data value conflicts (differences in


representation,or encoding):
For example, a weight attribute may be stored in diferent metric
units . Such semantic heterogeneity of data poses great challenges in
data integration.
Data Preprocessing Tasks

Task 5: Data Selection


Task 5: Data Selection

 Data selection (reduction) can reduce the data size by selecting


important features and eliminating redundant features
 The aim is to improve efficiency in mining while maintaining the
integrity of the original data leading to the same (almost the
same) analytical results.
Task 5: Data Selection

Data selection(reduction) strategies:


1. Dimension reduction: where irrelevant, weakly relevant, or
redundant attributes or dimensions may be detected and
removed.
2. Discretization and concept hierarchy generation, where raw
data values for attributes are replaced by ranges or higher
conceptual levels.
3. Numerosity reduction, where the data are replaced by
alternative, smaller data representations such as sampling
methods.
Task 5: Data Selection

Sampling methods
56
56

You might also like