Data Mining Transformations

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 40

Need of data processing

• The importance of data processing in data


mining Here data mining comes into play.
As data is often imperfect, noisy, and
incompatible it requires additional processing.
The complexity of this process is subject to the
scope of data collection and the complexity of
the required results.
• Data Preprocessing is required because:
• Real world data are generally:
• Incomplete: Missing attribute values, missing
certain attributes of importance, or having
only aggregate data
• Noisy: Containing errors or outliers
• Inconsistent: Containing discrepancies in
codes or names
• Data preprocessing includes cleaning,
Instance selection, normalization,
transformation, feature extraction and
selection, etc.
• The product of data preprocessing is the final
training set.
• Data preprocessing may affect the way in
which outcomes of the final data processing
can be interpreted.
Steps in Data preprocessing:

1. Data cleaning:
• Data cleaning, also called data cleansing or
scrubbing.
• Fill in missing values, smooth noisy data,
identify or remove the outliers, and resolve
inconsistencies.
• Data cleaning is required because source
systems contain “dirty data” that must be
cleaned.
Steps in Data cleaning:
1.1 Parsing:
• Parsing locates and identifies individual data
elements in the source files and then isolates
these data elements in the target files.
• Example includes parsing the first, middle and
the last name.
1.2 Correcting:
• Correct parsed individual data components
using sophisticated data algorithms and
secondary data sources.
• Example includes replacing a vanity address
and adding a zip code.
• 1.3 Standardizing:
• Standardizing applies conversion routines to
transform data into its preferred and
consistent format using both standard and
custom business rules.
• Examples include adding a pre name,
replacing a nickname.
• 1.4 Matching:
• Searching and matching records within and
across the parsed, corrected and standardized
data based on predefined business rules to
eliminate duplications.
• Examples include identifying similar names
and addresses.
1.5 Consolidating:
• Analyzing and identifying relationships
between matched records and
consolidating/merging them into one
representation.
1.6 Data cleansing must deal with many types
of possible errors:
• These include missing data and incorrect data
at one source.
1.7 Data Staging:
• Accumulates data from asynchronous sources.
• At a predefined cutoff time, data in the staging
file is transformed and loaded to the
warehouse.
• There is usually no end user access to the
staging file.
• An operational data store may be used for
data staging.
2. Data integration and Transformation:
• Data integration: Combines data from multiple
sources into a coherent data store e.g. data
warehouse.
• Sources may include multiple databases, data
cubes or data files.
• Issues in data integration:
– Schema integration:
• Integrate metadata from different sources.
• Entity identification problem: identify real world entities
from multiple data sources, e.g. A cust-id=B.cust#.
– Detecting and resolving data value conflicts:
• For the same real world entity, attribute values from
different sources are different.
• Possible reasons: different representations, different
scales.
– Redundant data occur often when integration of
multiple databases:
• The same attribute may have different names in
different databases.

• Data Transformation:
• Transformation process deals with rectifying any
inconsistency (if any).
• One of the most common transformation issues is
‘Attribute Naming Inconsistency’. It is common for
the given data element to be referred to by
different data names in different databases.
• Eg Employee Name may be EMP_NAME in one
database, ENAME in the other.
• Thus one set of Data Names are picked and used
consistently in the data warehouse.
• Once all the data elements have right names, they
must be converted to common formats.
• Data Reduction:
• Obtains reduced representation in volume but
produces the same or similar analytical
results.
• Need for data reduction:
– Reducing the number of attributes
– Reducing the number of attribute values
– Reducing the number of tuples
4. Discretization and Concept Hierarchy
Generation(or summarization):
• Discretization: Reduce the number of values for a
given continuous attribute by divide the range of a
continuous attribute into intervals.
• Interval labels can then be used to replace actual
data values.
• Concept Hierarchies: Reduce the data by
collecting and replacing low level concepts(such as
numeric values for the attribute age)by higher
level concepts(such as young, middle-aged or
senior).
1. Smoothing:
• It is a process that is used to remove noise from the
dataset using some algorithms.
• It allows for highlighting important features present in
the dataset.
• It helps in predicting the patterns.
• When collecting data, it can be manipulated to
eliminate or reduce any variance or any other noise
form.
• The concept behind data smoothing is that it will be
able to identify simple changes to help predict
different trends and patterns.
• This serves as a help to analysts or traders who need
to look at a lot of data which can often be difficult to
digest for finding patterns that they wouldn’t see
otherwise.
2. Aggregation:
• Data collection or aggregation is the method of storing
and presenting data in a summary format.
• The data may be obtained from multiple data sources
to integrate these data sources into a data analysis
description.
• This is a crucial step since the accuracy of data analysis
insights is highly dependent on the quantity and
quality of the data used.
• Gathering accurate data of high quality and a large
enough quantity is necessary to produce relevant
results.

• The collection of data is useful for everything
from decisions concerning financing or
business strategy of the product, pricing,
operations, and marketing strategies.

For example, Sales, data may be aggregated to


compute monthly& annual total amounts.
• 3. Discretization:
It is a process of transforming continuous data
into set of small intervals.
• Most Data Mining activities in the real world
require continuous attributes.
• Yet many of the existing data mining
frameworks are unable to handle these
attributes.
• Also, even if a data mining task can manage a
continuous attribute, it can significantly
improve its efficiency by replacing a constant
quality attribute with its discrete values.
• For example, (1-10, 11-20) (age:- young, middle
age, senior).
• 4. Attribute Construction:
Where new attributes are created & applied to
assist the mining process from the given set of
attributes. This simplifies the original data &
makes the mining more efficient.
• 5. Generalization:
It converts low-level data attributes to
high-level data attributes using concept
hierarchy. For Example Age initially in
Numerical form (22, 25) is converted into
categorical value (young, old).
• For example, Categorical attributes, such as
house addresses, may be generalized to
higher-level definitions, such as town or
country.
• 6. Normalization: Data normalization involves
converting all data variable into a given range.
Techniques that are used for normalization
are:
• Min-Max Normalization:This transforms the original
data linearly.
• Suppose that: min_A is the minima and max_A is the
maxima of an attribute, P
• We Have the Formula:

Where v is the value you want to plot in the new range.
• v’ is the new value you get after normalizing the old
value.
• Solved example:
Suppose the minimum and maximum value for an
attribute profit(P) are Rs. 10, 000 and Rs. 100, 000. We
want to plot the profit in the range [0, 1]. Using min-max
normalization the value of Rs. 20, 000 for attribute profit
can be plotted to:
• Z-Score Normalization:In z-score normalization
(or zero-mean normalization) the values of an
attribute (A), are normalized based on the mean
of A and its standard deviation
• A value, v, of attribute A is normalized to v’ by
computing
• For example:
Let mean of an attribute P = 60, 000, Standard
Deviation = 10, 000, for the attribute P. Using
z-score normalization, a value of 85000 for P can
be transformed to:
Decimal Scaling:
It normalizes the values of an attribute by changing the
position of their decimal points
The number of points by which the decimal point is
moved can be determined by the absolute maximum
value of attribute A.
A value, v, of attribute A is normalized to v’ by computing
where j is the smallest integer such that Max(|v’|) < 1.
For example:
Suppose: Values of an attribute P varies from -99 to 99.
The maximum absolute value of P is 99.
For normalizing the values we divide the numbers by 100 (i.e., j =
2) or (number of integers in the largest number) so that values
come out to be as 0.98, 0.97 and so on.

You might also like