Professional Documents
Culture Documents
DATA PREPROCESSING Note
DATA PREPROCESSING Note
DATA PREPROCESSING Note
1. Roll-up
2. Drill-down
3. Slice
4. Dice
5. Pivot (Rotate)
1. Roll-up:
The roll-up operation performs aggregation on a data cube, in any of the following
ways.
By climbing up a concept hierarchy for a dimension
By dimension reduction.
Shows the result of a roll-up operation performed on the central
cube by climbing up the concept hierarchy for the dimension location. This
hierarchy was defined as the total order “street < city < state < country.” The roll-
up operation shown aggregates the data by ascending the location hierarchy from
the level of city to the level of country. The data is grouped into cities rather than
countries. When roll-up is performed, one or more dimensions from the data cube
is removed.
2. Drill-down:
Drill-down is the reverse of roll-up. It navigates from less detailed data to more
detailed data. Drill-down can be realized by either of the following ways.
Stepping down a concept hierarchy for a dimension
Introducing additional dimensions.
Shows the result of a drill-down operation performed on the central
cube by stepping down a concept hierarchy for time defined as “day < month <
quarter < year.” Drill-down occurs by descending the time hierarchy from the
level of quarter to the more detailed level of month. When drill-down is
performed, one or more dimensions from the data cube are added .It navigates
the data from less detailed data to highly detailed data.
3. Slice:
The slice operation performs a selection on one particular dimension from a given
cube, provide a new sub cube. Shows a slice operation where the sales data are selected from
the central cube for the dimension time using the criterion time = “Q1”. The dice operation
defines a sub cube by performing a selection on two or more dimensions
4. Dice:
The dice operation defines a sub cube by performing a selection on two or more
dimensions and provides a new sub cube. Shows a dice operation on the central cube based
on the following selection criteria that involve three dimensions:
(Location = “Toronto” or “Vancouver”)
(Time = “Q1” or “Q2”)
(Item = “home entertainment” or “computer”).
5. Pivot:
Pivot operation is also called rotation. It is a visualization operation that rotates the
data Axes in view in order to provide an alternative presentation of the data. Shows a pivot
operation where the item and location axes in a 2-D slice are rotated.
DATA MINING FUNCTIONALITIES
Association analysis
Given a set of records each of which contain some number of
items from a given collection
Produces dependency rules which will predict occurrence of
other item.
TID ITEAMS
1 BREAD,COKE,MILK
2 BEER,BREAD
3 BEER,COKE,MILK
4 BEER,BREAD,EGG,MILK
5 COKE,MILK,EGG
Rules Discovered:
{Milk} {Coke}
{Bread, Milk} {Coke}
Classification
It is the process of finding a model (or function) that describes and
distinguishes data classes or concepts, for future prediction
The model to predict the class of objects whose class label is
unknown.
The derived model is based on the analysis for future prediction.
The derived model may be represented in various forms, such as
classification rule (IF-THEN),decision trees,mathematical
formulae or neural networks .
A decision tree is a flow-chart-like tree structure, where each node
denotes a test on an attribute value, each branch represents an
outcome of the test, and tree leaves represent classes or class
distributions.
Decision trees can easily be converted to classification rules.
A neural network, when used for classification, is typically a
collection of neuron-like processing units with weighted connections
between the units. Example:
Prediction
Predict some unknown or missing numerical values
Prediction may refer to both
*Numerical
*Class label prediction
Regression analysis is a part of numerical prediction
Regression analysis is a statistical methodology used for numeric
prediction
4.Clustering
A database may contain data objects that do not comply with the
general behaviour or model of the data.
These data objects are outliers. Most data mining methods discard
outliers as noise or exceptions.
some applications such as fraud detection, the rare events can be more
interesting than the more regularly occurring ones.
The analysis of outlier data is referred to as outlier mining.
Outliers may be detected using statistical tests that assume a
distribution or probability model for the data, or using distance
measures where objects that are a substantial distance from any other
cluster are considered outliers
Uncover fraudulent usage of credit cards by detecting purchases of
extremely large amounts for a given account number in comparison to
Regular charges incurred by the same account.
2. kind of knowledge:
This specifies the data mining functions to be performed, such as
characterization, discrimination, association or correlation analysis, classification, prediction,
clustering, outlier analysis, or evolution analysis.
3. Background knowledge:
This knowledge about the domain to be mined is useful for guiding the
knowledge discovery process and for evaluating the patterns found. Concept hierarchies are a
popular form of background knowledge, which allow data to be mined at multiple levels of
abstraction. An example of a concept hierarchy for the attribute (or dimension) age is shown
in Figure.
4. interestingness measures and thresholds for pattern evaluation:
They may be used to guide the mining process or, after discovery, to evaluate
the discovered patterns. Different kinds of knowledge may have different interestingness
measures. For example, interestingness measures for association rules include support and
confidence. Rules whose support and confidence values are below user-specified thresholds
are considered uninteresting.
1. Data Cleaning
Real-world data tend to be incomplete, noisy, and inconsistent.
Data cleaning (or data cleansing) routines attempt to fill in missing values, smooth out
noise while identifying outliers, and correct inconsistencies in the data.
*Binning:
Binning methods smooth a sorted data value by consulting its
“neighborhood,” that is, the values around it. The sorted values are distributed
into a number of “buckets,” or bins.
Binning is possible by,
1. Divide the list into equal partition.
2. Smoothing by bin mean.
3. Smoothing by bin boundaries.
Example:
Sorted data : 4, 8, 15, 21, 21, 24, 25, 28, 34
Partition into (equal-frequency) bins:
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
1. Smoothing by bin means:
Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
2. Smoothing by bin boundaries:
Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34
*Regression:
Data can be smoothed by fitting the data to a function, such as with
regression. Linear regression involves finding the “best” line to fit two attributes (or
variables), Multiple linear regression is an extension of linear regression, where more than
two attributes are involved and the data are fit to a multidimensional surface.
*Clustering:
Outliers may be detected by clustering, where similar values are
organized into groups, or “clusters”. Values that fall outside of the set of clusters may be
considered outliers.
2. Data Integration
Data integration, which combines data from multiple sources into a coherent data
store, as in data warehousing.
These sources may include multiple databases, data cubes, or flat files.
There are a number of issues to consider during data integration,
Entity identification problem:
This problem occurs which some attributes derived from multiple
sources and system is confused to take which. For example; Customer_id
from one Database and Customer_no other Database. Then the data cube can
create ambiguity in the system. To avoiding this problem using Metadata
concept.
Redundancy :
Redundancy is another important issue. Repeated information
from multiple resources can create confusion in the data and can create
redundancy. Some redundancies can be detected by correlation analysis. For
numeric attributes we can evaluate the correlation between attributes A&B,
by computing the correlation coefficient.
3. Data Transformation
In data transformation, the data are transformed or consolidated into forms
appropriate for mining.
Data transformation can involve the following,
Smoothing: which works to remove noise from the data. Such techniques
include
Binning, regression, and clustering.
Aggregation: where summary or aggregation operations are applied to the
data.
Generalization: Generalization of the data, where low-level data are
replaced by
Higher-level concepts through the use of concept hierarchies.
Normalization: where the attribute data are scaled so as to fall within a
small specified range, such as −1.0 to 1.0, or 0.0 to 1.0.Normalization is
particularly useful for classification algorithms involving neural networks,
or distance measurements such as nearest-neighbour classification and
clustering. There are many methods for data normalization.
min-max normalization:
Min-max normalization performs a linear transformation
on the original data. Suppose that minA and maxA are the minimum
and maximum values of an attribute, A. Min-max normalization
maps a value, v, of A to v’ in the range [new minA,new maxA] by
computing,
z-score normalization:
In z-score normalization (or zero-mean normalization), the values for an
where A and A are the mean and standard deviation, respectively, of attribute A.
This method of normalization is useful when the actual minimum and maximum
of attribute A are unknown, or when there are outliers that t dominate the
mean= 54000
SD=16000
Into,
4. Data Reduction
Data reduction techniques can be applied to obtain a reduced representation of the
data set that is much smaller in volume. Strategies for data reduction include the following:
1. Data cube aggregation
2. Attribute subset selection
3. Dimensionality reduction
4. Numerosity reduction
5. Discretization and concept hierarchy generation