Professional Documents
Culture Documents
# Understanding DM Issues
# Understanding DM Issues
These sources may include multiple databases, data cubes, or flat files. One of the most
well-known implementation of data integration is building an enterprise's data warehouse.
The benefit of a data warehouse enables a business to perform analyses based on the
data in the data warehouse.
Data Preprocessing Forms
There are mainly 2 major approaches for data integration – one is “tight coupling
approach” and another is “loose coupling approach”.
(i)Tight Coupling: Here, a data warehouse is treated as an information retrieval
component.
• In this coupling, data is combined from different sources into a single physical location
through the process of ETL – Extraction, Transformation and Loading.
(ii)Loose Coupling: Here, an interface is provided that takes the query from the user,
transforms it in a way the source database can understand and then sends the query
directly to the source databases to obtain the result.
• And the data only remains in the actual source databases.
Data Preprocessing Forms
• Issues in Data Integration: There are no of issues to consider during data integration:
Schema Integration, Redundancy, Detection and resolution of data value conflicts.
These are explained in brief as following below.
1. Schema Integration:
• Integrate metadata from different sources.
• The real world entities from multiple source be matched referred to as the entity
identification problem.
• For example, How can the data analyst and computer be sure that customer id in one
data base and customer number in another reference to the same attribute.
Data Preprocessing Forms
2. Redundancy:
•An attribute may be redundant if it can be derived or obtaining from another attribute or set of
attribute.
•Inconsistencies in attribute can also cause redundancies in the resulting data set.
•Some redundancies can be detected by correlation analysis.
3. Detection and resolution of data value conflicts:
•This is the third important issues in data integration.
•Attribute values from other different sources may differ for the same real world entity.
•An attribute in one system may be recorded at a lower level abstraction than the “same”
attribute in another.
Data Preprocessing Forms
C. Data Transformation
Data transformation is the process of converting data or information from one format to
another, usually from the format of a source system into the required format of a new
destination system.
The usual process involves converting documents, but data conversions sometimes
involve the conversion of a program from one computer language to another to enable the
program to run on a different platform.
In data transformation process data are transformed from one format to another format,
that is more appropriate for data mining.
Data Transformation Strategies include:-
Data Preprocessing Forms
1 Smoothing: Smoothing is a process of removing noise from the data.
2 Aggregation: Aggregation is a process where summary or aggregation operations are
applied to the data.
3 Generalization: In generalization low-level data are replaced with high-level data by
using concept hierarchies climbing.
4 Normalization: Normalization scaled attribute data so as to fall within a small specified
range, such as 0.0 to 1.0.
5 Attribute Construction: In Attribute construction, new attributes are constructed from the
given set of attributes.
Data Preprocessing Forms
Data Transformation involves two key phases:
1.Data Mapping: The assignment of elements from the source base or system toward the
destination to capture all transformations that occur. This is made more complicated
when there are complex transformations like many-to-one or one-to-many rules for
transformation.
2.Code Generation: The creation of the actual transformation program. The resulting data
map specification is used to create an executable program to run on computer
systems.
Data Preprocessing Forms
Commonly used transformational languages:
• Perl: A high-level procedural and object-oriented language capable of powerful
operations
• AWK: One of the oldest languages and a popular TXT transformation language
• XSLT: An XML data transformation language
• TXL: A prototyping language mostly used for source code transformation
• Template Languages and Processors: These specialize in data-to-document
transformation
Data Preprocessing Forms
D. Data Reduction
A database or date warehouse may store terabytes of data.So it may take very long to
perform data analysis and mining on such huge amounts of data.
Data reduction is the process of reducing the amount of capacity required to store data.
Data reduction can increase storage efficiency and reduce costs. Storage vendors will
often describe storage capacity in terms of raw capacity and effective capacity, which
refers to data after the reduction.
Data reduction techniques can be applied to obtain a reduced representation of the data
set that is much smaller in volume but still contain critical information.
Data Reduction Strategies include:-
Data Preprocessing Forms
1 Data Cube Aggregation: Aggregation operations are applied to the data in the
construction of a data cube.
Data cubes store multidimensional
aggregated information. Each cell holds
an aggregate data value, corresponding
to the data point in multidimensional
space.
Data cubes provide fast access to
precomputed, summarised data,
benefiting on-line analytical processing as
well as data mining.
Data Preprocessing Forms
2 Dimensionality Reduction: In dimensionality reduction redundant attributes are detected
and removed which reduce the data set size.
Dimensionality reduction is the process of reducing the number of random variables
under consideration, by obtaining a set of principal variables. It can be divided into
feature selection and feature extraction.
3 Data Compression: Encoding mechanisms are used to reduce the data set size.
Data compression is a reduction in the number of bits needed to represent data.
Compressing data can save storage capacity, speed up file transfer, and decrease costs
for storage hardware and network bandwidth.
Data Preprocessing Forms
4 Numerosity Reduction: In Numerosity reduction where the data are replaced or estimated by
alternative (e.g. replaces the original data by smaller form of data representation.). Basically
replacing by clusters or parametric models.
This is a technique of choosing smaller forms or data representation to reduce the volume of data.
5 Discretisation and concept hierarchy generation: Where raw data values for attributes are replaced
by ranges or higher conceptual levels. This involves automatic generation of concept hierarchies
from numerical data.
Discretization is the process of putting values into buckets so that there are a limited number of
possible states. The buckets themselves are treated as ordered and discrete values.
Discretization is a process that transforms quantitative data into qualitative data. Quantitative data
are commonly involved in data mining applications.
Data Preprocessing Forms
6. Attribute subset Selection: This is a technique which is used for data reduction in data
mining process. It involves removing irrelevant attributes by correlation analysis
Data reduction reduces the size of data so that it can be used for analysis purposes
more efficiently. The data set may have a large number of attributes. But some of those
attributes can be irrelevant or redundant.