# Understanding DM Issues

# Understanding DM Issues
Name: Muchake Brian

Faculty: Science
Department: Computer Science & Information Systems
Tel: 0701178573
Email: bmuchake@gmail.com, bmuchake@umu.ac.ug
Do not Keep Company With Worthless People

Psalms 26:11
Data Mining Issues
 Though data mining is very powerful, it faces many challenges during its
implementation.
 Data mining is actually not an easy task, as the algorithms used can get very
complex and data is not always available at one place.
 The challenges could be related to performance, data, methods and techniques used
etc. The data mining process becomes successful when the challenges or issues are
identified correctly and sorted out properly.
 On a bird’s view the diagram below represents a summarized categorization of data
mining issues
Data Mining Issues [Cont’d]
1. Mining Methodology and User Interaction Issues
Mining different kinds of knowledge in databases − Different users may be interested
in different kinds of knowledge. Therefore it is necessary for data mining to cover a
broad range of knowledge discovery task.
•Different user - different knowledge - different way. That means different client want a
different kind of information so it becomes difficult to cover vast range of data that can
meet the client requirement.
Interactive mining of knowledge at multiple levels of abstraction − The data mining
process needs to be interactive because it allows users to focus the search for
patterns, providing and refining data mining requests based on the returned results.
• Interactive mining allows users to focus the search for patterns from different angles.
The data mining process should be interactive because it is difficult to know what can
be discovered within a database.
 Incorporation of background knowledge − To guide discovery process and to express
the discovered patterns, the background knowledge can be used. Background
knowledge may be used to express the discovered patterns not only in concise terms
but at multiple levels of abstraction.
• Background knowledge is used to guide discovery process and to express the
discovered patterns.
 Data mining query languages and ad hoc data mining − Data Mining Query language
that allows the user to describe ad hoc mining tasks, should be integrated with a data
warehouse query language and optimized for efficient and flexible data mining.
• Relational query languages (such as SQL) allow users to pose ad-hoc queries for
data retrieval. The language of data mining query language should be in perfectly
matched with the query language of data warehouse.
 Presentation and visualization of data mining results − Once the patterns are
discovered it needs to be expressed in high level languages, and visual
representations. These representations should be easily understandable.
 Handling noisy or incomplete data − The data cleaning methods are required to
handle the noise and incomplete objects while mining the data regularities. If the data
cleaning methods are not there then the accuracy of the discovered patterns will be
poor.
• In a large database, many of the attribute values will be incorrect. This may be due to
human error or because of any instruments fail. Data cleaning methods and data
analysis methods are used to handle noise data.
 Pattern evaluation − The patterns discovered should be interesting because either
they represent common knowledge or lack novelty.
 Over fitting: Over fitting occurs when the model doesn’t fit future states. When a data mining
algorithm searches for the best parameters for a specific model using a set of samples, it
may over-fit the data, resulting in poor generalization. Cross-validation, regularization and
other sophisticated statistical methods can be applied to overcome the problem
 Outliers: The data entries that do not fit nicely into the derived model. That the model is
developed that includes these outliers, then the model may not behave well for data that are
not outliers.
 Security and social issues: Security is an important issue with any data collection that is
shared and is intended to be used for strategic decision-making. When the data is collected
for customer profiling, user behavior understanding, correlating personal data with other
information.
2. Performance Issues
Efficiency and scalability of data mining algorithms − In order to effectively extract the information
from huge amount of data in databases, data mining algorithm must be efficient and scalable.
•To effectively extract information from a huge amount of data in databases, data mining
algorithms must be efficient and scalable.
Parallel, distributed, and incremental mining algorithms − The factors such as huge size of
databases, wide distribution of data, and complexity of data mining methods motivate the
development of parallel and distributed data mining algorithms. These algorithms divide the data
into partitions which is further processed in a parallel fashion. Then the results from the partitions
is merged. The incremental algorithms, update databases without mining the data again from
scratch.
• The huge size of many databases, the wide distribution of data, and complexity of some data
mining methods are factors motivating the development of parallel and distributed data
mining algorithms. Such algorithms divide the data into partitions, which are processed in
parallel.
3. Diverse Data Types Issues
 Handling of relational and complex types of data − The database may contain complex data
objects, multimedia data objects, spatial data, temporal data etc. It is not possible for one
system to mine all these kind of data.
• There are many kinds of data stored in databases and data warehouses. It is not possible for
one system to mine all these kind of data. So different data mining systems should be
construed for different kinds data.
 Mining information from heterogeneous databases and global information systems −
The data is available at different data sources on LAN or WAN. These data source
may be structured, semi structured or unstructured. Therefore mining the knowledge
from them adds challenges to data mining.
• Since data is fetched from different data sources on Local Area Network (LAN) and
Wide Area Network (WAN).The discovery of knowledge from different sources of
structured is a great challenge to data mining.
Data Preprocessing
 Data preprocessing involves:
1.Incompleteness (e.g. lacking attributes). Consider a form capturing National ID
details and a respondent does not fill on the NIN attribute.
2.Noise or Incorrectness (e.g. Deviation from actual result). Imagine a respondent who
fill the NIN attribute with his names. Basically this considers wrong values.
3.Inconsistency (e.g. Discrepancies in code). Imagine a respondent in Form A select
sex (gender) as Female and in Form B Male as the gender. This is inconsistent.
Data Preprocessing Forms
A. Data Cleaning
 Data cleaning in data mining is the process of detecting and removing corrupt or
inaccurate records from a record set, table or database.
 Some data cleaning methods :-
1. You can ignore the tuple. This is done when class label is missing. This method is not
very effective , unless the tuple contains several attributes with missing values.
2. You can fill in the missing value manually. This approach is effective on small data set
with some missing values.
3. You can replace all missing attribute values with global constant, such as a label like
“Unknown” or minus infinity.
infinity
Data Preprocessing Forms [Cont’d]
4. You can use the attribute mean to fill in the missing value. For example customer
average income is 25000 then you can use this value to replace missing value for
income.
5. Use the most probable value to fill in the missing value.
Noisy Data
Noise is a random error or variance in a measured variable. Noisy Data may be due
to faulty data collection instruments, data entry problems and technology limitation.
How to Handle Noisy Data?
Data Preprocessing Forms [Cont’d]
Binning: Binning or discretization is the process of transforming numerical variables into
categorical counterparts. An example is to bin values for Age into categories such as 20-39,
40-59, and 60-79. Numerical variables are usually discretized in the modeling methods
based on frequency tables (e.g., decision trees).
• Binning methods sorted data value by consulting its “neighbor- hood,” that is, the values
around it. The sorted values are distributed into a number of “buckets,” or bins. For
example : Price = 4, 8, 15, 21, 21, 24, 25, 28, 34
Clustering: Detect and remove outliers. Cluster the data and use properties of the clusters to
represent the instances constituting those clusters.
Regression: Smooth by fitting the data into regression functions. Data can be smoothed by
fitting the data to a function, such as with regression.
Binning Methods
 Equal-width(distance) partitioning:
Binning Methods
 Equal-depth(frequency) partitioning
Removing Outliers and Artifacts
Correcting Inconsistent Data
Removing Duplicate Data
B. Integration
Data Integration is a data preprocessing technique that involves combining data from
multiple heterogeneous data sources into a coherent data store and provide a unified
view of the data. These sources may include multiple data cubes, databases or flat
files.
The data integration approach are formally defined as triple <G, S, M> where,
G- stand for the global schema,
S- stand for heterogeneous source of schema,
M-stand for mapping between the queries of source and global schema.
 These sources may include multiple databases, data cubes, or flat files. One of the most
well-known implementation of data integration is building an enterprise's data warehouse.
 The benefit of a data warehouse enables a business to perform analyses based on the
data in the data warehouse.
 There are mainly 2 major approaches for data integration – one is “tight coupling
approach” and another is “loose coupling approach”.
(i)Tight Coupling: Here, a data warehouse is treated as an information retrieval
component.
• In this coupling, data is combined from different sources into a single physical location
through the process of ETL – Extraction, Transformation and Loading.
(ii)Loose Coupling: Here, an interface is provided that takes the query from the user,
transforms it in a way the source database can understand and then sends the query
directly to the source databases to obtain the result.
• And the data only remains in the actual source databases.
• Issues in Data Integration: There are no of issues to consider during data integration:
Schema Integration, Redundancy, Detection and resolution of data value conflicts.
These are explained in brief as following below.
1. Schema Integration:
• Integrate metadata from different sources.
• The real world entities from multiple source be matched referred to as the entity
identification problem.
• For example, How can the data analyst and computer be sure that customer id in one
data base and customer number in another reference to the same attribute.
2. Redundancy:
•An attribute may be redundant if it can be derived or obtaining from another attribute or set of
attribute.
•Inconsistencies in attribute can also cause redundancies in the resulting data set.
•Some redundancies can be detected by correlation analysis.
3. Detection and resolution of data value conflicts:
•This is the third important issues in data integration.
•Attribute values from other different sources may differ for the same real world entity.
•An attribute in one system may be recorded at a lower level abstraction than the “same”
attribute in another.
C. Data Transformation
Data transformation is the process of converting data or information from one format to
another, usually from the format of a source system into the required format of a new
destination system.
The usual process involves converting documents, but data conversions sometimes
involve the conversion of a program from one computer language to another to enable the
program to run on a different platform.
In data transformation process data are transformed from one format to another format,
that is more appropriate for data mining.
Data Transformation Strategies include:-
1 Smoothing: Smoothing is a process of removing noise from the data.
2 Aggregation: Aggregation is a process where summary or aggregation operations are
applied to the data.
3 Generalization: In generalization low-level data are replaced with high-level data by
using concept hierarchies climbing.
4 Normalization: Normalization scaled attribute data so as to fall within a small specified
range, such as 0.0 to 1.0.
5 Attribute Construction: In Attribute construction, new attributes are constructed from the
given set of attributes.
 Data Transformation involves two key phases:
1.Data Mapping: The assignment of elements from the source base or system toward the
destination to capture all transformations that occur. This is made more complicated
when there are complex transformations like many-to-one or one-to-many rules for
transformation.
2.Code Generation: The creation of the actual transformation program. The resulting data
map specification is used to create an executable program to run on computer
systems.
 Commonly used transformational languages:
• Perl: A high-level procedural and object-oriented language capable of powerful
operations
• AWK: One of the oldest languages and a popular TXT transformation language
• XSLT: An XML data transformation language
• TXL: A prototyping language mostly used for source code transformation
• Template Languages and Processors: These specialize in data-to-document
transformation
D. Data Reduction
A database or date warehouse may store terabytes of data.So it may take very long to
perform data analysis and mining on such huge amounts of data.
Data reduction is the process of reducing the amount of capacity required to store data.
Data reduction can increase storage efficiency and reduce costs. Storage vendors will
often describe storage capacity in terms of raw capacity and effective capacity, which
refers to data after the reduction.
Data reduction techniques can be applied to obtain a reduced representation of the data
set that is much smaller in volume but still contain critical information.
Data Reduction Strategies include:-
1 Data Cube Aggregation: Aggregation operations are applied to the data in the
construction of a data cube.
 Data cubes store multidimensional
aggregated information. Each cell holds
an aggregate data value, corresponding
to the data point in multidimensional
space.
 Data cubes provide fast access to
precomputed, summarised data,
benefiting on-line analytical processing as
well as data mining.
2 Dimensionality Reduction: In dimensionality reduction redundant attributes are detected
and removed which reduce the data set size.
Dimensionality reduction is the process of reducing the number of random variables
under consideration, by obtaining a set of principal variables. It can be divided into
feature selection and feature extraction.
3 Data Compression: Encoding mechanisms are used to reduce the data set size.
Data compression is a reduction in the number of bits needed to represent data.
Compressing data can save storage capacity, speed up file transfer, and decrease costs
for storage hardware and network bandwidth.
4 Numerosity Reduction: In Numerosity reduction where the data are replaced or estimated by
alternative (e.g. replaces the original data by smaller form of data representation.). Basically
replacing by clusters or parametric models.
This is a technique of choosing smaller forms or data representation to reduce the volume of data.
5 Discretisation and concept hierarchy generation: Where raw data values for attributes are replaced
by ranges or higher conceptual levels. This involves automatic generation of concept hierarchies
from numerical data.
Discretization is the process of putting values into buckets so that there are a limited number of
possible states. The buckets themselves are treated as ordered and discrete values.
Discretization is a process that transforms quantitative data into qualitative data. Quantitative data
are commonly involved in data mining applications.
6. Attribute subset Selection: This is a technique which is used for data reduction in data
mining process. It involves removing irrelevant attributes by correlation analysis
Data reduction reduces the size of data so that it can be used for analysis purposes
more efficiently. The data set may have a large number of attributes. But some of those
attributes can be irrelevant or redundant.

# Understanding DM Issues

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

# Understanding DM Issues

Uploaded by

Copyright:

Available Formats

# Understanding DM Issues

Name: Muchake Brian

Do not Keep Company With Worthless People

You might also like