Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

UNIT-1

DATA-MINING-INTRODUCTION -FUNCTIONALITIES-CLASSIFICATION OF
DATA MINING SYSTEM-ISSUES-DATA PREPROCESSING- DATA CLEANING-DATA
INTEGRATION AND TRANSFORMATION –DATA REDUCTION

DATA-MINING-INTRODUCTION
There is a huge amount of data available in the Information Industry. This data is of no use
until it is converted into useful information. It is necessary to analyze this huge amount of
data and extract useful information from it.
Extraction of information is not the only process we need to perform; data mining also
involves other processes such as Data Cleaning, Data Integration, Data Transformation,
Data Mining, Pattern Evaluation and Data Presentation. Once all these processes are over,
we would be able to use this information in many applications such as Fraud Detection,
Market Analysis, Production Control, Science Exploration, etc.

What is Data Mining?


Data Mining is defined as extracting information from huge sets of data. In other words, we
can say that data mining is the procedure of mining knowledge from data.

Functionalities of Data Mining


Data mining functions are used to define the trends or correlations contained in data
mining activities.

In comparison, data mining activities can be divided into 2 categories:

1. Descriptive Data Mining:


It includes certain knowledge to understand what is happening within the data
without a previous idea. The common data features are highlighted in the data
set.
For example: count, average etc.
2. Predictive Data Mining:
It helps developers to provide unlabeled definitions of attributes. Based on
previous tests, the software estimates the characteristics that are absent.
For example: Judging from the findings of a patient’s medical examinations that
is he suffering from any particular disease.

Data Mining Functionality:

1. Class/Concept Descriptions:

Classes or definitions can be correlated with results. In simplified, descriptive and yet
accurate ways, it can be helpful to define individual groups and concepts.
These class or concept definitions are referred to as class/concept descriptions.
● Data Characterization:
This refers to the summary of general characteristics or features of the class
that is under the study. For example. To study the characteristics of a software
product whose sales increased by 15% two years ago, anyone can collect
these type of data related to such products by running SQL queries.

● Data Discrimination:
It compares common features of class which is under study. The output of this
process can be represented in many forms. Eg., bar charts, curves and pie
charts

2. Mining Frequent Patterns, Associations, and Correlations:

Frequent patterns are nothing but things that are found to be most common in the
data.
There are different kinds of frequency that can be observed in the dataset.
● Frequent item set:
This applies to a number of items that can be seen together regularly for eg:
milk and sugar.
● Frequent Subsequence:
This refers to the pattern series that often occurs regularly such as purchasing a
phone followed by a back cover.
● Frequent Substructure:
It refers to the different kinds of data structures such as trees and graphs that
may be combined with the itemset or subsequence.

3. Association Analysis:

The process involves uncovering the relationship between data and deciding the
rules of the association. It is a way of discovering the relationship between various
items. for example, it can be used to determine the sales of items that are frequently
purchased together.

4. Correlation Analysis:
Correlation is a mathematical technique that can show whether and how strongly the
pairs of attributes are related to each other. For example, Heighted people tend to
have more weight.

Classification of Data Mining System


 The information or knowledge extracted so can be used for any of the following
applications −

● Market Analysis
● Fraud Detection
● Customer Retention
● Production Control
● Science Exploration
Data Mining Applications
Data mining is highly useful in the following domains −

● Market Analysis and Management


● Corporate Analysis & Risk Management
● Fraud Detection
Apart from these, data mining can also be used in the areas of production control, customer
retention, science exploration, sports, astrology, and Internet Web Surf-Aid

Market Analysis and Management


Listed below are the various fields of market where data mining is used −
● Customer Profiling − Data mining helps determine what kind of people buy what
kind of products.
● Identifying Customer Requirements − Data mining helps in identifying the best
products for different customers. It uses prediction to find the factors that may attract
new customers.
● Cross Market Analysis − Data mining performs Association/correlations between
product sales.
● Target Marketing − Data mining helps to find clusters of model customers who
share the same characteristics such as interests, spending habits, income, etc.
● Determining Customer purchasing pattern − Data mining helps in determining
customer purchasing pattern.
● Providing Summary Information − Data mining provides us various
multidimensional summary reports.

Corporate Analysis and Risk Management


Data mining is used in the following fields of the Corporate Sector −
● Finance Planning and Asset Evaluation − It involves cash flow analysis and
prediction, contingent claim analysis to evaluate assets.
● Resource Planning − It involves summarizing and comparing the resources and
spending.
● Competition − It involves monitoring competitors and market directions.

Fraud Detection
Data mining is also used in the fields of credit card services and telecommunication to
detect frauds. In fraud telephone calls, it helps to find the destination of the call, duration of
the call, time of the day or week, etc. It also analyzes the patterns that deviate from
expected norms

.
Data Mining - Issues
Data mining is not an easy task, as the algorithms used can get very complex and data is
not always available at one place. It needs to be integrated from various heterogeneous
data sources. These factors also create some issues. Here in this tutorial, we will discuss
the major issues regarding −

● Mining Methodology and User Interaction


● Performance Issues
● Diverse Data Types Issues
The following diagram describes the major issues.

Mining Methodology and User Interaction Issues


It refers to the following kinds of issues −
● Mining different kinds of knowledge in databases − Different users may be
interested in different kinds of knowledge. Therefore it is necessary for data mining
to cover a broad range of knowledge discovery task.
● Interactive mining of knowledge at multiple levels of abstraction − The data
mining process needs to be interactive because it allows users to focus the search
for patterns, providing and refining data mining requests based on the returned
results.
● Incorporation of background knowledge − To guide discovery process and to
express the discovered patterns, the background knowledge can be used.
Background knowledge may be used to express the discovered patterns not only in
concise terms but at multiple levels of abstraction.
● Data mining query languages and ad hoc data mining − Data Mining Query
language that allows the user to describe ad hoc mining tasks, should be integrated
with a data warehouse query language and optimized for efficient and flexible data
mining.
● Presentation and visualization of data mining results − Once the patterns are
discovered it needs to be expressed in high level languages, and visual
representations. These representations should be easily understandable.
● Handling noisy or incomplete data − The data cleaning methods are required to
handle the noise and incomplete objects while mining the data regularities. If the
data cleaning methods are not there then the accuracy of the discovered patterns
will be poor.
● Pattern evaluation − The patterns discovered should be interesting because either
they represent common knowledge or lack novelty.

Performance Issues
There can be performance-related issues such as follows −
● Efficiency and scalability of data mining algorithms − In order to effectively
extract the information from huge amount of data in databases, data mining
algorithm must be efficient and scalable.
● Parallel, distributed, and incremental mining algorithms − The factors such as
huge size of databases, wide distribution of data, and complexity of data mining
methods motivate the development of parallel and distributed data mining
algorithms. These algorithms divide the data into partitions which is further
processed in a parallel fashion. Then the results from the partitions is merged. The
incremental algorithms, update databases without mining the data again from
scratch.

Diverse Data Types Issues


● Handling of relational and complex types of data − The database may contain
complex data objects, multimedia data objects, spatial data, temporal data etc. It is
not possible for one system to mine all these kind of data.
● Mining information from heterogeneous databases and global information
systems − The data is available at different data sources on LAN or WAN. These
data source may be structured, semi structured or unstructured. Therefore mining
the knowledge from them adds challenges to data mining.

Data Preprocessing in Data Mining


Preprocessing in Data Mining:

Data preprocessing is a data mining technique which is used to transform the raw data in a
useful and efficient format.
Steps Involved in Data Preprocessing:
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is
done. It involves handling of missing data, noisy data etc.
● (a). Missing Data:
This situation arises when some data is missing in the data. It can be handled in
various ways.
Some of them are:
1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and
multiple values are missing within a tuple.
2. Fill the Missing values:
There are various ways to do this task. You can choose to fill the missing values
manually, by attribute mean or the most probable value.
● (b). Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines.It can be
generated due to faulty data collection, data entry errors etc. It can be handled in
following ways :

1. Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided
into segments of equal size and then various methods are performed to complete
the task. Each segmented is handled separately. One can replace all data in a
segment by its mean or boundary values can be used to complete the task.
2. Regression:
Here data can be made smooth by fitting it to a regression function.The regression
used may be linear (having one independent variable) or multiple (having multiple
independent variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected
or it will fall outside the clusters.
2.DATA INTEGRATION:
Data integration merges data from several heterogeneous sources to attain meaningful
data. The source involves several databases, multiple files or data cubes. The integrated data
must exempt inconsistencies, discrepancies, redundancies and disparity.

Data integration is important as it provides a unified view of the scattered data not only this it also
maintains the accuracy of data. This helps the data-mining program in mining the useful
information which in turn helps the executive and managers in taking the strategic decisions for
the betterment of the enterprise.

a. Entity Identification Problem

As we know the data is unified from the heterogeneous sources then how can we ‘match the
real-world entities from the data’. For example, we have customer data from two different data
sources. An entity from one data source has customer_id and the entity from the other data
source has customer_number. Now how does the data analyst or the system understand that
these two entities refer to the same attribute?

Well, here the schema integration can be achieved using metadata of each attribute. Metadata
of an attribute incorporates its name, what does it mean in the particular scenario, what is its data
type, up to what range it can accept the value. What rules does the attribute follow for the null
value, blank, or zero? Analyzing this metadata information will prevent error in schema
integration.

Structural integration can be achieved by ensuring that the functional dependency of an


attribute in the source system and its referential constraints matches the functional dependency
and referential constraint of the same attribute in the target system.

This can be understood with the help of an example: suppose in one system, the discount would
be applied to an entire order but in another system, the discount would be applied to every single
item in the order. This difference must be caught before the data from these two sources are
integrated into the target system.

b. Redundancy and Correlation Analysis

Redundancy is one of the big issues during data integration. Redundant data is
unimportant data or the data that is no longer needed. It can also arise due to
attributes that could be derived using another attribute in the data set.

For example, if one data set has the customer age and the other data set has the
customer's date of birth then age would be a redundant attribute as it could be
derived using the date of birth.

Inconsistencies in the attribute also raise the level of redundancy. The


redundancy can be discovered using correlation analysis. The attributes are
analyzed to detect their interdependence on each other thereby detecting the
correlation between them.
c. Tuple Duplication

Along with redundancies data integration has also dealt with the duplicate tuples.
Duplicate tuples may come in the resultant data if the denormalized table has
been used as a source for data integration.

d. Data Conflict Detection and Resolution

Data conflict means the data merged from the different sources do not match.
Like the attribute values may differ in different data sets. The difference may be
because they are represented differently in the different data sets. For suppose
the price of a hotel room may be represented in different currencies in different
cities. This kind of issue is detected and resolved during data integration.

So far, we have discussed the issues that a data analyst or system has to deal
with during data integration. Next, we will discuss the techniques that are used for
data integration.

3. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for the mining
process. This involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help
the mining process.
3. Discretization:
This is done to replace the raw values of numeric attributes by interval levels or
conceptual levels.
4. Concept Hierarchy Generation:
Here attributes are converted from level to higher level in hierarchy. For Example-The
attribute “city” can be converted to “country”.
4. Data Reduction:
Since data mining is a technique that is used to handle huge amounts of data. While working
with a huge volume of data, analysis became harder in such cases. In order to get rid of this,
we use data reduction techniques. It aims to increase the storage efficiency and reduce data
storage and analysis costs.
The various steps to data reduction are:
1. Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.
2. Attribute Subset Selection:
The highly relevant attributes should be used, rest all can be discarded. For performing
attribute selection, one can use the level of significance and p- value of the
attribute.The attribute having p-value greater than significance level can be discarded.
3. Numerosity Reduction:
This enables us to store the model of data instead of whole data, for example:
Regression Models.
4. Dimensionality Reduction:
This reduces the size of data by encoding mechanisms.It can be lossy or lossless. If
after reconstruction from compressed data, original data can be retrieved, such
reduction is called lossless reduction, else it is called lossy reduction. The two effective
methods of dimensionality reduction are:Wavelet transforms and PCA (Principal
Component Analysis).

You might also like