Professional Documents
Culture Documents
1.introduction To Data Mining
1.introduction To Data Mining
KDD is a field of computer science, which deals with extraction of previously unknown
and interesting information from raw data. KDD is the whole process of trying to make
sense of data by developing appropriate methods or techniques. This process deal with
the mapping of low-level data into other forms those are more compact, abstract and
useful. This is achieved by creating short reports, modeling the process of generating
data and developing predictive models that can predict future cases. Due to the
exponential growth of data, especially in areas such as business, KDD has become a
very important process to convert this large wealth of data in to business intelligence,
as manual extraction of patterns has become seemingly impossible in the past few
decades. For example, it is currently being used for various applications such as social
network analysis, fraud detection, science, investment, manufacturing,
telecommunications, data cleaning, sports, information retrieval and largely for
marketing. KDD is usually used to answer questions like what are the main products
that might help to obtain high profit next year in Wal-Mart?. This process has several
steps. It starts with developing an understanding of the application domain and the goal
and then creating a target dataset. This is followed by cleaning, preprocessing,
reduction and projection of data.
Although, the two terms KDD and Data Mining are heavily used interchangeably, they
refer to two related yet slightly different concepts. KDD is the overall process of
extracting knowledge from data while Data Mining is a step inside the KDD process,
which deals with identifying patterns in data. In other words, Data Mining is only the
application of a specific algorithm based on the overall goal of the KDD process.
The significant components of data mining systems are a data source, data mining
engine, data warehouse server, the pattern evaluation module, graphical user interface,
and knowledge base.
Data Sources: There are so many documents present. That is a database, data
warehouse, World Wide Web (WWW). That are the actual sources of data.
Sometimes, data may reside even in plain text files or spreadsheets. World Wide
Web or the Internet is another big source of data.
Database or Data Warehouse Server: The database server contains the actual
data that is ready to be processed. Hence, the server handles retrieving the
relevant data. That is based on the data mining request of the user.
Data Mining Engine: In data mining system data mining engine is the core
component. As It consists a number of modules. That we used to perform data
mining tasks. That includes association, classification, characterization,
clustering, prediction, etc.
Pattern Evaluation Module: This module is mainly responsible for the measure
of interestingness of the pattern. For this, we use a threshold value. Also, it
interacts with the data mining engine. That’s main focus is to search towards
interesting patterns.
GUI: We use this interface to communicate between the user and the data mining
system. Also, this module helps the user use the system easily and efficiently.
They don’t know the real complexity of the process. When the user specifies a
query, this module interacts with the data mining system. Thus, displays the
result in an easily understandable manner.
Data mining functionalities are used to specify the kind of patterns to be found in data
mining tasks. Data mining tasks can be classified into two categories: descriptive and
predictive.
Descriptive mining tasks characterize the general properties of the data in the
database.
Predictive mining tasks perform inference on the current data in order to make
predictions
There are different data mining functionalities.
Example: If a person is buying popcorn in the theatre then there are 60% chances
that he will buy a cold drink. This way, a prediction can be made on the shopping
behavior of the consumer.
Outlier Analysis: If we are unable to group any data in any class, we use the
outlier analysis technique. Outlier analysis helps to learn about data quality.
Outlier means data abnormality in most cases. More outliers in your data sets,
low the data quality. You cannot identify data patterns or derive any conclusions
from data sets with a big number of outliers. The outlier analysis process helps
in checking if there is any data that can be used to analyze after some clean-up.
Nevertheless, it is still important to keep a track of unusual data and activities so
that any anomalies can be detected beforehand and any business impact can be
detected in advance.
Evolution Analysis: Evolution Analysis refers to the study of data sets that may
have been through a phase of transformation or change. The evolution analysis
models capture evolutionary trends in data, which further contributes to data
characterization, classification, or discrimination and clustering for multivariate
time series.
There are different types of data on which data mining was performed: -
Data warehouses: The method of building a data pool using some set of rules is
a data warehouse. Through combining data from several heterogeneous sources
which enable a user for analytical reporting, standardized and/or ad hoc requests,
and decision making. Data warehousing requires data cleaning, integration of
data and storage of information. To help historical research, a data warehouse
typically preserves several months or years of data. The data in a data warehouse
is usually loaded from multiple data sources by an extraction, transformation,
and loading process. Modern data warehouses shift towards an architecture of
extract, load, transformation in which all or much of the transformation of data
is carried out on the database that hosts the data warehouse. It is important to
remember that a very significant part of a data warehouse's design initiative is to
describe ETL (Extraction, Transformation, and Loading.) method. ETL
activities are the backbone of the data warehouse.
There is a given starting point for any database transaction, followed by steps to
change the data inside the database. In the end, before the transaction can be tried
again, the database either commits the changes to make them permanent or rolls
back the changes to the starting point.
There are many measurable benefits that have been achieved in different application
areas from data mining. So, let’s discuss different applications of Data Mining:
Market Base Analysis: Market Basket Analysis is a technique that gives the
careful study of purchases done by a customer in a supermarket. This concept
identifies the pattern of frequent purchase items by customers. This analysis can
help to promote deals, offers, sale by the companies and data mining techniques
helps to achieve this analysis task.
Education: For analyzing the education sector, data mining uses Educational
Data Mining (EDM) method. This method generates patterns that can be used
both by learners and educators. By using data mining EDM we can perform some
educational task.
Healthcare and Insurance: A Pharmaceutical sector can examine its new deals
force activity and their outcomes to improve the focusing of high-value
physicians and figure out which promoting activities will have the best effect in
the following upcoming months, Whereas the Insurance sector, data mining can
help to predict which customers will buy new policies, identify behavior patterns
of risky customers and identify fraudulent behavior of customers.
Fraud Detection: Billions of dollars have been lost to the action of frauds.
Traditional methods of fraud detection are time consuming and complex. Data
mining aids in providing meaningful patterns and turning data into information.
Any information that is valid and useful is knowledge. A perfect fraud detection
system should protect information of all the users. A supervised method includes
collection of sample records. These records are classified fraudulent or non-
fraudulent. A model is built using this data and the algorithm is made to identify
whether the record is fraudulent or not.
Energy Industry: Big Data is available even in the energy sector nowadays,
which points to the need for appropriate data mining techniques. Decision tree
models and support vector machine learning are among the most popular
approaches in the industry, providing feasible solutions for decision-making and
management. Additionally, data mining can also achieve productive gains by
predicting power outputs and the clearing price of electricity.
Retail Industry: The organized retail sector holds sizable quantities of data
points covering sales, purchasing history, delivery of goods, consumption, and
customer service. The databases have become even larger with the arrival of e-
commerce marketplaces.