Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Data Mining

Unit-1 Introduction to Data mining


1.1 Concepts

 Data mining is the process of analyzing hidden patterns of data according to


different perspectives in order to turn that data into useful and often actionable
information.
 Data mining is also known as data discovery or knowledge discovery. It's critical
in business intelligence to establish data-driven decisions.
 In general terms, “Mining” is the process of extraction of some valuable material
from the earth e.g. coal mining, diamond mining, etc. In the context of computer
science, “Data Mining” can be referred to as knowledge mining from data,
knowledge extraction, data/pattern analysis, data archaeology, and data
dredging. It is basically the process carried out for the extraction of useful
information from a bulk of data or data warehouses. One can see that the term
itself is a little confusing. In the case of coal or diamond mining, the result of the
extraction process is coal or diamond. But in the case of Data Mining, the result
of the extraction process is not data!! Instead, data mining results are the patterns
and knowledge that we gain at the end of the extraction process. In that sense,
we can think of Data Mining as a step in the process of Knowledge Discovery or
Knowledge Extraction.
 Gregory Piatetsky-Shapiro coined the term “Knowledge Discovery in
Databases” in 1989. However, the term ‘data mining’ became more popular in
the business and press communities. Currently, Data Mining and Knowledge
Discovery are used interchangeably.

1.2 KDD vs Data Mining

KDD is a field of computer science, which deals with extraction of previously unknown
and interesting information from raw data. KDD is the whole process of trying to make
sense of data by developing appropriate methods or techniques. This process deal with
the mapping of low-level data into other forms those are more compact, abstract and
useful. This is achieved by creating short reports, modeling the process of generating
data and developing predictive models that can predict future cases. Due to the
exponential growth of data, especially in areas such as business, KDD has become a
very important process to convert this large wealth of data in to business intelligence,
as manual extraction of patterns has become seemingly impossible in the past few
decades. For example, it is currently being used for various applications such as social
network analysis, fraud detection, science, investment, manufacturing,
telecommunications, data cleaning, sports, information retrieval and largely for
marketing. KDD is usually used to answer questions like what are the main products
that might help to obtain high profit next year in Wal-Mart?. This process has several
steps. It starts with developing an understanding of the application domain and the goal
and then creating a target dataset. This is followed by cleaning, preprocessing,
reduction and projection of data.

Although, the two terms KDD and Data Mining are heavily used interchangeably, they
refer to two related yet slightly different concepts. KDD is the overall process of
extracting knowledge from data while Data Mining is a step inside the KDD process,
which deals with identifying patterns in data. In other words, Data Mining is only the
application of a specific algorithm based on the overall goal of the KDD process.

1.3 Data Mining System Architecture

The significant components of data mining systems are a data source, data mining
engine, data warehouse server, the pattern evaluation module, graphical user interface,
and knowledge base.
 Data Sources: There are so many documents present. That is a database, data
warehouse, World Wide Web (WWW). That are the actual sources of data.
Sometimes, data may reside even in plain text files or spreadsheets. World Wide
Web or the Internet is another big source of data.

 Database or Data Warehouse Server: The database server contains the actual
data that is ready to be processed. Hence, the server handles retrieving the
relevant data. That is based on the data mining request of the user.

 Data Mining Engine: In data mining system data mining engine is the core
component. As It consists a number of modules. That we used to perform data
mining tasks. That includes association, classification, characterization,
clustering, prediction, etc.

 Pattern Evaluation Module: This module is mainly responsible for the measure
of interestingness of the pattern. For this, we use a threshold value. Also, it
interacts with the data mining engine. That’s main focus is to search towards
interesting patterns.

 GUI: We use this interface to communicate between the user and the data mining
system. Also, this module helps the user use the system easily and efficiently.
They don’t know the real complexity of the process. When the user specifies a
query, this module interacts with the data mining system. Thus, displays the
result in an easily understandable manner.

 Knowledge Base: In whole data mining process, the knowledge base is


beneficial. We use it to guiding the search for the result patterns. The knowledge
base might even contain user beliefs and data from user experiences. That can
be useful in the process of data mining. The data mining engine might get inputs
from the knowledge. That is the base to make the result more accurate and
reliable. The pattern evaluation module interacts with the knowledge base. That
is on a regular basis to get inputs and also to update it.

1.4 Data Mining Functionalities:

Data mining functionalities are used to specify the kind of patterns to be found in data
mining tasks. Data mining tasks can be classified into two categories: descriptive and
predictive.
 Descriptive mining tasks characterize the general properties of the data in the
database.
 Predictive mining tasks perform inference on the current data in order to make
predictions
There are different data mining functionalities.

 Classification: Classification is the technique to categorize elements in a


collection, basis their predefined functionalities and properties. In classification,
the model can classify new instances whose classification is unknown. These
particular instances that are used to create the model are called training data.
Such a mechanism of classification uses methods like if-then, decision trees,
neural networks, or even a set of classification rules. These methods can be
retrieved to identify future data.

 Association Analysis: Association Analysis is also called Market Basket


Analysis. It is a very popular data mining methodology with usage in sales.
Association analysis helps to find relations between elements frequently
occurring together. It is made up of a series of sets of elements and rules that
describe how these are grouped within the cases. Association rules are used to
predict the presence of an element in the database and are based on the
manifestation of a specific element identified as important.

Example: If a person is buying popcorn in the theatre then there are 60% chances
that he will buy a cold drink. This way, a prediction can be made on the shopping
behavior of the consumer.

 Cluster Analysis: The cluster analysis process is similar to that of classification.


In cluster analysis, similar types of data are grouped, and the only difference is
that the class label is unknown. Clustering algorithms divide the data basis
similarities and the grouped data are similar to each other more than the other
data in other groups. Cluster analysis is used in machine learning, deep learning,
image processing, pattern recognition, and NLP, etc.

 Data Characterization: The process of data characterization involves


summarizing the generic data features, which can result in specific rules to define
a target class. To characterize the data without much user intervention or
interaction, an attribute-oriented induction technique is used and the resultant
characterized data can be visualized in the form of different types of graphs,
charts, or tables.
 Data Discrimination: Data discrimination is a bias, which happens when a data
set or source is treated differently than the others, be it intentional or
unintentional. This data mining functionality helps to separate peculiar data sets,
based on the ambiguity in attribute values.

 Prediction: Prediction is among the most popular data mining functionalities


that determine any missing or unknown element in a data set. Linear regression
models based on the previous data are used to make numeric predictions, which
help businesses forecast the results of any given event, in a positive or negative
way. There are two types of predictions –

 Numeric Predictions – Predict any missing or unknown element in a data


set.
 Class Predictions – Predict the class label using a previously built class
model.

 Outlier Analysis: If we are unable to group any data in any class, we use the
outlier analysis technique. Outlier analysis helps to learn about data quality.
Outlier means data abnormality in most cases. More outliers in your data sets,
low the data quality. You cannot identify data patterns or derive any conclusions
from data sets with a big number of outliers. The outlier analysis process helps
in checking if there is any data that can be used to analyze after some clean-up.
Nevertheless, it is still important to keep a track of unusual data and activities so
that any anomalies can be detected beforehand and any business impact can be
detected in advance.

 Evolution Analysis: Evolution Analysis refers to the study of data sets that may
have been through a phase of transformation or change. The evolution analysis
models capture evolutionary trends in data, which further contributes to data
characterization, classification, or discrimination and clustering for multivariate
time series.

1.5 Kinds of Data on which Data Mining was performed

There are different types of data on which data mining was performed: -

 Relational Databases: A relational database is a set of records which are linked


between using some set of pre-defined constraints. These records are arranged
with columns and rows in the form of tables. Tables are used to store data about
the items that are to be described in the database.
A relational database is characterized as the set of data arranged in rows and
columns in the database tables. In relational databases, the database structure can
be defined using physical and logical schema. The physical schema is a schema
which describes the database structure and the relationship between tables while
logical schema is a schema which describes how tables are linked with one
another.

 Data warehouses: The method of building a data pool using some set of rules is
a data warehouse. Through combining data from several heterogeneous sources
which enable a user for analytical reporting, standardized and/or ad hoc requests,
and decision making. Data warehousing requires data cleaning, integration of
data and storage of information. To help historical research, a data warehouse
typically preserves several months or years of data. The data in a data warehouse
is usually loaded from multiple data sources by an extraction, transformation,
and loading process. Modern data warehouses shift towards an architecture of
extract, load, transformation in which all or much of the transformation of data
is carried out on the database that hosts the data warehouse. It is important to
remember that a very significant part of a data warehouse's design initiative is to
describe ETL (Extraction, Transformation, and Loading.) method. ETL
activities are the backbone of the data warehouse.

 Transactional Databases: A transaction is, in technical words, a series of


sequences of acts that are both independent and dependent at the same time. A
transaction is said to be concluded only if all the activities that are part of the
transaction are completed successfully. The transaction will be considered an
error even if it fails, and all the actions need to be rolled back or undone.

There is a given starting point for any database transaction, followed by steps to
change the data inside the database. In the end, before the transaction can be tried
again, the database either commits the changes to make them permanent or rolls
back the changes to the starting point.

 Database Management System: DBMS is an application for database


development and management. It offers a structured way for users to create,
retrieve, update, and manage the data. A person who uses DBMS to
communicate with the database need not concern about how and where the data
is processed. DBMS will take care of it.
DBMS is a collection of data in a structured manner. DBMS is a system for
database management that records information that has some significance. As an
example, if we have to create a student database, so we have to add certain
attributes such as student ID, student name, student address, student mobile
number, student email, etc., and all attributes have the same record type as a
student have. The DBMS provides the final user with a reliable firm.

 Advanced Database System: A new range of databases such as NoSQL/new


SQL was targeted by specialized database management systems. New
developments in data storage have risen by application demands, such as support
for predictive analytics, research, and data processing, are also supported by
advanced database management systems. The center of an effective database and
information systems has always been advanced data management. It treats a
wealth of different data models and surveys the foundations of structuring,
sorting, storing, and querying data according to these models

1.6 Applications of Data Mining

There are many measurable benefits that have been achieved in different application
areas from data mining. So, let’s discuss different applications of Data Mining:

 Scientific Analysis: Scientific simulations are generating bulks of data every


day. This includes data collected from nuclear laboratories, data about human
psychology, etc. Data mining techniques are capable of the analysis of these
data. Now we can capture and store more new data faster than we can analyze
the old data already accumulated.

 Business Transaction: Every business industry is memorized for perpetuity.


Such transactions are usually time-related and can be inter-business deals or
intra-business operations. The effective and in-time use of the data in a
reasonable time frame for competitive decision-making is definitely the most
important problem to solve for businesses that struggle to survive in a highly
competitive world. Data mining helps to analyze these business transactions and
identify marketing approaches and decision-making.

 Market Base Analysis: Market Basket Analysis is a technique that gives the
careful study of purchases done by a customer in a supermarket. This concept
identifies the pattern of frequent purchase items by customers. This analysis can
help to promote deals, offers, sale by the companies and data mining techniques
helps to achieve this analysis task.

 Education: For analyzing the education sector, data mining uses Educational
Data Mining (EDM) method. This method generates patterns that can be used
both by learners and educators. By using data mining EDM we can perform some
educational task.

 Research: A data mining technique can perform predictions, classification,


clustering, associations, and grouping of data with perfection in the research
area. Rules generated by data mining are unique to find results. In most of the
technical research in data mining, we create a training model and testing model.
The training/testing model is a strategy to measure the precision of the proposed
model. It is called Train/Test because we split the data set into two sets: a training
data set and a testing data set. A training data set used to design the training
model whereas testing data set is used in the testing model.

 Healthcare and Insurance: A Pharmaceutical sector can examine its new deals
force activity and their outcomes to improve the focusing of high-value
physicians and figure out which promoting activities will have the best effect in
the following upcoming months, Whereas the Insurance sector, data mining can
help to predict which customers will buy new policies, identify behavior patterns
of risky customers and identify fraudulent behavior of customers.

 Fraud Detection: Billions of dollars have been lost to the action of frauds.
Traditional methods of fraud detection are time consuming and complex. Data
mining aids in providing meaningful patterns and turning data into information.
Any information that is valid and useful is knowledge. A perfect fraud detection
system should protect information of all the users. A supervised method includes
collection of sample records. These records are classified fraudulent or non-
fraudulent. A model is built using this data and the algorithm is made to identify
whether the record is fraudulent or not.

 Energy Industry: Big Data is available even in the energy sector nowadays,
which points to the need for appropriate data mining techniques. Decision tree
models and support vector machine learning are among the most popular
approaches in the industry, providing feasible solutions for decision-making and
management. Additionally, data mining can also achieve productive gains by
predicting power outputs and the clearing price of electricity.

 Telecommunication Industry: Expanding and growing at a fast pace,


especially with the advent of the internet. Data mining can enable key industry
players to improve their service quality to stay ahead in the game.

 Retail Industry: The organized retail sector holds sizable quantities of data
points covering sales, purchasing history, delivery of goods, consumption, and
customer service. The databases have become even larger with the arrival of e-
commerce marketplaces.

You might also like