Data Mining Chapter 1

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 12

What Is Data Mining?

Data mining is the practice of automatically searching large stores of data


to discover patterns and trends that go beyond simple analysis. Data
mining uses sophisticated mathematical algorithms to segment the data
and evaluate the probability of future events. Data mining is also known as
Knowledge Discovery in Data (KDD).

The key properties of data mining are:

 Automatic discovery of patterns


 Prediction of likely outcomes
 Creation of actionable information
 Focus on large data sets and databases

Data mining can answer questions that cannot be addressed through


simple query and reporting techniques.

Why is data mining important?


So why is data mining important? You’ve seen the staggering numbers –
the volume of data produced is doubling every two years. Unstructured
data alone makes up 90 percent of the digital universe. But more
information does not necessarily mean more knowledge.

Data mining allows you to:

 Sift through all the chaotic and repetitive noise in your data.
 Understand what is relevant and then make good use of that
information to assess likely outcomes.
 Accelerate the pace of making informed decisions.
Learn more about data mining techniques in Data Mining From A to Z, a
paper that shows how organizations can use predictive analytics and data
mining to reveal new insights from data. 
1.3 What Kinds of Data Can Be Mined?
As a general technology, data mining can be applied to any kind of
data as long as the data are meaningful for a target application. The
most basic forms of data for mining applications are database data
(Section 1.3.1), data warehouse data (Section 1.3.2), and
transactional data (Section 1.3.3). The concepts and techniques
presented in this book focus on such data. Data mining can also be
applied to other forms of data (e.g., data streams, ordered/sequence
data, graph or networked data, spatial data, text data, multimedia
data, and the WWW). We present an overview of such data in Section
1.3.4. Techniques for mining of these kinds of data are briefly
introduced in Chapter 13.
1.4What Kinds of Patterns Can Be Mined?
We have observed various types of data and information repositories on
which data mining can be performed. Let us now examine the kinds of
patterns that can be mined. There are a number of data mining
functionalities. These include characterization and discrimination
(Section 1.4.1); the mining of frequent patterns, associations, and
correlations(Section1.4.2);classificationandregression(Section1.4.3);clus
teringanalysis (Section 1.4.4); and outlier analysis (Section 1.4.5). Data
mining functionalities are used to specify the kinds of patterns to be
found in data mining tasks. In general, such tasks can be classified into
two categories: descriptive and predictive. Descriptive mining tasks
characterize properties of the data in a target data set. Predictive mining
tasks perform induction on the current data in order to make predictions.
Dataminingfunctionalities,andthekindsofpatternstheycandiscover,aredesc
ribed below. In addition, Section 1.4.6 looks at what makes a pattern
interesting. Interesting patterns represent knowledge.KDD Process in
Data Mining
Last Updated: 20-08-2019
Data Mining – Knowledge Discovery in Databases(KDD).
Why we need Data Mining?
Volume of information is increasing everyday that we can handle from
business transactions, scientific data, sensor data, Pictures, videos, etc.
So, we need a system that will be capable of extracting essence of
information available and that can automatically generate report,
views or summary of data for better decision-making.
Why Data Mining is used in Business?
Data mining is used in business to make better managerial decisions by:
 Automatic summarization of data
 Extracting essence of information stored.
 Discovering patterns in raw data.
Data Mining also known as Knowledge Discovery in Databases, refers
to the nontrivial extraction of implicit, previously unknown and
potentially useful information from data stored in databases.
Steps Involved in KDD Process:
KDD process

1. Data Cleaning: Data cleaning is defined as removal of noisy and


irrelevant data from collection.
 Cleaning in case of Missing values.
 Cleaning noisy data, where noise is a random or variance
error.
 Cleaning with Data discrepancy detection and Data
transformation tools.
2. Data Integration: Data integration is defined as heterogeneous
data from multiple sources combined in a common
source(DataWarehouse).
 Data integration using Data Migration tools.
 Data integration using Data Synchronization tools.
 Data integration using ETL(Extract-Load-Transformation)
process.
3. Data Selection: Data selection is defined as the process where data
relevant to the analysis is decided and retrieved from the data
collection.
 Data selection using Neural network.
 Data selection using Decision Trees.
 Data selection using Naive bayes.
 Data selection using Clustering, Regression, etc.
4. Data Transformation: Data Transformation is defined as the
process of transforming data into appropriate form required by
mining procedure.
Data Transformation is a two step process:
 Data Mapping: Assigning elements from source base to
destination to capture transformations.
 Code generation: Creation of the actual transformation
program.
5. Data Mining: Data mining is defined as clever techniques that are
applied to extract patterns potentially useful.
 Transforms task relevant data into patterns.
 Decides purpose of model
using classification or characterization.
6. Pattern Evaluation: Pattern Evaluation is defined as as identifying
strictly increasing patterns representing knowledge based on given
measures.
 Find interestingness score of each pattern.
 Uses summarization and Visualization to make data
understandable by user.
7. Knowledge representation: Knowledge representation is defined
as technique which utilizes visualization tools to represent data
mining results.
 Generate reports.
 Generate tables.
 Generate discriminant rules, classification
rules, characterization rules, etc.
Note:
 KDD is an iterative process where evaluation measures can be
enhanced, mining can be refined, new data can be integrated and
transformed in order to get different and more appropriate results.
 Preprocessing of databases consists of Data cleaning and Data
Integration.

What is Machine Learning?

 “Machine Learning is a subset of artificial intelligence. It focuses mainly on


the designing of systems, thereby allowing them to learn and make
predictions based on some set of matrices in machines”.

How does Machine Learning work?


 
One of the approaches is where the machine learning( ML) algorithm is
trained using a labelled or unlabelled training data set to produce a model.
New input data is introduced to the ML algorithm and make a prediction
based on the model, the prediction is then evaluated for accuracy and if the
accuracy is acceptable the machine learning algorithm is deployed.
 
But, what if the accuracy is not acceptable, the ML algorithm is trained again
and again within an augmented training data set, this was just a high-level
example as there are other steps involved in it. Let’s move on and quickly
parse through Machine learning into different types, see how each of them
are, how they worked and how each of them is used in various fields.
 
 
Types of Machine Learning
 
1. Supervised Learning
2. Unsupervised Learning
3. Reinforcement Learning

Supervised Machine Learning


 
In supervised learning, you train your model on a labelled dataset that means
we have both raw input data as well as its results. We split our data into a
training dataset and test dataset where the training dataset is used to train
our network whereas the test dataset acts as new data for predicting results
or to see the accuracy of our model.
 
Hence, in supervised learning, our model learns from seen results the same as
a teacher teaches his students because the teacher already knows the results.
Accuracy is what we achieve in supervised learning as model perfection is
usually high.

Some algorithms for supervised learning


 
1. Linear Regression
2. Random Forest
3. Support Vector Machines (SVM)

Unsupervised Learning
 
In unsupervised learning, the information used to train is neither classified nor
labelled in the dataset. Unsupervised learning studies on how systems can
infer a function to describe a hidden structure from unlabelled data. The main
task of unsupervised learning is to find patterns in the data.
 
Once a model learns to develop patterns, it can easily predict patterns for any
new dataset in the form of clusters. The system doesn’t figure out the right
output, but it explores the data and can draw inferences from datasets to
describe hidden structures from unlabeled data.

Some algorithms available for unsupervised learning are


 
1. Principal Component Analysis Algorithm
2. K-means Algorithm
3. Singular Value Decomposition Algorithm

Reinforcement Learning
 
It is a Machine Learning algorithm that allows software agents and machines
to automatically determine the ideal behaviour within a specific context to
maximize its performance. It does not have labelled dataset or results
associated with data so the only way to perform a given task is to learn from
experience.
 
For every correct action or decision of algorithm, it is rewarded with positive
reinforcement whereas, for every incorrect action, it is rewarded with negative
reinforcement. In this way, it learns which actions are needed to perform and
which are not. Reinforcement learning can, therefore, help in industrial
automation as well as the gaming sector primarily.

Major Issues in Data Mining

Data mining is not an easy task, as the algorithms used can get very
complex and data is not always available at one place. It needs to be
integrated from various heterogeneous data sources. These factors also
create some issues. Here in this tutorial, we will discuss the major issues
regarding −

 Mining Methodology and User Interaction


 Performance Issues
 Diverse Data Types Issues

Mining Methodology and User Interaction Issues


It refers to the following kinds of issues −
 Mining different kinds of knowledge in databases − Different users may be
interested in different kinds of knowledge. Therefore it is necessary for data
mining to cover a broad range of knowledge discovery task.
 Interactive mining of knowledge at multiple levels of abstraction − The data
mining process needs to be interactive because it allows users to focus the
search for patterns, providing and refining data mining requests based on the
returned results.
 Incorporation of background knowledge − To guide discovery process and to
express the discovered patterns, the background knowledge can be used.
Background knowledge may be used to express the discovered patterns not
only in concise terms but at multiple levels of abstraction.
 Data mining query languages and ad hoc data mining − Data Mining Query
language that allows the user to describe ad hoc mining tasks, should be
integrated with a data warehouse query language and optimized for efficient and
flexible data mining.
 Presentation and visualization of data mining results − Once the patterns
are discovered it needs to be expressed in high level languages, and visual
representations. These representations should be easily understandable.
 Handling noisy or incomplete data − The data cleaning methods are required
to handle the noise and incomplete objects while mining the data regularities. If
the data cleaning methods are not there then the accuracy of the discovered
patterns will be poor.
 Pattern evaluation − The patterns discovered should be interesting because
either they represent common knowledge or lack novelty.

Performance Issues
There can be performance-related issues such as follows −
 Efficiency and scalability of data mining algorithms − In order to effectively
extract the information from huge amount of data in databases, data mining
algorithm must be efficient and scalable.
 Parallel, distributed, and incremental mining algorithms − The factors such
as huge size of databases, wide distribution of data, and complexity of data
mining methods motivate the development of parallel and distributed data
mining algorithms. These algorithms divide the data into partitions which is
further processed in a parallel fashion. Then the results from the partitions is
merged. The incremental algorithms, update databases without mining the data
again from scratch.

Diverse Data Types Issues


 Handling of relational and complex types of data − The database may
contain complex data objects, multimedia data objects, spatial data, temporal
data etc. It is not possible for one system to mine all these kind of data.
 Mining information from heterogeneous databases and global information
systems − The data is available at different data sources on LAN or WAN.
These data source may be structured, semi structured or unstructured.
Therefore mining the knowledge from them adds challenges to data mining.

What is data mining?


In your answer, address the following:Data Mining is the process or method for
extracting “mines” the interesting information or patterns fromlarge amount of data
to be able to take a decision based on that
(a) Is it another hype?
Data mining is note another Hype but actually the need for data mining is due to the
wide availability ofhuge amount of data and the need for transforming such data into
a useful information that we can takea decision or Analysis based on that. So Data
mining is the result of evolution of information technology
(b) Is it a simple transformation of technology developed from databases, statistics
and machinelearning?
No, Data mining is more thanNo. Data mining is more than a simple transformation
of technology developed from databases,statistics, and machine learning. Its
involves integration rather than a simple transformation oftechniques from multiple
disciplines such as database technology, statistics, machine learning, high-
performance computing, pattern recognition, neural networks, data visualization,
information retrieval,image and signal processing, and spatial data analysis.
(c) We have presented a view that data mining is the result of the evolution of
database technology.Do you think that data mining is also the result of the
evolution of machine learning research? Can youpresent such views based on the
historical progress of this discipline? Address the same for the fieldsof statistics
and pattern recognition.
Database technology began with the development of data collection and database
creation mechanismsthat led to the development of effective mechanisms for data
management including data storage,retrieval, query and transaction processing. The
large number of database systems offering query andtransaction processing
eventually and naturally led to the need for data analysis and understanding.Hence,
data mining began its development out of this necessity.
(d) Describe the steps involved in data mining when viewed as a process of
knowledge discovery.
Data mining knowledge discovery are as follows:-Data cleaning, a process that
removes or transforms noise and inconsistent data - Data integration,where multiple
data sources may be combined.-Data selection, where data relevant to the analysis
task are retrieved from the database-Data transformation, where data are transformed
or consolidated into forms appropriate for mining-Data mining, an essential process
where intelligent and efficient methods are applied in order to extractpatterns

You might also like