Screenshot 2023-10-19 at 11.36.57

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 27

MODULE 2

Introduction to data mining (DM):


Motivation for Data Mining - Data Mining-Definition and Functionalities –
Classification of DM Systems - DM task primitives - Integration of a Data Mining
system with a Database or a Data Warehouse - Issues in DM – KDD Process
Data Pre-processing: Why to pre-process data? - Data cleaning: Missing Values,
Noisy Data - Data Integration and transformation - Data Reduction: Data cube
aggregation, Dimensionality reduction - Data Compression - Numerosity Reduction
- Data Mining Primitives - Languages and System Architectures: Task relevant data
- Kind of Knowledge to be mined - Discretization and Concept Hierarchy
Motivation for Data Mining
Data mining is the procedure of finding useful new correlations, patterns, and trends by sharing
through a high amount of data saved in repositories, using pattern recognition technologies
including statistical and mathematical techniques. It is the analysis of factual datasets to discover
unsuspected relationships and to summarize the records in novel methods that are both logical
and helpful to the data owner.
It is the procedure of selection, exploration, and modeling of high quantities of information to
find regularities or relations that are at first unknown to obtain clear and beneficial results for the
owner of the database.
It is not limited to the use of computer algorithms or statistical techniques. It is a process of
business intelligence that can be used together with information technology to support company
decisions.
Data Mining is similar to Data Science. It is carried out by a person, in a particular situation, on
a specific data set, with an objective. This phase contains several types of services including text
mining, web mining, audio and video mining, pictorial data mining, and social media mining. It
is completed through software that is simple or greatly specific.
Data mining has engaged a huge deal of attention in the information market and society as a
whole in current years, because of the wide availability of huge amounts of data and the imminent
needed for turning such data into beneficial data and knowledge. The information and knowledge
gained can be used for software ranging from industry analysis, fraud detection, and user
retention, to production control and science exploration.
Data mining can be considered as a result of the natural progress of data technology. The database
system market has supported an evolutionary direction in the development of the following
functionalities including data collection and database creation, data management, and advanced
data analysis.
For example, the recent development of data collection and database creation structure served as
necessary for the later development of an effective structure for data storage and retrieval, and
query and transaction processing. With various database systems providing query and transaction
processing as common practice, advanced data analysis has developed into the next object.
Data can be saved in several types of databases and data repositories. One data repository
structure that has appeared in the data warehouse, a repository of several heterogeneous data
sources organized under a unified schema at an individual site to support management decision
making.
Data warehouse technology involves data cleaning, data integration, and online analytical
processing (OLAP), especially, analysis techniques with functionalities including
summarization, consolidation, and aggregation, and the ability to view data from multiple angles.
What motivated data mining? Why is it important?

The major reason that data mining has attracted a great deal of attention in information
industry in recent years is due to the wide availability of huge amounts of data and the
imminent need for turning such data into useful information and knowledge. The
information and knowledge gained can be used for applications ranging from business
management, production control, and market analysis, to engineering design and
science exploration.
The evolution of database technology

Data collection and Database Creation


(1960s and earlier)

Primitive file processing

Database Management Systems

(1970s-early 1980s)

1) Hierarchical and network database system


2) Relational database system
3) Data modeling tools: entity-relational models, etc
4) Indexing and accessing methods: B-trees, hashing etc.
5) Query languages: SQL, etc.
User Interfaces, forms and reports
6) Query Processing and Query Optimization
Advanced Data Analysis:
Advanced Database Systems7) Transactions, concurrency control and recovery
8) Online
Data
transaction
warehousing
Processing
and Data(OLTP)
(mid 1980s-present)
mining (late 1980s-present)
1) Advanced Data models: 1)Data warehouse and OLAP
Extended relational, 2)Data mining and Web based databases
object- relational ,etc. knowledge
(1990s-present)
2) Advanced applications; discovery: generalization, classification,
Spatial, temporal, 1) XML- based
associ ation , clustering, frequent pattern,
multimedia, active database systems
outlier analysis, etc
stream and sensor, 2)Integration with
knowledge based 3)Advanced data mining applications: information retrieval
Stream data mining, bio-data mining, 3)Data and information
Integration
New Generation of Integrated Data and Information Systems(present future)
What is data mining?

Data mining refers to extracting or mining" knowledge from large amounts of data.
There are many other terms related to data mining, such as knowledge mining,
knowledge extraction, data/pattern analysis, data archaeology, and data dredging.
Many people treat datamining as a synonym for another popularly used term,
Knowledge Discovery in Databases", or KDD

Data Mining means “knowledge mining from data”.


Data Mining is processing data to identify patterns and establish relationships. Data mining is the
process of analyzing large amounts of data stored in a data warehouse for useful information
which makes use of artificial intelligence techniques, neural networks, and advanced statistical
tools to reveal trends, patterns and relationships, which otherwise may be undetected.
In addition, many other terms have a similar meaning to data mining—for example, knowledge
mining from data, knowledge extraction, data/pattern analysis, data archaeology, and data
dredging.
Many people treat data mining as a synonym for another popularly used term, knowledge
discovery from data, or KDD, while others view data mining as merely an essential step in the
process of knowledge discovery.
The knowledge discovery process consists of the following steps:
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined)
3. Data selection (where data relevant to the analysis task are retrieved from the database)
4. Data transformation (where data are transformed and consolidated into forms appropriate for
mining by performing summary or aggregation operations)
5. Data mining (an essential process where intelligent methods are applied to extract data patterns)
6. Pattern evaluation
7. Knowledge presentation (where visualization and knowledge representation techniques are
used to present mined knowledge to users)

Data Mining-Definition and Functionalities

Data mining is a technical methodology to detect information from huge data sets.
The main objective of data mining is to identify patterns, trends, or rules that
explain data behavior contextually. The data mining method uses mathematical
analysis to deduce patterns and trends, which were not possible through the old
methods of data exploration. Data mining is a handy and extremely convenient
methodology when it comes to dealing with huge volumes of data. In this article, we
explore some data mining functionalities that are measured to predict the type of
patterns in data sets.

Data mining functionalities are used to represent the type of patterns that have to be
discovered in data mining tasks. In general, data mining tasks can be classified into two types
including descriptive and predictive. Descriptive mining tasks define the common features of
the data in the database and the predictive mining tasks act inference on the current
information to develop predictions.
There are various data mining functionalities which are as follows −
Data characterization − It is a summarization of the general characteristics of an object class
of data. The data corresponding to the user-specified class is generally collected by a database
query. The output of data characterization can be presented in multiple forms.
Data discrimination − It is a comparison of the general characteristics of target class data
objects with the general characteristics of objects from one or a set of contrasting classes. The
target and contrasting classes can be represented by the user, and the equivalent data objects
fetched through database queries.
Association Analysis − It analyses the set of items that generally occur together in a
transactional dataset. There are two parameters that are used for determining the association
rules −
It provides which identifies the common item set in the database.
Confidence is the conditional probability that an item occurs in a transaction when another
item occurs.
Classification − Classification is the procedure of discovering a model that represents and
distinguishes data classes or concepts, for the objective of being able to use the model to
predict the class of objects whose class label is anonymous. The derived model is established
on the analysis of a set of training data (i.e., data objects whose class label is common).
Prediction − It defines predict some unavailable data values or pending trends. An object can
be anticipated based on the attribute values of the object and attribute values of the classes. It
can be a prediction of missing numerical values or increase/decrease trends in time-related
information.
Clustering − It is similar to classification but the classes are not predefined. The classes are
represented by data attributes. It is unsupervised learning. The objects are clustered or
grouped, depends on the principle of maximizing the intraclass similarity and minimizing the
intraclass similarity.
Outlier analysis − Outliers are data elements that cannot be grouped in a given class or
cluster. These are the data objects which have multiple behaviour from the general behaviour
of other data objects. The analysis of this type of data can be essential to mine the knowledge.
Evolution analysis − It defines the trends for objects whose behaviour changes over some
time.

Classification of DM Systems –

DM task primitives
Data Mining Primitives:
A data mining task can be specified in the form of a data mining query, which is input to the data
mining system. A data mining query is defined in terms of data mining task primitives. These
primitives allow the user to inter-actively communicate with the data mining system during
discovery of knowledge.
The data mining task primitives includes the following:
 Task-relevant data
 Kind of knowledge to be mined
 Background knowledge
 Interestingness measurement
 Presentation for visualizing the discovered patterns
Task-relevant data
This specifies the portions of the database or the dataset of data in which the user is interested.
This includes the database attributes or data warehouse dimensions of interest (referred to as the
relevant attributes or dimensions).
The kind of knowledge to be mined
This specifies the data mining functions to be performed. Such as characterization, discrimination,
association or correlation analysis, classification, prediction, clustering, outlier analysis, or
evolution analysis.
The background knowledge to be used in the discovery process
The knowledge about the domain is useful for guiding the knowledge discovery process for
evaluating the interesting patterns. Concept hierarchies are a popular form of background
knowledge, which allow data to be mined at multiple levels of abstraction.
An example of a concept hierarchy for the attribute (or dimension) age is shown in user beliefs
regarding relationships in the data are another form of background knowledge.
The interestingness measures and thresholds for pattern evaluation:
Different kinds of knowledge may have different interestingness measures.
For example, interestingness measures for association rules include support and confidence.
Rules whose support and confidence values are below user-specified thresholds are considered
uninteresting.
The expected representation for visualizing the discovered patterns. It refers to the discovered
patterns are to be displayed, which may include rules, tables, charts, graphs, decision trees, and
cubes.
A data mining query language can be designed to incorporate these primitives, allowing users to
flexibly interact with data mining systems.

Data Mining Task Primitives


A data mining task can be specified in the form of a data mining query, which is input to the data
mining system. A data mining query is defined in terms of data mining task primitives. These
primitives allow the user to interactively communicate with the data mining system during
discovery to direct the mining process or examine the findings from different angles or depths.
The data mining primitives specify the following,

1. Set of task-relevant data to be mined.


2. Kind of knowledge to be mined.
3. Background knowledge to be used in the discovery process.
4. Interestingness measures and thresholds for pattern evaluation.
5. Representation for visualizing the discovered patterns.

A data mining query language can be designed to incorporate these primitives, allowing users to
interact with data mining systems flexibly. Having a data mining query language provides a
foundation on which user-friendly graphical interfaces can be built.
Designing a comprehensive data mining language is challenging because data mining covers a
wide spectrum of tasks, from data characterization to evolution analysis. Each task has different
requirements. The design of an effective data mining query language requires a deep
understanding of the power, limitation, and underlying mechanisms of the various kinds of data
mining tasks. This facilitates a data mining system's communication with other information
systems and integrates with the overall information processing environment.

ist of Data Mining Task Primitives


A data mining query is defined in terms of the following primitives, such as:

. The set of task-relevant data to be mined

This specifies the portions of the database or the set of data in which the user is interested. This
includes the database attributes or data warehouse dimensions of interest (the relevant attributes
or dimensions).

In a relational database, the set of task-relevant data can be collected via a relational query
involving operations like selection, projection, join, and aggregation.

The data collection process results in a new data relational called the initial data relation. The
initial data relation can be ordered or grouped according to the conditions specified in the query.
This data retrieval can be thought of as a subtask of the data mining task.

This initial relation may or may not correspond to physical relation in the database. Since virtual
relations are called Views in the field of databases, the set of task-relevant data for data mining is
called a minable view.

2. The kind of knowledge to be mined

This specifies the data mining functions to be performed, such as characterization, discrimination,
association or correlation analysis, classification, prediction, clustering, outlier analysis, or
evolution analysis.

3. The background knowledge to be used in the discovery process

This knowledge about the domain to be mined is useful for guiding the knowledge discovery
process and evaluating the patterns found. Concept hierarchies are a popular form of
background knowledge, which allows data to be mined at multiple levels of abstraction.

Concept hierarchy defines a sequence of mappings from low-level concepts to higher-level,


more general concepts.
o Rolling Up - Generalization of data: Allow to view data at more meaningful and explicit
abstractions and makes it easier to understand. It compresses the data, and it would require
fewer input/output operations.
o Drilling Down - Specialization of data: Concept values replaced by lower-level
concepts. Based on different user viewpoints, there may be more than one concept
hierarchy for a given attribute or dimension.

An example of a concept hierarchy for the attribute (or dimension) age is shown below. User
beliefs regarding relationships in the data are another form of background knowledge.

4. The interestingness measures and thresholds for pattern evaluation

Different kinds of knowledge may have different interesting measures. They may be used to
guide the mining process or, after discovery, to evaluate the discovered patterns. For example,
interesting measures for association rules include support and confidence. Rules whose support
and confidence values are below user-specified thresholds are considered uninteresting.

o Simplicity: A factor contributing to the interestingness of a pattern is the pattern's overall


simplicity for human comprehension. For example, the more complex the structure of a
rule is, the more difficult it is to interpret, and hence, the less interesting it is likely to be.
Objective measures of pattern simplicity can be viewed as functions of the pattern
structure, defined in terms of the pattern size in bits or the number of attributes or operators
appearing in the pattern.
o Certainty (Confidence): Each discovered pattern should have a measure of certainty
associated with it that assesses the validity or "trustworthiness" of the pattern. A certainty
measure for association rules of the form "A =>B" where A and B are sets of items is
confidence. Confidence is a certainty measure. Given a set of task-relevant data tuples, the
confidence of "A => B" is defined as
Confidence (A=>B) = # tuples containing both A and B /# tuples containing A
o Utility (Support): The potential usefulness of a pattern is a factor defining its
interestingness. It can be estimated by a utility function, such as support. The support of
an association pattern refers to the percentage of task-relevant data tuples (or transactions)
for which the pattern is true.
Utility (support): usefulness of a pattern
Support (A=>B) = # tuples containing both A and B / total #of tuples
o Novelty: Novel patterns are those that contribute new information or increased
performance to the given pattern set. For example -> A data exception. Another strategy
for detecting novelty is to remove redundant patterns.

5. The expected representation for visualizing the discovered patterns

This refers to the form in which discovered patterns are to be displayed, which may include
rules, tables, cross tabs, charts, graphs, decision trees, cubes, or other visual representations.

Users must be able to specify the forms of presentation to be used for displaying the discovered
patterns. Some representation forms may be better suited than others for particular kinds of
knowledge.

For example, generalized relations and their corresponding cross tabs or pie/bar charts are good
for presenting characteristic descriptions, whereas decision trees are common for classification.

Integration of a Data Mining system with a Database or a Data


Warehouse

The data mining system is integrated with a database or data warehouse system so that it can do
its tasks in an effective presence. A data mining system operates in an environment that needed
it to communicate with other data systems like a database system. There are the possible
integration schemes that can integrate these systems which are as follows −
No coupling − No coupling defines that a data mining system will not use any function of a
database or data warehouse system. It can retrieve data from a specific source (including a file
system), process data using some data mining algorithms, and therefore save the mining results
in a different file.
Such a system, though simple, deteriorates from various limitations. First, a Database system
offers a big deal of flexibility and adaptability at storing, organizing, accessing, and processing
data. Without using a Database/Data warehouse system, a Data mining system can allocate a
large amount of time finding, collecting, cleaning, and changing data.
Loose Coupling − In this data mining system uses some services of a database or data warehouse
system. The data is fetched from a data repository handled by these systems. Data mining
approaches are used to process the data and then the processed data is saved either in a file or in
a designated area in a database or data warehouse. Loose coupling is better than no coupling as
it can fetch some area of data stored in databases by using query processing or various system
facilities.
Semitight Coupling − In this adequate execution of a few essential data mining primitives can
be supported in the database/datawarehouse system. These primitives can contain sorting,
indexing, aggregation, histogram analysis, multi-way join, and pre-computation of some
important statistical measures, including sum, count, max, min, standard deviation, etc.
Tight coupling − Tight coupling defines that a data mining system is smoothly integrated into
the database/data warehouse system. The data mining subsystem is considered as one functional
element of an information system.
Data mining queries and functions are developed and established on mining query analysis, data
structures, indexing schemes, and query processing methods of database/data warehouse systems.
It is hugely desirable because it supports the effective implementation of data mining functions,
high system performance, and an integrated data processing environment.

Issues in DM

Major Issues in Data Mining


1. Mining Methodology
Mining various and new kinds of knowledge
Data mining covers a wide spectrum of data analysis and knowledge discovery tasks, from data
characterization and discrimination to association and correlation analysis, classification,
regression, clustering, outlier analysis, sequence analysis, and trend and evolution analysis. These
tasks may use the same database in different ways and require the development of numerous data
mining techniques.
Mining knowledge in multidimensional space
When searching for knowledge in large data sets, we can explore the data in multidimensional
space. That is, we can search for interesting patterns among combinations of dimensions
(attributes) at varying levels of abstraction. Such mining is known as (exploratory)
multidimensional data mining.
Data mining
The power of data mining can be substantially enhanced by integrating new methods from
multiple disciplines. For example,to mine data with natural language text, it makes sense to fuse
data mining methods with methods of information retrieval and natural language processing. As
another example, consider the mining of software bugs in large programs. This form of mining,
known as bug mining, benefits from the incorporation of software engineering knowledge into
the data mining process.
Boosting the power of discovery in a networked environment
Most data objects reside in a linked or interconnected environment, whether it be the Web,
database relations, files, or documents. Semantic links across multiple data objects can be used to
advantage in data mining. Knowledge derived in one set of objects can be used to boost the
discovery of knowledge in a “related” or semantically linked set of objects.
Handling uncertainty, noise, or incompleteness of data
Data often contain noise, errors, exceptions, or uncertainty, or are incomplete. Errors and noise
may confuse the data mining process, leading to the derivation of erroneous patterns. Data
cleaning, data preprocessing, outlier detection and removal, and uncertainty reasoning are
examples of techniques that need to be integrated with the data mining process.
Pattern evaluation and pattern- or constraint-guided mining
Not all the patterns generated by data mining processes are interesting. What makes a pattern
interesting may vary from user to user. Therefore, techniques are needed to assess the
interestingness of discovered patterns based on subjective measures. These estimate the value of
patterns with respect to a given user class, based on user beliefs or expectations. Moreover, by
using interestingness measures or user-specified constraints to guide the discovery process, we
may generate more interesting patterns and reduce the search space.
2. User Interaction
Interactive mining
Interactive mining should allow users to dynamically change the focus of a search, to refine
mining requests based on returned results, and to drill, dice, and pivot through the data and
knowledge space interactively, dynamically exploring “cube space” while mining.
Incorporation of background knowledge
Background knowledge, constraints, rules, and other information regarding the domain under
study should be incorporated into the knowledge discovery process. Such knowledge can be used
for pattern evaluation as well as to guide the search toward interesting patterns.
-Ad hoc data mining and data mining query languages high-level data mining query languages or
other high-level flexible user interfaces will give users the freedom to define ad hoc data mining
tasks. This should facilitate specification of the relevant sets of data for analysis, the domain
knowledge, the kinds of knowledge to be mined, and the conditions and constraints to be enforced
on the discovered patterns. Optimization of the processing of such flexible mining requests is
another promising area of study.
-Presentation and visualization of data mining results
How can a data mining system present data mining results, vividly and flexibly, so that the
discovered knowledge can be easily understood and directly usable by humans? This is especially
crucial if the data mining process is interactive. It requires the system to adopt expressive
knowledge representations, user-friendly interfaces, and visualization techniques.
3. Efficiency and Scalability
Efficiency and scalability of data mining algorithms Data mining algorithms must be efficient and
scalable in order to effectively extract information from huge amounts of data in many data
repositories or in dynamic data streams.
Parallel, distributed, and incremental mining algorithms
Such algorithms first partition the data into “pieces.” Each piece is processed, in parallel, by
searching for patterns. The parallel processes may interact with one another. The patterns from
each partition are eventually merged. The high cost of some data mining processes and the
incremental nature of input promote incremental data mining, which incorporates new data
updates without having to mine the entire data “from scratch.” Such methods perform knowledge
modification incrementally to amend and strengthen what was previously discovered.
4. Diversity of Database Types
Handling complex types of data
Diverse applications generate a wide spectrum of new data types, from structured data such as
relational and data warehouse data to semi-structured and unstructured data; from stable data
repositories to dynamic data streams; from simple data objects to temporal data, biological
sequences, sensor data, spatial data, hypertext data, multimedia data, software program code, Web
data, and social network data.
Mining dynamic, networked, and global data repositories
Multiple sources of data are connected by the Internet and various kinds of networks, forming
gigantic, distributed, and heterogeneous global information systems and networks. The discovery
of knowledge from different sources of structured, semi-structured, or unstructured yet
interconnected data with diverse data semantics poses great challenges to data mining.
5. Data Mining and Society
Social impacts of data mining
With data mining penetrating our everyday lives, it is important to study the impact of data mining
on society. How can we use data mining technology to benefit society? How can we guard against
its misuse? The improper disclosure or use of data and the potential violation of individual privacy
and data protection rights are areas of concern that need to be addressed.
Privacy-preserving data mining
Data mining will help scientific discovery, business management, economy recovery, and security
protection (e.g., the real-time discovery of intruders and cyberattacks).
Invisible data mining
We cannot expect everyone in society to learn and master data mining techniques. More and more
systems should have data mining functions built within so that people can perform data mining or
use data mining results simply by mouse clicking, without any knowledge of data mining
algorithms. Intelligent search engines and Internet-based stores perform such invisible data
mining by incorporating data mining into their components to improve their functionality and
performance.

KDD Process
KDD- Knowledge Discovery in Databases
The term KDD stands for Knowledge Discovery in Databases. It refers to the broad procedure of
discovering knowledge in data and emphasizes the high-level applications of specific Data Mining
techniques. It is a field of interest to researchers in various fields, including artificial intelligence,
machine learning, pattern recognition, databases, statistics, knowledge acquisition for expert
systems, and data visualization.

The main objective of the KDD process is to extract information from data in the context of large
databases. It does this by using Data Mining algorithms to identify what is deemed knowledge.

The Knowledge Discovery in Databases is considered as a programmed, exploratory analysis and


modeling of vast data repositories.KDD is the organized procedure of recognizing valid, useful,
and understandable patterns from huge and complex data sets. Data Mining is the root of the KDD
procedure, including the inferring of algorithms that investigate the data, develop the model, and
find previously unknown patterns. The model is used for extracting the knowledge from the data,
analyze the data, and predict the data.

The availability and abundance of data today make knowledge discovery and Data Mining a
matter of impressive significance and need. In the recent development of the field, it isn't
surprising that a wide variety of techniques is presently accessible to specialists and experts.
The KDD Process
The knowledge discovery process(illustrates in the given figure) is iterative and interactive,
comprises of nine steps. The process is iterative at each stage, implying that moving back to the
previous actions might be required. The process has many imaginative aspects in the sense that
one cant presents one formula or make a complete scientific categorization for the correct
decisions for each step and application type. Thus, it is needed to understand the process and the
different requirements and possibilities in each stage.

The process begins with determining the KDD objectives and ends with the implementation of
the discovered knowledge. At that point, the loop is closed, and the Active Data Mining starts.
Subsequently, changes would need to be made in the application domain. For example, offering
various features to cell phone users in order to reduce churn. This closes the loop, and the impacts
are then measured on the new data repositories, and the KDD process again. Following is a concise
description of the nine-step KDD process, Beginning with a managerial step:

1. Building up an understanding of the application domain

This is the initial preliminary step. It develops the scene for understanding what should be done
with the various decisions like transformation, algorithms, representation, etc. The individuals
who are in charge of a KDD venture need to understand and characterize the objectives of the
end-user and the environment in which the knowledge discovery process will occur ( involves
relevant prior knowledge).

2. Choosing and creating a data set on which discovery will be performed

Once defined the objectives, the data that will be utilized for the knowledge discovery process
should be determined. This incorporates discovering what data is accessible, obtaining important
data, and afterward integrating all the data for knowledge discovery onto one set involves the
qualities that will be considered for the process. This process is important because of Data Mining
learns and discovers from the accessible data. This is the evidence base for building the models.
If some significant attributes are missing, at that point, then the entire study may be unsuccessful
from this respect, the more attributes are considered. On the other hand, to organize, collect, and
operate advanced data repositories is expensive, and there is an arrangement with the opportunity
for best understanding the phenomena. This arrangement refers to an aspect where the interactive
and iterative aspect of the KDD is taking place. This begins with the best available data sets and
later expands and observes the impact in terms of knowledge discovery and modeling.

3. Preprocessing and cleansing

In this step, data reliability is improved. It incorporates data clearing, for example, Handling the
missing quantities and removal of noise or outliers. It might include complex statistical techniques
or use a Data Mining algorithm in this context. For example, when one suspects that a specific
attribute of lacking reliability or has many missing data, at this point, this attribute could turn into
the objective of the Data Mining supervised algorithm. A prediction model for these attributes
will be created, and after that, missing data can be predicted. The expansion to which one pays
attention to this level relies upon numerous factors. Regardless, studying the aspects is significant
and regularly revealing by itself, to enterprise data frameworks.

4. Data Transformation

In this stage, the creation of appropriate data for Data Mining is prepared and developed.
Techniques here incorporate dimension reduction( for example, feature selection and extraction
and record sampling), also attribute transformation(for example, discretization of numerical
attributes and functional transformation). This step can be essential for the success of the entire
KDD project, and it is typically very project-specific. For example, in medical assessments, the
quotient of attributes may often be the most significant factor and not each one by itself. In
business, we may need to think about impacts beyond our control as well as efforts and transient
issues. For example, studying the impact of advertising accumulation. However, if we do not
utilize the right transformation at the starting, then we may acquire an amazing effect that insights
to us about the transformation required in the next iteration. Thus, the KDD process follows upon
itself and prompts an understanding of the transformation required.

5. Prediction and description


We are now prepared to decide on which kind of Data Mining to use, for example, classification,
regression, clustering, etc. This mainly relies on the KDD objectives, and also on the previous
steps. There are two significant objectives in Data Mining, the first one is a prediction, and the
second one is the description. Prediction is usually referred to as supervised Data Mining, while
descriptive Data Mining incorporates the unsupervised and visualization aspects of Data Mining.
Most Data Mining techniques depend on inductive learning, where a model is built explicitly or
implicitly by generalizing from an adequate number of preparing models. The fundamental
assumption of the inductive approach is that the prepared model applies to future cases. The
technique also takes into account the level of meta-learning for the specific set of accessible data.

6. Selecting the Data Mining algorithm

Having the technique, we now decide on the strategies. This stage incorporates choosing a
particular technique to be used for searching patterns that include multiple inducers. For example,
considering precision versus understandability, the previous is better with neural networks, while
the latter is better with decision trees. For each system of meta-learning, there are several
possibilities of how it can be succeeded. Meta-learning focuses on clarifying what causes a Data
Mining algorithm to be fruitful or not in a specific issue. Thus, this methodology attempts to
understand the situation under which a Data Mining algorithm is most suitable. Each algorithm
has parameters and strategies of leaning, such as ten folds cross-validation or another division for
training and testing.

7. Utilizing the Data Mining algorithm

At last, the implementation of the Data Mining algorithm is reached. In this stage, we may need
to utilize the algorithm several times until a satisfying outcome is obtained. For example, by
turning the algorithms control parameters, such as the minimum number of instances in a single
leaf of a decision tree.

8. Evaluation

In this step, we assess and interpret the mined patterns, rules, and reliability to the objective
characterized in the first step. Here we consider the preprocessing steps as for their impact on the
Data Mining algorithm results. For example, including a feature in step 4, and repeat from there.
This step focuses on the comprehensibility and utility of the induced model. In this step, the
identified knowledge is also recorded for further use. The last step is the use, and overall feedback
and discovery results acquire by Data Mining.

9. Using the discovered knowledge

Now, we are prepared to include the knowledge into another system for further activity. The
knowledge becomes effective in the sense that we may make changes to the system and measure
the impacts. The accomplishment of this step decides the effectiveness of the whole KDD process.
There are numerous challenges in this step, such as losing the "laboratory conditions" under which
we have worked. For example, the knowledge was discovered from a certain static depiction, it is
usually a set of data, but now the data becomes dynamic. Data structures may change certain
quantities that become unavailable, and the data domain might be modified, such as an attribute
that may have a value that was not expected previously.

Data Pre-processing: Why to pre-process data? - Data cleaning: Missing Values,


Noisy Data - Data Integration and transformation - Data Reduction: Data cube
aggregation, Dimensionality reduction - Data Compression - Numerosity Reduction
- Data Mining Primitives - Languages and System Architectures: Task relevant data
- Kind of Knowledge to be mined - Discretization and Concept Hierarchy

Preprocessing in Data Mining:


Data preprocessing is a data mining technique which is used to transform the raw data in a
useful and efficient format.
Data Preprocessing

Data preprocessing is the process of transforming raw data into an understandable format. It
is also an important step in data mining as we cannot work with raw data. The quality of the
data should be checked before applying machine learning or data mining algorithms.

Why is Data preprocessing important?

Preprocessing of data is mainly to check the data quality. The quality can be checked by the
following

 Accuracy: To check whether the data entered is correct or not.


 Completeness: To check whether the data is available or not recorded.
 Consistency: To check whether the same data is kept in all the places that do or do
not match.
 Timeliness: The data should be updated correctly.
 Believability: The data should be trustable.
 Interpretability: The understandability of the data.
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is
done. It involves handling of missing data, noisy data etc.

 (a). Missing Data:


This situation arises when some data is missing in the data. It can be handled in
various ways.
Some of them are:
1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large
and multiple values are missing within a tuple.

2. Fill the Missing values:


There are various ways to do this task. You can choose to fill the
missing values manually, by attribute mean or the most probable
value.
 (b). Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines.It can be
generated due to faulty data collection, data entry errors etc. It can be handled in
following ways :
1. Binning Method:
This method works on sorted data in order to smooth it. The whole
data is divided into segments of equal size and then various methods
are performed to complete the task. Each segmented is handled
separately. One can replace all data in a segment by its mean or
boundary values can be used to complete the task.

2. Regression:
Here data can be made smooth by fitting it to a regression
function.The regression used may be linear (having one independent
variable) or multiple (having multiple independent variables).

3. Clustering:
This approach groups the similar data in a cluster. The outliers may
be undetected or it will fall outside the clusters.
2. Data Integration

3. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for
mining process. This involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0
to 1.0)
Attribute construction (or feature construction), where new attributes are constructed and
added from the given set of attributes to help the mining process.
Min-max normalization performs a linear transformation on the original data. Suppose that minA
and maxA are the minimum and maximum values of an attribute, A. Min-max normalization maps
a value, v, of A to v 0 in the range [new minA, newmaxA]
In z-score normalization (or zero-mean normalization), the values for an attribute, A, are
normalized based on the mean and standard deviation of A. A value, v, of A is normalized to v 0.
Normalization by decimal scaling normalizes by moving the decimal point of values of attribute
A. The number of decimal points moved depends on the maximum absolute value of A. A value,
v, of A is normalized to v0.
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to
help the mining process.

3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or
conceptual levels.

4. Concept Hierarchy Generation:


Here attributes are converted from lower level to higher level in hierarchy. For
Example-The attribute “city” can be converted to “country”.

4. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While
working with huge volume of data, analysis became harder in such cases. In order
to get rid of this, we uses data reduction technique. It aims to increase the storage
efficiency and reduce data storage and analysis costs.
The various steps to data reduction are:
1. Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.

2. Attribute Subset Selection:


The highly relevant attributes should be used, rest all can be discarded. For
performing attribute selection, one can use level of significance and p- value of
the attribute.the attribute having p-value greater than significance level can be
discarded.

3. Numerosity Reduction:
This enable to store the model of data instead of whole data, for example:
Regression Models.

4. Dimensionality Reduction:
This reduce the size of data by encoding mechanisms. It can be lossy or lossless.
If after reconstruction from compressed data, original data can be retrieved,
such reduction are called lossless reduction else it is called lossy reduction. The
two effective methods of dimensionality reduction are:Wavelet transforms and
PCA (Principal Component Analysis).

Given data: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
Step: 1
Partition into equal-depth [n=4]:
Bin 1: 4, 8, 9, 15
Bin 2: 21, 21, 24, 25
Bin 3: 26, 28, 29, 34
Step: 2
Smoothing by bin means:
Bin 1: 9, 9, 9, 9
Bin 2: 23, 23, 23, 23
Bin 3: 29, 29, 29, 29
Binning method - Example (Cont..)
Given data: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
Step: 1
Partition into equal-depth [n=4]:
Bin 1: 4, 8, 9, 15
Bin 2: 21, 21, 24, 25
Bin 3: 26, 28, 29, 34
Step: 2
Smoothing by bin boundaries:
Bin 1: 4, 4, 4, 15
Bin 2: 21, 21, 25, 25
Bin 3: 26, 26, 26, 34

Data Mining Primitives

A data mining query is defined in terms of the following primitives


Task-relevant data: This is the database portion to be investigated. For example,
suppose that you are a manager of All Electronics in charge of sales in the United
States and Canada. In particular, you would like to study the buying trends of
customers in Canada. Rather than mining on the entire database. These are referred
to as relevant attributes
The kinds of knowledge to be mined: This specifies the data mining functions to
be performed, such as characterization, discrimination, association, classification,
clustering, or evolution analysis. For instance, if studying the buying habits of
customers in Canada, you may choose to mine associations between customer
profiles and the items that these customers like to buy
Background knowledge: Users can specify background knowledge, or knowledge
about the domain to be mined. This knowledge is useful for guiding the knowledge
discovery process, and for evaluating the patterns found. There are several kinds of
background knowledge.
Interestingness measures: These functions are used to separate uninteresting
patterns from knowledge. They may be used to guide the mining process, or after
discovery, to evaluate the discovered patterns. Different kinds of knowledge may
have different interestingness measures.

Languages and System Architectures:

Architecture of a typical data mining system/Major Components

Data mining is the process of discovering interesting knowledge from large amounts
of data stored either in databases, data warehouses, or other information repositories.
Based on this view, the architecture of a typical data mining system may have the
following major components:

1. A database, data warehouse, or other information repository, which


consists of the set of databases, data warehouses, spreadsheets, or other
kinds of information repositories containing the student and course
information.
2. A database or data warehouse server which fetches the relevant data
based on users’ data mining requests.
3. A knowledge base that contains the domain knowledge used to guide the
search or to evaluate the interestingness of resulting patterns. For
example, the knowledge base may contain metadata which describes data
from multiple heterogeneous sources.
4. A data mining engine, which consists of a set of functional modules for
tasks such as classification, association, classification, cluster analysis, and
evolution and deviation analysis.
5. A pattern evaluation module that works in tandem with the data mining
modules by employing interestingness measures to help focus the search
towards interestingness patterns.
6. A graphical user interface that allows the user an interactive
approach to thedata mining system.

Architecture of a typical data mining system

Graphical user interface

Pattern evaluation

Knowledge base

Data mining engine

Database or data warehouse server

Data cleansing
Data Integration Filtering

Database Data warehouse

Task relevant data


Data mining tasks are designed to be semi-automatic or fully automatic and on large data sets to
uncover patterns such as groups or clusters, unusual or over the top data called anomaly detection
and dependencies such as association and sequential pattern. Once patterns are uncovered, they
can be thought of as a summary of the input data, and further analysis may be carried out using
Machine Learning and Predictive analytics. For example, the data mining step might help identify
multiple groups in the data that a decision support system can use. Note that data collection,
preparation, reporting are not part of data mining.

There is a lot of confusion between data mining and data analysis. Data mining functions are used
to define the trends or correlations contained in data mining activities. While data analysis is used
to test statistical models that fit the dataset, for example, analysis of a marketing campaign, data
mining uses Machine Learning and mathematical and statistical models to discover patterns
hidden in the data. In comparison, data mining activities can be divided into two categories:

o Descriptive Data Mining: It includes certain knowledge to understand what is happening


within the data without a previous idea. The common data features are highlighted in the
data set. For example, count, average etc.
o Predictive Data Mining: It helps developers to provide unlabeled definitions of
attributes. With previously available or historical data, data mining can be used to make
predictions about critical business metrics based on data's linearity. For example,
predicting the volume of business next quarter based on performance in the previous
quarters over several years or judging from the findings of a patient's medical
examinations that is he suffering from any particular disease.

Kind of Knowledge to be mined

This specifies the data mining functions to be performed, such as characterization, discrimination,
association or correlation analysis, classification, prediction, clustering, outlier analysis, or
evolution analysis.

Discretization and Concept Hierarchy

Data Discretization

 Dividing the range of a continuous attribute into intervals.


 Interval labels can then be used to replace actual data values.
 Reduce the number of values for a given continuous attribute.
 Some classification algorithms only accept categorically attributes.
 This leads to a concise, easy-to-use, knowledge-level representation of mining
results.
 Discretization techniques can be categorized based on whether it uses class
information or not such as follows:
o Supervised Discretization - This discretization process uses class
information.
o Unsupervised Discretization - This discretization process does not use class
information.
 Discretization techniques can be categorized based on which direction it proceeds as
follows:

Top-down Discretization -

 If the process starts by first finding one or a few points called split points or cut
points to split the entire attribute range and then repeat this recursively on the
resulting intervals.

Bottom-up Discretization -

 Starts by considering all of the continuous values as potential split-points.


 Removes some by merging neighborhood values to form intervals, and then
recursively applies this process to the resulting intervals.

Concept Hierarchies

 Discretization can be performed rapidly on an attribute to provide a hierarchical


partitioning of the attribute values, known as a Concept Hierarchy.
 Concept hierarchies can be used to reduce the data by collecting and replacing low-
level concepts with higher-level concepts.
 In the multidimensional model, data are organized into multiple dimensions, and
each dimension contains multiple levels of abstraction defined by concept
hierarchies.
 This organization provides users with the flexibility to view data from different
perspectives.
 Data mining on a reduced data set means fewer input and output operations and is
more efficient than mining on a larger data set.
 Because of these benefits, discretization techniques and concept hierarchies are
typically applied before data mining, rather than during mining.

You might also like