Data Mining 1

UNIT 1 : INTRODUCTION data comes from multiple places such as Marketing and Finance.
data comes from multiple places such as Marketing and Finance. The extracted data o Different data mining instruments operate in distinct ways due to the different
is utilized for analytical purposes and helps in decision- making for a business algorithms used in their design. Therefore, the selection of the right data mining
DATA MINING
organization. tools is a very challenging task.
The process of extracting information to identify patterns, trends, and useful data that
3.Data Repositories: o The data mining techniques are not precise, so that it may lead to severe
would allow the business to take the data-driven decision from huge sets of data is
consequences in certain conditions.
called Data Mining. The Data Repository generally refers to a destination for data storage. However, many
IT professionals utilize the term more clearly to refer to a specific kind of setup within DATA MINING APPLICATIONS
In other words, we can say that Data Mining is the process of investigating hidden
an IT structure.
patterns of information to various perspectives for categorization into useful data, Data Mining is primarily used by organizations with intense consumer demands-
which is collected and assembled in particular areas such as data warehouses, 4.Object-Relational Database: Retail, Communication, Financial, marketing company, determine price, consumer
efficient analysis, data mining algorithm, helping decision making and other data preferences, product positioning, and impact on sales, customer satisfaction, and
A combination of an object-oriented database model and relational database model is
requirement to eventually cost-cutting and generating revenue. corporate profits
called an object-relational model. It supports Classes, Objects, Inheritance, etc.
Data mining is the act of automatically searching for large stores of information to
One of the primary objectives of the Object-relational data model is to close the gap
find trends and patterns that go beyond simple analysis procedures. Data mining
between the Relational database and the object-oriented model practices frequently
utilizes complex mathematical algorithms for data segments and evaluates the
utilized in many programming languages
probability of future events. Data Mining is also called Knowledge Discovery of Data
(KDD). 5.Transactional Database:
Data Mining is similar to Data Science carried out by a person, in a specific situation, A transactional database refers to a database management system (DBMS) that has
on a particular data set, with an objective. This process includes various types of the potential to undo a database transaction if it is not performed appropriately. Even
services such as text mining, web mining, audio and video mining, pictorial data though this was a unique capability a very long while back, today, most of the
mining, and social media mining. It is done through software that is simple or highly relational database systems support transactional database activities.
specific
Advantages of Data Mining
o The Data Mining technique enables organizations to obtain knowledge-based 1.Data Mining in Healthcare:
data. Data mining in healthcare has excellent potential to improve the health system. It
o Data mining enables organizations to make lucrative modifications in operation uses data and analytics for better insights and to identify best practices that will
and production. enhance health care services and reduce costs. Analysts use data mining approaches
such as Machine learning, Multi-dimensional database, Data visualization, Soft
o Compared with other statistical data applications, data mining is a cost-
computing, and statistics. Data Mining can be used to forecast patients in each
efficient.
category.
o Data Mining helps the decision-making process of an organization.
2.Data Mining in Market Basket Analysis:
o It Facilitates the automated discovery of hidden patterns as well as the
Types of Data Mining Market basket analysis is a modeling method based on a hypothesis. If you buy a
prediction of trends and behaviors.
specific group of products, then you are more likely to buy another group of products.
Data mining can be performed on the following types of data: This technique may enable the retailer to understand the purchase behavior of a
o It can be induced in the new system as well as the existing platforms.
1.Relational Database: buyer. This data may assist the retailer in understanding the requirements of the
o It is a quick process that makes it easy for new users to analyze enormous
buyer and altering the store's layout accordingly
A relational database is a collection of multiple data sets formally organized by tables, amounts of data in a short time.
records, and columns from which data can be accessed in various ways without 3.Data mining in education:
Disadvantages of Data Mining
having to recognize the database tables. Tables convey and share information, which Education data mining objectives are recognized as affirming student's future learning
facilitates data searchability, reporting, and organization. o There is a probability that the organizations may sell useful data of customers
behavior, studying the impact of educational support, and promoting learning science.
to other organizations for money. As per the report, American Express has sold
2.Data warehouses: An organization can use data mining to make precise decisions and also to predict the
credit card purchases of their customers to other organizations.
results of the student. With the results, the institution can concentrate on what to
A Data Warehouse is the technology that collects the data from various sources teach and how to teach.
o Many data mining analytics software is difficult to operate and needs advance
within the organization to provide meaningful business insights. The huge amount of
training to work on.
4.Data Mining in CRM (Customer Relationship Management): 5.Data Visualization: the study of data mining from a databaseperspective involves looking at all types of d
ata mining applications and techniques.
Customer Relationship Management (CRM) is all about obtaining and holding In data mining, data visualization is a very important process because it is the primary
Customers, also enhancing customer loyalty and implementing customer-oriented method that shows the output to the user in a presentable way. The extracted data DATA MINIG ARCHITECTURE
strategies. To get a decent relationship with the customer, a business organization should convey the exact meaning of what it intends to express. But many times,
Data Source:
needs to collect data and analyze the data representing the information to the end-user in a precise and easy way is difficult. The
input data and the output information being complicated, very efficient, and The actual source of data is the Database, data warehouse, World Wide Web (WWW),
CHALLENGES IN DATA MINING
successful data visualization processes need to be implemented to make it text files, and other documents. Organizations typically store data in databases or
Although data mining is very powerful, it faces many challenges during its execution. successful. data warehouses. Data warehouses may comprise one or more databases, text files
Various challenges could be related to performance, data, methods, and techniques, spreadsheets, or other repositories of data.
DATA MINING METRICES
etc
Different processes:
The user of the data mining tools will have to direct the machine rules, preferences,
and even experiences to have decision support data mining metrics are as follows − Before passing the data to the database or data warehouse server, the data must be
cleaned, integrated, and selected. As the information comes from various sources and
Usefulness − Usefulness involves several metrics that tell us whether the model
in different formats, it can't be used directly for the data mining procedure because
provides useful data. For instance, a data mining model that correlates save the
the data may not be complete and accurate. So, the first data requires to be cleaned
location with sales can be both accurate and reliable, but cannot be useful, because it
and unified.
cannot generalize that result by inserting more stores at the same location.
Database or Data Warehouse Server:
Furthermore, it does not answer the fundamental business question of why specific
locations have more sales. It can also find that a model that appears successful is The database or data warehouse server consists of the original data that is ready to
meaningless because it depends on cross-correlations in the data. be processed. Hence, the server is cause for retrieving the relevant data that is based
on data mining as per user request.
Return on Investment (ROI) − Data mining tools will find interesting patterns buried
inside the data and develop predictive models. These models will have several Data Mining Engine:
measures for denoting how well they fit the records. It is not clear how to create a
It engine is a major component of any data mining system. It contains several
decision based on some of the measures reported as an element of data mining
1.Incomplete and noisy data: modules for operating data mining tasks, including association, characterization,
analyses.
classification, clustering, prediction, time-series analysis, etc.
The process of extracting useful data from large volumes of data is data mining. The
Access Financial Information during Data Mining − The simplest way to frame
data in the real-world is heterogeneous, incomplete, and noisy. Data in huge Pattern Evaluation Module:
quantities will usually be inaccurate or unreliable. These problems may occur due to decisions in financial terms is to augment the raw information that is generally mined
to also contain financial data. Some organizations are investing and developing data The Pattern evaluation module is primarily responsible for the measure of
data measuring instrument or because of human errors.
warehouses, and data marts. investigation of the pattern by using a threshold value. It collaborates with the data
2.Data Distribution: mining engine to focus the search on exciting patterns.
The design of a warehouse or mart contains considerations about the types of
Real-worlds data is usually stored on various platforms in a distributed computing analyses and data needed for expected queries. It is designing warehouses in a way Graphical User Interface:
environment. It might be in a database, individual systems, or even on the internet. that allows access to financial information along with access to more typical data on
The graphical user interface (GUI) module communicates between the data mining
Practically, It is a quite tough task to make all the data to a centralized data product attributes, user profiles, etc. can be useful.
system and the user. This module helps the user to easily and efficiently use the
repository mainly due to organizational and technical concerns.
DATA MINING FROM DATABASAE PERSPECTIVE system without knowing the complexity of the process. This module cooperates with
3.Complex Data: the data mining system when the user specifies a query or a task and displays the
Data Mining can be studied from many different perspectives. Researchers in many
results.
Real-world data is heterogeneous, and it could be multimedia data, including audio different fields have shown great interest in data mining. An information retrieval
and video, images, complex data, spatial data, time series, and so on. Managing these researcher probably would concentrate on the use of data mining techniques Knowledge Base:
various types of data and extracting useful information is a tough task. to access text data; a primarily at the historical techniques!
The knowledge base is helpful in the entire process of data mining. It might be helpful
4.Performance: including time series analysis! Hypothesis testing! and applications of "ayes theorem;
to guide the search or evaluate the stake of the result patterns. The knowledge base
a machine learningspecialist might be interested primarily in data miningalgorithms
may even contain user views and data from user experiences that might be helpful in
The data mining system's performance relies primarily on the efficiency of algorithms that learn; and an algorithms researcher would be interested in studying and
the data mining process. The data mining engine may receive inputs from the
and techniques used. If the designed algorithm and techniques are not up to the mark, comparing algorithms based on type and complexity.
knowledge base to make the result more accurate and reliable. The pattern
then the efficiency of the data mining process will be affected adversely.
assessment module regularly interacts with the knowledge base to get inputs, and 1. Euclidean Distance:
also update it. Euclidean distance is considered the traditional metric for problems with geometry. It
UNIT: 2 DATA MINING TECHNIQUES
can be simply explained as the ordinary distance between two points. It is one of the
A STATISTICAL PERSPECTIVE ON DATA MINING most used algorithms in the cluster analysis. One of the algorithms that use this
formula would be K-mean. Mathematically it computes the root of squared
The recent upsurge of interest in the field variously known as data mining, knowledge
differences between the coordinates between two objects.
discovery or machine learning! has taken many statisticians by surprise. Data mining
attacks such problems as obtaining efficient summaries of large amounts of data,
identifying interesting structures and relationships within a data set, and using a set
of previously observed data to construct predictors of future observations.
Statisticians have well established techniques for attacking all of these problems.
Exploratory data analysis, a field particularly associated with J. W. Tukey [18], is a
collection of methods for summarizing and identifying patterns in data. Many
statistical models exist for explaining relationships in a data set or for making
predictions: cluster analysis, discriminant analysis and nonparametric regression can
be used in many data mining problems. It is therefore tempting for a statistician to
BENEFITS OF DATA MINING:
regard data mining as no more than a branch of statistics.
1. Marketing/Retails
Nonetheless, the problems and methods of data mining have some distinct features of
To create models, marketing companies use data mining. This was based on history to their own. Data sets can be very much larger than is usual in statistics, running to
forecast who will respond to new marketing campaigns such as direct mail, online hundreds of gigabytes or terabytes. Data analyses are on a correspondingly larger
marketing, etc. This means that marketers can sell profitable products to targeted scale, often requiring days of computer time to fit a single model. There are
customers. differences of emphasis in the approach to modelling : compared with statistics, data
mining pays less attention to the large-sample asymptotic properties of its inferences
2. Finance/Banking
and more to the general philosophy of ―learning‖, including consideration of the
Since data extraction provides financial institutions information on loans and credit complexity of models and of the computations that they require. Some modeling
reports, data can determine good or bad credits by creating a model for historical techniques, such as rule-based methods, are difficult to fit into the classical
customers. It also helps banks detect fraudulent transactions by credit cards that statistical framework, and others, such as neural networks, have an extensive 2. Manhattan Distance:
protect a credit card owner. methodology and terminology that has developed largely independently of input from This determines the absolute difference among the pair of the coordinates. Suppose
statisticians. we have two points P and Q to determine the distance between these points we
3. Helps in Decision Making
MEASURE OF DISTANCE IN DATA MINING simply have to calculate the perpendicular distance of the points from X-Axis and Y-
People use these data mining techniques to help them make some decisions in
Axis.
marketing or business. Today, with the use of this technology, all information can be Clustering consists of grouping certain objects that are similar to each other, it can be
In a plane with P at coordinate (x1, y1) and Q at (x2, y2).
determined. Also, using such technology, one can decide precisely what is unknown used to decide if two items are similar or dissimilar in their properties.
and unexpected. Manhattan distance between P and Q = |x1 – x2| + |y1 – y2|
In a Data Mining sense, the similarity measure is a distance with dimensions
4. To Predict Future Trends describing object features. That means if the distance among two data points
is small then there is a high degree of similarity among the objects and vice versa.
All information factors are part of the working nature of the system. The data mining
The similarity is subjective and depends heavily on the context and application. For
systems can also be obtained from these. They can help you predict future trends, and
example, similarity among vegetables can be determined from their taste, size, colour
with the help of this technology, this is entirely possible. And people also adopt
etc.
behavioural changes.
Most clustering approaches use distance measures to assess the similarities or
5. Increases Website Optimization
differences between a pair of objects, the most popular distance measures used are:
We use data mining to find all kinds of unseen element information. And adding data
mining helps you to optimize your website. Similarly, this data mining provides
information that may use the technology of data mi
3. Jaccard Index: In short, a decision tree is just like a flow chart diagram with the terminal nodes
The Jaccard distance measures the similarity of the two data set items as showing decisions. Starting with the dataset, we can measure the entropy to find a
the intersection of those items divided by the union of the data items. way to segment the set until the data belongs to the same class.
Figure – Jaccard Index
4. Minkowski distance:
Figure – Cosine Distance
It is the generalized form of the Euclidean and Manhattan Distance Measure. In an N-
dimensional space, a point is represented as,
(x1, x2, ..., xN)
Consider two points P1 and P2: DECISION TREE

NEURAL NETWORKS
P1: (X1, X2, ..., XN) Decision Tree is a supervised learning method used in data mining for classification
and regression methods. It is a tree that helps us in decision-making purposes. The As businesses continue to accrue exponentially larger and larger quantities of data,
P2: (Y1, Y2, ..., YN)
decision tree creates classification or regression models as a tree structure. It there is a corresponding and critical need for automated processes to handle and
Then, the Minkowski distance between P1 and P2 is given as: separates a data set into smaller subsets, and at the same time, the decision tree is rmake sense of such volumes of information.2 For companies keen to mine effectively
steadily developed. The final tree is a tree with the decision nodes and leaf nodes. A and understand big data, neural networks in data mining would be an inspired choice.
decision node has at least two branches. The leaf nodes show a classification or The importance of neural networks, or nodes, is clear in their ability to detect and
decision. We can't accomplish more split on leaf nodes-The uppermost decision node assimilate relationships between a range of variables.
 When p = 2, Minkowski distance is same as the Euclidean distance.
in a tree that relates to the best predictor called the root node. Decision trees can
A neural network is a series of algorithms that recognize underlying relationships in a
 When p = 1, Minkowski distance is same as the Manhattan distance. deal with both categorical and numerical data.
set of data through a process that imitates the way the human brain operates.
5. Cosine Index: Key factors:
The artificial neural network (ANN) assimilates data in the same way the human brain
Cosine distance measure for clustering determines the cosine of the angle between
Entropy: processes information. The brain’s neurons process information in the form of electric
two vectors given by the following formula.
signals. External information, or stimuli, is received and processed, and the brain then
Entropy refers to a common way to measure impurity. In the decision tree, it
produces an output.4
measures the randomness or impurity in data sets.
Similarly, neural networks reflect the behavior of the human brain, allowing computer
Here (theta) gives the angle between two vectors and A, B are n-dimensional vectors. Information Gain:
programs to recognize patterns and solve common problems in the fields of artificial
Information Gain refers to the decline in entropy after the dataset is split. It is also intelligence (AI), machine learning, and deep learning.5
called Entropy Reduction. Building a decision tree is all about discovering attributes
that return the highest data gain.
the hypothesis that the user will purchase a computer. Thus P (H|X) reverses the
probability that user X will purchase a computer given that the user’s age and income
are acknowledged.
P (H) is the prior probability of H. For instance, this is the probability that any given
user will purchase a computer, regardless of age, income, or some other data. The
posterior probability P (H|X) is located on more data than the prior probability P (H),
which is free of X.
Likewise, P (X|H) is the posterior probability of X conditioned on H. It is the probability

that a user X is 30 years old and gains Rs. 20,000.
P (H), P (X|H), and P (X) can be measured from the given information. Bayes theorem
supports a method of computing the posterior probability P (H|X), from P (H), P (X|H),
and P(X). It is given by
GENETIC ALGORITHM
A genetic algorithm in data mining is an advanced method of data classification. Data

classification incorporates two steps, i.e. learning step and the classification step.
The classification model is constructed in the learning step, and in the classification
step, the model predicts the output for the provided input.
UNIT 3: CLASSIFICATION DECISION TREE INDUCTION ALGORITHM
A genetic algorithm is based on the basic principle of natural evolution where the
fittest individual survives at the end. We use the algorithm for solving optimization There are two types of STATISTICAL BASED ALGORITHM which are as follows − A machine researcher named J. Ross Quinlan in 1980 developed a decision tree
problems algorithm known as ID3 (Iterative Dichotomiser). Later, he presented C4.5, which was
 Regression − Regression issues deal with the evaluation of an output value
the successor of ID3. ID3 and C4.5 adopt a greedy approach. In this algorithm, there is
The genetic algorithm applies the same technique in data mining – it iteratively located on input values. When utilized for classification, the input values are
no backtracking; the trees are constructed in a top-down recursive divide-and-conquer
performs the selection, crossover, mutation, and encoding process to evolve the values from the database and the output values define the classes. Regression
manner.
successive generation of models. can be used to clarify classification issues, but it is used for different
applications including forecasting. The elementary form of regression is simple Generating a decision tree form training tuples of data partition D
The components of genetic algorithms consist of:
linear regression that includes only one predictor and a prediction. Algorithm : Generate_decision_tree
 Population incorporating individuals.
Regression can be used to implement classification using two various methods which
 Encoding or decoding mechanism of individuals. are as follows −
Input:
 The objective function and an associated fitness evaluation criterion. o Division − The data are divided into regions located on class.
Data partition, D, which is a set of training tuples
 Selection procedure. o Prediction − Formulas are created to predict the output class’s value.
and their associated class labels.
 Genetic operators like recombination or crossover, mutation.  Bayesian Classification − Statistical classifiers are used for the classification.
Bayesian classification is based on the Bayes theorem. Bayesian classifiers attribute_list, the set of candidate attributes.
 Probabilities to perform genetic operations.
view high efficiency and speed when used to high databases. Attribute selection method, a procedure to determine the
 Replacement technique.
Bayes Theorem − Let X be a data tuple. In the Bayesian method, X is treated as splitting criterion that best partitions that the data
 Termination combination. ―evidence.‖ Let H be some hypothesis, including that the data tuple X belongs to a
particularized class C. The probability P (H|X) is decided to define the data. This tuples into individual classes. This criterion includes a
probability P (H|X) is the probability that hypothesis H’s influence has given the splitting_attribute and either a splitting point or splitting subset.
―evidence‖ or noticed data tuple X.
Output:
P (H|X) is the posterior probability of H conditioned on X. For instance, consider the
nature of data tuples is limited to users defined by the attribute age and income, A Decision Tree
commonly, and that X is 30 years old users with Rs. 20,000 income. Assume that H is
Method NEURAL NETWORK BASED ALGORITHM  Evolutionary algorithms use recombination mix candidates of a population and
create new candidates.
create a node N; Gradient Descent
 On random selection evolutionary algorithm based.
if tuples in D are all of the same class, C then We use the gradient descent algorithm to find the local smallest of a function. The
Neural Network Algorithm converges to the local smallest. By approaching In all these Neural Network Algorithms, a genetic algorithm is the most common
return N as leaf node labeled with class C;
proportional to the negative of the gradient of the function. To find local maxima, take evolutionary algorithm.
if attribute_list is empty then the steps proportional to the positive gradient of the function. This is a gradient
RULE BASED CLASSIFIER
ascendant process.
return N as leaf node with labeled
Rule-based classifiers are just another type of classifier which makes the class
In linear models, the error surface is well defined and well known mathematical object
with majority class in D;|| majority voting decision depending by using various ―if..else‖ rules. These rules are easily
in the shape of a parabola. Then find the least point by calculation. Unlike linear
interpretable and thus these classifiers are generally used to generate descriptive
apply attribute_selection_method(D, attribute_list) models, neural networks are complex nonlinear models. Here, the error surface has an
models. The condition used with ―if‖ is called the antecedent and the predicted class
irregular layout, crisscrossed with hills, valleys, plateau, and deep ravines. To find the
to find the best splitting_criterion; of each rule is called the consequent.
last point on this surface, for which no maps are available, the user must explore it.
label node N with splitting_criterion; Properties of rule-based classifiers:
In this Neural Network Algorithm, you move over the error surface by following the
if splitting_attribute is discrete-valued and line with the greatest slope. It also offers the possibility of reaching the lowest  Coverage: The percentage of records which satisfy the antecedent conditions of
possible point. You then have to work out at the optimal rate at which you should a particular rule.
multiway splits allowed then // no restricted to binary trees
travel down the slope.
 The rules generated by the rule-based classifiers are generally not mutually
The correct speed is proportional to the slope of the surface and the learning rate. exclusive, i.e. many rules can cover the same record.
attribute_list = splitting attribute; // remove splitting attribute Learning rate controls the extent of modification of the weights during the learning
 The rules generated by the rule-based classifiers may not be exhaustive, i.e.
for each outcome j of splitting criterion process.
there may be some records which are not covered by any of the rules.
Hence, the moment of a neural network can affect the performance of multilayer
 The decision boundaries created by them is linear, but these can be much more
perceptron.
complex than the decision tree because the many rules are triggered for the
// partition the tuples and grow subtrees for each partition
Evolutionary Algorithms same record.
let Dj be the set of data tuples in D satisfying outcome j; // a partition
This algorithm based on the concept of natural selection or survival of the fittest in COMBINING TECHINQUES
Biology. The concept of natural selection states that — for a given population,
It is the process of combining two or more similar records into a single one. Merging is
if Dj is empty then environment conditions use a pressure that results in the rise of the fittest in that
done to add variables to a dataset, append or add cases or observations to a dataset,
population.
attach a leaf labeled with the majority or remove duplicates and other incorrect information.
To measure fittest in a given population, you can apply a function as an abstract
class in D to node N; measure.
else In the context of evolutionary algorithms, refer recombination to as an operator. Then
attach the node returned by Generate apply it to two or more candidates known as parents, and result in one or more new
candidates known as children. Apply the mutation on a single candidate and results in
decision tree(Dj, attribute list) to node N; a new candidate. By applying recombination and mutation, we can get a set of new
end for candidates to place in the next generation based on their fittest measure.
return N; The two basic elements of evolutionary algorithms in Neural Network are:
 Variation operators (recombination and mutation)
 Selection process (selection of the fittest)
The common features of evolutionary algorithms are:

this process makes it easier and faster to analyze data stored in multiple locations,
 Evolutionary algorithms are population-based.
worksheets, or data tables. Merging data into a single point is necessary in certain
situations, especially when an organization needs to add new cases, variables, or data 1. Scalability: Scalability in clustering implies that as we boost the amount of data There are some points which should be remembered in this type of Partitioning
based on the lookup values. However, data merging needs to be performed with objects, the time to perform clustering should approximately scale to the complexity Clustering Method which are:
caution; otherwise, it can lead to duplication, inaccuracy, or inconsistency issues. order of the algorithm. For example, if we perform K- means clustering, we know it is
1. There will be an initial partitioning if we already give no. of a partition (say m).
O(n), where n is the number of objects in the data. If we raise the number of data
Data from multiple sources is merged in a number of scenarios: objects 10 folds, then the time taken to cluster them should also approximately 2. There is one technique called iterative relocation, which means the object will
 Digital transformation initiatives increase 10 times. It means there should be a linear relationship. If that is not the be moved from one group to another to improve the partitioning.
case, then there is some error with our implementation process.
 Driving business intelligence 2. Hierarchical Clustering Methods
 Integration after mergers and acquisitions, when data from different Among the many different types of clustering in data mining, In this hierarchical
organizations are merged into one dataset clustering method, the given set of an object of data is created into a kind of
hierarchical decomposition. The formation of hierarchical decomposition will decide
 Different applications including customer relationship management, marketing
the purposes of classification. There are two types of approaches for the creation of
automation tools, and website analytics tools are merged for analysis,
hierarchical decomposition, which are: –
processing, and predictions
1. Divisive Approach
UNIT 4: CLUSTERING
Another name for the Divisive approach is a top-down approach. At the
Clustering is an unsupervised Machine Learning-based Algorithm that comprises a
beginning of this method, all the data objects are kept in the same cluster.
group of data points into clusters so that the objects belong to the same group.
Smaller clusters are created by splitting the group by using the continuous
Clustering helps to splits data into several subsets. Each of these subsets contains 2. Interpretability: The outcomes of clustering should be interpretable, iteration. The constant iteration method will keep on going until the condition of
data similar to each other, and these subsets are called clusters. Now that the data comprehensible, and usable. termination is met. One cannot undo after the group is split or merged, and that
from our customer base is divided into clusters, we can make an informed decision is why this method is not so flexible.
3. Discovery of clusters with attribute shape: The clustering algorithm should be able
about who we think is best suited for this product.
to find arbitrary shape clusters. They should not be limited to only distance 2. Agglomerative Approach
Applications of cluster analysis in data mining: measurements that tend to discover a spherical cluster of small sizes.
Another name for this approach is the bottom-up approach. All the groups are
o In many applications, clustering analysis is widely used, such as data analysis, 4. Ability to deal with different types of attributes: Algorithms should be capable of separated in the beginning. Then it keeps on merging until all the groups are
market research, pattern recognition, and image processing. being applied to any data such as data based on intervals (numeric), binary data, and merged, or condition of termination is met.
categorical data.
o It assists marketers to find different groups in their client base and based on the There are two approaches which can be used to improve the Hierarchical
purchasing patterns. They can characterize their customer groups. 5. Ability to deal with noisy data: Databases contain data that is noisy, missing, or Clustering Quality in Data Mining which are: –
o It helps in allocating documents on the internet for data discovery. incorrect. Few algorithms are sensitive to such data and may result in poor quality 1. One should carefully analyze the linkages of the object at every
clusters.
partitioning of hierarchical clustering.
o Clustering is also used in tracking applications such as detection of credit card
fraud. 6. High dimensionality: The clustering tools should not only able to handle high 2. One can use a hierarchical agglomerative algorithm for the integration of
dimensional data space but also the low-dimensional space. hierarchical agglomeration. In this approach, first, the objects are grouped
o As a data mining function, cluster analysis serves as a tool to gain insight into
into micro-clusters. After grouping data objects into microclusters, macro
the distribution of data to analyze the characteristics of each cluster. CLUSTERING METHOD
clustering is performed on the microcluster.
o In terms of biology, It can be used to determine plant and animal taxonomies, 1. Partitioning Clustering Method
3. Density-Based Clustering Method
categorization of genes with the same functionalities and gain insight into
In this method, let us say that ―m‖ partition is done on the ―p‖ objects of the database.
structure inherent to populations. In this method of clustering in Data Mining, density is the main focus. The notion of
A cluster will be represented by each partition and m < p. K is the number of groups
mass is used as the basis for this clustering method. In this clustering method, the
o It helps in the identification of areas of similar land that are used in an earth after the classification of objects. There are some requirements which need to be
cluster will keep on growing continuously. At least one number of points should be
observation database and the identification of house groups in a city according satisfied with this Partitioning Clustering Method and they are: –
there in the radius of the group for each point of data.
to house type, value, and geographical location.
1. One objective should only belong to only one group.
2. There should be no group without even a single purpose.
4. Grid-Based Clustering Method Types of Association Rules: Concurrent Processing
In this type of Grid-Based Clustering Method, a grid is formed using the object There are various types of association rules in data mining:- The easy availability of computer along with the growth of internet has changed the
together. A Grid Structure is formed by quantifying the object space into a finite way we store and process data. We are living in a day and age where data is available
 Multi-relational association rules
number of cells. in abundance. Every day we deal with huge volume of data the require complex
 Generalized association rules computing and that too, in quick time. Sometime we need to fetch data from similar or
Advantage of Grid-based clustering method: –
interrelated events that occur simultaneously. This is where we require concurrent
 Quantitative association rules
1. Faster time of processing: The processing time of this method is much processing that can divide a complex task and process it multiple system to produce
quicker than another way, and thus it can save time.  Interval information association rules the output in quick time.
2. This method depends on the no. of cells in the space of quantized each 1. Multi-relational association rules: Multi-Relation Association Rules (MRAR) is a new Concurrent processing is essential where the task involves processing a huge bulk of
dimension. class of association rules, different from original, simple, and even multi-relational complex data. Example include accessing large databases, aircraft testing,
association rules (usually extracted from multi-relational databases), each rule astronomical calculation , atomic and nuclear physics, biomedical analysis, economic
5. Model-Based Clustering Methods
element consists of one entity but many a relationship. These relationships represent planning etc.
In this type of clustering method, every cluster is hypothesized so that it can find the indirect relationships between entities.
Parallelism
data which is best suited for the model. The density function is clustered to locate the
2. Generalized association rules: Generalized association rule extraction is a powerful
group in this method. Parallelism is the process of processing several set of instruction simultaneously. It
tool for getting a rough idea of interesting patterns hidden in data. However, since
reduces the total computational time . parallelism can be implemented by using
6. Constraint-Based Clustering Method patterns are extracted at each level of abstraction, the mined rule sets may be too
parallel computers, i.e a computer with mant processors. Parallel computers require
large to be used effectively for decision-making. Therefore, in order to discover
Application or user-oriented constraints are incorporated to perform the clustering. parallel algorithm, programming language, compilers and operating system that
valuable and interesting knowledge, post-processing steps are often required.
The expectation of the user is referred to as the constraint. In this process of support multitasking.
Generalized association rules should have categorical (nominal or discrete) properties
grouping, communication is very interactive, which is provided by the restrictions.
on both the left and right sides of the rule. DISTRIBUTED ALGORITHM
UNIT 5: ASSOCIATION RULE
3. Quantitative association rules: Quantitative association rules is a special type of A distributed algorithm is an algorithm designed to run on computer
Association rule mining finds interesting associations and relationships among large association rule. Unlike general association rules, where both left and right sides of hardware constructed from interconnected processors. Distributed algorithms are
sets of data items. This rule shows how frequently a itemset occurs in a transaction. the rule should be categorical (nominal or discrete) attributes, at least one attribute used in different application areas of distributed computing, such
A typical example is a Market Based Analysis. (left or right) of quantitative association rules must contain numeric attributes as telecommunications, scientific computing, distributed information processing, and
real-time process control. Standard problems solved by distributed algorithms
Market Based Analysis is one of the key techniques used by large relations to show Uses of Association Rules
include leader election, consensus, distributed search, spanning
associations between items.It allows retailers to identify relationships between the
Some of the uses of association rules in different fields are given below: tree generation, mutual exclusion, and resource allocation.[1]
items that people buy together frequently.
 Medical Diagnosis: Association rules in medical diagnosis can be used to help Distributed algorithms are a sub-type of parallel algorithm, typically
Given a set of transactions, we can find rules that will predict the occurrence of an
doctors cure patients. As all of us know that diagnosis is not an easy thing, and executed concurrently, with separate parts of the algorithm being run simultaneously
item based on the occurrences of other items in the transaction.
there are many errors that can lead to unreliable end results. Using the multi- on independent processors, and having limited information about what the other parts
relational association rule, we can determine the probability of disease of the algorithm are doing. One of the major challenges in developing and
occurrence associated with various factors and symptoms. implementing distributed algorithms is successfully coordinating the behavior of the
independent parts of the algorithm in the face of processor failures and unreliable
 Market Basket Analysis: It is one of the most popular examples and uses of
communications links. The choice of an appropriate distributed algorithm to solve a
association rule mining. Big retailers typically use this technique to determine
given problem depends on both the characteristics of the problem, and characteristics
the association between items.
of the system the algorithm will run on such as the type and probability of processor
PARALLEL ALGORITHM or link failures, the kind of inter-process communication that can be performed, and
the level of timing synchronization between separate processes
An algorithm is a sequence of steps that take inputs from the user and after some
computation, produces and output. a parallel algorithm is an algorithm that can
execute several instruction simultaneously on different processing devices and
combine all the individual output to produce the final result.
INCREMENTAL RULE OF DATA MINING ground truth. Let us consider another clustering C2 which is identical to C1 but now SPATIAL DATA MINING
s1 and s2 are merged into one cluster. Then, we define the clustering quality measure,
The mining of association rules on transactional database is usually an offline process Spatial data mining refers to the extraction of knowledge, spatial relationships, or
Q, and according to cluster completeness C2, will have more cluster quality
since it is costly to find the association rules in large databases. With usual market- other interesting patterns not explicitly stored in spatial databases. Such mining
compared to the C1 that is, Q(C2, Cg ) > Q(C1, Cg ).
basket applications, new transactions are generated and old transactions may be demands the unification of data mining with spatial database technologies. It can be
obsolete as time advances. As a result, incremental updating techniques should be ADVANCED TECHINQUES OF DATA MINING used for learning spatial records, discovering spatial relationships and relationships
developed for maintenance of the discovered association rules to avoid redoing among spatial and nonspatial records, constructing spatial knowledge bases,
WEB MINING
mining on the whole updated database. A database may allow frequent or occasional reorganizing spatial databases, and optimizing spatial queries.
updates and such updates may not only invalidate existing association rules but also Web Mining is the process of Data Mining techniques to automatically discover and
It is expected to have broad applications in geographic data systems, marketing,
activate new rules. Thus it is nontrivial to maintain such discovered rules in large extract information from Web documents and services. The main purpose of web
remote sensing, image database exploration, medical imaging, navigation, traffic
databases. Considering an original database and newly inserted transactions, the mining is discovering useful information from the World-Wide Web and its usage
control, environmental studies, and many other areas where spatial data are used.
following four cases may arise: Case 1: An itemset is large in the original database patterns. Applications of Web Mining:
and in the newly inserted transactions. Case 2: An itemset is large in the original A central challenge to spatial data mining is the exploration of efficient spatial data
1. Web mining helps to improve the power of web search engine by classifying the
database, but is not large in the newly inserted transactions. Case 3: An itemset is not mining techniques because of the large amount of spatial data and the difficulty of
web documents and identifying the web pages.
large in the original database, but is large in the newly inserted transactions. Case 4: spatial data types and spatial access methods. Statistical spatial data analysis has
An itemset is not large in the original database and in the newly inserted transactions. 2. It is used for Web Searching e.g., Google, Yahoo etc and Vertical Searching e.g., been a popular approach to analyzing spatial data and exploring geographic
Since itemsets in Case 1 are large in both the original database and the new FatLens, Become etc. information.
transactions, they will still be large after the weighted average of the counts.
3. Web mining is used to predict user behavior. The term geostatistics is often associated with continuous geographic space,
Similarly, itemsets in Case 4 will still be small after the new transactions are inserted.
whereas the term spatial statistics is often associated with discrete space. In a
Thus Cases 1 and 4 will not affect the final association rules. Case 2 may remove 4. Web mining is very useful of a particular Website and e-service e.g., landing
statistical model that manages non-spatial records, one generally considers
existing association rules, and Case 3 may add new association rules. A good rule- page optimization.
statistical independence among different areas of data.
maintenance algorithm should thus accomplish the following: 1. Evaluate large
Web mining can be broadly divided into three different types of techniques of
itemsets in the original database and determine whether they are still large in the There is no such separation among spatially distributed records because, actually
mining:
updated database; 2. Find out whether any small itemsets in the original database spatial objects are interrelated, or more exactly spatially co-located, in the sense that
may become large in the updated database; 3. Seek itemsets that appear only in the 1. Web Content Mining: Web content mining is the application of extracting the closer the two objects are placed, the more likely they send the same properties.
newly inserted transactions and determine whether they are large in the updated useful information from the content of the web documents. Web content For example, natural resources, climate, temperature, and economic situations are
database. consist of several types of data – text, image, audio, video etc. Content likely to be similar in geographically closely located regions.
data is the group of facts that a web page is designed. It can provide
Measures for Quality of Clustering: Such a property of close interdependency across nearby space leads to the notion of
effective and interesting patterns about user needs. Text documents are
spatial autocorrelation. Based on this notion, spatial statistical modeling methods
If all the data objects in the cluster are highly similar then the cluster has high quality. related to text mining, machine learning and natural language processing.
have been developed with success. Spatial data mining will create spatial statistical
We can measure the quality of Clustering by using the Dissimilarity/Similarity metric in This mining is also known as text mining. This type of mining performs
analysis methods and extend them for large amounts of spatial data, with more
most situations. But there are some other methods to measure the Qualities of Good scanning and mining of the text, images and groups of web pages
emphasis on effectiveness, scalability, cooperation with database and data
Clustering if the clusters are alike. according to the content of the input.
warehouse systems, enhanced user interaction, and the discovery of new kinds of
2. Web Structure Mining: Web structure mining is the application of
1. Dissimilarity/Similarity metric: The similarity between the clusters can be knowledge.
discovering structure information from the web. The structure of the web
expressed in terms of a distance function, which is represented by d(i, j). Distance
graph consists of web pages as nodes, and hyperlinks as edges TEMPORAL MINING
functions are different for various data types and data variables. Distance function
connecting related pages. Structure mining basically shows the
measure is different for continuous-valued variables, categorical variables, and vector Temporal data mining defines the process of extraction of non-trivial, implicit, and
structured summary of a particular website. It identifies relationship
variables. Distance function can be expressed as Euclidean distance, Mahalanobis potentially essential data from large sets of temporal data. Temporal data are a series
between web pages linked by information or direct link connection. To
distance, and Cosine distance for different types of data. of primary data types, generally numerical values, and it deals with gathering
determine the connection between two commercial websites, Web
beneficial knowledge from temporal data.
2. Cluster completeness: Cluster completeness is the essential parameter for good structure mining can be very useful.
clustering, if any two data objects are having similar characteristics then they are 3. Web Usage Mining: Web usage mining is the application of identifying or The objective of temporal data mining is to find temporal patterns, unexpected trends,
assigned to the same category of the cluster according to ground truth. Cluster discovering interesting usage patterns from large data sets. And these or several hidden relations in the higher sequential data, which is composed of a
completeness is high if the objects are of the same category. patterns enable you to understand the user behaviors or something like sequence of nominal symbols from the alphabet referred to as a temporal sequence
that. In web usage mining, user access data on the web and collect data and a sequence of continuous real-valued components called a time series, by
Let us consider the clustering C1, which contains the sub-clusters s1 and s2, where
in form of logs. So, Web usage mining is also called log mining.
the members of the s1 and s2 cluster belong to the same category according to
utilizing a set of approaches from machine learning, statistics, and database
technologies.
Temporal data mining is composed of three major works such as the description of
temporal data, representation of similarity measures, and mining services.
Temporal Data Mining includes processing time series, generally sequences of data,
which compute values of the same attribute at a sequence of multiple time points.
Pattern matching using such information, where it is searching for specific patterns of
interest, has attracted considerable interest in current years.
Temporal Data Mining can include the exploitation of efficient techniques of data
storage, quick processing, and quick retrieval methods that have been advanced for
temporal databases.
Temporal data mining is an individual phase in the process of knowledge discovery in

temporal databases that calculate temporal patterns from or fit models too, temporal
data is a temporal data mining algorithm.
Temporal data mining is concerned with the analysis of temporal data and for
discovering temporal patterns and consistencies in sets of temporal information. It
also allows the possibility of computer-driven, automatic exploration of the data.
There are various tasks in temporal mining which are as follows −
 Data characterization and comparison
 Clustering analysis
 Classification
 Association rules
 Pattern analysis
 Prediction and trend analysis

Data Mining 1

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining 1

Uploaded by

Copyright:

Available Formats

UNIT 1 : INTRODUCTION data comes from multiple places such as Marketing and Finance.

Figure – Jaccard Index

(x1, x2, ..., xN)

Consider two points P1 and P2: DECISION TREE

Likewise, P (X|H) is the posterior probability of X conditioned on H. It is the probability

A genetic algorithm in data mining is an advanced method of data classification. Data

 Variation operators (recombination and mutation)

 Selection process (selection of the fittest)

The common features of evolutionary algorithms are:

2. There should be no group without even a single purpose.

4. Grid-Based Clustering Method Types of Association Rules: Concurrent Processing

Temporal data mining is an individual phase in the process of knowledge discovery in

 Data characterization and comparison

 Prediction and trend analysis

You might also like